Skip to content

Conversation

tukwila
Copy link
Contributor

@tukwila tukwila commented Sep 8, 2025

Summary

Details

I hope data file can support ShareGPT as benchmark test data such as: ShareGPT_V3_unfiltered_cleaned_split.json; In this PR, user can abstract testing prompts from origin file and filter human prompts (10 < words < 1000) to save into local file, refer to:

image image
  • [ ]

Test Plan

Related Issues

  • Resolves #

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

@tukwila tukwila changed the title support ShareGPT dataset as data file draft: support ShareGPT dataset as data file Sep 8, 2025
@tukwila tukwila changed the title draft: support ShareGPT dataset as data file support ShareGPT dataset as data file Sep 8, 2025
@tukwila tukwila force-pushed the support_sharegpt branch 3 times, most recently from 1cf7e56 to e98bd0e Compare September 9, 2025 04:21
@tukwila
Copy link
Contributor Author

tukwila commented Sep 9, 2025

@sjmonson
Copy link
Collaborator

This seems external to the GuideLLM. Can you please move all code and documentation to /contrib/sharegpt_preprocess.

@tukwila
Copy link
Contributor Author

tukwila commented Sep 12, 2025

This seems external to the GuideLLM. Can you please move all code and documentation to /contrib/sharegpt_preprocess.

Done

@sjmonson
Copy link
Collaborator

Sorry I forgot about this PR due to the sudden flurry of new PRs. Can you also move the changes in docs/datasets.md to contrib/sharegpt_preprocess/README.md.

@tukwila
Copy link
Contributor Author

tukwila commented Sep 16, 2025

Sorry I forgot about this PR due to the sudden flurry of new PRs. Can you also move the changes in docs/datasets.md to contrib/sharegpt_preprocess/README.md.

Done

Copy link
Collaborator

@jaredoconnell jaredoconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the requirements.txt supposed to include all dependencies? I had to install datasets and transformers for it to work.

It may be beneficial to also note that you need to run it with the HF_TOKEN value set.

Once I addressed these it appears to have worked.

# except special characters
not re.search(r"[<>{}[\]\\]", prompt_text)
and not prompt_text.isdigit()
): # except pure numbers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment belongs above the line that's above it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants