-
Notifications
You must be signed in to change notification settings - Fork 32
Combine GPTHuggingfaceDatasetConfig input sources into source_schema
#255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: generalize_dynamic_classes
Are you sure you want to change the base?
Combine GPTHuggingfaceDatasetConfig input sources into source_schema
#255
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor suggestion, otherwise LGTM!
class SourceSchemaConfig(Config): | ||
pass | ||
|
||
class TextColumnConfig(SourceSchemaConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class TextColumnConfig(SourceSchemaConfig): | |
class TextColumnConfig(SourceSchemaConfig): | |
type: ClassVar[str] = "text_column" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlamypoirier, what's the endorsed idiom for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing until we merge @245. Then we can add to registry. This one won't work unless it's marked as a classvar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nitsanluke's PR is assuming #245 is getting merged. It's branching off from the generalize_dynamic_classes
branch of #245.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then it needs to be registered after the class definition, ex. https://github.com/ServiceNow/Fast-LLM/pull/245/files#diff-7001a0c0d12d5adf8e07c112218448985de3f25255f166f01d001768c7130788R148. Here something like SourceSchemaConfig.register_subclass("text_column", TextColumnConfig)
.
Co-authored-by: Torsten Scholak <torsten.scholak@googlemail.com>
Co-authored-by: Torsten Scholak <torsten.scholak@googlemail.com>
…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source
Co-authored-by: Torsten Scholak <torsten.scholak@googlemail.com>
…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source
_data_column: str | ||
_loss_masking_spans_column: str | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: do we need to make these class variables? With more schemas coming in, we would have to define class variables for fields in each of them. Wouldn't it be simpler to directly access them from the schema config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type class variable is a discriminator field. It tells fast-llm to instantiate the right class when deserializing the config from yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need one such discriminator field (type: ClassVar[str]) per source schema config class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a sec. This isn’t the class I thought I was looking at. I think you’re right @sohamparikh, I don’t think we need these fields here. @nitsanluke?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohamparikh The idea was to separate the set of vars used by prepare from the config; since we can't change config vars after defining. So it become complex to use the same var with multiple data sources (need to set during validation). Eg: For the concat case the new col that would be generated during the execution will take the place of _data_column
and rest of the prepare code doesn't have to change. The separation allows for different source schema to use the class vars to setup for the specific use-case they are dealing with. We can discuss further to see if the alternative has more benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to use a typing.ClassVar[str]
in the config class. Or for more flexibility it could be a @property
or @cachedproperty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nitsanluke, makes sense to separate them from the config
With more modalities coming in (image, audio), could we rename _data_column
to _text_column
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _text_column
data_source
source_schema
default="text", | ||
desc="Field of the dataset to use.", | ||
source_schema: SourceSchemaConfig = Field( | ||
default_factory=TextColumnConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default_factory
doesn't actually work and is removed for config classes in #245. I'm working on a way to set a default subclass https://github.com/ServiceNow/Fast-LLM/pull/245/files#diff-7001a0c0d12d5adf8e07c112218448985de3f25255f166f01d001768c7130788R44.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it from here and keep a TODO.
✨ Description
This PR creates a common interface for all
GPTHuggingfaceDatasetConfig
input columns via the newsource_schema
variable. Beyond the variablefiled
we require additional keys to preprocess and tokenize different types of datasets. (eg: SFT, combine cols, etc).Therefore we have created a new variable
source_schema
which can accommodate these different data sources specific preprocessing and tokenization. Current variablesfield
andloss_masking_spans
are moved intoTextColumnConfig
as a type of input/data source.Merge after #245
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing