-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ByTermSplit splitter #499
base: main
Are you sure you want to change the base?
Adds ByTermSplit splitter #499
Conversation
Hey Faisal,
and the error:
|
Hey @kmanpearl, thanks for trying it out, and for the complete example and bug report! The issue occurred because I was putting literal gene IDs (which also were strings, rather than ints) in the splits rather than indices into the My latest commit changes it to integer indices, and with that change I was able to run your example without any obvious errors. If you could try an example again and let me know if the output looks reasonable, I'd appreciate it, since I don't know how to interpret it myself. |
hey @falquaddoomi, When I try to run the same code after the latest commit I am now getting the following different error. I thought possibly it was related to having the
|
Brings in testing fixes from main so this PR's tests can pass
for more information, see https://pre-commit.ci
Hey @kmanpearl, thanks again for the error traceback. I accidentally introduced that error in commit f1e5dd7 when I was making changes to get the testing suites to pass; specifically, I changed It's fixed as of e2b28f0 and I re-ran your code to make sure it's actually working this time, but please feel free to run it on your end as well. |
@falquaddoomi I was able to get the code to run without error but I'm not sure it is behaving as desired. I was comparing the results of the term split and the original gene split using the following two functions:
The shape of y_mask was as expected. However, when I look at the contents of
|
This PR adds a new splitter,
ByTermSplit
, that like the existing splitters divides a set of genes into multiple "splits", typically a training/testing split, or a training, validation, and test split.The
ByTermSplit
splitter takes the following arguments in its constructor:labelset
, which should be an instance ofLabelSetCollection
split_terms
, which should be an iterable, which contains as many iterables as you want splits.exclusive
, a boolean that, if true, only allows a specific gene to occur in the first split to which it's been assignedFor example, to produce three splits, you'd supply the following for
split_terms
:The resulting splits will contain all genes to which any term in the split corresponds. Note that while it's expected that you'd make the splits disjoint (i.e., with no terms shared between splits), you don't have to; the same genes would then end up in both splits. Also, if the same gene happens to be identified by two different terms in different splits, it will be included in both (unless
exclusive
is specified).Optionally, you can specify up to one "catch-all" split, identified by a single
"*"
. The "catch-all" split will contain any genes that weren't assigned to other splits. For example:Finally, the
exclusive
flag ensures that no split contains genes present in another split. Preference is given to the first split in which the gene occurs, so if gene ID 3841 occurred in splits 1 and 3, it would just be retained in 1.Partially addresses #498.