A program to split (very) large .csv files column-wise based on some other metadata file with minimal memory overhead.
This needs the xsv
binary in your PATH, since it uses it as a backend to do the heavy lifting. Install it from the BurntSushi/xsv repository.
You need to have python 3.10 or later installed. Install metasplit with:
pip install git+https://github.com/MrHedmad/metasplit@main
Then you can use metasplit with the metasplit
command.
Use metasplit --help
for a list of arguments. The argument should be self-explaining with the exception of the selection strings. I explain them here.
A selection string is a structured strings with this form:
|--------------------| |---------||------------ >>>
/path/to/metadata/file@id_variable?meta_var1=value&meta_var2=value ...
^ ^ ^
The query has many parts:
/path/to/metadata/file
: The (full) path to the metadata file to use to select the columns of the input file with.id_variable
: The name of the column in the metadata file that holds the ids of the columns in the input file. Must be preceded by an@
.- After these two parts, the rest of the string is made up by selections:
- The first selection always starts with an
?
. This marks the beginning of the selection strings. - Every selection is of the form
variable
+sign
+value(s)
. The variable is the column to consider in the metadata. The value(s) are either one (value
) or a list of ([value1,value2,value3]
) of values to select the ids with. The sign might be either=
or!=
for the variable being equal to or not equal to the values, respectively. - Multiple selections may be chained together by starting new selection strings with either
&
or|
for a logical AND or a logical OR with the previous selection.
- The first selection always starts with an
You can pass multiple selection strings as input, even from different metadata files. Each selection from every metadata file will be summed together (a sort of "OR") to subset the final data file.
If you instead wish to only keep IDs that satisfy your selections in every metadata file (a sort of "AND"), you can pass the --intersect
flag to do just that.
Some examples of query strings:
~/metadata.csv@gene_id?sample_type=tumor
: Read the~/metadata.csv
file, and select column ids in thegene_id
column where the columnsample_type
is equal totumor
.~/metadata.csv@gene_id?type=[primary_tumor,metastasis]&study=tcga
: Similar to the previous example, select wheretype
is eitherprimary_tumor
ormetastasis
AND thestudy
istcga
.~/metadata.csv@gene_id?study=tcga|selection=manually_selected
: select wherestudy
is equal totcga
OR theselection
ismanually_selected
.~/metadata.csv@sample_id?study=tcga ~/clinical_metadata.csv@patient_id?smoker=true|exposed_to_asbestos=true --intersect
: select in themetadata.csv
file wherestudy
is equal totcga
. Then, select in theclinical_metadata.csv
file wheresmoker
istrue
ORexposed_to_asbestos
istrue
. Keep only samples that satisfy both selections (due to the--intersect
flag).