Training

Warning: for advanced users. Knowledge of python and scikit-learn is recommended

This is a guide for users to retrain the default SV² classifiers and to train custom supervised classifiers.

Included in the SV² source package are the original training set and a jupyter notebook containing instructions for (re)training.

SV² Training Set

The default training set is packaged with the source distribution.

The files are located here after extracting the source package:

$ ls sv2-VERSION/sv2/training/1kgp_training_data

1kgp_highcov_del_gt1kb.txt
1kgp_highcov_del_lt1kb.txt         
1kgp_highcov_del_malesexchrom.txt           
1kgp_highcov_dup_snv.txt
1kgp_lowcov_dup_breakpoint.txt
1kgp_lowcov_dup_malesexchrom.txt

These files can be used for retraining in the training SVM classifiers section

Custom Feature Extraction

sv2train is a script designed for advanced users that wish to train genotyping classifiers with their own data.

Given SV input, SV² will generate features for training a new classifier, given user-defined genotype labels.

$ sv2train -i <in.txt> [-b ...] [-v ...] -o <sv2>

$ ls sv2_training_features/

   sv2_deletion_gt1kb_training_features.txt
   sv2_deletion_lt1kb_training_features.txt
   sv2_deletion_male_sex_chrom_features.txt
   sv2_duplication_breakpoint_training_features.txt
   ...

The header is formatted for the companion jupyter notebook, please do not alter it.

VERY IMPORTANT:bangbang:

before training, users have to populate the values in copy_number. The default output is NA and the expected output for the companion jupyter notebook is the following:

Biallelic SVs

copy_number	VCF genotype
0	1/1 (DEL:HOM)
1	0/1 (DEL:HET)
2	0/0 (REF)
3	0/1 (DUP:HET)
4	1/1 (DUP:HOM)

SVs on Male Sex Chromosomes

copy_number	VCF genotype
0	1 (DEL:ALT)
1	0 (REF)
2	1 (DUP:ALT)

Above is tabulated the expected values for copy_number in the sv2train output. The companion jupyter notebook encodes the genotype labels as copy number for simplicity. This is useful if a user wants to include variants with multiple alleles such as,

ALT	Genotype	copy_number
,	2/2	4
,	1/2	2
,	0/2	4

Training SVM Classifiers

The jupyter notebook is located in the source package here: sv2-VERSION/sv2/training/sv2_training.ipynb

A copy is also available on github

This notebook is designed to guide users into training genotyping classifiers. It is important to chose a name for your classifier, this name will be later loaded into SV²

The output of the jupyter notebook is a JSON file containing the paths to the trained classifiers. The models are saved in pickle .pkl files.

It is very important to not alter the paths in the JSON file or the pickle files themselves.

Adding New Classifiers to SV²

A JSON file containing paths to classifier models saved in pickle files is required to add new classifiers.

Pass the JSON file to the SV² -load-clf command

$ sv2 -load-clf myclf.json

This command appends new classifiers to the SV² classifier JSON file located here: $SV2_INSTALL_LOCATION/sv2/config/sv2_clf.json

Genotyping with New Classifiers

After loading the classifiers with the -load-clf command, users can specify which model to genotype on with the -clf <classifier-name> option.

# genotype with default classifiers
$ sv2 -i in.txt [-b ...] [-v ...] -clf default

# genotype with a classifier named "myclf"
$ sv2 -i in.txt [-b ...] [-v ...] -clf myclf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training

SV² Training Set

Custom Feature Extraction

Training SVM Classifiers

Adding New Classifiers to SV²

Genotyping with New Classifiers

Clone this wiki locally

Training

SV2 Training Set

Custom Feature Extraction

Training SVM Classifiers

Adding New Classifiers to SV2

Genotyping with New Classifiers

Clone this wiki locally

SV² Training Set

Adding New Classifiers to SV²