GitHub - ioerger/TTN-Fitness: Statistical analysis of Himar1 TnSeq data.

ioerger / TTN-Fitness Public

Notifications You must be signed in to change notification settings
Fork 3
Star 2

Statistical analysis of Himar1 TnSeq data.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
demodata		demodata
Fitness_Estimation.py		Fitness_Estimation.py
LFCs_to_TTN.py		LFCs_to_TTN.py
README		README
compute_LFCs.py		compute_LFCs.py
predict_with_STLM.py		predict_with_STLM.py
train_STLM.py		train_STLM.py

Repository files navigation

This software is an implementation of an algorithm for predicting insertion counts based on nucleotide patterns surrounfign TA dinulceotide sites using the Himar1 transposon
The input files are wig files containing the insertion counts at TA site coordinates in a bacterial genome. It generates an output of essentiality of genes based on comparing observed counts to expected counts

Note: These scripts may have errors when run on MacOS
################# SCRIPTS ########################
---- Pre-Processing ----

With wig files and a prot table that matches the genome used to create the wig files, compute the LFCs. The output file does not contain headers. The columns are:
	- TA site coordinate
	- ORF ID at this coordinate 
	- ORF Name of the ORF ID at this coordinate
	- Nucleotides in postions -20 to +20 from the TA site, including the TA dinucleotide itself.
	- State of the TA site. 
		-> 'E' : if TA site is in a region of 6 or more consecutive sites with 2 or less insertions
		-> 'NE': otherwise
	- mean insertion count at TA site coordinate across wig files prvided
	- LFC at the TA site, using insertion counts at surrounding TA sites 
	- Description of the ORF at this coordinate

>python3 compute_LFCs.py <.fna file> <annotation .prot_table> <.wig file 1> <.wig file 2> <.wig file 3> ...<wig file n>  > <LFC file>


We create the tetra-nucleotide vectors at each TA site that will be used to train the STLM and calculate expected LFCs/expected counts. The output file contains headers. They contain columns in the LFC file generated in the previous step, expect instead one column with "Nucleotides in postions -20 to +20 from the TA site", the file contains columns of binary values of whether the tetranucleotides exist at TA sites at specific coordinates. 

>python3 LFCs_to_TTN.py <LFC file> > <TTN .csv>


---- Train STLM ----
The STLM is a linear model trained on log-fold-changes (LFCs) of insertion counts of a sample dataset. It generates the coefficents for the tetra-nucleotides.
>python3 train_STLM.py <TTN .csv> <.pickle file to save to> #pickle file created wil tared


---- Predict using STLM ---- 
With a trained STLM, the expected count at each TA site accounting for surrounding nucleotides can be calculated for a given sample dataset.
>python3 predict_with_STLM.py <.pickle file without the tar.gz suffix> <train TTN .csv file> <test TTN .csv> > <output file to write the STLM predictions to> 


--- Fitness Estimation ----
The Gene+TTN model is a linear model that incorporates the expected counts at TA sites to estimate the fitness of genes. The genes are classified by their coefficients and significance in the following manner:
	- ES : Transit Gumbel labels it 'E'  
	- ESB : The gene is of N TA sites, such that N > log10(0.05)/log10(1-saturation)
	- NE : Gene+TTN Coef = 0 and FDR ADjusted Gene+TTN Pval < 0.05    OR    FDR Adjusted Pval > 0.05
	- GA : Gene+TTN Coef > 0 and FDR Adjusted Gene+TTN Pval < 0.05
	- GD : Gene+TTN Coef < 0 and FDR Adjusted Gene+TTN Pval < 0.05

The output file contains the columns:
	- ORF ID
	- ORF Name
	- ORF Description 
	- Total # of TA sites in the ORF
	- Number of TA sites in the ORF with insertions
	- Gene-Only Model(M0) Coeffients
	- Gene-Only Model (M0) Pvals (FDR-Adjusted)
	- Gene+TTN Model (M1) Coefficents -> these values are used to categorize genes when adjusted pval < 0.05
	- Gene+TTN Model (M1) Pvals (FDR-Adjusted) -> values can be sorted in increasing order and those greater than 0.05 are marked 'NE' by the TTN-Fitness
	- M0 Fitness Estimation -> log (Counts expected by M0/Counts Observed). Ratio of expected to actual counts of genes using M0.
		- None -> Gene was essential
		- < 1 -> GD
		- = 1 -> NE
		- > 1 -> GA
	- M1 Fitness Estimation -> log (Counts expected by M1/Counts Observed). Ratio of expected to actual counts of genes using M1.	
	        - None -> Gene was essential
                - < 1 -> GD
                - = 1 -> NE
                - > 1 -> GA
	- Mean Actual Count
	- Gumbel Fitness Calls -> calls made from transit's gumbel run
	- TTN Fitness Calls -> calls made with the method in its entirity

>transit gumbel <comma-separated .wig files> <annotation .prot_table or GFF3> <output file> 
>python3 Fitness_Estimation.py <STLM output.csv> <annotation .prot_table> <output gumbel file> > <output file to save Gene Essentiality predictions>


===============================================================================
                             EXAMPLE WORKFLOWS
===============================================================================

--------------- Get Estimated Fitness of genes in H37Rv --------------------
Get expected counts of TA sites in H37Rv using the STLM trained on H37Rv. Using these expected counts, estimate fitness of genes in the H37Rv
	1. Obtain LFC data and tetranucleotide vectors for H37Rv
	2. Run Transit Gumbel for the H37Rv data
	3. Get expected counts at TA sites in H37Rv
	4. Using Gumbel and expected counts of H37Rv, get fitness estimate for genes in H37Rv

The following commands should be run from the main directory. They generate files in a folder labeled 'out' in the main directory. 

>python3 compute_LFCs.py demodata/H37Rv/H37RvBD1.fna demodata/H37Rv/H37RvBD1.prot_table demodata/H37Rv/*.wig > demodata/out/H37Rv_LFC.txt
>python3 LFCs_to_TTN.py demodata/out/H37Rv_LFC.txt > demodata/out/H37Rv_TTN.csv

>transit gumbel demodata/H37Rv/Cara_WT.wig,demodata/H37Rv/CS_TraCS053.wig,demodata/H37Rv/CS_TraCS054.wig,demodata/H37Rv/WT1_SWP.wig,demodata/H37Rv/WT2_SWP.wig,demodata/H37Rv/WT3_SWP.wig,demodata/H37Rv/WT4_SWP.wig,demodata/H37Rv/WT5_SWP.wig,demodata/H37Rv/WT6_SWP.wig,demodata/H37Rv/WT7_SWP.wig,demodata/H37Rv/WX_WT1.wig,demodata/H37Rv/WX_WT2.wig,demodata/H37Rv/WX_WT3.wig,demodata/H37Rv/WX_WT4.wig demodata/H37Rv/H37RvBD1.prot_table demodata/out/H37Rv_gumbel_gene_predictions.csv
>python3 predict_with_STLM.py demodata/H37Rv/STLM_model_trained_on_H37Rv.pickle demodata/out/H37Rv_TTN.csv demodata/out/H37Rv_TTN.csv > demodata/out/H37Rv_expected_counts.csv
>python3 Fitness_Estimation.py demodata/out/H37Rv_expected_counts.csv demodata/H37Rv/H37RvBD1.prot_table demodata/out/H37Rv_gumbel_gene_predictions.csv > demodata/out/Fitness_Estimation_of_H37Rv_genes.csv


------------- Getting fitness estimation of genes in abscessus ATCC-19977 using our trained STLM -------------------
Using our STLM, trained on the H37Rv, get expected insertion counts at TA sites in abscessus ATCC 19977:
	1. Obtain LFC data and tetranucleotide vectors for both H37Rv and abscessus
	2. Using the STLM already trained on H37Rv, calculate expected counts of absessus

The following commands should be run from the main directory. They generate files in a folder labeled 'out' in the main directory  

>python3 compute_LFCs.py demodata/H37Rv/H37RvBD1.fna demodata/H37Rv/H37RvBD1.prot_table demodata/H37Rv/*.wig > demodata/out/H37Rv_LFC.txt
>python3 compute_LFCs.py demodata/abscessus/ATCC_19977/abscessus.fna demodata/abscessus/ATCC_1977/abscessus.prot_table demodata/abscessus/ATCC_19977/*.wig > demodata/out/abscessus_LFC.txt
>python3 LFCs_to_TTN.py out/H37Rv_LFC.txt > demodata/out/H37Rv_TTN.csv
>python3 LFCs_to_TTN.py out/abscessus_LFC.txt > demodata/out/abscessus_TTN.csv


>python3 predict_with_STLM.py demodata/H37Rv/STLM_model_trained_on_H37Rv.pickle demodata/out/H37Rv_TTN.csv demodata/out/abscessus_TTN.csv > demodata/out/abscessus_expected_counts.csv
>transit gumbel demodata/abscessus/ATCC_19977/TnSeq-Ref-1.wig,demodata/abscessus/ATCC_19977/TnSeq-Ref-2.wig,demodata/abscessus/ATCC_19977/TnSeq-Ref-3.wig demodata/abscessus/ATCC_19977/abscessus.prot_table demodata/out/abscessus_gumbel_gene_predictions.csv
>python3 Fitness_Estimation.py out/abscessus_expected_counts.csv demodata/abscessus/ATCC_19977/abscessus.prot_table demodata/out/abscessus_gumbel_gene_predictions.csv > demodata/out/Fitness_Estimation_of_abscessus_ATCC_19977_genes.csv


--------- Retraining STLM on abscessus --------------------
Train a new STLM using the abscessus ATCC-19977 data:
	1. Obtain LFC data and tetranucleotide vectors for the abscessus ATCC-19977
	2. Train a new STLM using the abscessus TTN vectors and LFCs 

The following commands should be run from the main directory. They generate files in a folder labeled 'out' in the main directory

>python3 compute_LFCs.py demodata/abscessus/ATCC_19977/abscessus.fna demodata/abscessus/ATCC_1977/abscessus.prot_table demodata/abscessus/ATCC_19977/TnSeq-Ref-1.wig demodata/abscessus/ATCC_19977/TnSeq-Ref-2.wig demodata/abscessus/ATCC_19977/TnSeq-Ref-3.wig > demodata/out/abscessus_LFC.txt
>python3 LFCs_to_TTN.py demodata/out/abscessus_LFC.txt > demodata/out/abscessus_TTN.csv
>python3 train_STLM.py demodata/out/H37Rv_TTN.csv demodata/out/STLM_model_trained_on_abscessus.pickle