-
Notifications
You must be signed in to change notification settings - Fork 2
Usage
InterTADs consists of three major scripts (two of them perform the main functionalities and one for the visualization).
In order to run the tool, the scripts should be executed with the following order:
Data_Integration.R
TADiff.R
Visualization.R
For the Data Integration part, all omics datasets are separated into two folders, freq
and counts
(the user is able to define different folder names), based on the information they are carrying (frequency or score count values). Each file should contain four major columns (INDEX, CHROMOSOME, START POSITION ON CHROMOSOME, END POSITION ON CHROMOSOME) followed by the data.
An example of a frequency table can be found below:
ID chromosome_name start_position end_position S6_GR-10824 S8_GR-1615 S8_GR-7341 D S6_GR-3810 S6_GR-6296 D S6_GR-9320 S8_GR-2980 D S8_GR-1114 S6_GR-7150
cg13869341 1 15864 15865 0.83029719 0.87924895 0.76518576 0.74811852 0.77672951 0.79969665 0.82446763 0.77489221 0.84642477
cg14008030 1 18826 18827 0.46587484 0.36375759 0.42892993 0.43966841 0.34872824 0.44089334 0.42953261 0.37386713 0.52129949
cg12045430 1 29406 29407 0.06704503 0.06174565 0.05452966 0.05422707 0.06373271 0.05207358 0.06518455 0.03379153 0.07121873
cg20826792 1 29424 29425 0.19085259 0.13481677 0.16600570 0.18401117 0.15711119 0.18683240 0.13360431 0.12754390 0.26688323
cg00381604 1 29434 29435 0.05333948 0.03570165 0.04609482 0.05332919 0.04466450 0.04220193 0.05977976 0.02439079 0.02971444
cg20253340 1 68848 68849 0.51558781 0.56116867 0.45075386 0.27855765 0.49120712 0.45580421 0.59264789 0.44046663 0.56385297
and an example of a count table:
LOC Chromosome Start End gene.features.locus Genes S1 S2 S8 S9 S11 S4 S5 S6 S7 S10 S12
XLOC_000001 chr1 11873 29370 chr1:11873-29370 DDX11L1 0.1677510 0.1999510 0.0779174 0.1908730 0.0773154 0.1045200 0.0488422 0.0840655 0.0383534 0.1518370 0.0836381
XLOC_000002 chr1 11873 29370 chr1:11873-29370 WASH7P 0.0000000 1.1977200 0.7615410 0.8504720 1.2126800 3.2438200 0.0722865 0.6610380 0.8709270 0.5822600 1.2383400
XLOC_000003 chr1 30365 30503 chr1:30365-30503 MIR1302-10 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
XLOC_000004 chr1 69090 70008 chr1:69090-70008 OR4F5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
XLOC_000005 chr1 323891 328581 chr1:323891-328581 LOC100132062,LOC100133331 0.0363166 0.0449181 0.0000000 0.0583464 0.0861039 0.0201765 0.0000000 0.0377682 0.1620190 0.0643269 0.0000000
XLOC_000006 chr1 367658 368597 chr1:367658-368597 OR4F16 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
The two folders are placed into a directory, along with a meta data file which provides information about the mapping between the columns for each dataset. For more details regarding the structure of this file please see here:
sample_methylation_data sample_expr_data sample_mutation_data groups newNames
S6_GR-10824 S10 10824_S10 ss6 S10
S8_GR-1615 S11 6242_S6 ss8 S11
S8_GR-7341 D S2 7341R_S2 ss8 S2
S6_GR-3810 S5 4229_S5 ss6 S5
S6_GR-6296 D S7 6296R_S7 ss6 S7
S6_GR-9320 S12 9577_S7 ss6 S12
S8_GR-2980 D S9 2980R_S9 ss8 S9
S8_GR-1114 S1 5452_S1 ss8 S1
S6_GR-7150 S6 7150_S6 ss6 S6
Moreover, the user can choose a folder name for the output table and a option about the Human Genome that is being used (accepted values are hg19
or hg38
).
The entire list of the inputs for the Data Integration is the following:
-
dir_name
: Directory of input datasets containing feature counts and frequency tables -
output_folder
: Folder name for printing output tables -
tech
: Human Genome Reference preferred -
meta
: Meta data file name used -
counts_dir
: Directory name of counts NGS data -
freq_dir
: Directory name of freq NGS data
Once every input is provided, the script can be run by:
source("Data_Integration.R")
The generated table follows the format below:
ID chromosome_name start_position end_position Gene_id Gene_locus S10 S11 S2 S5 S7 S12 S9 S1 S6 parent
chr1,100491168:T:TA 1 100491168 100491169 SLC35A3|SLC35A3|SLC35A3 transcript|exon|threeUTR 0.000000 0.000000 0.000000 50.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3
cg25151559 1 1223441 1223442 SCNN1D|SCNN1D transcript|intron 33.045591 76.854680 37.288103 0.000000 6.593783 76.784934 77.551448 40.368888 29.723830 1
chr1,100503564:T:C 1 100503564 100503565 MFSD14A|NA promoter|intergenic 0.000000 50.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3
cg02913364 1 1208572 1208573 UBE2J2|UBE2J2 transcript|intron 6.740306 11.566024 9.024343 9.028459 12.387940 5.701391 9.662759 7.551389 5.517274 1
cg16619049 1 805540 805541 FAM41C|FAM41C transcript|intron 0.712985 9.954163 28.119258 62.539744 50.121639 56.854456 21.807748 58.569130 23.909206 1
cg15994267 1 805351 805352 FAM41C|FAM41C transcript|intron 43.039788 34.019183 41.020202 45.317996 43.767269 38.059623 38.256512 32.864791 43.418376 1
For the TADiff part, the paths to the input meta data file and output table (from the Data Integration part) must be provided. In addition a BED file is needed containing information regarding the TADs:
chr1 521368 750000 TAD1
chr1 750000 1850000 TAD2
chr1 1850000 6000000 TAD3
chr1 6000000 6750000 TAD4
chr1 6750000 7800000 TAD5
chr1 7800000 8050000 TAD6
The full input list for the script is the following:
-
dir_name
: Directory of input datasets containing meta data file -
output_folder
: Folder name for printing output tables -
image_output_folder
: Folder name for printing output images -
tad_file
: BED file containing information about TADs -
meta
: Meta data file name used -
paired.data
: Boolean flag indicating whether input data is paired or not -
FDR_criterion
: User defined FDR criterion
In order to run the script:
source("TADiff.R")
The output table produced by the script is the one below:
chromosome_name tad_name tad_start tad_end ID start_position end_position Gene_id Gene_locus parent S10 S11 S2 S5 S7 S12 S9 S1 S6 diff
1 TAD2 750000 1850000 cg18792131 1099165 1099166 <NA> intergenic 1 96.105811 94.965889 94.3893586 97.222416 91.741260 96.100389 97.402224 98.120703 97.371092 0.5113503
1 TAD127 100200000 100950000 chr1,100476244:C:CA 100476244 100476245 SLC35A3|SLC35A3 transcript|intron 3 0.000000 0.000000 50.0000000 0.000000 0.000000 0.000000 50.000000 0.000000 0.000000 25.0000000
1 TAD127 100200000 100950000 chr1,100442764:A:G 100442764 100442765 SLC35A3|SLC35A3 transcript|intron 3 0.000000 0.000000 0.0000000 0.000000 100.000000 0.000000 0.000000 0.000000 0.000000 -20.0000000
1 TAD2 750000 1850000 cg25610492 948624 948625 ISG15|NA promoter|intergenic 1 5.267563 2.626257 0.3441028 5.189518 2.784986 1.788684 4.188292 11.576175 5.362097 0.6051372
1 TAD2 750000 1850000 cg22379708 982917 982918 AGRN|AGRN transcript|intron 1 79.367710 4.115131 73.0741959 75.720993 75.668553 72.052459 83.662421 82.700215 77.902354 -15.2544232
Along with the previous table a statistical analysis between two groups of the samples can be performed, in order to locate main differences amonts them. The output generated is a table containing information about the TADs and the level of significance of each one of them:
tad_name count mean IQR ttest wilcoxon FDR
TAD1 26 3.320106 3.332445 5.905208e-01 3.924510e-01 1.000000e+00
TAD10 170 16.505051 20.676735 1.369320e-06 2.367577e-08 1.562601e-06
TAD11 1 1.687172 0.000000 2.910859e-01 2.857143e-01 1.000000e+00
TAD12 6 1.585812 3.109927 9.942975e-01 9.167833e-01 1.000000e+00
TAD126 7 20.000000 0.000000 6.245001e-03 1.277503e-02 2.810507e-01
TAD127 444 16.019144 22.500000 6.713911e-03 2.275272e-05 7.508397e-04
For the visualization of the results, the paths to input meta data file and output tables generated by the TADiff part need to be provided.
The full of inputs for the script is the following:
-
dir_name
: Directory of input datasets -
tad_folder
: Folder name for reading InterTADs output tables -
meta
: meta-data file name used -
image_folder_name
: Directory name of output visualizations -
diff_col
: Group column to compare -
my_colors
: Colors for printing images -
my_cex
: Sizes for printing images
Once every input is provided, you can run the script by:
source("Visualization.R")