Skip to content
Nikolaos Pechlivanis edited this page Jul 8, 2020 · 15 revisions

InterTADs consists of three major scripts (two of them perform the main functionalities and one for the visualization).

In order to run the tool, the scripts should be executed with the following order:

  1. Data_Integration.R
  2. TADiff.R
  3. Visualization.R

1. Data Integration

For the Data Integration part, all omics datasets are separated into two folders, freq and counts (the user is able to define different folder names), based on the information they are carrying (frequency or score count values). Each file should contain four major columns (INDEX, CHROMOSOME, START POSITION ON CHROMOSOME, END POSITION ON CHROMOSOME) followed by the data.

An example of a frequency table can be found below:

        ID chromosome_name start_position end_position S6_GR-10824 S8_GR-1615 S8_GR-7341 D S6_GR-3810 S6_GR-6296 D S6_GR-9320 S8_GR-2980 D S8_GR-1114 S6_GR-7150
cg13869341               1          15864        15865  0.83029719 0.87924895   0.76518576 0.74811852   0.77672951 0.79969665   0.82446763 0.77489221 0.84642477
cg14008030               1          18826        18827  0.46587484 0.36375759   0.42892993 0.43966841   0.34872824 0.44089334   0.42953261 0.37386713 0.52129949
cg12045430               1          29406        29407  0.06704503 0.06174565   0.05452966 0.05422707   0.06373271 0.05207358   0.06518455 0.03379153 0.07121873
cg20826792               1          29424        29425  0.19085259 0.13481677   0.16600570 0.18401117   0.15711119 0.18683240   0.13360431 0.12754390 0.26688323
cg00381604               1          29434        29435  0.05333948 0.03570165   0.04609482 0.05332919   0.04466450 0.04220193   0.05977976 0.02439079 0.02971444
cg20253340               1          68848        68849  0.51558781 0.56116867   0.45075386 0.27855765   0.49120712 0.45580421   0.59264789 0.44046663 0.56385297

and an example of a count table:

        LOC Chromosome  Start    End gene.features.locus                     Genes        S1        S2        S8        S9       S11        S4        S5        S6        S7       S10       S12
XLOC_000001       chr1  11873  29370    chr1:11873-29370                   DDX11L1 0.1677510 0.1999510 0.0779174 0.1908730 0.0773154 0.1045200 0.0488422 0.0840655 0.0383534 0.1518370 0.0836381
XLOC_000002       chr1  11873  29370    chr1:11873-29370                    WASH7P 0.0000000 1.1977200 0.7615410 0.8504720 1.2126800 3.2438200 0.0722865 0.6610380 0.8709270 0.5822600 1.2383400
XLOC_000003       chr1  30365  30503    chr1:30365-30503                MIR1302-10 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
XLOC_000004       chr1  69090  70008    chr1:69090-70008                     OR4F5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
XLOC_000005       chr1 323891 328581  chr1:323891-328581 LOC100132062,LOC100133331 0.0363166 0.0449181 0.0000000 0.0583464 0.0861039 0.0201765 0.0000000 0.0377682 0.1620190 0.0643269 0.0000000
XLOC_000006       chr1 367658 368597  chr1:367658-368597                    OR4F16 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

The two folders are placed into a directory, along with a meta data file which provides information about the mapping between the columns for each dataset. For more details regarding the structure of this file please see here:

 sample_methylation_data sample_expr_data sample_mutation_data groups newNames
             S6_GR-10824              S10            10824_S10    ss6      S10
              S8_GR-1615              S11              6242_S6    ss8      S11
            S8_GR-7341 D               S2             7341R_S2    ss8       S2
              S6_GR-3810               S5              4229_S5    ss6       S5
            S6_GR-6296 D               S7             6296R_S7    ss6       S7
              S6_GR-9320              S12              9577_S7    ss6      S12
            S8_GR-2980 D               S9             2980R_S9    ss8       S9
              S8_GR-1114               S1              5452_S1    ss8       S1
              S6_GR-7150               S6              7150_S6    ss6       S6

Moreover, the user can choose a folder name for the output table and a option about the Human Genome that is being used (accepted values are hg19 or hg38).

The entire list of the inputs for the Data Integration is the following:

  • dir_name: Directory of input datasets containing feature counts and frequency tables
  • output_folder: Folder name for printing output tables
  • tech: Human Genome Reference preferred
  • meta: Meta data file name used
  • counts_dir: Directory name of counts NGS data
  • freq_dir: Directory name of freq NGS data

Once every input is provided, the script can be run by:

source("Data_Integration.R")

The generated table follows the format below:

                 ID chromosome_name start_position end_position                 Gene_id               Gene_locus       S10       S11        S2        S5        S7       S12        S9        S1        S6 parent
chr1,100491168:T:TA               1      100491168    100491169 SLC35A3|SLC35A3|SLC35A3 transcript|exon|threeUTR  0.000000  0.000000  0.000000 50.000000  0.000000  0.000000  0.000000  0.000000  0.000000      3
         cg25151559               1        1223441      1223442           SCNN1D|SCNN1D        transcript|intron 33.045591 76.854680 37.288103  0.000000  6.593783 76.784934 77.551448 40.368888 29.723830      1
 chr1,100503564:T:C               1      100503564    100503565              MFSD14A|NA      promoter|intergenic  0.000000 50.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000      3
         cg02913364               1        1208572      1208573           UBE2J2|UBE2J2        transcript|intron  6.740306 11.566024  9.024343  9.028459 12.387940  5.701391  9.662759  7.551389  5.517274      1
         cg16619049               1         805540       805541           FAM41C|FAM41C        transcript|intron  0.712985  9.954163 28.119258 62.539744 50.121639 56.854456 21.807748 58.569130 23.909206      1
         cg15994267               1         805351       805352           FAM41C|FAM41C        transcript|intron 43.039788 34.019183 41.020202 45.317996 43.767269 38.059623 38.256512 32.864791 43.418376      1

2. TADiff

For the TADiff part, the paths to the input meta data file and output table (from the Data Integration part) must be provided. In addition a BED file is needed containing information regarding the TADs:

chr1  521368  750000 TAD1
chr1  750000 1850000 TAD2
chr1 1850000 6000000 TAD3
chr1 6000000 6750000 TAD4
chr1 6750000 7800000 TAD5
chr1 7800000 8050000 TAD6

The full input list for the script is the following:

  • dir_name: Directory of input datasets containing meta data file
  • output_folder: Folder name for printing output tables
  • image_output_folder: Folder name for printing output images
  • tad_file: BED file containing information about TADs
  • meta: Meta data file name used
  • paired.data: Boolean flag indicating whether input data is paired or not
  • FDR_criterion: User defined FDR criterion

In order to run the script:

source("TADiff.R")

The output table produced by the script is the one below:

 chromosome_name tad_name tad_start   tad_end                  ID start_position end_position         Gene_id          Gene_locus parent       S10       S11         S2        S5         S7       S12        S9        S1        S6        diff
               1     TAD2    750000   1850000          cg18792131        1099165      1099166            <NA>          intergenic      1 96.105811 94.965889 94.3893586 97.222416  91.741260 96.100389 97.402224 98.120703 97.371092   0.5113503
               1   TAD127 100200000 100950000 chr1,100476244:C:CA      100476244    100476245 SLC35A3|SLC35A3   transcript|intron      3  0.000000  0.000000 50.0000000  0.000000   0.000000  0.000000 50.000000  0.000000  0.000000  25.0000000
               1   TAD127 100200000 100950000  chr1,100442764:A:G      100442764    100442765 SLC35A3|SLC35A3   transcript|intron      3  0.000000  0.000000  0.0000000  0.000000 100.000000  0.000000  0.000000  0.000000  0.000000 -20.0000000
               1     TAD2    750000   1850000          cg25610492         948624       948625        ISG15|NA promoter|intergenic      1  5.267563  2.626257  0.3441028  5.189518   2.784986  1.788684  4.188292 11.576175  5.362097   0.6051372
               1     TAD2    750000   1850000          cg22379708         982917       982918       AGRN|AGRN   transcript|intron      1 79.367710  4.115131 73.0741959 75.720993  75.668553 72.052459 83.662421 82.700215 77.902354 -15.2544232

Along with the previous table a statistical analysis between two groups of the samples can be performed, in order to locate main differences amonts them. The output generated is a table containing information about the TADs and the level of significance of each one of them:

 tad_name count      mean       IQR        ttest     wilcoxon          FDR
     TAD1    26  3.320106  3.332445 5.905208e-01 3.924510e-01 1.000000e+00
    TAD10   170 16.505051 20.676735 1.369320e-06 2.367577e-08 1.562601e-06
    TAD11     1  1.687172  0.000000 2.910859e-01 2.857143e-01 1.000000e+00
    TAD12     6  1.585812  3.109927 9.942975e-01 9.167833e-01 1.000000e+00
   TAD126     7 20.000000  0.000000 6.245001e-03 1.277503e-02 2.810507e-01
   TAD127   444 16.019144 22.500000 6.713911e-03 2.275272e-05 7.508397e-04

3. Visualization

For the visualization of the results, the paths to input meta data file and output tables generated by the TADiff part need to be provided.

The full of inputs for the script is the following:

  • dir_name: Directory of input datasets
  • tad_folder: Folder name for reading InterTADs output tables
  • meta: meta-data file name used
  • image_folder_name: Directory name of output visualizations
  • diff_col: Group column to compare
  • my_colors: Colors for printing images
  • my_cex: Sizes for printing images

Once every input is provided, you can run the script by:

source("Visualization.R")

Clone this wiki locally