This workflow identifies stably unmethylated regions in plant genomes using methylation data.
-
Running environment:
- The workflow was constructed based on macOS Catalina 10.15.7 running the Oracle v1.8 java runtime environment (JREs). However, you can also run this using your preferred Linux distribution.
-
Required software and versions:
The example data used here is the paired-end fastq file generated by using the Illumina platform.
- R1 FASTQ file:
input/B73_chr1_subset_reads_1.fastq
- R2 FASTQ file:
input/B73_chr1_subset_reads_2.fastq
Each entry in a FASTQ files consists of 4 lines:
- A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
- The sequence (the base calls; A, C, T, G and N).
- A separator, which is simply a plus (+) sign.
- The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.
The first entry of the input data:
@SRR8738272.153232
TGATTTGAAATTAAACGAATATGGAAATCGGTTTGAAGGTTTTGGAATCGAGTATAATTGGATTTACAAATGTGGTTTATGGGAATTTTTTTATGTGAAAGTTTTGATTCTGATGTATAATATTGA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@
Other input files are also required, such as:
- A reference genome.
- A file containing chromosome sizes. Each entry consists of two columns: the chromosome and the size of the chromosome.
Here is the example file:
maize_chr1_reference 20000
- A reference genome cytosine tile file.
The file contains 6 columns: 1) The chromosome number 2) Start of the 100bp tile 3) End of the 100bp tile 4) Number of CG sites in the 100bp tile 5) Number of CHG sites in the 100bp tile 6) Number of CHH sites in the 100bp tile
Here are the first 5 lines of the example tile file:
chr start end cg_sites chg_sites chh_sites
maize_chr1_reference 1 100 4 6 29
maize_chr1_reference 101 200 6 7 25
maize_chr1_reference 201 300 6 4 36
maize_chr1_reference 301 400 2 10 28
More example tile files can found in the example_genomes folder in the input folder. They will be provided by UQeSpace, with the DOI being available when the article is published.
- Note that you have to normalize the path in the shell script.
sh workflow/1_trim_reads.sh
sh workflow/2_map_reads.sh <samtool 0.1.18 path>
- Results can be converted into a bigWig format, which can be visualized using IGV.
sh 3_visualize_results.sh <bedgraph2BigWig path>
4_find_UMRs.sh
It is a free and open source software, licensed under GPLv3.