by Anastasiia Hryhorzhevska
Special thanks to Linda: https://github.com/LindaDi
Brief introduction to DNAm array analysis
2. Quality control check for samples
6. Surrogate Variable Analysis
10. Methylation age estimation
On cluster:
- Installation
minfi
:
Sys.setenv(LC_CTYPE="en_US.UTF-8")
Sys.setenv(LC_ALL="en_US.UTF-8")
BiocManager::install("minfi")
## OR
install.packages("remotes")
remotes::install_github("hansenlab/minfi")
or
export LC_CTYPE="en_US.UTF-8"
export LC_ALL="en_US.UTF-8"
-
For each CpG, there are two measurements: a methylated intensity (denoted by M) and an unmethylated intensity (denoted by U). These intensity values can be used to determine the proportion of methylation at each CpG locus. Methylation levels are commonly reported as either beta values (β = M/(M+U)) or M-values (M-value = log2(M/U)).
-
For practical purposes, a small offset, α, can be added to the denominator of the β value equation to avoid dividing by small values, which is the default behaviour of the getBeta function in minfi. The default value for α is 100. It may also be desirable to add a small offset to the numerator and denominator when calculating M-values to avoid dividing by zero in rare cases, however the default getM function in minfi does not do this.
-
Beta values and M-values are related through a logit transformation.
-
Beta values are generally preferable for describing the level of methylation at a locus or for graphical presentation because percentage methylation is easily interpretable. However, due to their distributional properties, M-values are more appropriate for statistical testing
The DNAm QC Roadmap
- iData data :
/binder/mgp/workspace/2020_DexStim_Array_Human/methylation/
- RData:
/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/
rgSet_dex
: raw data from the dex IDAT files; organized at the probe (not CpG locus) level and has two channels (Red and Green).
The following data are obtained from rgSet_dex:
-
Mset_original
: data organized by the CpG locus level, but not mapped to a genome and has two channels, Meth (methylated) and Unmeth (unmethylated); -
RatioSet_original
: data organized by the CpG locus level, but not mapped to a genome, and has at least one of two channels, Beta and/or M (logratio of Beta); -
RawBetas
: Beta values matrix, obtained from RatioSet; -
gRatioSet_original
: the same as RatioSet but mapped to a genome; -
pd_original
: phenotype data
To get rgSet and the rest of the data, run:
cd ~/01_prepare_data_formats/
chmod +x ./getRgSetFormats.sh
sbatch ./getRgSetFormats.sh
-
Calculate detection p-values. We can generate a detection p-value for every CpG in every sample, which is indicative of the quality of the signal. The method used by minfi to calculate detection p-values compares the total signal (M + U) for each probe to the background signal level, which is estimated from the negative control probes. Very small p-values are indicative of a reliable signal whilst large p-values, for example > 0.01, generally indicate a poor quality signal.
-
Remove poor quality samples with a mean detection p-value > 0.01:
- from rgSet
- from beta matrix
- from p-values table
- from targets table
Result:
Data
-
rgSet_qc.Rdata
-
RawBetas_qc.Rdata
-
detP_qc.Rdata
: p-values table;
Reports: detectionP.pdf
, qcReport.pdf
- Distribution artefacts
Result:
Reports: 03_beta_densities_report.pdf
Suspicious observations:
-
From array R01C01;
-
Partially from R03C01;
-
Sample 200712160065.
- Sex mismatches
By looking at the median total intensity of the X chromosome-mapped probes, denoted xMed, and the median total intensity of the Y-chromosome-mapped probes, denoted yMed, one can observe two different clusters of points corresponding to which gender the samples belong to. To predict the gender, minfi separates the points by using a cutoff on log2(yMed)−log2(xMed). The default cutoff is −2.
Result:
The result should be cheked with genotype data
Data: sex_predicted.Rdata
- Remove sampleID found at the steps 3 and 4
Result:
Data
-
rgSet_clean.Rdata
-
RawBetas_clean.Rdata
-
detP_clean.Rdata
-
pd_celan.Rdata
-
annotated_data_clean.Rdata
Beta values were normalized using stratified quantile normalization (Touleimat & Tost, 2012), followed by BMIQ (Teschendorff et al., 2013).
Result:
Data
Folder: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/rData
:
-
BMIQ.quantileN.Rdata
-
quantileN.Rdata
-
quantileNBetas.Rdata
-
quantileNMs.Rdat
-
BMIQ_quantileN.Rdata
Reports: BetaValue_Distributions_Norm_Quantile.pdf
The following probes in one or more samples were removed:
-
probes on X or Y chromosomes
-
probes containing SNPs
-
cross-hybridizing and polymorphic probes according to Chen et al. (2013)
-
and McCartney et al. (2016)
-
probes with a detection p-value > 0.01
Result:
Data
Folder: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/rData
-
quantileN_filtered.Rdata
-
BMIQ.quantileN_filtered.Rdata
-
Betas_quantileN_filtered.Rdata
-
Ms_quantileN_filtered.Rdata
Reports: BetaValue_Distributions_Norm_quantile_Filter.png
Beta values were transformed into M-values, and batch-effects were removed using Combat. For this, principal component analysis (PCA) was performed on the M-values and then checked which batches were most strongly associated with the principal components (PCs). The strongest batches for the respective data set were iteratively removed. First, the the correction was done for Plate. Then, PCA was performed on corrected data and p-values were computed. The results showed associaiton between PC2 and Array. Therefore, the correction was done for Array. And later, for Slide, since the was still strong association between PC2 and Slide. Corrected M-values were re-transformed into beta values. Attached pdf files shows the evaluation results after each step. Important to mention that the most variation was explained by DEX treatment.
Result:_
Data
Folder: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/rData
-
Betas_combated.Rdata
: normalized and batch-adjusted beta values df -
Betas_combated_ExprSet.Rdata
: expression set with normalized and batch-adjusted beta values
Reports: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/14_reports/05_batch_effects_correction
-
00_PCA-map_ANOVA-res_before_correction.pdf
: PCA Individual Map, Density Plot for each batch, Group (dex or veh) and Gender before any batch correction, ANOVA summary statistics -
01_PCA-map_ANOVA-res_PLATE_correction.pdf
: PCA Individual Map, Density Plot for each batch, Group (dex or veh) and Gender after correction on PLATE, ANOVA summary statistics -
01_PCA-map_ANOVA-res_ARRAY_correction.pdf
: PCA Individual Map, Density Plot for each batch, Group (dex or veh) and Gender after correction on ARRAY, ANOVA summary statistics -
01_PCA-map_ANOVA-res_SLIDE_correction.pdf
: PCA Individual Map, Density Plot for each batch, Group (dex or veh) and Gender after correction on SLIDE, ANOVA summary statistics -
04_BetaComBated_Distributions_Plot.pdf
: Normalized and batch corrected beta distribution plots
SVs were calculated using the R package sva with the following design:
mod0 <- model.matrix(~ 1 + individual_id, data = pheno)
mod <- model.matrix(~ individual_id + Sample_Group, data = pheno)
Returned 62 SVs.
More information for methylation mixupmapping:
Required files description can be found here:
Data
-
Methylation data :
/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper
-
SNPs:
/binder/mgp/datasets/2020_DexStim_Array_Human/snps/mixupmapper/
-
Imputed genotypes:
/binder/mgp/datasets/2020_DexStim_Array_Human/snps/Dex_genoData_SNPs.bed
-
Batch-adjusted beta values:
/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/rData/Betas_combated.Rdata
-
TRITYPER genotypes:
/binder/mgp/datasets/2020_DexStim_Array_Human/snps/mixupmapper/
-
Trait file: batch-adjusted beta values:
/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/beta_combated_for_mixupmapper.txt
-
Genotype - phenotype coupling:
/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/genotypemethylationcoupling.txt
Run:
- Convert genotypes .bed to TriTyper format:
GENOTYPES_BED_DIR="/binder/mgp/datasets/2020_DexStim_Array_Human/snps/Dex_genoData_SNPs"
GENOTYPES_TRITYPER_DIR="/binder/mgp/datasets/2020_DexStim_Array_Human/snps/mixupmapper"
wget https://github.com/molgenis/systemsgenetics/releases/download/1.4.0_20-8.1/GenotypeHarmonizer-1.4.23-dist.tar.gz
tar -xvf GenotypeHarmonizer-1.4.23-dist.tar.gz
cd GenotypeHarmonizer-1.4.23
java -jar GenotypeHarmonizer.jar -i $DIR_GENOTYPES_BED -I PLINK_BED -o $DIR_GENOTYPES_TRITYPER -O TRITYPER --update-id
- Beta values normalization
screen -S mixupmapper
BETA_VALUES_FILENAME=/binder/mgp/datasets/2020_DexStim_Array_Human/methylation//mixupmapper/betas_combat_veh_mixupmapper_removed_mixups.txt
MIXUPMAPPER_DATA_DIR=/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/out_eqtl_normalization_removed_mixups
MIXUPMAPPER_DIR=/path/to/MixupMapper/eqtl-mapping-pipeline-1.2.4E-SNAPSHOT
# java -jar $MIXUPMAPPER_DIR/eqtl-mapping-pipeline.jar --mode normalize --in $BETAS_VALUE_FILENAME --out $MIXUPMAPPER_DATA_DIR --centerscale
srun --part=pe -c 8 --mem=200G java -jar $MIXUPMAPPER_DIR/eqtl-mapping-pipeline.jar --mode normalize --in $BETA_VALUES_FILENAME --out $MIXUPMAPPER_DATA_DIR --centerscale
- Run MixupMapper
GENOTYPES_TRITYPER_DIR=/binder/mgp/datasets/2020_DexStim_Array_Human/snps/mixupmapper
TRAIT_NORM_FILENAME=/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/out_eqtl_normalization_removed_mixups/betas_combat_veh_mixupmapper_removed_mixups.ProbesCentered.SamplesZTransformed.txt.gz
ANNOTATION_FILENAME=/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/annotation.txt
COUPLING_FILENAME=/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/genotypemethylationcoupling_removed_mixups.txt
OUT_MIXUPMAPPER_DIR=/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/mixupmapper/out_mixupmapper_removed_mixups
MIXUPMAPPER_DIR=/path/to/MixupMapper/eqtl-mapping-pipeline-1.2.4E-SNAPSHOT
srun --job-name="eqtl_mapping_pl" --part=pe --mem=200G --output=$OUT_MIXUPMAPPER_DIR/eqtl_mixupmapper.out java -Xmx30g -Xms30g -jar $MIXUPMAPPER_DIR/eqtl-mapping-pipeline.jar --mode mixupmapper --in $GENOTYPES_TRITYPER_DIR --out $OUT_MIXUPMAPPER_DIR --inexp $TRAIT_NORM_FILENAME --inexpplatform EPIC --inexpannot $ANNOTATION_FILENAME --gte $COUPLING_FILENAME
Results:
Two mix-ups were found
Genotype | OriginalLinkedTrait | OriginalLinkedTraitScore | BestMatchingTrait | BestMatchingTraitScore | Mixup |
---|---|---|---|---|---|
MPIPSYKL_007875 | 200712160042_R01C01 | -4.9666422637773895 | 200720060022_R04C01 | -12.023177359249031 | true |
MPIPSYKL_007893 | 200720060022_R04C01 | -5.231460612404898 | 200712160042_R01C01 | -6.322773218809481 | true |
- Remove the individuas (both dex and veh) that we setermined as mix-ups
Results:
The results data are the final data. Please look at the Final result section to get location where the data are stored.
Input Data
dex_methyl_beta_combat_mtrx.rds
: final beta matrix after normalization, probes and samples filtering, batch correction and mix-ups removal (740'357 x 399)
Result
Folder: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/13_gaphunter
:
-
gaphunter RDSs
: R objectsdex_methyl_betas_mtrx_after_gap_outliers_na.rds
: beta matrix obtained from the beta combated matrix, dex_methyl_beta_combat_mtrx.rds, with all outliers detected as NAsdex_methyl_betas_mtrx_after_gap_extreme_outliers_na.rds
: beta matrix obtained from the beta combated matrix, dex_methyl_beta_combat_mtrx.rds, with additional NAs for extreme outliers
-
01_Gaphunter_CpGs_Summary_threshold_0.05.csv
: summary files for probes -
02_Gaphunter_Samples_Summary_threshold_0.05.csv
: summary files for samples
Biological findings in blood samples can often be confounded with cell type composition. In order to estimate the confounding levels between phenotype and cell type composition, we estimate the cell type composition of blood samples by using a modified version of the Houseman algorithm (E Andres Houseman et al. 2012). Based on FlowSorted.Blood.450k. The function takes as input a RGChannelSet and returns a cell counts vector for each samples.
Input Data
dex_methyl_beta_combat_mtrx.rds
: final beta matrix after normalization, probes and samples filtering, batch correction and mix-ups removal (740'357 x 399)
Result
Folder: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/11_cell_types_estimation
:
-
dex_stim_array_human_cellcounts.csv
: cell types estimation table -
dex_stim_cell_type_estimation_plot.pdf
: plots the average DNA methylation across the cell-type discrimating probes within the mixed and sorted samples
Methylation age can reflect a person’s biological age, which may be more related to the person’s health status than chronological age. Three different types of methylation age are estimated using methyAge: Horvath, Hannum and PhenoAge.
Input Data
dex_methyl_beta_combat_mtrx.rds
: final beta matrix after normalization, probes and samples filtering, batch correction and mix-ups removal (740'357 x 399)
Result
Folder: /binder/mgp/datasets/2020_DexStim_Array_Human/methylation/12_DNAm_age
:
-
dex_stim_array_human_meth_age.csv
: table which contains DNAm age predicitons ( Horvath, Hannum and PhenoAge) and chronological age -
DNAm_Age_and_Chronological_Age_Relation_plot.pdf
Path on cluster :
/binder/mgp/datasets/2020_DexStim_Array_Human/methylation/
and
/binder/common/methylation/qc_methylation/DexStim_EPIC_2020
-
10_final_qc_data
: R objectsdex_methyl_phenotype.rds
: final phenotype data after excluding poor qc and mix-ups samples (399 x 16)dex_methyl_detP.rds
: final p-values matrix after excluding poor qc and mix-ups samples but not probes (866'836 x 399)dex_methyl_bmiq_quantileN.rds
: final beta matrix after quantile + BMIQ normalization, samples filtering and mix-ups removal (865'859 x 399)dex_methyl_bmiq_beta_mtrx.rds
: final beta matrix after BMIQ normalization, samples filtering and mix-ups removal (862'805 x 399)dex_methyl_bmiq_quantileN_filtered.rds
: final beta matrix after normalization, probes and samples filtering and mix-ups removal (740'357 x 399)dex_methyl_beta_combat_mtrx.rds
: final beta matrix after normalization, probes and samples filtering, batch correction and mix-ups removal (740'357 x 399)dex_methyl_beta_combat_exprset.rds
: final beta expression set after normalization, probes and samples filtering, batch correction and mix-ups removal (740'357 x 399)dex_methyl_rgset_final.rds
: final RGChannel Set after removing poor qc samples and mix-ups, number of probes is the same as in initial (1'052'641 x 399)
-
11_cell_types_estimation
:dex_stim_array_human_cellcounts.csv
: cell types estimation tabledex_stim_cell_type_estimation_plot.pdf
: plots the average DNA methylation across the cell-type discrimating probes within the mixed and sorted samples
-
12_DNAm_age
:dex_stim_array_human_meth_age.csv
: table which contains DNAm age predicitons ( Horvath, Hannum and PhenoAge) and chronological ageDNAm_Age_and_Chronological_Age_Relation_plot.pdf
-
13_gaphunter
:gaphunter RDSs
: R objectsdex_methyl_betas_mtrx_after_gap_outliers_na.rds
: beta matrix obtained from the beta combated matrix, dex_methyl_beta_combat_mtrx.rds, with all outliers detected as NAsdex_methyl_betas_mtrx_after_gap_extreme_outliers_na.rds
: beta matrix obtained from the beta combated matrix, dex_methyl_beta_combat_mtrx.rds, with additional NAs for extreme outliers
-
14_reports
: Reports generated during processing the dataThe folder contains the set of folders with generated reports for particular step, e.g.:
05_batch_effects_correction
: folder contains all reports generated during batch effect correction step:00_PCA-map_ANOVA-res_before_correction.pdf
01_PCA-map_ANOVA-res_PLATE_correction.pdf
02_PCA-map_ANOVA-res_ARRAY_correction.pdf
03_PCA-map_ANOVA-res_SLIDE_correction.pdf
04_BetaComBated_Distributions_Plot.pdf
Notes:
R objects generated during analysis are stored in the folder on cluster:
/binder/mgp/workspace/2020_DexStim_Array_Human/methylation/rData