diff --git a/docs/404.html b/docs/404.html deleted file mode 100644 index 7cd223a..0000000 --- a/docs/404.html +++ /dev/null @@ -1,121 +0,0 @@ - - -
- - - - -MungeSumstats
: Getting startedvignettes/MungeSumstats.Rmd
- MungeSumstats.Rmd
The MungeSumstats package is designed to facilitate the -standardisation of GWAS summary statistics as utilised in our Nature -Genetics paper1.
-The package is designed to handle the lack of standardisation of -output files by the GWAS community. There is a group who have now -manually standardised many GWAS: R interface to the IEU GWAS -database API • ieugwasr and gwasvcf but because a lot -of GWAS remain closed access, these repositories are not all -encompassing.
-The GWAS-Download -project has collated summary statistics from 200+ GWAS. This -repository has been utilsed to identify the most common formats, all of -which can be standardised with MungeSumstats.
-Moreover, there is an emerging standard of VCF format for summary -statistics files with multiple, useful, associated R packages such as -vcfR. However, there is currently no method to convert VCF -formats to a standardised format that matches older approaches.
-The MungeSumstats package standardises both VCF and the most -common summary statistic file formats to enable downstream integration -and analysis.
-MungeSumstats also offers comprehensive Quality Control (QC) -steps which are important prerequisites for downstream analysis like -Linkage disequilibrium score regression (LDSC) and MAGMA.
-Moreover, MungeSumstats is efficiently written resulting in
-all reformatting and quality control checks completing in minutes for
-GWAS summary statistics with 500k SNPs on a standard desktop machine.
-This speed can be increased further by increasing the number of threads
-(nThread) for data.table
to use.
Currently MungeSumstats only works on data from humans, as -it uses human-based genome references.
-MungeSumstats will ensure that the all essential columns for -analysis are present and syntactically correct. Generally, summary -statistic files include (but are not limited to) the columns:
-MungeSumstats uses a mapping file to infer the inputted
-column names (run data("sumstatsColHeaders")
to view
-these). This mapping file is far more comprehensive than any other
-publicly available munging tool containing more than 200 unique mappings
-at the time of writing this vignette. However, if your column headers
-are missing or if you want to change the mapping, you can do so by
-passing your own mapping file (see
-format_sumstats(mapping_file)
).
MungeSumstats offers unmatched levels of quality control to -ensure, for example, consistency of allele assignment and direction of -effects. Tests run by MungeSumstats include:
-Users can specify which checks to run on their data. A -note on the allele flipping check: -MungeSumstats infers the effect allele will always be -the A2 allele, this is the approach done for IEU GWAS -VCF and has such also been adopted here. This inference is first -from the inputted file’s column headers however, the allele flipping -check ensures this by comparing A1, what should be the reference allele, -to the reference genome. If a SNP’s A1 DNA base doesn’t match the -reference genome but it’s A2 (what should be the alternative allele) -does, the alleles will be flipped along with the effect information -(e.g. Beta, Odds Ratio, signed summary statistics, FRQ, Z-score*).
-*-by default the Z-score is assumed to be calculated off the effect -size not the P-value and so will be flipped. This can be changed by a -user.
-If a test is failed, the user will be notified and if possible, the
-input will be corrected. The QC steps from the checks above can also be
-adjusted to suit the user’s analysis, see
-MungeSumstats::format_sumstats
.
MungeSumstats can handle VCF, txt, tsv, csv file types or -.gz/.bgz versions of these file types. The package also gives the user -the flexibility to export the reformatted file as tab-delimited, VCF or -R native objects such as data.table, GRanges or VRanges objects. The -output can also be outputted in an LDSC ready format -which means the file can be fed directly into LDSC without the need for -additional munging. NOTE - If LDSC format is used, the -naming convention of A1 as the reference (genome build) allele and A2 as -the effect allele will be reversed to match LDSC (A1 will now be the -effect allele). See more info on this here. -Note that any effect columns (e.g. Z) will be inrelation to A1 now -instead of A2.
-Please read carefully through our FAQ -Website to gain insight on how best to run MungeSumstats on your -data.
-The MungeSumstats package contains small subsets of GWAS -summary statistics files. Firstly, on Educational Attainment by Okbay et -al 2016: PMID: 27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b.
-Secondly, a VCF file (VCFv4.2) relating to the GWAS Amyotrophic -lateral sclerosis from ieu open GWAS project. Dataset: ebi-a-GCST005647: -https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST005647/
-These datasets will be used to showcase MungeSumstats -functionality.
-MungeSumstats is available on Bioconductor. To install the -package on Bioconductor run the following lines of code:
-if (!require("BiocManager")) install.packages("BiocManager")
-BiocManager::install("MungeSumstats")
-Once installed, load the package:
- -To standardise the summary statistics’ file format, simply call
-format_sumstats()
passing in the path to your summary
-statistics file or directly pass the summary statistics as a dataframe
-or datatable. You can specify which genome build was used in the
-GWAS(GRCh37 or GRCh38) or, as default, infer the genome build from the
-data.The reference genome is used for multiple checks like deriving
-missing data such SNP/BP/CHR/A1/A2 and for QC steps like removing
-non-biallelic SNPs, strand-ambiguous SNPs or ensuring correct allele and
-direction of SNP effects. The path to the reformatted summary statistics
-file can be returned by the function call, the user can specify a
-location to save the file or the user can return an R native object for
-the data: data.table, VRanges or GRanges object.
Note that for a number of the checks implored by
-MungeSumstats a reference genome is used. If your GWAS summary
-statistics file of interest relates to GRCh38, you will need to
-install SNPlocs.Hsapiens.dbSNP155.GRCh38
and
-BSgenome.Hsapiens.NCBI.GRCh38
from Bioconductor as
-follows:
#increase permissible time to download data, in case of slow internet access
-options(timeout=2000)
-BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")
-BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")
-If your GWAS summary statistics file of interest relates to
-GRCh37, you will need to install
-SNPlocs.Hsapiens.dbSNP155.GRCh37
and
-BSgenome.Hsapiens.1000genomes.hs37d5
from Bioconductor as
-follows:
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh37")
-BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
-These may take some time to install and are not included in the -package as some users may only need one of -GRCh37/GRCh38.
-The Educational Attainment by Okbay GWAS summary statistics file is -saved as a text document in the package’s external data folder so we can -just pass the file path to it straight to MungeSumstats.
-NOTE - By default, Formatted results will be saved
-to tempdir()
. This means all formatted summary stats will
-be deleted upon ending the R session if not copied to a local file path.
-Otherwise, to keep formatted summary stats, change
-save_path
(
-e.g.file.path('./formatted',basename(path))
), or make sure
-to copy files elsewhere after processing (
-e.g.file.copy(save_path, './formatted/' )
.
-eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
- package="MungeSumstats")
-reformatted <-
- MungeSumstats::format_sumstats(path=eduAttainOkbayPth,
- ref_genome="GRCh37")
##
-##
-## ******::NOTE::******
-## - Formatted results will be saved to `tempdir()` by default.
-## - This means all formatted summary stats will be deleted upon ending the R session.
-## - To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
-## ********************
-## Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpJonkzo/filec4ec6d3b393.tsv.gz
-## Warning: replacing previous import 'utils::findMatches' by
-## 'S4Vectors::findMatches' when loading 'SNPlocs.Hsapiens.dbSNP155.GRCh37'
-## Importing tabular file: /private/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T/RtmpKLvRpi/temp_libpath17f3d19176b21/MungeSumstats/extdata/eduAttainOkbay.txt
-## Checking for empty columns.
-## Infer Effect Column
-## First line of summary statistics file:
-## MarkerName CHR POS A1 A2 EAF Beta SE Pval
-## Allele columns are ambiguous, attempting to infer direction
-## Can't infer allele columns from sumstats
-## Standardising column headers.
-## First line of summary statistics file:
-## MarkerName CHR POS A1 A2 EAF Beta SE Pval
-## Summary statistics report:
-## - 93 rows
-## - 93 unique variants
-## - 70 genome-wide significant variants (P<5e-8)
-## - 20 chromosomes
-## Checking for multi-GWAS.
-## Checking for multiple RSIDs on one row.
-## Checking SNP RSIDs.
-## Checking for merged allele column.
-## Checking A1 is uppercase
-## Checking A2 is uppercase
-## Checking for incorrect base-pair positions
-## Checking for missing data.
-## Checking for duplicate columns.
-## Checking for duplicated rows.
-## INFO column not available. Skipping INFO score filtering step.
-## Filtering SNPs, ensuring SE>0.
-## Ensuring all SNPs have N<5 std dev above mean.
-## 47 SNPs (50.5%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
-## The FRQ column was mapped from one of the following from the inputted summary statistics file:
-## FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.B, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, ALL_AF
-## As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
-## set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
-## Sorting coordinates with 'data.table'.
-## Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpJonkzo/filec4ec6d3b393.tsv.gz
-## Summary statistics report:
-## - 93 rows (100% of original 93 rows)
-## - 93 unique variants
-## - 70 genome-wide significant variants (P<5e-8)
-## - 20 chromosomes
-## Done munging in 0.051 minutes.
-## Successfully finished preparing sumstats file, preview:
-## Reading header.
-## SNP CHR BP A1 A2 FRQ BETA SE P
-## <char> <int> <int> <char> <char> <num> <num> <num> <num>
-## 1: rs301800 1 8490603 T C 0.17910 0.019 0.003 1.794e-08
-## 2: rs11210860 1 43982527 A G 0.36940 0.017 0.003 2.359e-10
-## 3: rs34305371 1 72733610 A G 0.08769 0.035 0.005 3.762e-14
-## 4: rs2568955 1 72762169 T C 0.23690 -0.017 0.003 1.797e-08
-## Returning path to saved data.
-Here we know the summary statistics are based on the reference genome
-GRCh37, GRCh38 can also be inputted. Moreover, if you are unsure of the
-genome build, leave it as NULL
and Mungesumstats will infer
-it from the data.
Also note that the default dbSNP version used along with the
-reference genome is the latest version available on Bioconductor
-(currently dbSNP 155) but older versions are also availble. Use the
-dbSNP
input parameter to control this.
The arguments format_sumstats
in that control the level
-of QC conducted by MungeSumstats are:
p-values < 5e-324
be converted to 0? Small p-values pass
-the R limit and can cause errors with LDSC/MAGMA and should be
-converted. Default is TRUE.1-22, X, Y, MT
; the UCSC style is
-chr1-chr22, chrX, chrY, chrM
; and the dbSNP style is
-ch1-ch22, chX, chY, chMT
. Default is Ensembl.c("X", "Y", "MT")
which removes all non-autosomal
-SNPs.NULL
, all columns will be checked for missing values.
-Default columns are SNP, chromosome, position, allele 1, allele 2,
-effect columns (frequency, beta, Z-score, standard error, log odds,
-signed sumstats, odds ratio), p value and N columns.data.table
,
-GRanges
or VRanges
directly to user. Otherwise,
-return the path to the save data. Default is FALSE.data(sumstatsColHeaders)
for default mapping and necessary
-format.See ?MungeSumstats::format_sumstats()
for the full list
-of parameters to control MungeSumstats QC and standardisation steps.
VCF files can also be standardised to the same format as other -summary statistic files. A subset of the Amyotrophic lateral sclerosis -GWAS from the ieu open GWAS project (a .vcf file) has been added to -MungeSumstats to demonstrate this functionality.Simply pass the -path to the file in the same manner you would for other summary -statistic files:
-
-#save ALS GWAS from the ieu open GWAS project to a temp directory
-ALSvcfPth <- system.file("extdata","ALSvcf.vcf", package="MungeSumstats")
-reformatted_vcf <-
- MungeSumstats::format_sumstats(path=ALSvcfPth,
- ref_genome="GRCh37")
You can also get more information on the SNPs which have had data
-imputed or have been filtered out by MungeSumstats by using the
-imputation_ind
and log_folder_ind
parameters
-respectively. For example:
-#set
-reformatted_vcf_2 <-
- MungeSumstats::format_sumstats(path=ALSvcfPth,
- ref_genome="GRCh37",
- log_folder_ind=TRUE,
- imputation_ind=TRUE,
- log_mungesumstats_msgs=TRUE)
## Time difference of 0.1 secs
-## Time difference of 0.4 secs
-Check the file snp_bi_allelic.tsv.gz
in the
-log_folder
directory you supply (by default a temp
-directory), for a list of SNPs removed as they are non-bi-allelic. The
-text files containing the console output and messages are also stored in
-the same directory.
Note you can also control the dbSNP version used as a reference
-dataset by MungeSumstats using the dbSNP
parameter. By
-default this will be set to the most recent dbSNP version available
-(155).
Note that using log_folder_ind
returns a list from
-format_sumstats
which includes the file locations of the
-differing classes of removed SNPs. Using
-log_mungesumstats_msgs
saves the messages to the console to
-a file which is returned in the same list. Note that not all the
-messages will also print to screen anymore when you set
-log_mungesumstats_msgs
:
-names(reformatted_vcf_2)
## [1] "sumstats" "log_files"
-A user can load a file to view the excluded SNPs.
-In this case, SNPs were filtered based on non-bi-allelic -criterion:
-
-print(reformatted_vcf_2$log_files$snp_bi_allelic)
## NULL
-The different types of exclusion which lead to the names are -explained below:
-Note to export to another type such as R native objects including
-data.table, GRanges, VRanges or save as a VCF file, set
-return_data=TRUE
and choose your
-return_format
:
-#set
-reformatted_vcf_2 <-
- MungeSumstats::format_sumstats(path=ALSvcfPth,
- ref_genome="GRCh37",
- log_folder_ind=TRUE,
- imputation_ind=TRUE,
- log_mungesumstats_msgs=TRUE,
- return_data=TRUE,
- return_format="GRanges")
Also you can now output a VCF compatible with IEU OpenGWAS format (Note that -currently all IEU OpenGWAS sumstats are GRCh37, MungeSumstats will throw -a warning if your data isn’t GRCh37 when saving):
-
-#set
-reformatted_vcf_2 <-
- MungeSumstats::format_sumstats(path=ALSvcfPth,
- ref_genome="GRCh37",
- write_vcf=TRUE,
- save_format ="openGWAS")
See our publication for further discussion of these checks and -options:
- -MungeSumstats also contains a function to quickly infer the
-genome build of multiple summary statistic files. This can be called
-separately to format_sumstats()
which is useful if you want
-to just quickly check the genome build:
-# Pass path to Educational Attainment Okbay sumstat file to a temp directory
-eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats")
-ALSvcfPth <- system.file("extdata","ALSvcf.vcf", package="MungeSumstats")
-sumstats_list <- list(ss1 = eduAttainOkbayPth, ss2 = ALSvcfPth)
-
-ref_genomes <- MungeSumstats::get_genome_builds(sumstats_list = sumstats_list)
MungeSumstats exposes the liftover()
function
-as a general utility for users.
Useful features include: - Automatic standardisation of genome build
-names (i.e. “hg19”, “hg37”, and “GRCh37” will all be recognized as the
-same genome build.) - Ability to specify chrom_col
as well
-as both start_col
and end_col
(for variants
-that span >1bp). - Ability to return in data.table
or
-GRanges
format. - Ability to specify which chromosome
-format (e.g. “chr1” vs. 1) to return GRanges
as.
-sumstats_dt <- MungeSumstats::formatted_example()
## Standardising column headers.
-## First line of summary statistics file:
-## MarkerName CHR POS A1 A2 EAF Beta SE Pval
-## Sorting coordinates with 'data.table'.
-
-sumstats_dt_hg38 <- MungeSumstats::liftover(sumstats_dt = sumstats_dt,
- ref_genome = "hg19",
- convert_ref_genome = "hg38")
## Performing data liftover from hg19 to hg38.
-## Converting summary statistics to GenomicRanges.
-## Downloading chain file...
-## Downloading chain file from Ensembl.
-## /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpJonkzo/GRCh37_to_GRCh38.chain.gz
-## Reordering so first three column headers are SNP, CHR and BP in this order.
-## Reordering so the fourth and fifth columns are A1 and A2.
-
-SNP | -CHR | -BP | -A1 | -A2 | -FRQ | -BETA | -SE | -P | -IMPUTATION_gen_build | -
---|---|---|---|---|---|---|---|---|---|
rs301800 | -1 | -8430543 | -T | -C | -0.17910 | -0.019 | -0.003 | -0e+00 | -TRUE | -
rs11210860 | -1 | -43516856 | -A | -G | -0.36940 | -0.017 | -0.003 | -0e+00 | -TRUE | -
rs34305371 | -1 | -72267927 | -A | -G | -0.08769 | -0.035 | -0.005 | -0e+00 | -TRUE | -
rs2568955 | -1 | -72296486 | -T | -C | -0.23690 | --0.017 | -0.003 | -0e+00 | -TRUE | -
rs1008078 | -1 | -90724174 | -T | -C | -0.37310 | --0.016 | -0.003 | -0e+00 | -TRUE | -
rs61787263 | -1 | -98153158 | -T | -C | -0.76120 | -0.016 | -0.003 | -1e-07 | -TRUE | -
In some cases, users may not want to run the full munging pipeline
-provided byMungeSumstats::format_sumstats
, but still would like to
-take advantage of the file type conversion and column header
-standardisation features. This will not be nearly as robust as the full
-pipeline, but can still be helpful.
To do this, simply run the following:
-
-eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats")
-formatted_path <- tempfile(fileext = "_eduAttainOkbay_standardised.tsv.gz")
-
-
-#### 1. Read in the data and standardise header names ####
-dat <- MungeSumstats::read_sumstats(path = eduAttainOkbayPth,
- standardise_headers = TRUE)
## Importing tabular file: /private/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T/RtmpKLvRpi/temp_libpath17f3d19176b21/MungeSumstats/extdata/eduAttainOkbay.txt
-## Checking for empty columns.
-## Standardising column headers.
-## First line of summary statistics file:
-## MarkerName CHR POS A1 A2 EAF Beta SE Pval
-
-SNP | -CHR | -BP | -A1 | -A2 | -FRQ | -BETA | -SE | -P | -
---|---|---|---|---|---|---|---|---|
rs10061788 | -5 | -87934707 | -A | -G | -0.2164 | -0.021 | -0.004 | -0e+00 | -
rs1007883 | -16 | -51163406 | -T | -C | -0.3713 | --0.015 | -0.003 | -1e-07 | -
rs1008078 | -1 | -91189731 | -T | -C | -0.3731 | --0.016 | -0.003 | -0e+00 | -
rs1043209 | -14 | -23373986 | -A | -G | -0.6026 | -0.018 | -0.003 | -0e+00 | -
rs10496091 | -2 | -61482261 | -A | -G | -0.2705 | --0.018 | -0.003 | -0e+00 | -
rs10930008 | -2 | -161854736 | -A | -G | -0.7183 | --0.016 | -0.003 | -1e-07 | -
-#### 2. Write to disk as a compressed, tab-delimited, tabix-indexed file ####
-formatted_path <- MungeSumstats::write_sumstats(sumstats_dt = dat,
- save_path = formatted_path,
- tabix_index = TRUE,
- write_vcf = FALSE,
- return_path = TRUE)
## Sorting coordinates with 'data.table'.
-## Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpJonkzo/filec4ec45c572ed_eduAttainOkbay_standardised.tsv
-## Writing uncompressed instead of gzipped to enable tabix indexing.
-## Converting full summary stats file to tabix format for fast querying...
-## Reading header.
-## Ensuring file is bgzipped.
-## Tabix-indexing file.
-## Removing temporary .tsv file.
-data.table
-If you already have your data imported as an data.table
,
-you can also standardise its headers like so:
-#### Mess up some column names ####
-dat_raw <- data.table::copy(dat)
-data.table::setnames(dat_raw, c("SNP","CHR"), c("rsID","Seqnames"))
-#### Add a non-standard column that I want to keep the casing for ####
-dat_raw$Support <- runif(nrow(dat_raw))
-
-dat2 <- MungeSumstats::standardise_header(sumstats_dt = dat_raw,
- uppercase_unmapped = FALSE,
- return_list = FALSE )
## Standardising column headers.
-## First line of summary statistics file:
-## rsID Seqnames BP A1 A2 FRQ BETA SE P Support
-## Returning unmapped column names without making them uppercase.
-
-SNP | -CHR | -BP | -A1 | -A2 | -FRQ | -BETA | -SE | -P | -Support | -
---|---|---|---|---|---|---|---|---|---|
rs301800 | -1 | -8490603 | -T | -C | -0.17910 | -0.019 | -0.003 | -0e+00 | -0.9805397 | -
rs11210860 | -1 | -43982527 | -A | -G | -0.36940 | -0.017 | -0.003 | -0e+00 | -0.7415215 | -
rs34305371 | -1 | -72733610 | -A | -G | -0.08769 | -0.035 | -0.005 | -0e+00 | -0.0514463 | -
rs2568955 | -1 | -72762169 | -T | -C | -0.23690 | --0.017 | -0.003 | -0e+00 | -0.5302125 | -
rs1008078 | -1 | -91189731 | -T | -C | -0.37310 | --0.016 | -0.003 | -0e+00 | -0.6958239 | -
rs61787263 | -1 | -98618714 | -T | -C | -0.76120 | -0.016 | -0.003 | -1e-07 | -0.6885560 | -
The MungeSumstats package aims to be able to handle the most -common summary statistic file formats including VCF. If your file can -not be formatted by MungeSumstats feel free to report the bug -on github: https://github.com/neurogenomics/MungeSumstats along -with your summary statistic file header.
-We also encourage people to edit the code to resolve their particular -issues too and are happy to incorporate these through pull requests on -github. If your summary statistic file headers are not recognised by -MungeSumstats but correspond to one of:
-SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS,
-SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ
-feel free to update the
-MungeSumstats::sumstatsColHeaders
following the approach in
-the data.R file and add your mapping. Then use a pull request on github
-and we will incorporate this change into the package.
A note on MungeSumstats::sumstatsColHeaders
for summary
-statistic files with A0/A1. The mapping in
-MungeSumstats::sumstatsColHeaders
converts A0 to A*, this
-is a special case so that the code knows to map A0/A1 to A1/A2
-(ref/alt). The special case is needed since ordinarily A1 refers to the
-reference not the alternative allele.
A note on MungeSumstats::sumstatsColHeaders
for summary
-statistic files with Effect Size (ES). By default, MSS takes BETA to be
-any BETA-like value (including ES). This is coded into the mapping file
-- MungeSumstats::sumstatsColHeaders
. If this isn’t the case
-for your sumstats, you can set the es_is_beta
parameter in
-MungeSumstats::format_sumstats()
to FALSE to avoid this.
-Note this is done to try and capture most use cases of MSS.
See the Open -GWAS vignette for how MungeSumstats can be used along with data from -the MRC IEU Open GWAS Project and also Mungesumstats’ functionality to -handle lists of summary statistics files.
-## R version 4.3.0 (2023-04-21)
-## Platform: x86_64-apple-darwin20 (64-bit)
-## Running under: macOS 15.1.1
-##
-## Matrix products: default
-## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
-## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
-##
-## locale:
-## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
-##
-## time zone: Europe/London
-## tzcode source: internal
-##
-## attached base packages:
-## [1] stats graphics grDevices utils datasets methods base
-##
-## other attached packages:
-## [1] MungeSumstats_1.15.4 BiocStyle_2.30.0
-##
-## loaded via a namespace (and not attached):
-## [1] tidyselect_1.2.1
-## [2] dplyr_1.1.4
-## [3] blob_1.2.4
-## [4] filelock_1.0.3
-## [5] R.utils_2.12.3
-## [6] Biostrings_2.70.3
-## [7] bitops_1.0-9
-## [8] fastmap_1.2.0
-## [9] RCurl_1.98-1.16
-## [10] BiocFileCache_2.10.2
-## [11] VariantAnnotation_1.48.1
-## [12] GenomicAlignments_1.38.2
-## [13] XML_3.99-0.17
-## [14] digest_0.6.37
-## [15] lifecycle_1.0.4
-## [16] KEGGREST_1.42.0
-## [17] RSQLite_2.3.7
-## [18] magrittr_2.0.3
-## [19] compiler_4.3.0
-## [20] rlang_1.1.4
-## [21] sass_0.4.9
-## [22] progress_1.2.3
-## [23] tools_4.3.0
-## [24] utf8_1.2.4
-## [25] yaml_2.3.10
-## [26] data.table_1.16.0
-## [27] rtracklayer_1.62.0
-## [28] knitr_1.48
-## [29] prettyunits_1.2.0
-## [30] S4Arrays_1.2.1
-## [31] htmlwidgets_1.6.4
-## [32] curl_5.2.3
-## [33] bit_4.5.0
-## [34] DelayedArray_0.28.0
-## [35] ieugwasr_1.0.1
-## [36] xml2_1.3.6
-## [37] abind_1.4-8
-## [38] BiocParallel_1.36.0
-## [39] purrr_1.0.2
-## [40] BiocGenerics_0.48.1
-## [41] desc_1.4.3
-## [42] R.oo_1.26.0
-## [43] grid_4.3.0
-## [44] stats4_4.3.0
-## [45] fansi_1.0.6
-## [46] biomaRt_2.58.2
-## [47] SummarizedExperiment_1.32.0
-## [48] cli_3.6.3
-## [49] rmarkdown_2.28
-## [50] crayon_1.5.3
-## [51] generics_0.1.3
-## [52] ragg_1.3.1
-## [53] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
-## [54] rstudioapi_0.16.0
-## [55] httr_1.4.7
-## [56] rjson_0.2.23
-## [57] DBI_1.2.3
-## [58] cachem_1.1.0
-## [59] stringr_1.5.1
-## [60] zlibbioc_1.48.2
-## [61] parallel_4.3.0
-## [62] AnnotationDbi_1.64.1
-## [63] BiocManager_1.30.25
-## [64] XVector_0.42.0
-## [65] restfulr_0.0.15
-## [66] matrixStats_1.4.1
-## [67] vctrs_0.6.5
-## [68] Matrix_1.6-5
-## [69] jsonlite_1.8.9
-## [70] bookdown_0.39
-## [71] IRanges_2.36.0
-## [72] hms_1.1.3
-## [73] S4Vectors_0.40.2
-## [74] bit64_4.5.2
-## [75] GenomicFiles_1.38.0
-## [76] systemfonts_1.0.6
-## [77] GenomicFeatures_1.54.4
-## [78] jquerylib_0.1.4
-## [79] glue_1.8.0
-## [80] pkgdown_2.0.9
-## [81] codetools_0.2-20
-## [82] stringi_1.8.4
-## [83] GenomeInfoDb_1.38.8
-## [84] BiocIO_1.12.0
-## [85] GenomicRanges_1.54.1
-## [86] tibble_3.2.1
-## [87] pillar_1.9.0
-## [88] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.24
-## [89] rappdirs_0.3.3
-## [90] htmltools_0.5.8.1
-## [91] GenomeInfoDbData_1.2.11
-## [92] BSgenome_1.70.2
-## [93] dbplyr_2.5.0
-## [94] R6_2.5.1
-## [95] textshaping_0.3.7
-## [96] evaluate_1.0.0
-## [97] lattice_0.22-6
-## [98] Biobase_2.62.0
-## [99] R.methodsS3_1.8.2
-## [100] png_0.1-8
-## [101] Rsamtools_2.18.0
-## [102] memoise_2.0.1
-## [103] bslib_0.8.0
-## [104] SparseArray_1.2.4
-## [105] xfun_0.48
-## [106] fs_1.6.4
-## [107] MatrixGenerics_1.14.0
-## [108] pkgconfig_2.0.3
-vignettes/OpenGWAS.Rmd
- OpenGWAS.Rmd
MungeSumstats now offers high throughput query and import -functionality to data from the MRC IEU Open GWAS Project.
-This is made possible by the use the IEU OpwnGWAS R package:
-ieugwasr
.
Before you can use this functionality however, please complete the -following steps:
-To authenticate, you need to generate a token from the OpenGWAS
-website. The token behaves like a password, and it will be used to
-authorise the requests you make to the OpenGWAS API. Here are the steps
-to generate the token and then have ieugwasr
automatically
-use it for your queries:
OPENGWAS_JWT=<token>
to your .Renviron file,
-thi can be edited in R by running
-usethis::edit_r_environ()
-ieugwasr::get_opengwas_jwt()
. If it returns a long random
-string then you are authenticated.ieugwasr::user()
. It will make a request to the API for
-your user information using your token. It should return a list with
-your user information. If it returns an error, then your token is not
-working.We can search by terms and with other filters like sample size:
-
-#### Search for datasets ####
-metagwas <- MungeSumstats::find_sumstats(traits = c("parkinson","alzheimer"),
- min_sample_size = 1000)
-head(metagwas,3)
-ids <- (dplyr::arrange(metagwas, nsnp))$id
## id trait group_name year author
-## 1 ieu-a-298 Alzheimer's disease public 2013 Lambert
-## 2 ieu-b-2 Alzheimer's disease public 2019 Kunkle BW
-## 3 ieu-a-297 Alzheimer's disease public 2013 Lambert
-## consortium
-## 1 IGAP
-## 2 Alzheimer Disease Genetics Consortium (ADGC), European Alzheimer's Disease Initiative (EADI), Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium (CHARGE), Genetic and Environmental Risk in AD/Defining Genetic, Polygenic and Environmental Risk for Alzheimer's Disease Consortium (GERAD/PERADES),
-## 3 IGAP
-## sex population unit nsnp sample_size build
-## 1 Males and Females European log odds 11633 74046 HG19/GRCh37
-## 2 Males and Females European NA 10528610 63926 HG19/GRCh37
-## 3 Males and Females European log odds 7055882 54162 HG19/GRCh37
-## category subcategory ontology mr priority pmid sd
-## 1 Disease Psychiatric / neurological NA 1 1 24162737 NA
-## 2 Binary Psychiatric / neurological NA 1 0 30820047 NA
-## 3 Disease Psychiatric / neurological NA 1 2 24162737 NA
-## note ncase
-## 1 Exposure only; Effect allele frequencies are missing; forward(+) strand 25580
-## 2 NA 21982
-## 3 Effect allele frequencies are missing; forward(+) strand 17008
-## ncontrol N
-## 1 48466 74046
-## 2 41944 63926
-## 3 37154 54162
-You can also search by ID:
-
-### By ID and sample size
-metagwas <- find_sumstats(
- ids = c("ieu-b-4760", "prot-a-1725", "prot-a-664"),
- min_sample_size = 5000
-)
You can supply import_sumstats()
with a list of as many
-OpenGWAS IDs as you want, but we’ll just give one to save time.
-datasets <- MungeSumstats::import_sumstats(ids = "ieu-a-298",
- ref_genome = "GRCH37")
By default, import_sumstats
results a named list where
-the names are the Open GWAS dataset IDs and the items are the respective
-paths to the formatted summary statistics.
-print(datasets)
## $`ieu-a-298`
-## [1] "/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpIEX1aF/ieu-a-298.tsv.gz"
-You can easily turn this into a data.frame as well.
-
-results_df <- data.frame(id=names(datasets),
- path=unlist(datasets))
-print(results_df)
## id
-## ieu-a-298 ieu-a-298
-## path
-## ieu-a-298 /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpIEX1aF/ieu-a-298.tsv.gz
-Optional: Speed up with multi-threaded download via axel.
-
-datasets <- MungeSumstats::import_sumstats(ids = ids,
- vcf_download = TRUE,
- download_method = "axel",
- nThread = max(2,future::availableCores()-2))
See the Getting -started vignette for more information on how to use MungeSumstats -and its functionality.
-
-utils::sessionInfo()
## R version 4.3.0 (2023-04-21)
-## Platform: x86_64-apple-darwin20 (64-bit)
-## Running under: macOS 15.1.1
-##
-## Matrix products: default
-## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
-## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
-##
-## locale:
-## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
-##
-## time zone: Europe/London
-## tzcode source: internal
-##
-## attached base packages:
-## [1] stats graphics grDevices utils datasets methods base
-##
-## other attached packages:
-## [1] MungeSumstats_1.15.4 BiocStyle_2.30.0
-##
-## loaded via a namespace (and not attached):
-## [1] tidyselect_1.2.1 dplyr_1.1.4
-## [3] blob_1.2.4 filelock_1.0.3
-## [5] R.utils_2.12.3 Biostrings_2.70.3
-## [7] bitops_1.0-9 fastmap_1.2.0
-## [9] RCurl_1.98-1.16 BiocFileCache_2.10.2
-## [11] VariantAnnotation_1.48.1 GenomicAlignments_1.38.2
-## [13] XML_3.99-0.17 digest_0.6.37
-## [15] lifecycle_1.0.4 KEGGREST_1.42.0
-## [17] RSQLite_2.3.7 magrittr_2.0.3
-## [19] compiler_4.3.0 rlang_1.1.4
-## [21] sass_0.4.9 progress_1.2.3
-## [23] tools_4.3.0 utf8_1.2.4
-## [25] yaml_2.3.10 data.table_1.16.0
-## [27] rtracklayer_1.62.0 knitr_1.48
-## [29] prettyunits_1.2.0 S4Arrays_1.2.1
-## [31] htmlwidgets_1.6.4 curl_5.2.3
-## [33] bit_4.5.0 DelayedArray_0.28.0
-## [35] ieugwasr_1.0.1 xml2_1.3.6
-## [37] abind_1.4-8 BiocParallel_1.36.0
-## [39] purrr_1.0.2 BiocGenerics_0.48.1
-## [41] desc_1.4.3 R.oo_1.26.0
-## [43] grid_4.3.0 stats4_4.3.0
-## [45] fansi_1.0.6 biomaRt_2.58.2
-## [47] SummarizedExperiment_1.32.0 cli_3.6.3
-## [49] rmarkdown_2.28 crayon_1.5.3
-## [51] generics_0.1.3 ragg_1.3.1
-## [53] rstudioapi_0.16.0 httr_1.4.7
-## [55] rjson_0.2.23 DBI_1.2.3
-## [57] cachem_1.1.0 stringr_1.5.1
-## [59] zlibbioc_1.48.2 parallel_4.3.0
-## [61] AnnotationDbi_1.64.1 BiocManager_1.30.25
-## [63] XVector_0.42.0 restfulr_0.0.15
-## [65] matrixStats_1.4.1 vctrs_0.6.5
-## [67] Matrix_1.6-5 jsonlite_1.8.9
-## [69] bookdown_0.39 IRanges_2.36.0
-## [71] hms_1.1.3 S4Vectors_0.40.2
-## [73] bit64_4.5.2 systemfonts_1.0.6
-## [75] GenomicFeatures_1.54.4 jquerylib_0.1.4
-## [77] glue_1.8.0 pkgdown_2.0.9
-## [79] codetools_0.2-20 stringi_1.8.4
-## [81] GenomeInfoDb_1.38.8 BiocIO_1.12.0
-## [83] GenomicRanges_1.54.1 tibble_3.2.1
-## [85] pillar_1.9.0 rappdirs_0.3.3
-## [87] htmltools_0.5.8.1 GenomeInfoDbData_1.2.11
-## [89] BSgenome_1.70.2 dbplyr_2.5.0
-## [91] R6_2.5.1 textshaping_0.3.7
-## [93] evaluate_1.0.0 lattice_0.22-6
-## [95] Biobase_2.62.0 R.methodsS3_1.8.2
-## [97] png_0.1-8 Rsamtools_2.18.0
-## [99] memoise_2.0.1 bslib_0.8.0
-## [101] SparseArray_1.2.4 xfun_0.48
-## [103] fs_1.6.4 MatrixGenerics_1.14.0
-## [105] pkgconfig_2.0.3
-vignettes/docker.Rmd
- docker.Rmd
MungeSumstats is now available via ghcr.io -as a containerised environment with Rstudio and all necessary -dependencies pre-installed.
-First, install -Docker if you have not already.
-Create an image of the Docker -container in command line:
-docker pull ghcr.io/neurogenomics/MungeSumstats
-Once the image has been created, you can launch it with:
-docker run \
- -d \
- -e ROOT=true \
- -e PASSWORD="<your_password>" \
- -v ~/Desktop:/Desktop \
- -v /Volumes:/Volumes \
- -p 8900:8787 \
- ghcr.io/neurogenomics/MungeSumstats
-<your_password>
above with
-whatever you want your password to be.-v
flags for your
-particular use case.-d
ensures the container will run in “detached”
-mode, which means it will persist even after you’ve closed your command
-line session.If you are using a system that does not allow Docker (as is the case -for many institutional computing clusters), you can instead install -Docker images via Singularity.
-singularity pull docker://ghcr.io/neurogenomics/MungeSumstats
-For troubleshooting, see the Singularity -documentation.
-Finally, launch the containerised Rstudio by entering the following -URL in any web browser: http://localhost:8900/
-Login using the credentials set during the Installation steps.
-
-utils::sessionInfo()
## R version 4.3.0 (2023-04-21)
-## Platform: x86_64-apple-darwin20 (64-bit)
-## Running under: macOS 15.1.1
-##
-## Matrix products: default
-## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
-## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
-##
-## locale:
-## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
-##
-## time zone: Europe/London
-## tzcode source: internal
-##
-## attached base packages:
-## [1] stats graphics grDevices utils datasets methods base
-##
-## other attached packages:
-## [1] MungeSumstats_1.15.4 BiocStyle_2.30.0
-##
-## loaded via a namespace (and not attached):
-## [1] tidyselect_1.2.1 dplyr_1.1.4
-## [3] blob_1.2.4 filelock_1.0.3
-## [5] R.utils_2.12.3 Biostrings_2.70.3
-## [7] bitops_1.0-9 fastmap_1.2.0
-## [9] RCurl_1.98-1.16 BiocFileCache_2.10.2
-## [11] VariantAnnotation_1.48.1 GenomicAlignments_1.38.2
-## [13] XML_3.99-0.17 digest_0.6.37
-## [15] lifecycle_1.0.4 KEGGREST_1.42.0
-## [17] RSQLite_2.3.7 magrittr_2.0.3
-## [19] compiler_4.3.0 rlang_1.1.4
-## [21] sass_0.4.9 progress_1.2.3
-## [23] tools_4.3.0 utf8_1.2.4
-## [25] yaml_2.3.10 data.table_1.16.0
-## [27] rtracklayer_1.62.0 knitr_1.48
-## [29] prettyunits_1.2.0 S4Arrays_1.2.1
-## [31] htmlwidgets_1.6.4 curl_5.2.3
-## [33] bit_4.5.0 DelayedArray_0.28.0
-## [35] ieugwasr_1.0.1 xml2_1.3.6
-## [37] abind_1.4-8 BiocParallel_1.36.0
-## [39] purrr_1.0.2 BiocGenerics_0.48.1
-## [41] desc_1.4.3 R.oo_1.26.0
-## [43] grid_4.3.0 stats4_4.3.0
-## [45] fansi_1.0.6 biomaRt_2.58.2
-## [47] SummarizedExperiment_1.32.0 cli_3.6.3
-## [49] rmarkdown_2.28 crayon_1.5.3
-## [51] generics_0.1.3 ragg_1.3.1
-## [53] rstudioapi_0.16.0 httr_1.4.7
-## [55] rjson_0.2.23 DBI_1.2.3
-## [57] cachem_1.1.0 stringr_1.5.1
-## [59] zlibbioc_1.48.2 parallel_4.3.0
-## [61] AnnotationDbi_1.64.1 BiocManager_1.30.25
-## [63] XVector_0.42.0 restfulr_0.0.15
-## [65] matrixStats_1.4.1 vctrs_0.6.5
-## [67] Matrix_1.6-5 jsonlite_1.8.9
-## [69] bookdown_0.39 IRanges_2.36.0
-## [71] hms_1.1.3 S4Vectors_0.40.2
-## [73] bit64_4.5.2 systemfonts_1.0.6
-## [75] GenomicFeatures_1.54.4 jquerylib_0.1.4
-## [77] glue_1.8.0 pkgdown_2.0.9
-## [79] codetools_0.2-20 stringi_1.8.4
-## [81] GenomeInfoDb_1.38.8 BiocIO_1.12.0
-## [83] GenomicRanges_1.54.1 tibble_3.2.1
-## [85] pillar_1.9.0 rappdirs_0.3.3
-## [87] htmltools_0.5.8.1 GenomeInfoDbData_1.2.11
-## [89] BSgenome_1.70.2 dbplyr_2.5.0
-## [91] R6_2.5.1 textshaping_0.3.7
-## [93] evaluate_1.0.0 lattice_0.22-6
-## [95] Biobase_2.62.0 R.methodsS3_1.8.2
-## [97] png_0.1-8 Rsamtools_2.18.0
-## [99] memoise_2.0.1 bslib_0.8.0
-## [101] SparseArray_1.2.4 xfun_0.48
-## [103] fs_1.6.4 MatrixGenerics_1.14.0
-## [105] pkgconfig_2.0.3
-The MungeSumstats
package is designed to facilitate the standardisation of GWAS summary statistics.
The package is designed to handle the lack of standardisation of output files by the GWAS community. The MRC IEU Open GWAS team have provided full summary statistics for >10k GWAS, which are API-accessible via the ieugwasr
and gwasvcf
packages. But these GWAS are only standardised in the sense that they are VCF format, and can be fully standardised with MungeSumstats
.
MungeSumstats
provides a framework to standardise the format for any GWAS summary statistics, including those in VCF format, enabling downstream integration and analysis. It addresses the most common discrepancies across summary statistic files, and offers a range of adjustable Quality Control (QC) steps.
If you use MungeSumstats
, please cite the original authors of the GWAS as well as:
--Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics. Bioinformatics, btab665, https://doi.org/10.1093/bioinformatics/btab665
-
MungeSumstats
-
-MungeSumstats
is available on Bioconductor. To install MungeSumstats
on Bioconductor run:
-if (!require("BiocManager")) install.packages("BiocManager")
-
-BiocManager::install("MungeSumstats")
You can then load the package and data package:
- -Note that there is also a docker image for MungeSumstats.
-Note that for a number of the checks implored by MungeSumstats
a reference genome is used. If your GWAS summary statistics file of interest relates to GRCh38, you will need to install SNPlocs.Hsapiens.dbSNP155.GRCh38
and BSgenome.Hsapiens.NCBI.GRCh38
from Bioconductor as follows:
-BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")
-BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")
If your GWAS summary statistics file of interest relates to GRCh37, you will need to install SNPlocs.Hsapiens.dbSNP155.GRCh37
and BSgenome.Hsapiens.1000genomes.hs37d5
from Bioconductor as follows:
-BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh37")
-BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
These may take some time to install and are not included in the package as some users may only need one of GRCh37/GRCh38. If you are unsure of the genome build, MungeSumstats can also infer this information from your data.
-See the Getting started vignette website for up-to-date instructions on usage.
-See the OpenGWAS vignette website for information on how to use MungeSumstats to access, standardise and perform quality control on GWAS Summary Statistics from the MRC IEU Open GWAS Project.
-Please read carefully through the FAQ website for an queries about running MungeSumstats. If you have any outside of this problems please do file an Issue here on GitHub.
-The MungeSumstats
package aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted by MungeSumstats
feel free to report the Issue on GitHub along with your summary statistics file header.
We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by MungeSumstats
but correspond to one of
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON,
-NSTUDY, INFO or FRQ,
Feel free to update the data("sumstatsColHeaders")
following the approach in the data.R file and add your mapping. Then use a Pull Request on GitHub and we will incorporate this change into the package.
NEWS.md
- *Updated retrieval of IEU OpenGWAS to new approach requiring login. Also updated to use the IEU OpwnGWAS R package as a dependency.
-*FAQ Website updated.
-*FAQ Website added.
-eff_on_minor_alleles
parameter added (off by default) - controls whether MungeSumstats should assume that the effects are majoritively measured on the minor alleles. Default is FALSE as this is an assumption that won’t be appropriate in all cases. However, the benefit is that if we know the majority of SNPs have their effects based on the minor alleles, we can catch cases where the allele columns have been mislabelled.local_chain
in format_sumstats()
and liftover()
).drop_na_cols
in format_sumstats()
). By default, SNP, effect columns and P/N columns are checked. Set to Null to check all columns or choose specific columns.check_no_rs_snp()
check with imputation_ind=TRUE
.get_genome_builds()
to help with RAM & CPU usage during unit tests. No change in functionality for end user.make_ordered
from sort_coords()
-rmv_chrPrefix
parameter in format_sumstats()
has been replaced with the new chr_style
parameter, which allows users to specify their desired chromosome name style. The supported chromosome styles are “NCBI”, “UCSC”, “dbSNP”, and “Ensembl” with “Ensembl” being the default.check_chr()
now automatically removes all SNPs with nonstandard CHR entries (anything other than 1-22, X, Y, and MT in the Ensembl naming style).write_sumstats
:
-NULL
to ref_genome
.ref_genome
(only in conditions where its used).sort_coord
:
-sort_methods
, including improved/more robust data.table
-native method.test-index_tabular.R
.check_numeric
:
-sort_coord
, read_header
-run_biocheck
-sed -E
rather than sed -r
as its compatible with mac which has issues with sed -r
-log_folder
parameter in format_sumstats()
has been updated. It is still used to point to the directory for the log files and the log of MungeSumstats messages to be stored. And the default is still a temporary directory. However, now the name of the log files (log messages and log outputs) are the same as the name of the file specified in the save_path
parameter with the extension ’_log_msg.txt’ and ’_log_output.txt’ respectively.data.table::fread()
leaves NAs blank instead of including a literal NA. That’s fine for CSVs and if the output is read in by fread, but it breaks other tools for TSVs and is hard to read. Updated that and added a message when the table is switched to uncompressed for indexing.read_header
:
-n=NULL
.seqminer
from all code (too buggy).import_sumstats
:
-@inheritDotParams format_sumstats
for better documentation.parse_logs
: Added new fields.format_sumstats
: Added time report at the end (minutes taken total). Since this is a message, will be included in the logs, and is now parsed by parse_logs
and put into the column “time”.find_sumstats()
:
-vcf2df
.
-read_vcf
can now be parallised: splits query into chunks, imports them, and (optionally) converts them to data.table
before rbinding them back into one object.
-mt_thresh
to avoid using parallelisation when VCFs are small, due to the overhead outweighing the benefits in these cases.tryCatch
to downloader
with different download.file
parameters that may work better on certain machines.file.path
to specify URL in:
-get_chain_file
import_sumstats
download_vcf
to pass URLs directly (without downloading the files) when vcf_download=FALSE
.download_vcf
:
-load_ref_genome_data
:
-read_vcf_genome
: more robust way to get genome build from VCF.read_sumstats
: Speed up by using remove_empty_cols(sampled_rows=)
, and only run for tabular file (read_vcf
already does this internally).select_vcf_field
: Got rid of “REF col doesn’t exists” warning by omitting rowRanges
.vignettes/MungeSumstats.Rmd
were surrounding by ticks.vcf2df
: Accounted for scenarios where writeVcf
accidentally converts geno
data into redundant 3D matrices.
-data.table::rbindlist(fill=TRUE)
to bind chunks back together.read_vcf
upgrades:
-infer_vcf_sample_ids
is_vcf_parsed
check_tab_delimited
read_vcf_data
remove_nonstandard_vcf_cols
dt_to_granges
by merging functionality into to_granges
.
-liftover
to accommodate the slight change.is_tabix
(I had incorrectly made path
all lowercase).index_vcf
recognize all compressed vcf suffixes.
-BiocParallel
registered threads back to 1 after read_vcf_parallel
finishes, to avoid potential conflicts with downstream steps.find_sumstats
output to keep track of search parameters.import_sumstats
:
-save_path
) exists before downloading to save time.force_new
in additional to force_new_vcf
.MungeSumstats
.read_vcf
to be more robust.IRanges
to Imports.stringr
(no longer used)is_tabix
to check whether a file is already tabix-indexed.read_sumstats
:
-samples
as an arg.GenomicFiles
.read_sumstats
: now takes samples
as an arg.INFO_filter=
from ALS VCF examples in vignettes (no longer necessary now that INFO parsing has been corrected).download_vcf
can now handle situations with vcf_url=
is actually a local file (not remote).check_info_score
step.check_info_score
:
-log_files$info_filter
in these instances.check_empty_cols
was accidentally dropping more columns than it should have.write_sumstats
when indexing VCF.read_sumstats
can read in any VCF files (local/remote, indexed/non-indexed).test-vcf_formatting.R
-test-check_impute_se_beta
-setkey
on SNP (now automatically renamed from ID by read_vcf
).test-read_sumstats
:
-read_sumstats
.vcf_ss
are dropped.parse_logs
: Add lines to parsing subfunctions to allow handling of logs that don’t contain certain info (thus avoid warnings when creating the final data.table).check_pos_se
check_signed_col
Rsamtools::bgzip
does compression in Bioc 3.15. Switched to using fread + readLines
in:
-read_header
read_sumstats
read_header
: wasn’t reading in enough lines to get past the VCF header. Increase to readLines(n=1000)
.read_vcf
: Would sometimes induce duplicate rows. Now only unique rows are used (after sample and columns filtering).liftover
-GenomeInfoDb::mapGenomeBuilds
to standardise build names.standardise_sumstats_column_headers_crossplatform
-standardise_header
while keeping the original function name as an internal function (they call the same code).vignette -
liftover` tutorial
-compute_nsize
standardise_sumstats_column_headers_crossplatform
formatted_example
standardise_sumstats_column_headers_crossplatform
: Added arg uppercase_unmapped
to to allow users to specify whether they want make the columns that could not be mapped to a standard name uppercase (default=TRUE
for backcompatibility). Added arg return_list
to specify whether to return a named list (default) or just the data.table
.formatted_example
: Added args formatted
to specify whether the file should have its colnames standardised. Added args sorted
to specify whether the file should sort the data by coordinates. Added arg return_list
to specify whether to return a named list (default) or just the data.table
..datatable.aware=TRUE
to .zzz as extra precaution.vcf2df
: Documented arguments.import_sumstats
: Create individual folders for each GWAS dataset, with a respective logs
subfolder to avoid overwriting log files when processing multiple GWAS.parse_logs
: New function to convert logs from one or more munged GWAS into a data.table
.list_sumstats
: New function to recursively search for local summary stats files previously munged with MungeSumstats
.inst/extdata/MungeSumstats_log_msg.txt
to test logs files.list_sumstats
and parse_logs
.gh-pages
branch automatically by new GHA workflow.convert_large_p
and convert_neg_p
, respectively. These are both handled by the new internal function check_range_p_val
, which also reports the number of SNPs found meeting these criteria to the console/logs.check_small_p_val
records which SNPs were imputed in a more robust way, by recording which SNPs met the criteria before making the changes (as opposed to inferred this info from which columns are 0 after making the changes). This function now only handles non-negative p-values, so that rows with negative p-values can be recorded/reported separately in the check_range_p_val
step.check_small_p_val
now reports the number of SNPs <= 5e-324 to console/logs.check_range_p_val
and check_small_p_val
.parse_logs
can now extract information reported by check_range_p_val
and check_small_p_val
.logs_example
provides easy access to log file stored in inst/extdata, and includes documentation on how it was created.check_range_p_val
and check_small_p_val
now use #' @inheritParams format_sumstats
to improve consistency of documentation.suppressWarnings
where possible.validate_parameters
can now handle ref_genome=NULL
to_GRanges
/to_GRanges
functions to all-lowercase functions (for consistency with other functions).nThread=1
in data.table
test functions.get_genome_builds
save_path
is in was actually created (as opposed to finding out at the very end of the pipeline).read_header
and read_sumstats
now both work with .bgz files.format_sumstats(FRQ_filter)
added so SNPs can now be filtered by allele frequencyformat_sumstats(frq_is_maf)
check added to infer if FRQ column values are minor/effect allele frequencies or not. frq_is_maf allows users to rename the FRQ column as MAJOR_ALLELE_FRQ if some values appear to be major allele frequenciesget_genome_builds()
can now be called to quickly get the genome build without running the whole reformatting.format_sumstats(compute_n)
now has more methods to compute the effective sample size with “ldsc”, “sum”, “giant” or “metal”.format_sumstats(convert_ref_genome)
now implemented which can perform liftover to GRCh38 from GRCh37 and vice-versa enabling better cohesion between different study’s summary statistics.check_no_rs_snp
can now handle extra information after an RS ID. So if you have rs1234:A:G
that will be separated into two columns.check_two_step_col
and check_four_step_col
, the two checks for when multiple columns are in one, have been updated so if not all SNPs have multiple columns or some have more than the expected number, this can now be handled.FRQ
column have been added to the mapping filecheck_multi_rs_snp
can now handle all punctuation with/without spaces. So if a row contains rs1234,rs5678
or rs1234, rs5678
or any other punctuation character other than ,
these can be handled.format_sumstats(path)
can now be passed a dataframe/datatable of the summary statistics directly as well as a path to their saved location.A0/A1
corresponding to ref/alt can now be handled by the mappign file as well as A1/A2
corresponding to ref/alt.import_sumstats
reads GWAS sum stats directly from Open GWAS. Now parallelised and reports how long each dataset took to import/format in total.find_sumstats
searches Open GWAS for datasets.compute_z
computes Z-score from P.compute_n
computes N for all SNPs from user defined smaple size.format_sumstats(ldsc_format=TRUE)
ensures sum stats can be fed directly into LDSC without any additional munging.read_sumstats
, write_sumstas
, and download_vcf
functions now exported.format_sumstats(sort_coordinates=TRUE)
sorts results by their genomic coordinates.format_sumstats(return_data=TRUE)
returns data directly to user. Can be returned in either data.table
(default), GRanges
or VRanges
format using format_sumstats(return_format="granges")
.format_sumstats(N_dropNA=TRUE)
(default) drops rows where N is missing.format_sumstats(snp_ids_are_rs_ids=TRUE)
(default) Should the SNP IDs inputted be inferred as RS IDs or some arbitrary ID.format_sumstats(write_vcf=TRUE)
writes a tabix-indexed VCF file instead of tabular format.format_sumstats(save_path=...)
lets users decide where their results are saved and what they’re named.save_path
indicates it’s in tempdir()
, message warns users that these files will be deleted when R session ends.format_sumstats
via report_summary()
.preview_sumstats()
messages improved.format_sumstats(pos_se=TRUE,effect_columns_nonzero=TRUE)
-format_sumstats(log_folder_ind=TRUE,log_folder=tempdir())
-format_sumstats(imputation_ind=TRUE)
-data(sumstatsColHeaders)
. See format_sumstats(mapping_file = mapping_file)
.read_vcf
upgraded to account for more VCF formats.check_n_num
now accounts for situations where N is a character vector and converts to numeric.Efficiently convert DataFrame to -data.table.
-DF_to_dt(DF)
DataFrame object.
VCF data in data.table format.
-R wrapper for axel, which enables multi-threaded download -of a single large file.
-axel(
- input_url,
- output_path,
- background = FALSE,
- nThread = 1,
- force_overwrite = FALSE,
- quiet = TRUE,
- alternate = TRUE,
- check_certificates = FALSE
-)
input_url.
output_path.
Run in background
Number of threads to parallelize over.
Overwrite existing file.
Run quietly.
alternate,
check_certificates
Path where the file has been downloaded
-https://github.com/axel-download-accelerator/axel/
-Other downloaders:
-downloader()
R/check_allele_flip.R
- check_allele_flip.Rd
Ensure A1 & A2 are correctly named, if GWAS SNP constructed as -Alternative/Reference or Risk/Nonrisk alleles these SNPs will need to be -converted to Reference/Alternative or Nonrisk/Risk. Here non-risk is defined -as what's on the reference genome (this may not always be the case).
-check_allele_flip(
- sumstats_dt,
- path,
- ref_genome,
- rsids,
- allele_flip_check,
- allele_flip_drop,
- allele_flip_z,
- allele_flip_frq,
- bi_allelic_filter,
- flip_frq_as_biallelic,
- imputation_ind,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- standardise_headers = FALSE,
- mapping_file,
- dbSNP
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should the allele columns be checked against -reference genome to infer if flipping is necessary. Default is TRUE.
Binary Should the SNPs for which neither their A1 or -A2 base pair values match a reference genome be dropped. Default is TRUE.
Binary should the Z-score be flipped along with effect -and FRQ columns like Beta? It is assumed to be calculated off the effect size -not the P-value and so will be flipped i.e. default TRUE.
Binary should the frequency (FRQ) column be flipped -along with effect and z-score columns like Beta? Default TRUE.
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
Binary Should non-bi-allelic SNPs frequency -values be flipped as 1-p despite there being other alternative alleles? -Default is FALSE but if set to TRUE, this allows non-bi-allelic SNPs to be -kept despite needing flipping.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
Run
-standardise_sumstats_column_headers_crossplatform
first.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
version of dbSNP to be used for imputation (144 or 155).
A list containing two data tables:
sumstats_dt
: the modified summary statistics
-data.table
object.
rsids
: snpsById, filtered to SNPs of interest if
-loaded already. Or else NULL.
log_files
: log file list
R/check_allele_merge.R
- check_allele_merge.Rd
Ensure that A1:A2 or A1/A2 or A1>A2 or A2>A1 aren't merged into 1 column
-check_allele_merge(sumstats_dt, path)
data table obj of the summary statistics file for the GWAS
Filepath for the summary statistics file to be formatted
list containing sumstats_dt, the modified summary -statistics data table object.
-Remove non-biallelic SNPs
-check_bi_allelic(
- sumstats_dt,
- path,
- ref_genome,
- bi_allelic_filter,
- rsids,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- dbSNP
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
version of dbSNP to be used for imputation (144 or 155).
A list containing two data tables:
sumstats_dt
: the modified summary statistics data table object
rsids
: snpsById, filtered to SNPs of interest
-if loaded already. Or else NULL
.
log_files
: log file list
R/check_bp_range.R
- check_bp_range.Rd
Ensure that the Base-pair column values are all within the range for the -chromosome
-check_bp_range(
- sumstats_dt,
- path,
- ref_genome,
- log_folder_ind,
- imputation_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-Maps chromosome names to the default Ensembl/NCBI naming style and removes -SNPs with nonstandard CHR entries. Optionally, also removes SNPs on -user-specified chromosomes.
-check_chr(
- sumstats_dt,
- log_files,
- check_save_out,
- rmv_chr,
- nThread,
- tabix_index,
- log_folder_ind
-)
data.table with summary statistics
list of locations for all log files
list of parameters for saved files
Chromosomes to exclude from the formatted summary statistics
-file. Use NULL if no filtering is necessary. Default is c("X", "Y", "MT")
-which removes all non-autosomal SNPs.
Number of threads to use for parallel processes.
Index the formatted summary statistics with -tabix for fast querying.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
list containing the updated summary statistics data.table and the -updated log file locations list
-R/check_col_order.R
- check_col_order.Rd
Ensure that the first three columns are SNP, CHR, BP in that order and -then A1, A2 if present
-check_col_order(sumstats_dt, path)
data table obj of the summary statistics file for the GWAS
Filepath for the summary statistics file to be formatted
list containing sumstats_dt, the modified summary statistics -data table object
-Drop Indels from summary statistics
-check_drop_indels(
- sumstats_dt,
- drop_indels,
- path,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
-sumstats_dt <- MungeSumstats:::formatted_example()
-sumstats <- check_drop_indels(sumstats_dt = sumstats_dt,
- drop_indels = TRUE)
-
data table obj of the summary statistics file for the GWAS
Binary, should any indels found in the sumstats be -dropped? These can not be checked against a reference dataset and will have -the same RS ID and position as SNPs which can affect downstream analysis. -Default is False.
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list containing sumstats_dt, -the modified summary statistics data table object
-R/check_dup_bp.R
- check_dup_bp.Rd
Ensure all rows have unique positions, drop those that don't
-check_dup_bp(
- sumstats_dt,
- bi_allelic_filter,
- check_dups,
- indels,
- path,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
whether to check for duplicates - if formatting QTL -datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and log files list
-Ensure that no columns are duplicated
-check_dup_col(sumstats_dt, path)
data table obj of the summary statistics file for the GWAS
Filepath for the summary statistics file to be formatted
list containing sumstats_dt, the modified -summary statistics data table object
-R/check_dup_row.R
- check_dup_row.Rd
Ensure all rows are unique based on SNP,CHR,BP,A1,A2, drop those that aren't
-check_dup_row(
- sumstats_dt,
- check_dups,
- path,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
whether to check for duplicates - if formatting QTL -datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and log files list
-R/check_dup_snp.R
- check_dup_snp.Rd
Ensure all rows have unique SNP IDs, drop those that don't
-check_dup_snp(
- sumstats_dt,
- indels,
- path,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- bi_allelic_filter,
- check_dups
-)
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
whether to check for duplicates - if formatting QTL -datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
list containing sumstats_dt, the modified summary statistics data -table object and log files list
-R/check_effect_columns_nonzero.R
- check_effect_columns_nonzero.Rd
Ensure that the standard error (se) is positive for all SNPs
-check_effect_columns_nonzero(
- sumstats_dt,
- path,
- effect_columns_nonzero,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary should the effect columns in the data -BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to ensure no SNP=0. -Those that do are removed(if present in sumstats file). Default FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-Empty columns contain only ".", NA, or 0
-check_empty_cols(sumstats_dt, sampled_rows = NULL, verbose = TRUE)
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Print messages.
empty_cols
-R/check_four_step_col.R
- check_four_step_col.Rd
Ensure that CHR:BP:A2:A1 aren't merged into 1 column
-check_four_step_col(sumstats_dt, path)
data table obj of the summary statistics file for the GWAS
Filepath for the summary statistics file to be formatted
list containing sumstats_dt, the modified -summary statistics data table object
-Ensure all SNPs have frq score above threshold
-check_frq(
- sumstats_dt,
- path,
- FRQ_filter,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
numeric The minimum value permissible of the frequency(FRQ) -of the SNP (i.e. Allele Frequency (AF)) (if present in sumstats file). By -default no filtering is done, i.e. value of 0.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-R/check_frq_maf.R
- check_frq_maf.Rd
Check that FRQ column refers to minor/effect allele frequency not major
-check_frq_maf(sumstats_dt, frq_is_maf)
Conventionally the FRQ column is intended to show the -minor/effect allele frequency (MAF) but sometimes the major allele frequency -can be inferred as the FRQ column. This logical variable indicates that the -FRQ column should be renamed to MAJOR_ALLELE_FRQ if the frequency values -appear to relate to the major allele i.e. >0.5. By default this mapping won't -occur i.e. is TRUE.
sumstats_dt, the modified summary statistics data table object
-R/check_info_score.R
- check_info_score.Rd
Ensure all SNPs have info score above threshold
-check_info_score(
- sumstats_dt,
- INFO_filter,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
numeric The minimum value permissible of the imputation -information score (if present in sumstats file). Default 0.9.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations.
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-R/check_ldsc_format.R
- check_ldsc_format.Rd
Format summary statistics for direct input to
-Linkage Disequilibrium SCore (LDSC) regression without the need
-to use their munge_sumstats.py
script first.
check_ldsc_format(
- sumstats_dt,
- save_format,
- convert_n_int,
- allele_flip_check,
- compute_z,
- compute_n
-)
data table obj of the summary statistics file for the -GWAS.
Output format of sumstats. Options are NULL - standardised -output format from MungeSumstats, LDSC - output format compatible with LDSC -and openGWAS - output compatible with openGWAS VCFs. Default is NULL. -NOTE - If LDSC format is used, the naming convention of A1 as the -reference (genome build) allele and A2 as the effect allele will be reversed -to match LDSC (A1 will now be the effect allele). See more info on this -here. Note that any -effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
Binary, if N (the number of samples) is not an integer, -should this be rounded? Default is TRUE.
Binary Should the allele columns be checked against -reference genome to infer if flipping is necessary. Default is TRUE.
Whether to compute Z-score column. Default is FALSE. This -can be computed from Beta and SE with (Beta/SE) or P -(Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))). -Note that imputing the Z-score from P for every SNP will not be -perfectly correct and may result in a loss of power. This should only be done -as a last resort. Use 'BETA' to impute by BETA/SE and 'P' to impute by SNP -p-value.
Whether to impute N. Default of 0 won't impute, any other -integer will be imputed as the N (sample size) for every SNP in the dataset. -Note that imputing the sample size for every SNP is not correct and -should only be done as a last resort. N can also be inputted with "ldsc", -"sum", "giant" or "metal" by passing one of these for this field or a vector -of multiple. Sum and an integer value creates an N column in the output -whereas giant, metal or ldsc create an Neff or effective sample size. If -multiples are passed, the formula used to derive it will be indicated.
Formatted summary statistics
-Remove SNPs with missing data
-check_miss_data(
- sumstats_dt,
- path,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- drop_na_cols
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
A character vector of column names to be checked for
-missing values. Rows with missing values in any of these columns (if present
-in the dataset) will be dropped. If NULL
, all columns will be checked for
-missing values. Default columns are SNP, chromosome, position, allele 1,
-allele2, effect columns (frequency, beta, Z-score, standard error, log odds,
-signed sumstats, odds ratio), p value and N columns.
list containing sumstats_dt, the modified summary statistics data -table object and a log file list.
-R/check_multi_gwas.R
- check_multi_gwas.Rd
Ensure that only one model in GWAS sumstats or only one trait tested
-check_multi_gwas(
- sumstats_dt,
- path,
- analysis_trait,
- ignore_multi_trait,
- mapping_file
-)
data table obj of the summary statistics file for the GWAS
Filepath for the summary statistics file to be formatted
If multiple traits were studied, name of the trait for -analysis from the GWAS. Default is NULL
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
list containing sumstats_dt, the modified summary statistics data -table object
-R/check_multi_rs_snp.R
- check_multi_rs_snp.Rd
Ensure that SNP ids don't have multiple rs ids on one line
-check_multi_rs_snp(
- sumstats_dt,
- path,
- remove_multi_rs_snp,
- imputation_ind,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Sometimes summary statistics can have -multiple RSIDs on one row (i.e. related to one SNP), for example -"rs5772025_rs397784053". This can cause an error so by default, the first -RS ID will be kept and the rest removed e.g."rs5772025". If you want to just -remove these SNPs entirely, set it to TRUE. Default is FALSE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list.
-Ensure that the N column is all integers
-check_n_int(sumstats_dt, path, convert_n_int, imputation_ind)
data table obj of the summary statistics file for the GWAS
Filepath for the summary statistics file to be formatted
Binary, if N (the number of samples) is not an integer, -should this be rounded? Default is TRUE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). Note -these columns will be in the formatted summary statistics returned. Default -is FALSE.
list containing sumstats_dt, the modified summary -statistics data table object.
-R/check_n_num.R
- check_n_num.Rd
In case some SNPs were genotyped by a specialized genotyping array and -have substantially more samples than others. These will be removed.
-check_n_num(
- sumstats_dt,
- path,
- N_std,
- N_dropNA = FALSE,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
numeric The number of standard deviations above the mean a SNP's -N is needed to be removed. Default is 5.
Drop rows where N is missing.Default is TRUE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-R/check_no_allele.R
- check_no_allele.Rd
More care needs to be taken if one of A1/A2 is present, before imputing the -other allele flipping needs to be checked
-check_no_allele(
- sumstats_dt,
- path,
- ref_genome,
- rsids,
- imputation_ind,
- allele_flip_check,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- bi_allelic_filter,
- dbSNP
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should the allele columns be checked against -reference genome to infer if flipping is necessary. Default is TRUE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
version of dbSNP to be used for imputation (144 or 155).
A list containing two data tables:
sumstats_dt
: the modified summary statistics data table object
rsids
: snpsById, filtered to SNPs of interest
-if loaded already. Or else NULL.
allele_flip_check
: does the dataset require allele flip check
log_files
: log file list
bi_allelic_filter
: should multi-allelic SNPs be filtered out
R/check_no_chr_bp.R
- check_no_chr_bp.Rd
Ensure that CHR and BP are missing if SNP is present, can find them
-check_no_chr_bp(
- sumstats_dt,
- path,
- ref_genome,
- rsids,
- imputation_ind,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- dbSNP
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
version of dbSNP to be used for imputation (144 or 155).
A list containing two data tables:
sumstats_dt
- : the modified summary statistics data table object
rsids
- : snpsById, filtered to SNPs of interest if loaded already. Or else NULL
log_files
- : log file list
R/check_no_rs_snp.R
- check_no_rs_snp.Rd
Ensure that SNP appears to be valid RSIDs (starts with rs)
-check_no_rs_snp(
- sumstats_dt,
- path,
- ref_genome,
- snp_ids_are_rs_ids,
- indels,
- imputation_ind,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- dbSNP
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should the supplied SNP ID's be assumed to -be RSIDs. If not, imputation using the SNP ID for other columns like -base-pair position or chromosome will not be possible. If set to FALSE, the -SNP RS ID will be imputed from the reference genome if possible. Default is -TRUE.
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
version of dbSNP to be used for imputation (144 or 155).
list containing sumstats_dt, the modified summary statistics data -table object and the log file list.
-R/check_no_snp.R
- check_no_snp.Rd
Ensure that SNP is present if not can find it with CHR and BP
-check_no_snp(
- sumstats_dt,
- path,
- ref_genome,
- indels,
- imputation_ind,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- dbSNP,
- verbose = TRUE
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
version of dbSNP to be used for imputation (144 or 155).
should messages be printed. Default it TRUE.
list containing sumstats_dt, the modified summary statistics data -table object and the log files list
-Checks for any columns that should be numeric, -and ensures that they are indeed numeric.
-check_numeric(sumstats_dt, cols = c("P", "SE", "FRQ", "MAF", "BETA"))
Summary stats with column names already standardised by -format_sumstats.
Names of columns that should be numeric.
-If any of these columns are not actually present in sumstats_dt
,
-they will be skipped.
sumstats_dt
-R/check_on_ref_genome.R
- check_on_ref_genome.Rd
Ensure all SNPs are on the reference genome
-check_on_ref_genome(
- sumstats_dt,
- path,
- ref_genome,
- on_ref_genome,
- indels = indels,
- rsids,
- imputation_ind,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- dbSNP
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should a check take place that all SNPs are on -the reference genome by SNP ID. Default is TRUE.
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
version of dbSNP to be used for imputation (144 or 155).
A list containing two data tables:
sumstats_dt
- : the modified summary statistics data table object
rsids
- : snpsById, filtered to SNPs of interest if loaded already. Or else NULL
log_files
- : log file list
R/check_pos_se.R
- check_pos_se.Rd
Ensure that the standard error (se) is positive for all SNPs -Also impute se if missing
-check_pos_se(
- sumstats_dt,
- path,
- pos_se,
- log_folder_ind,
- imputation_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files,
- impute_se
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should the standard Error (SE) column be checked to -ensure it is greater than 0? Those that are, are removed (if present in -sumstats file). Default TRUE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
Binary, whether the standard error should be imputed using -other effect data if it isn't present in the sumstats. Note that this -imputation is an approximation so could have an effect on downstream -analysis. Use with caution. The different methods MungeSumstats will try and -impute se (in this order or priority) are:
BETA / Z 2. abs(BETA/ qnorm(P/2)) -Default is FALSE.
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-R/check_range_pval.R
- check_range_p_val.Rd
Ensure that the p values are not >1 and if so set to 1
-check_range_p_val(sumstats_dt, convert_large_p, convert_neg_p, imputation_ind)
-sumstats_dt <- MungeSumstats:::formatted_example()
-sumstats_dt$P[1:3] <- 5
-sumstats_dt$P[6:10] <- -5
-sumstats <- check_range_p_val(sumstats_dt = sumstats_dt,
- convert_large_p = TRUE,
- convert_neg_p = TRUE,
- imputation_ind = TRUE)
-
data table obj of the summary statistics file for the GWAS
Binary, should p-values >1 be converted to 1? -P-values >1 should not be possible and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
Binary, should p-values <0 be converted to 0? -Negative p-values should not be possible and can cause errors -with LDSC/MAGMA and should be converted. Default is TRUE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
list containing sumstats_dt, -the modified summary statistics data table object
-R/check_row_snp.R
- check_row_snp.Rd
Ensure all rows have SNPs beginning with rs or SNP, drop those that don't
-check_row_snp(
- sumstats_dt,
- path,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and log file list
-R/check_save_path.R
- check_save_path.Rd
Check if save path and log folder is appropriate
-check_save_path(
- save_path,
- log_folder,
- log_folder_ind,
- tabix_index,
- write_vcf = FALSE,
- verbose = TRUE
-)
File path to save formatted data. Defaults to
-tempfile(fileext=".tsv.gz")
.
Filepath to the directory for the log files and the log of -MungeSumstats messages to be stored. Default is a temporary directory. Note -the name of the log files (log messages and log outputs) are now the same as -the name of the file specified in the save path parameter with the extension -'_log_msg.txt' and '_log_output.txt' respectively.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Whether to write as VCF (TRUE) or tabular file (FALSE).
Print messages.
Corrected save_path
, the file type, the separator, corrected
-log_folder
,the log file extension.
R/check_signed_col.R
- check_signed_col.Rd
Ensure that there is at least one signed column in summary statistics file -Impute beta if user requests
-check_signed_col(
- sumstats_dt,
- impute_beta,
- log_folder_ind,
- rsids,
- imputation_ind,
- check_save_out,
- tabix_index,
- log_files,
- nThread
-)
data table obj of the summary statistics -file for the GWAS
Binary, whether BETA should be imputed using other effect -data if it isn't present in the sumstats. Note that this imputation is an -approximation (for Z & SE approach) so could have an effect on downstream -analysis. Use with caution. The different methods MungeSumstats will try and -impute beta (in this order or priority) are:
log(OR) 2. Z x SE -Default value is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Index the formatted summary statistics with -tabix for fast querying.
list of log file locations
Number of threads to use for parallel processes.
null
-R/check_small_p_val.R
- check_small_p_val.Rd
Ensure that the non-negative p-values are not 5e-324 or lower, if so set to 0
-check_small_p_val(sumstats_dt, convert_small_p, imputation_ind)
-sumstats_dt <- MungeSumstats:::formatted_example()
-sumstats_dt$P[1:3] <- 5e-324
-sumstats_dt$P[6:10] <- "5e-324"
-sumstats <- check_small_p_val(sumstats_dt = sumstats_dt,
- convert_small_p = TRUE,
- imputation_ind = TRUE)
-
data table obj of the summary statistics file for the GWAS
Binary, should non-negative -p-values <= 5e-324 be converted to 0? -Small p-values pass the R limit and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
list containing sumstats_dt, -the modified summary statistics data table object
-R/check_strand_ambiguous.R
- check_strand_ambiguous.Rd
Remove SNPs with strand-ambiguous alleles
-check_strand_ambiguous(
- sumstats_dt,
- path,
- ref_genome,
- strand_ambig_filter,
- log_folder_ind,
- check_save_out,
- tabix_index,
- nThread,
- log_files
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should SNPs with strand-ambiguous alleles -be removed. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Index the formatted summary statistics with -tabix for fast querying.
Number of threads to use for parallel processes.
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-Ensure valid tabular format
-check_tabular(header)
The summary statistics file for the GWAS
Whether the file is tabular
-R/check_two_step_col.R
- check_two_step_col.Rd
Ensure that CHR:BP aren't merged into 1 column
-check_two_step_col(sumstats_dt, path)
data table obj of the summary statistics -file for the GWAS
Filepath for the summary statistics file to be formatted
list containing sumstats_dt, the modified summary -statistics data table object
-Check if the inputted file is in VCF format
-check_vcf(header)
Header of the GWAS summary statistics file.
Whether the file is vcf or not
-R/check_vital_col.R
- check_vital_col.Rd
Ensure that all necessary columns are in the summary statistics file
-check_vital_col(sumstats_dt)
data table obj of the summary statistics file for the GWAS
null
-The following ensures that a Z-score column is present. -The Z-score formula we used here is a R implementation of the formula -used in LDSC's munge_sumstats.py:
-check_zscore(
- sumstats_dt,
- imputation_ind,
- compute_z = "BETA",
- force_new_z = FALSE,
- standardise_headers = FALSE,
- mapping_file
-)
data table obj of the summary statistics file for the -GWAS.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). Note -these columns will be in the formatted summary statistics returned. Default -is FALSE.
Whether to compute Z-score column. Default is FALSE. This -can be computed from Beta and SE with (Beta/SE) or P -(Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))). -Note that imputing the Z-score from P for every SNP will not be -perfectly correct and may result in a loss of power. This should only be done -as a last resort. Use 'BETA' to impute by BETA/SE and 'P' to impute by SNP -p-value.
When a "Z" column already exists, it will be used by
-default. To override and compute a new Z-score column from P set
-force_new_z=TRUE
.
Run
-standardise_sumstats_column_headers_crossplatform
first.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
list("sumstats_dt"=sumstats_dt)
np.sqrt(chi2.isf(P, 1))
The R implementation is adapted from the GenomicSEM::munge
function,
-after optimizing for speed using data.table
:
sumstats_dt[,Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))]
NOTE: compute_z
is set to TRUE
by
-default to ensure standardisation
-of the "Z" column (which can be computed differently in different datasets).
Useful in situations where you need to specify columns by -index instead of name (e.g. awk queries).
-column_dictionary(file_path)
Borrowed function from - -echotabix.
-
-eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats"
-)
-tmp <- tempfile(fileext = ".tsv")
-file.copy(eduAttainOkbayPth, tmp)
-cdict <- MungeSumstats:::column_dictionary(file_path = tmp)
-
Path to full summary stats file -(or any really file you want to make a column dictionary for).
Named list of column positions.
-R/compute_nsize.R
- compute_nsize.Rd
Check for N column if not present and user wants, impute N based on user's -sample size. NOTE this will be the same value for each SNP which is not -necessarily correct and may cause issues down the line. N can also be -inputted with "ldsc", "sum", "giant" or "metal" by passing one or -multiple of these.
-compute_nsize(
- sumstats_dt,
- imputation_ind = FALSE,
- compute_n = c("ldsc", "giant", "metal", "sum"),
- standardise_headers = FALSE,
- force_new = FALSE,
- return_list = TRUE
-)
data table obj of the summary statistics file for the -GWAS.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). Note -these columns will be in the formatted summary statistics returned. Default -is FALSE.
How to compute per-SNP sample size (new column "N").
0
: N will not be computed.
>0
: If any number >0 is provided,
-that value will be set as N for every row.
-Note: Computing N this way is incorrect and should be avoided
-if at all possible.
"sum"
: N will be computed as:
-cases (N_CAS) + controls (N_CON), so long as both columns are present.
"ldsc"
: N will be computed as effective sample size:
-Neff =(N_CAS+N_CON)*(N_CAS/(N_CAS+N_CON)) / mean((N_CAS/(N_CAS+N_CON))(N_CAS+N_CON)==max(N_CAS+N_CON)).
"giant"
: N will be computed as effective sample size:
-Neff = 2 / (1/N_CAS + 1/N_CON).
"metal"
: N will be computed as effective sample size:
-Neff = 4 / (1/N_CAS + 1/N_CON).
Standardise headers first.
If "Neff" (or "N") already exists in sumstats_dt
,
-replace it with the recomputed version.
Return the sumstats_dt
within a named list
-(default: TRUE
).
list("sumstats_dt"=sumstats_dt)
sumstats_dt <- MungeSumstats::formatted_example()
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Sorting coordinates with 'data.table'.
-sumstats_dt2 <- MungeSumstats::compute_nsize(sumstats_dt=sumstats_dt,
- compute_n=10000)
-#> Assigning N=10000 for all SNPs.
-
Computes sample sum (as new column "N") or -effective sample size (ESS) (as new column "Neff"). -Computing ESS is important as it takes into account -the proportion of cases to controls (i.e. class imbalance) so as not to -overestimate your statistical power.
-compute_sample_size(
- sumstats_dt,
- method = c("ldsc", "giant", "metal", "sum"),
- force_new = FALSE,
- append_method_name = FALSE
-)
Summary statistics data.table.
Method for computing (effective) sample size.
-"ldsc" :
-\(Neff = (N_CAS+N_CON) * (N_CAS/(N_CAS+N_CON)) /
- mean((N_CAS/(N_CAS+N_CON))[(N_CAS+N_CON)==max(N_CAS+N_CON)]))\)
-bulik/ldsc GitHub Issue
-
-bulik/ldsc GitHub code
"giant" :
-\(Neff = 2 / (1/N_CAS + 1/N_CON)\)
-Winkler et al. 2014, Nature
"metal" :
-\(Neff = 4 / (1/N_CAS + 1/N_CON)\)
-Willer et al. 2010, Bioinformatics
"sum" :
-\(N = N_CAS + N_CON\)
-Simple summation of cases and controls
-that does not account for class imbalance.
"\<integer\>" : N = \<integer\>
-If method is a positive integer, it will be used as N
-for every row.
If "Neff" (or "N") already exists in sumstats_dt
,
-replace it with the recomputed version.
should Neff column have an indicator to explain the -method that makes it., Default is FALSE unless multiple methods are passed
A data.table with a new column "Neff" or "N"
-There are many different formulas for calculating ESS, -but LDSC is probably the best method available here, as it -doesn't assume that the proportion of controls:cases -is 2:1 (as in GIANT) or 4:1 (as in METAL).
-Add user supplied sample size
-compute_sample_size_n(sumstats_dt, method, force_new = FALSE)
Summary statistics data.table.
Method for computing (effective) sample size.
-"ldsc" :
-\(Neff = (N_CAS+N_CON) * (N_CAS/(N_CAS+N_CON)) /
- mean((N_CAS/(N_CAS+N_CON))[(N_CAS+N_CON)==max(N_CAS+N_CON)]))\)
-bulik/ldsc GitHub Issue
-
-bulik/ldsc GitHub code
"giant" :
-\(Neff = 2 / (1/N_CAS + 1/N_CON)\)
-Winkler et al. 2014, Nature
"metal" :
-\(Neff = 4 / (1/N_CAS + 1/N_CON)\)
-Willer et al. 2010, Bioinformatics
"sum" :
-\(N = N_CAS + N_CON\)
-Simple summation of cases and controls
-that does not account for class imbalance.
"\<integer\>" : N = \<integer\>
-If method is a positive integer, it will be used as N
-for every row.
If "Neff" (or "N") already exists in sumstats_dt
,
-replace it with the recomputed version.
No return
-Compute Neff/N
-compute_sample_size_neff(
- sumstats_dt,
- method,
- force_new = FALSE,
- append_method_name = FALSE
-)
Summary statistics data.table.
Method for computing (effective) sample size.
-"ldsc" :
-\(Neff = (N_CAS+N_CON) * (N_CAS/(N_CAS+N_CON)) /
- mean((N_CAS/(N_CAS+N_CON))[(N_CAS+N_CON)==max(N_CAS+N_CON)]))\)
-bulik/ldsc GitHub Issue
-
-bulik/ldsc GitHub code
"giant" :
-\(Neff = 2 / (1/N_CAS + 1/N_CON)\)
-Winkler et al. 2014, Nature
"metal" :
-\(Neff = 4 / (1/N_CAS + 1/N_CON)\)
-Willer et al. 2010, Bioinformatics
"sum" :
-\(N = N_CAS + N_CON\)
-Simple summation of cases and controls
-that does not account for class imbalance.
"\<integer\>" : N = \<integer\>
-If method is a positive integer, it will be used as N
-for every row.
If "Neff" (or "N") already exists in sumstats_dt
,
-replace it with the recomputed version.
should Neff column have an indicator to explain the -method that makes it., Default is FALSE unless multiple methods are passed
No return
-R/convert_sumstats.R
- convert_sumstats.Rd
Convert summary statistics to desired object type
-convert_sumstats(
- sumstats_dt,
- return_format = c("data.table", "vranges", "granges")
-)
Object type to convert to;
-"data.table"
, "GenomicRanges"
or
-"VRanges"
(default is "data.table"
).
Summary statistics in the converted format
-R/download_vcf.R
- download_vcf.Rd
Ideally, we would use gwasvcf -instead but it hasn't been made available on CRAN or Bioconductor yet, -so we can't include it as a dep.
-download_vcf(
- vcf_url,
- vcf_dir = tempdir(),
- vcf_download = TRUE,
- download_method = "download.file",
- force_new = FALSE,
- quiet = FALSE,
- timeout = 10 * 60,
- nThread = 1
-)
Remote URL to VCF file.
Where to download the original VCF from Open GWAS.
-WARNING: This is set to tempdir()
by default.
-This means the raw (pre-formatted) VCFs be deleted upon ending the R session.
-Change this to keep the raw VCF file on disk
-(e.g. vcf_dir="./raw_vcf"
).
Download the original VCF from Open GWAS.
"axel"
(multi-threaded) or
-"download.file"
(single-threaded) .
Overwrite a previously downloaded VCF -with the same path name.
Run quietly.
How many seconds before giving up on download.
-Passed to download.file
. Default: 10*60
(10min).
Number of threads to parallelize over.
List containing the paths to the downloaded VCF and its index file.
-#only run the examples if user has internet access:
-if(try(is.character(getURL("www.google.com")))==TRUE){
-vcf_url <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
-out_paths <- download_vcf(vcf_url = vcf_url)
-}
-#> Error in getURL("www.google.com") : could not find function "getURL"
-
R wrapper for -axel -(multi-threaded) and -download.file (single-threaded) -download functions.
-downloader(
- input_url,
- output_path,
- download_method = "axel",
- background = FALSE,
- force_overwrite = FALSE,
- quiet = TRUE,
- show_progress = TRUE,
- continue = TRUE,
- nThread = 1,
- alternate = TRUE,
- check_certificates = TRUE,
- timeout = 10 * 60
-)
input_url.
output_path.
"axel"
(multi-threaded) or
-"download.file"
(single-threaded) .
Run in background
Overwrite existing file.
Run quietly.
show_progress.
continue.
Number of threads to parallelize over.
alternate,
check_certificates
How many seconds before giving up on download.
-Passed to download.file
. Default: 10*60
(10min).
Local path to downloaded file.
-Other downloaders:
-axel()
Drop columns with identical names (if any exist) within a data.table.
-drop_duplicate_cols(dt)
data.table
Null output
-Drop rows with duplicate values across all columns.
-drop_duplicate_rows(dt, verbose = TRUE)
data.table
Print messages.
Filtered dt
.
For each argument, searches for any datasets matching -a case-insensitive substring search in the respective metadata column. -Users can supply a single character string or a -list/vector of character strings.
-find_sumstats(
- ids = NULL,
- traits = NULL,
- years = NULL,
- consortia = NULL,
- authors = NULL,
- populations = NULL,
- categories = NULL,
- subcategories = NULL,
- builds = NULL,
- pmids = NULL,
- min_sample_size = NULL,
- min_ncase = NULL,
- min_ncontrol = NULL,
- min_nsnp = NULL,
- include_NAs = FALSE
-)
List of Open GWAS study IDs
-(e.g. c("prot-a-664", "ieu-b-4760")
).
List of traits
-(e.g. c("parkinson", "Alzheimer")
).
List of years
-(e.g. seq(2015,2021)
or c(2010, 2012, 2021)
).
List of consortia
-(e.g. c("MRC-IEU","Neale Lab")
.
List of authors
-(e.g. c("Elsworth","Kunkle","Neale")
).
List of populations
-(e.g. c("European","Asian")
).
List of categories
-(e.g. c("Binary","Continuous","Disease","Risk factor"))
).
List of categories
-(e.g. c("neurological","Immune","cardio"))
).
List of genome builds
-(e.g. c("hg19","grch37")
).
List of PubMed ID (exact matches only)
-(e.g. c(29875488, 30305740, 28240269)
).
Minimum total number of study participants
-(e.g. 5000
).
Minimum number of case participants
-(e.g. 1000
).
Minimum number of control participants
-(e.g. 1000
).
Minimum number of SNPs
-(e.g. 200000
).
Include datasets with missing metadata for size criteria
-(i.e. min_sample_size
, min_ncase
, or min_ncontrol
).
(Filtered) GWAS metadata table.
-By default, returns metadata for all studies currently in Open GWAS database.
-# Only run the examples if user has internet access
-# and if access token has been added
-if(try(is.character(getURL("www.google.com")))==TRUE && ieugwasr::get_opengwas_jwt()!=""){
-### By ID
-metagwas <- find_sumstats(ids = c(
- "ieu-b-4760",
- "prot-a-1725",
- "prot-a-664"
-))
-### By ID and sample size
-metagwas <- find_sumstats(
- ids = c("ieu-b-4760", "prot-a-1725", "prot-a-664"),
- min_sample_size = 5000
-)
-### By criteria
-metagwas <- find_sumstats(
- traits = c("alzheimer", "parkinson"),
- years = seq(2015, 2021)
-)
-}
-#> Error in getURL("www.google.com") : could not find function "getURL"
-
R/format_sumstats.R
- format_sumstats.Rd
Check that summary statistics from GWAS are in a homogeneous format
-format_sumstats(
- path,
- ref_genome = NULL,
- convert_ref_genome = NULL,
- chain_source = "ensembl",
- local_chain = NULL,
- convert_small_p = TRUE,
- convert_large_p = TRUE,
- convert_neg_p = TRUE,
- compute_z = FALSE,
- force_new_z = FALSE,
- compute_n = 0L,
- convert_n_int = TRUE,
- impute_beta = FALSE,
- es_is_beta = TRUE,
- impute_se = FALSE,
- analysis_trait = NULL,
- ignore_multi_trait = FALSE,
- INFO_filter = 0.9,
- FRQ_filter = 0,
- pos_se = TRUE,
- effect_columns_nonzero = FALSE,
- N_std = 5,
- N_dropNA = TRUE,
- chr_style = "Ensembl",
- rmv_chr = c("X", "Y", "MT"),
- on_ref_genome = TRUE,
- infer_eff_direction = TRUE,
- eff_on_minor_alleles = FALSE,
- strand_ambig_filter = FALSE,
- allele_flip_check = TRUE,
- allele_flip_drop = TRUE,
- allele_flip_z = TRUE,
- allele_flip_frq = TRUE,
- bi_allelic_filter = TRUE,
- flip_frq_as_biallelic = FALSE,
- snp_ids_are_rs_ids = TRUE,
- remove_multi_rs_snp = FALSE,
- frq_is_maf = TRUE,
- indels = TRUE,
- drop_indels = FALSE,
- drop_na_cols = c("SNP", "CHR", "BP", "A1", "A2", "FRQ", "BETA", "Z", "OR", "LOG_ODDS",
- "SIGNED_SUMSTAT", "SE", "P", "N"),
- dbSNP = 155,
- check_dups = TRUE,
- sort_coordinates = TRUE,
- nThread = 1,
- save_path = tempfile(fileext = ".tsv.gz"),
- write_vcf = FALSE,
- tabix_index = FALSE,
- return_data = FALSE,
- return_format = "data.table",
- ldsc_format = FALSE,
- save_format = NULL,
- log_folder_ind = FALSE,
- log_mungesumstats_msgs = FALSE,
- log_folder = tempdir(),
- imputation_ind = FALSE,
- force_new = FALSE,
- mapping_file = sumstatsColHeaders,
- rmv_chrPrefix = NULL
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
name of the reference genome to convert to -("GRCh37" or "GRCh38"). This will only occur if the current genome build does -not match. Default is not to convert the genome build (NULL).
source of the chain file to use in liftover, if converting -genome build ("ucsc" or "ensembl"). Note that the UCSC chain files require a -license for commercial use. The Ensembl chain is used by default ("ensembl").
Path to local chain file to use instead of downlaoding. -Default of NULL i.e. no local file to use. NOTE if passing a local chain file -make sure to specify the path to convert from and to the correct build like -GRCh37 to GRCh38. We can not sense check this for local files. The chain file -can be submitted as a gz file (as downloaed from source) or unzipped.
Binary, should non-negative -p-values <= 5e-324 be converted to 0? -Small p-values pass the R limit and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
Binary, should p-values >1 be converted to 1? -P-values >1 should not be possible and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
Binary, should p-values <0 be converted to 0? -Negative p-values should not be possible and can cause errors -with LDSC/MAGMA and should be converted. Default is TRUE.
Whether to compute Z-score column. Default is FALSE. This -can be computed from Beta and SE with (Beta/SE) or P -(Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))). -Note that imputing the Z-score from P for every SNP will not be -perfectly correct and may result in a loss of power. This should only be done -as a last resort. Use 'BETA' to impute by BETA/SE and 'P' to impute by SNP -p-value.
When a "Z" column already exists, it will be used by
-default. To override and compute a new Z-score column from P set
-force_new_z=TRUE
.
Whether to impute N. Default of 0 won't impute, any other -integer will be imputed as the N (sample size) for every SNP in the dataset. -Note that imputing the sample size for every SNP is not correct and -should only be done as a last resort. N can also be inputted with "ldsc", -"sum", "giant" or "metal" by passing one of these for this field or a vector -of multiple. Sum and an integer value creates an N column in the output -whereas giant, metal or ldsc create an Neff or effective sample size. If -multiples are passed, the formula used to derive it will be indicated.
Binary, if N (the number of samples) is not an integer, -should this be rounded? Default is TRUE.
Binary, whether BETA should be imputed using other effect -data if it isn't present in the sumstats. Note that this imputation is an -approximation (for Z & SE approach) so could have an effect on downstream -analysis. Use with caution. The different methods MungeSumstats will try and -impute beta (in this order or priority) are:
log(OR) 2. Z x SE -Default value is FALSE.
Binary, whether to map ES to BETA. We take BETA to be any -BETA-like value (including Effect Size). If this is not the case for your -sumstats, change this to FALSE. Default is TRUE.
Binary, whether the standard error should be imputed using -other effect data if it isn't present in the sumstats. Note that this -imputation is an approximation so could have an effect on downstream -analysis. Use with caution. The different methods MungeSumstats will try and -impute se (in this order or priority) are:
BETA / Z 2. abs(BETA/ qnorm(P/2)) -Default is FALSE.
If multiple traits were studied, name of the trait for -analysis from the GWAS. Default is NULL.
If you have multiple traits (p-values) in the study -but you want to ignorwe these and instead use a standard named p-value, set -to TRUE. By default is FALSE which will check for multi-traits.
numeric The minimum value permissible of the imputation -information score (if present in sumstats file). Default 0.9.
numeric The minimum value permissible of the frequency(FRQ) -of the SNP (i.e. Allele Frequency (AF)) (if present in sumstats file). By -default no filtering is done, i.e. value of 0.
Binary Should the standard Error (SE) column be checked to -ensure it is greater than 0? Those that are, are removed (if present in -sumstats file). Default TRUE.
Binary should the effect columns in the data -BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to ensure no SNP=0. -Those that do are removed(if present in sumstats file). Default FALSE.
numeric The number of standard deviations above the mean a SNP's -N is needed to be removed. Default is 5.
Drop rows where N is missing.Default is TRUE.
Chromosome naming style to use in the formatted summary
-statistics file ("NCBI", "UCSC", "dbSNP", or "Ensembl"). The NCBI and
-Ensembl styles both code chromosomes as 1-22, X, Y, MT
; the UCSC style is
-chr1-chr22, chrX, chrY, chrM
; and the dbSNP style is
-ch1-ch22, chX, chY, chMT
. Default is Ensembl.
Chromosomes to exclude from the formatted summary statistics
-file. Use NULL if no filtering is necessary. Default is c("X", "Y", "MT")
-which removes all non-autosomal SNPs.
Binary Should a check take place that all SNPs are on -the reference genome by SNP ID. Default is TRUE.
Binary Should a check take place to ensure the -alleles match the effect direction? Default is TRUE.
Binary Should MungeSumstats assume that the -effects are majoritively measured on the minor alleles? Default is FALSE as -this is an assumption that won't be appropriate in all cases. However, the -benefit is that if we know the majority of SNPs have their effects based on -the minor alleles, we can catch cases where the allele columns have been -mislabelled.
Binary Should SNPs with strand-ambiguous alleles -be removed. Default is FALSE.
Binary Should the allele columns be checked against -reference genome to infer if flipping is necessary. Default is TRUE.
Binary Should the SNPs for which neither their A1 or -A2 base pair values match a reference genome be dropped. Default is TRUE.
Binary should the Z-score be flipped along with effect -and FRQ columns like Beta? It is assumed to be calculated off the effect size -not the P-value and so will be flipped i.e. default TRUE.
Binary should the frequency (FRQ) column be flipped -along with effect and z-score columns like Beta? Default TRUE.
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
Binary Should non-bi-allelic SNPs frequency -values be flipped as 1-p despite there being other alternative alleles? -Default is FALSE but if set to TRUE, this allows non-bi-allelic SNPs to be -kept despite needing flipping.
Binary Should the supplied SNP ID's be assumed to -be RSIDs. If not, imputation using the SNP ID for other columns like -base-pair position or chromosome will not be possible. If set to FALSE, the -SNP RS ID will be imputed from the reference genome if possible. Default is -TRUE.
Binary Sometimes summary statistics can have -multiple RSIDs on one row (i.e. related to one SNP), for example -"rs5772025_rs397784053". This can cause an error so by default, the first -RS ID will be kept and the rest removed e.g."rs5772025". If you want to just -remove these SNPs entirely, set it to TRUE. Default is FALSE.
Conventionally the FRQ column is intended to show the -minor/effect allele frequency (MAF) but sometimes the major allele frequency -can be inferred as the FRQ column. This logical variable indicates that the -FRQ column should be renamed to MAJOR_ALLELE_FRQ if the frequency values -appear to relate to the major allele i.e. >0.5. By default this mapping won't -occur i.e. is TRUE.
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Binary, should any indels found in the sumstats be -dropped? These can not be checked against a reference dataset and will have -the same RS ID and position as SNPs which can affect downstream analysis. -Default is False.
A character vector of column names to be checked for
-missing values. Rows with missing values in any of these columns (if present
-in the dataset) will be dropped. If NULL
, all columns will be checked for
-missing values. Default columns are SNP, chromosome, position, allele 1,
-allele2, effect columns (frequency, beta, Z-score, standard error, log odds,
-signed sumstats, odds ratio), p value and N columns.
version of dbSNP to be used for imputation (144 or 155).
whether to check for duplicates - if formatting QTL -datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
Whether to sort by coordinates of resulting sumstats
Number of threads to use for parallel processes.
File path to save formatted data. Defaults to
-tempfile(fileext=".tsv.gz")
.
Whether to write as VCF (TRUE) or tabular file (FALSE).
Index the formatted summary statistics with -tabix for fast querying.
Return data.table
, GRanges
or VRanges
-directly to user. Otherwise, return the path to the save data. Default is
-FALSE.
If return_data is TRUE. Object type to be returned -("data.table","vranges","granges").
DEPRECATED, do not use. Use save_format="LDSC" instead.
Output format of sumstats. Options are NULL - standardised -output format from MungeSumstats, LDSC - output format compatible with LDSC -and openGWAS - output compatible with openGWAS VCFs. Default is NULL. -NOTE - If LDSC format is used, the naming convention of A1 as the -reference (genome build) allele and A2 as the effect allele will be reversed -to match LDSC (A1 will now be the effect allele). See more info on this -here. Note that any -effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Binary Should a log be stored containing all -messages and errors printed by MungeSumstats in a run. Default is FALSE
Filepath to the directory for the log files and the log of -MungeSumstats messages to be stored. Default is a temporary directory. Note -the name of the log files (log messages and log outputs) are now the same as -the name of the file specified in the save path parameter with the extension -'_log_msg.txt' and '_log_output.txt' respectively.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
If a formatted file of the same names as save_path
-exists, formatting will be skipped and this file will be imported instead
-(default). Set force_new=TRUE
to override this.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
Is now deprecated, do. not use. Use chr_style instead - -chr_style = 'Ensembl' will give the same result as rmv_chrPrefix=TRUE used to -give.
The address for the modified sumstats file or the actual data -dependent on user choice. Also, if log files wanted by the user, the return -in both above instances are a list.
-# Pass path to Educational Attainment Okbay sumstat file to a temp directory
-
-eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats"
-)
-
-## Call uses reference genome as default with more than 2GB of memory,
-## which is more than what 32-bit Windows can handle so remove certain checks
-## Using dbSNP = 144 for speed as it's smaller but you should use 155 unless
-## you know what you are doing and need 144
-
-is_32bit_windows <-
- .Platform$OS.type == "windows" && .Platform$r_arch == "i386"
-if (!is_32bit_windows) {
- reformatted <- format_sumstats(
- path = eduAttainOkbayPth,
- ref_genome = "GRCh37",
- dbSNP = 144
- )
-} else {
- reformatted <- format_sumstats(
- path = eduAttainOkbayPth,
- ref_genome = "GRCh37",
- on_ref_genome = FALSE,
- strand_ambig_filter = FALSE,
- bi_allelic_filter = FALSE,
- allele_flip_check = FALSE,
- dbSNP=144
- )
-}
-#>
-#>
-#> ******::NOTE::******
-#> - Formatted results will be saved to `tempdir()` by default.
-#> - This means all formatted summary stats will be deleted upon ending the R session.
-#> - To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
-#> ********************
-#> Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp4DII6I/filec16d74d43914.tsv.gz
-#> Warning: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP144.GRCh37’
-#> Importing tabular file: /private/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T/RtmpKLvRpi/temp_libpath17f3d19176b21/MungeSumstats/extdata/eduAttainOkbay.txt
-#> Checking for empty columns.
-#> Infer Effect Column
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Allele columns are ambiguous, attempting to infer direction
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Loading SNPlocs data.
-#> Loading reference genome data.
-#> Preprocessing RSIDs.
-#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
-#> BSgenome::snpsById done in 25 seconds.
-#> Effect/frq column(s) relate to A2 in the inputted sumstats
-#> Found direction from matching reference genome - NOTE this assumes non-effect allele will match the reference genome
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Summary statistics report:
-#> - 93 rows
-#> - 93 unique variants
-#> - 70 genome-wide significant variants (P<5e-8)
-#> - 20 chromosomes
-#> Checking for multi-GWAS.
-#> Checking for multiple RSIDs on one row.
-#> Checking SNP RSIDs.
-#> Checking for merged allele column.
-#> Checking A1 is uppercase
-#> Checking A2 is uppercase
-#> Checking for incorrect base-pair positions
-#> Ensuring all SNPs are on the reference genome.
-#> Loading SNPlocs data.
-#> Loading reference genome data.
-#> Preprocessing RSIDs.
-#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
-#> BSgenome::snpsById done in 14 seconds.
-#> Checking for correct direction of A1 (reference) and A2 (alternative allele).
-#> There are 46 SNPs where A1 doesn't match the reference genome.
-#> These will be flipped with their effect columns.
-#> Checking for missing data.
-#> Checking for duplicate columns.
-#> Checking for duplicate SNPs from SNP ID.
-#> Checking for SNPs with duplicated base-pair positions.
-#> INFO column not available. Skipping INFO score filtering step.
-#> Filtering SNPs, ensuring SE>0.
-#> Ensuring all SNPs have N<5 std dev above mean.
-#> Checking for bi-allelic SNPs.
-#> 67 SNPs (72%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
-#> The FRQ column was mapped from one of the following from the inputted summary statistics file:
-#> FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.B, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, ALL_AF
-#> As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
-#> set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
-#> Sorting coordinates with 'data.table'.
-#> Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp4DII6I/filec16d74d43914.tsv.gz
-#> Summary statistics report:
-#> - 93 rows (100% of original 93 rows)
-#> - 93 unique variants
-#> - 70 genome-wide significant variants (P<5e-8)
-#> - 20 chromosomes
-#> Done munging in 0.753 minutes.
-#> Successfully finished preparing sumstats file, preview:
-#> Reading header.
-#> SNP CHR BP A1 A2 FRQ BETA SE P
-#> <char> <int> <int> <char> <char> <num> <num> <num> <num>
-#> 1: rs301800 1 8490603 T C 0.17910 0.019 0.003 1.794e-08
-#> 2: rs11210860 1 43982527 G A 0.63060 -0.017 0.003 2.359e-10
-#> 3: rs34305371 1 72733610 G A 0.91231 -0.035 0.005 3.762e-14
-#> 4: rs2568955 1 72762169 T C 0.23690 -0.017 0.003 1.797e-08
-#> Returning path to saved data.
-# returned location has the updated summary statistics file
-
Returns an example of summary stats that have had their column names -already standardised with -standardise_header.
-formatted_example(
- path = system.file("extdata", "eduAttainOkbay.txt", package = "MungeSumstats"),
- formatted = TRUE,
- sorted = TRUE
-)
Path to raw example file. Default to built-in dataset.
Whether the column names should be formatted
-(default:TRUE
).
Whether the rows should be sorted by genomic coordinates
-(default:TRUE
).
sumstats_dt
sumstats_dt <- MungeSumstats::formatted_example()
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Sorting coordinates with 'data.table'.
-
Download chain file for liftover
-genome build converted from ("hg38", "hg19")
genome build converted to ("hg19", "hg38")
chain file source used ("ucsc" as default, or "ensembl")
where is the chain file saved? Default is a temp directory
extra messages printed? Default is TRUE
loaded chain file for liftover
-R/get_eff_frq_allele_combns.R
- get_eff_frq_allele_combns.Rd
Get combinations of uncorrected allele and effect (and frq) columns
-get_eff_frq_allele_combns(
- mapping_file = sumstatsColHeaders,
- eff_frq_cols = c("BETA", "OR", "LOG_ODDS", "SIGNED_SUMSTAT", "Z", "FRQ")
-)
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
Corrected effect or frequency column names found in a -sumstats. Default of BETA, OR, LOG_ODDS, SIGNED_SUMSTAT, Z and FRQ.
datatable containing uncorrected and corrected combinations
-R/get_genome_build.R
- get_genome_build.Rd
Infers the genome build of the summary statistics file (GRCh37 or GRCh38) -from the data. Uses SNP (RSID) & CHR & BP to get genome build.
-get_genome_build(
- sumstats,
- nThread = 1,
- sampled_snps = 10000,
- standardise_headers = TRUE,
- mapping_file = sumstatsColHeaders,
- dbSNP = 155,
- header_only = FALSE,
- allele_match_ref = FALSE,
- ref_genome = NULL,
- chr_filt = NULL
-)
data table/data frame obj of the summary statistics file for -the GWAS ,or file path to summary statistics file.
Number of threads to use for parallel processes.
Downsample the number of SNPs used when inferring genome -build to save time.
Run
-standardise_sumstats_column_headers_crossplatform
.
MungeSumstats has a pre-defined
-column-name mapping file
-which should cover the most common column headers and their interpretations.
-However, if a column header that is in your file is missing of the mapping we
-give is incorrect you can supply your own mapping file. Must be a 2 column
-dataframe with column names "Uncorrected" and "Corrected". See
-data(sumstatsColHeaders)
for default mapping and necessary format.
version of dbSNP to be used (144 or 155). Default is 155.
Instead of reading in the entire sumstats
file,
-only read in the first N rows where N=sampled_snps
.
-This should help speed up cases where you have to read in sumstats
-from disk each time.
Instead of returning the genome_build this will -return the proportion of matches to each genome build for each allele -(A1,A2).
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Internal for testing - filter reference genomes and sumstats -to specific chromosomes for testing. Pass a list of chroms in format: -c("1","2"). Default is NULL i.e. no filtering
ref_genome the genome build of the data
-Infers the genome build of summary statistics files (GRCh37 or GRCh38) -from the data. Uses SNP (RSID) & CHR & BP to get genome build.
-get_genome_builds(
- sumstats_list,
- header_only = TRUE,
- sampled_snps = 10000,
- names_from_paths = FALSE,
- dbSNP = 155,
- nThread = 1,
- chr_filt = NULL
-)
A named list of paths to summary statistics,
-or a named list of data.table
objects.
Instead of reading in the entire sumstats
file,
-only read in the first N rows where N=sampled_snps
.
-This should help speed up cases where you have to read in sumstats
-from disk each time.
Downsample the number of SNPs used when inferring genome -build to save time.
Infer the name of each item in sumstats_list
-from its respective file path.
-Only works if sumstats_list
is a list of paths.
version of dbSNP to be used (144 or 155). Default is 155.
Number of threads to use for parallel processes.
Internal for testing - filter reference genomes and sumstats -to specific chromosomes for testing. Pass a list of chroms in format: -c("1","2"). Default is NULL i.e. no filtering
ref_genome the genome build of the data
-Iterative version of get_genome_build
.
# Pass path to Educational Attainment Okbay sumstat file to a temp directory
-
-eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats"
-)
-sumstats_list <- list(ss1 = eduAttainOkbayPth, ss2 = eduAttainOkbayPth)
-
-## Call uses reference genome as default with more than 2GB of memory,
-## which is more than what 32-bit Windows can handle so remove certain checks
-is_32bit_windows <-
- .Platform$OS.type == "windows" && .Platform$r_arch == "i386"
-if (!is_32bit_windows) {
-
- #multiple sumstats can be passed at once to get all their genome builds:
- #ref_genomes <- get_genome_builds(sumstats_list = sumstats_list)
- #just passing first here for speed
- sumstats_list_quick <- list(ss1 = eduAttainOkbayPth)
- ref_genomes <- get_genome_builds(sumstats_list = sumstats_list_quick,
- dbSNP=144)
-}
-#> Inferring genome build of 1 sumstats file(s).
-#> Inferring genome build.
-#> Reading in only the first 10000 rows of sumstats.
-#> Importing tabular file: /private/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T/RtmpKLvRpi/temp_libpath17f3d19176b21/MungeSumstats/extdata/eduAttainOkbay.txt
-#> Checking for empty columns.
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Loading SNPlocs data.
-#> Loading reference genome data.
-#> Preprocessing RSIDs.
-#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
-#> BSgenome::snpsById done in 16 seconds.
-#> Loading SNPlocs data.
-#> Warning: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘SNPlocs.Hsapiens.dbSNP144.GRCh38’
-#> Loading reference genome data.
-#> Preprocessing RSIDs.
-#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
-#> BSgenome::snpsById done in 29 seconds.
-#> Inferred genome build: GRCH37
-#> Time difference of 47.96309 secs
-#> GRCH37: 1 file(s)
-
R/get_unique_name_log_file.R
- get_unique_name_log_file.Rd
Simple function to ensure the new entry name to a list doesn't have the same -name as another entry
-get_unique_name_log_file(name, log_files)
proposed name for the entry
list of log file locations
a unique name (character)
-Get VCF sample ID(s)
-get_vcf_sample_ids(path)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
sample_id
-Convert a GRanges into a data.table.
-granges_to_dt(gr)
A GRanges object.
A data.table object.
-UCSC Chain file hg19 to hg38, .chain.gz file, downloaded from -https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/ on 09/10/21
-gunzipped chain file
-The chain file was downloaded from
-https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/
-
-utils::download.file('ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz',tempdir())
-
UCSC Chain file hg19 to hg38, .chain.gz file, downloaded on 09/10/21 -To be used as a back up if the download from UCSC fails.
-NA
-UCSC Chain file hg38 to hg19, .chain.gz file, downloaded from -https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/ on 09/10/21
-gunzipped chain file
-The chain file was downloaded from
-https://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/
-
-utils::download.file('ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz',tempdir())
-
UCSC Chain file hg38 to hg19, .chain.gz file, downloaded on 09/10/21 -To be used as a back up if the download from UCSC fails.
-NA
-Local ieu-a-298 file from IEU Open GWAS, downloaded on 09/10/21.
-gunzipped tsv file
-The file was downloaded with:
-
-MungeSumstats::import_sumstats(ids = "ieu-a-298",ref_genome = "GRCH37")
-
Local ieu-a-298 file from IEU Open GWAS, downlaoded on 09/10/21. -This is done in case the download in the package vignette fails.
-NA
-R/import_sumstats.R
- import_sumstats.Rd
Requires internet access to run.
-List of Open GWAS study IDs
-(e.g. c("prot-a-664", "ieu-b-4760")
).
Where to download the original VCF from Open GWAS.
-WARNING: This is set to tempdir()
by default.
-This means the raw (pre-formatted) VCFs be deleted upon ending the R session.
-Change this to keep the raw VCF file on disk
-(e.g. vcf_dir="./raw_vcf"
).
Download the original VCF from Open GWAS.
Directory to save formatted summary statistics in.
Whether to write as VCF (TRUE) or tabular file (FALSE).
"axel"
(multi-threaded) or
-"download.file"
(single-threaded) .
Run quietly.
If a formatted file of the same names as save_path
-exists, formatting will be skipped and this file will be imported instead
-(default). Set force_new=TRUE
to override this.
Overwrite a previously downloaded VCF -with the same path name.
Number of threads to use for parallel processes.
If parallel_across_ids=TRUE
-and nThread>1
,
-then each ID in ids
will be processed in parallel.
Arguments passed on to format_sumstats
path
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
ref_genome
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
convert_ref_genome
name of the reference genome to convert to -("GRCh37" or "GRCh38"). This will only occur if the current genome build does -not match. Default is not to convert the genome build (NULL).
chain_source
source of the chain file to use in liftover, if converting -genome build ("ucsc" or "ensembl"). Note that the UCSC chain files require a -license for commercial use. The Ensembl chain is used by default ("ensembl").
local_chain
Path to local chain file to use instead of downlaoding. -Default of NULL i.e. no local file to use. NOTE if passing a local chain file -make sure to specify the path to convert from and to the correct build like -GRCh37 to GRCh38. We can not sense check this for local files. The chain file -can be submitted as a gz file (as downloaed from source) or unzipped.
convert_small_p
Binary, should non-negative -p-values <= 5e-324 be converted to 0? -Small p-values pass the R limit and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
convert_large_p
Binary, should p-values >1 be converted to 1? -P-values >1 should not be possible and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
convert_neg_p
Binary, should p-values <0 be converted to 0? -Negative p-values should not be possible and can cause errors -with LDSC/MAGMA and should be converted. Default is TRUE.
compute_z
Whether to compute Z-score column. Default is FALSE. This -can be computed from Beta and SE with (Beta/SE) or P -(Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))). -Note that imputing the Z-score from P for every SNP will not be -perfectly correct and may result in a loss of power. This should only be done -as a last resort. Use 'BETA' to impute by BETA/SE and 'P' to impute by SNP -p-value.
force_new_z
When a "Z" column already exists, it will be used by
-default. To override and compute a new Z-score column from P set
-force_new_z=TRUE
.
compute_n
Whether to impute N. Default of 0 won't impute, any other -integer will be imputed as the N (sample size) for every SNP in the dataset. -Note that imputing the sample size for every SNP is not correct and -should only be done as a last resort. N can also be inputted with "ldsc", -"sum", "giant" or "metal" by passing one of these for this field or a vector -of multiple. Sum and an integer value creates an N column in the output -whereas giant, metal or ldsc create an Neff or effective sample size. If -multiples are passed, the formula used to derive it will be indicated.
convert_n_int
Binary, if N (the number of samples) is not an integer, -should this be rounded? Default is TRUE.
impute_beta
Binary, whether BETA should be imputed using other effect -data if it isn't present in the sumstats. Note that this imputation is an -approximation (for Z & SE approach) so could have an effect on downstream -analysis. Use with caution. The different methods MungeSumstats will try and -impute beta (in this order or priority) are:
log(OR) 2. Z x SE -Default value is FALSE.
es_is_beta
Binary, whether to map ES to BETA. We take BETA to be any -BETA-like value (including Effect Size). If this is not the case for your -sumstats, change this to FALSE. Default is TRUE.
impute_se
Binary, whether the standard error should be imputed using -other effect data if it isn't present in the sumstats. Note that this -imputation is an approximation so could have an effect on downstream -analysis. Use with caution. The different methods MungeSumstats will try and -impute se (in this order or priority) are:
BETA / Z 2. abs(BETA/ qnorm(P/2)) -Default is FALSE.
analysis_trait
If multiple traits were studied, name of the trait for -analysis from the GWAS. Default is NULL.
ignore_multi_trait
If you have multiple traits (p-values) in the study -but you want to ignorwe these and instead use a standard named p-value, set -to TRUE. By default is FALSE which will check for multi-traits.
INFO_filter
numeric The minimum value permissible of the imputation -information score (if present in sumstats file). Default 0.9.
FRQ_filter
numeric The minimum value permissible of the frequency(FRQ) -of the SNP (i.e. Allele Frequency (AF)) (if present in sumstats file). By -default no filtering is done, i.e. value of 0.
pos_se
Binary Should the standard Error (SE) column be checked to -ensure it is greater than 0? Those that are, are removed (if present in -sumstats file). Default TRUE.
effect_columns_nonzero
Binary should the effect columns in the data -BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to ensure no SNP=0. -Those that do are removed(if present in sumstats file). Default FALSE.
N_std
numeric The number of standard deviations above the mean a SNP's -N is needed to be removed. Default is 5.
N_dropNA
Drop rows where N is missing.Default is TRUE.
chr_style
Chromosome naming style to use in the formatted summary
-statistics file ("NCBI", "UCSC", "dbSNP", or "Ensembl"). The NCBI and
-Ensembl styles both code chromosomes as 1-22, X, Y, MT
; the UCSC style is
-chr1-chr22, chrX, chrY, chrM
; and the dbSNP style is
-ch1-ch22, chX, chY, chMT
. Default is Ensembl.
rmv_chrPrefix
Is now deprecated, do. not use. Use chr_style instead - -chr_style = 'Ensembl' will give the same result as rmv_chrPrefix=TRUE used to -give.
rmv_chr
Chromosomes to exclude from the formatted summary statistics
-file. Use NULL if no filtering is necessary. Default is c("X", "Y", "MT")
-which removes all non-autosomal SNPs.
on_ref_genome
Binary Should a check take place that all SNPs are on -the reference genome by SNP ID. Default is TRUE.
infer_eff_direction
Binary Should a check take place to ensure the -alleles match the effect direction? Default is TRUE.
eff_on_minor_alleles
Binary Should MungeSumstats assume that the -effects are majoritively measured on the minor alleles? Default is FALSE as -this is an assumption that won't be appropriate in all cases. However, the -benefit is that if we know the majority of SNPs have their effects based on -the minor alleles, we can catch cases where the allele columns have been -mislabelled.
strand_ambig_filter
Binary Should SNPs with strand-ambiguous alleles -be removed. Default is FALSE.
allele_flip_check
Binary Should the allele columns be checked against -reference genome to infer if flipping is necessary. Default is TRUE.
allele_flip_drop
Binary Should the SNPs for which neither their A1 or -A2 base pair values match a reference genome be dropped. Default is TRUE.
allele_flip_z
Binary should the Z-score be flipped along with effect -and FRQ columns like Beta? It is assumed to be calculated off the effect size -not the P-value and so will be flipped i.e. default TRUE.
allele_flip_frq
Binary should the frequency (FRQ) column be flipped -along with effect and z-score columns like Beta? Default TRUE.
bi_allelic_filter
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
flip_frq_as_biallelic
Binary Should non-bi-allelic SNPs frequency -values be flipped as 1-p despite there being other alternative alleles? -Default is FALSE but if set to TRUE, this allows non-bi-allelic SNPs to be -kept despite needing flipping.
snp_ids_are_rs_ids
Binary Should the supplied SNP ID's be assumed to -be RSIDs. If not, imputation using the SNP ID for other columns like -base-pair position or chromosome will not be possible. If set to FALSE, the -SNP RS ID will be imputed from the reference genome if possible. Default is -TRUE.
remove_multi_rs_snp
Binary Sometimes summary statistics can have -multiple RSIDs on one row (i.e. related to one SNP), for example -"rs5772025_rs397784053". This can cause an error so by default, the first -RS ID will be kept and the rest removed e.g."rs5772025". If you want to just -remove these SNPs entirely, set it to TRUE. Default is FALSE.
frq_is_maf
Conventionally the FRQ column is intended to show the -minor/effect allele frequency (MAF) but sometimes the major allele frequency -can be inferred as the FRQ column. This logical variable indicates that the -FRQ column should be renamed to MAJOR_ALLELE_FRQ if the frequency values -appear to relate to the major allele i.e. >0.5. By default this mapping won't -occur i.e. is TRUE.
indels
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
drop_indels
Binary, should any indels found in the sumstats be -dropped? These can not be checked against a reference dataset and will have -the same RS ID and position as SNPs which can affect downstream analysis. -Default is False.
drop_na_cols
A character vector of column names to be checked for
-missing values. Rows with missing values in any of these columns (if present
-in the dataset) will be dropped. If NULL
, all columns will be checked for
-missing values. Default columns are SNP, chromosome, position, allele 1,
-allele2, effect columns (frequency, beta, Z-score, standard error, log odds,
-signed sumstats, odds ratio), p value and N columns.
dbSNP
version of dbSNP to be used for imputation (144 or 155).
check_dups
whether to check for duplicates - if formatting QTL -datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
sort_coordinates
Whether to sort by coordinates of resulting sumstats
save_path
File path to save formatted data. Defaults to
-tempfile(fileext=".tsv.gz")
.
tabix_index
Index the formatted summary statistics with -tabix for fast querying.
return_data
Return data.table
, GRanges
or VRanges
-directly to user. Otherwise, return the path to the save data. Default is
-FALSE.
return_format
If return_data is TRUE. Object type to be returned -("data.table","vranges","granges").
ldsc_format
DEPRECATED, do not use. Use save_format="LDSC" instead.
save_format
Output format of sumstats. Options are NULL - standardised -output format from MungeSumstats, LDSC - output format compatible with LDSC -and openGWAS - output compatible with openGWAS VCFs. Default is NULL. -NOTE - If LDSC format is used, the naming convention of A1 as the -reference (genome build) allele and A2 as the effect allele will be reversed -to match LDSC (A1 will now be the effect allele). See more info on this -here. Note that any -effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
log_folder_ind
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
log_mungesumstats_msgs
Binary Should a log be stored containing all -messages and errors printed by MungeSumstats in a run. Default is FALSE
log_folder
Filepath to the directory for the log files and the log of -MungeSumstats messages to be stored. Default is a temporary directory. Note -the name of the log files (log messages and log outputs) are now the same as -the name of the file specified in the save path parameter with the extension -'_log_msg.txt' and '_log_output.txt' respectively.
imputation_ind
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
mapping_file
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
Either a named list of data objects or paths,
-depending on the arguments passed to format_sumstats
.
#only run the examples if user has internet access:
-if(try(is.character(getURL("www.google.com")))==TRUE){
-### Search by criteria
-metagwas <- find_sumstats(
- traits = c("parkinson", "alzheimer"),
- min_sample_size = 5000
-)
-### Only use a subset for testing purposes
-ids <- (dplyr::arrange(metagwas, nsnp))$id
-
-### Default usage
-## You can supply \code{import_sumstats()}
-## with a list of as many OpenGWAS IDs as you want,
-## but we'll just give one to save time.
-
-## Call uses reference genome as default with more than 2GB of memory,
-## which is more than what 32-bit Windows can handle so remove certain checks
-## commented out down to runtime
-# datasets <- import_sumstats(ids = ids[1])
-}
-#> Error in getURL("www.google.com") : could not find function "getURL"
-
- All functions- - |
- |
---|---|
- - | -Ensures that parameters are compatible with LDSC format |
-
- - | -Check for N column if not present and user wants, impute N based on user's sample size. NOTE this will be the same value for each SNP which is not necessarily correct and may cause issues down the line. N can also be inputted with "ldsc", "sum", "giant" or "metal" by passing one or multiple of these. |
-
- - | -Download VCF file and its index file from Open GWAS |
-
- - | -Search Open GWAS for datasets matching criteria |
-
- - | -Check that summary statistics from GWAS are in a homogeneous format |
-
- - | -Formatted example |
-
- - | -Get combinations of uncorrected allele and effect (and frq) columns |
-
- - | -Infer genome builds |
-
- - | -UCSC Chain file hg19 to hg38 |
-
- - | -UCSC Chain file hg38 to hg19 |
-
- - | -Local ieu-a-298 file from IEU Open GWAS |
-
- - | -Import full genome-wide GWAS summary statistics from Open GWAS |
-
- - | -Tabix-index file: table |
-
- - | -Infer if effect relates to a1 or A2 if ambiguously named |
-
- - | -Genome build liftover |
-
- - | -List munged summary statistics |
-
- - | -Load the reference genome data for SNPs of interest |
-
- - | -Loads the SNP locations and alleles for Homo sapiens extracted from NCBI dbSNP Build 144. Reference genome version is dependent on user input. |
-
- - | -Parse data from log files |
-
- - | -GWAS Amyotrophic lateral sclerosis ieu open GWAS project - Subset |
-
- - | -GWAS Educational Attainment Okbay 2016 - Subset |
-
- - | -Read in file header |
-
- - | -Determine summary statistics file type and read them into memory |
-
- - | -Read in VCF file |
-
- - | -Register cores |
-
- - | -Standardise the column headers in the Summary Statistics files |
-
- - | -Summary Statistics Column Headers |
-
- - | -VCF to DF |
-
- - | -Write sum stats file to disk |
-
Convert summary stats file to tabix format.
-index_tabular(
- path,
- chrom_col = "CHR",
- start_col = "BP",
- end_col = start_col,
- overwrite = TRUE,
- remove_tmp = TRUE,
- verbose = TRUE
-)
Borrowed function from - -echotabix.
-Path to GWAS summary statistics file.
Name of the chromosome column in
-sumstats_dt
(e.g. "CHR").
Name of the starting genomic position
-column in sumstats_dt
(e.g. "POS","start").
Name of the ending genomic position
-column in sumstats_dt
(e.g. "POS","end").
-Can be the same as start_col
when sumstats_dt
-only contains SNPs that span 1 base pair (bp) each.
A logical(1) indicating whether dest
should
- be over-written, if it already exists.
Remove the temporary uncompressed version of the file -(.tsv).
Print messages.
Path to tabix-indexed tabular file
-Other tabix:
-index_vcf()
sumstats_dt <- MungeSumstats::formatted_example()
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Sorting coordinates with 'data.table'.
-path <- tempfile(fileext = ".tsv")
-MungeSumstats::write_sumstats(sumstats_dt = sumstats_dt, save_path = path)
-#> Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp4DII6I/filec16d4d6776ee.tsv
-indexed_file <- MungeSumstats::index_tabular(path = path)
-#> Converting full summary stats file to tabix format for fast querying...
-#> Reading header.
-#> Ensuring file is bgzipped.
-#> Tabix-indexing file.
-#> Removing temporary .tsv file.
-
Convert summary stats file to tabix format
-index_vcf(path, verbose = TRUE)
Borrowed function from - -echotabix.
-Path to VCF.
Print messages.
Path to tabix-indexed tabular file
-Other tabix:
-index_tabular()
eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats")
-sumstats_dt <- data.table::fread(eduAttainOkbayPth, nThread = 1)
-sumstats_dt <-
-MungeSumstats:::standardise_sumstats_column_headers_crossplatform(
- sumstats_dt = sumstats_dt)$sumstats_dt
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-sumstats_dt <- MungeSumstats:::sort_coords(sumstats_dt = sumstats_dt)
-#> Sorting coordinates with 'data.table'.
-path <- tempfile(fileext = ".tsv")
-MungeSumstats::write_sumstats(sumstats_dt = sumstats_dt, save_path = path)
-#> Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp4DII6I/filec16d1d8e92cf.tsv
-
-indexed_file <- MungeSumstats::index_tabular(path = path)
-#> Converting full summary stats file to tabix format for fast querying...
-#> Reading header.
-#> Ensuring file is bgzipped.
-#> Tabix-indexing file.
-#> Removing temporary .tsv file.
-
R/infer_effect_column.R
- infer_effect_column.Rd
Three checks are made to infer which allele the effect/frequency information -relates to if they are ambiguous (named A0, A1 and A2 or equivalent):
Check if ambiguous naming conventions are used (i.e. allele 0, 1 and 2 or -equivalent). If not exit, otherwise continue to next checks. This can be -checked by using the mapping file and splitting A1/A2 mappings by those that -contain 0, 1 or 2 (ambiguous) or doesn't contain 0, 1 or 2 e.g. effect, -tested (unambiguous so fine for MSS to handle as is).
Look for effect column/frequency column where the A0/A1/A2 explicitly -mentioned, if found then we know the direction and should update A0/A1/A2 -naming so A2 is the effect column. We can look for such columns by getting -every combination of A0/A1/A2 naming and effect/frq naming.
If not found in 2, a final check should be against the reference genome, -whichever of A0, A1 and A2 has more of a match with the reference genome -should be taken as not the effect allele. There is an assumption in this -but is still better than guessing the ambiguous allele naming.
infer_effect_column(
- sumstats_dt,
- dbSNP = 155,
- sampled_snps = 10000,
- mapping_file = sumstatsColHeaders,
- nThread = nThread,
- ref_genome = NULL,
- on_ref_genome = TRUE,
- infer_eff_direction = TRUE,
- eff_on_minor_alleles = FALSE,
- return_list = TRUE
-)
data table obj of the summary statistics file for the -GWAS.
version of dbSNP to be used for imputation (144 or 155).
Downsample the number of SNPs used when inferring genome -build to save time.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
Number of threads to use for parallel processes.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
Binary Should a check take place that all SNPs are on -the reference genome by SNP ID. Default is TRUE.
Binary Should a check take place to ensure the -alleles match the effect direction? Default is TRUE.
Binary Should MungeSumstats assume that the -effects are majoritively measured on the minor alleles? Default is FALSE as -this is an assumption that won't be appropriate in all cases. However, the -benefit is that if we know the majority of SNPs have their effects based on -the minor alleles, we can catch cases where the allele columns have been -mislabelled.
Return the sumstats_dt
within a named list
-(default: TRUE
).
list containing sumstats_dt, the modified summary statistics data -table object
-Also, if eff_on_minor_alleles=TRUE, check 3 will be used in all cases. -However, This assumes that the effects are majoritively measured on the -minor alleles and should be used with caution as this is an assumption that -won't be appropriate in all cases. However, the benefit is that if we know -the majority of SNPs have their effects based on the minor alleles, we can -catch cases where the allele columns have been mislabelled. IF -eff_on_minor_alleles=TRUE, checks 1 and 2 will be skipped.
-sumstats <- MungeSumstats::formatted_example()
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Sorting coordinates with 'data.table'.
-#for speed, don't run on_ref_genome part of check (on_ref_genome = FALSE)
-sumstats_dt2<-infer_effect_column(sumstats_dt=sumstats,on_ref_genome = FALSE)
-#> Infer Effect Column
-#> First line of summary statistics file:
-#> SNP CHR BP A1 A2 FRQ BETA SE P
-#> Allele columns are ambiguous, attempting to infer direction
-#> Can't infer allele columns from sumstats
-
Is a file bgz-compressed and tabix-indexed.
-is_tabix(path)
Path to file.
logical: whether the file is tabix-indexed or not.
- - -logical
-Transfer genomic coordinates from one genome build to another.
-liftover(
- sumstats_dt,
- convert_ref_genome,
- ref_genome,
- chain_source = "ensembl",
- imputation_ind = TRUE,
- chrom_col = "CHR",
- start_col = "BP",
- end_col = start_col,
- as_granges = FALSE,
- style = "NCBI",
- local_chain = NULL,
- verbose = TRUE
-)
data table obj of the summary statistics -file for the GWAS.
name of the reference genome to convert to -("GRCh37" or "GRCh38"). This will only occur if the current genome build does -not match. Default is not to convert the genome build (NULL).
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
chain file source used ("ucsc" as default, or "ensembl")
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Name of the chromosome column in
-sumstats_dt
(e.g. "CHR").
Name of the starting genomic position
-column in sumstats_dt
(e.g. "POS","start").
Name of the ending genomic position
-column in sumstats_dt
(e.g. "POS","end").
-Can be the same as start_col
when sumstats_dt
-only contains SNPs that span 1 base pair (bp) each.
Return results as GRanges
-instead of a data.table (default: FALSE
).
Style to return GRanges object in
-(e.g. "NCBI" = 4; "UCSC" = "chr4";) (default: "NCBI"
).
Path to local chain file to use instead of downlaoding. -Default of NULL i.e. no local file to use. NOTE if passing a local chain file -make sure to specify the path to convert from and to the correct build like -GRCh37 to GRCh38. We can not sense check this for local files. The chain file -can be submitted as a gz file (as downloaed from source) or unzipped.
Print messages.
sumstats_dt <- MungeSumstats::formatted_example()
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-#> Sorting coordinates with 'data.table'.
-
-sumstats_dt_hg38 <- liftover(sumstats_dt=sumstats_dt,
- ref_genome = "hg19",
- convert_ref_genome="hg38")
-#> Performing data liftover from hg19 to hg38.
-#> Converting summary statistics to GenomicRanges.
-#> Downloading chain file...
-#> Downloading chain file from Ensembl.
-#> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp4DII6I/GRCh37_to_GRCh38.chain.gz
-#> Reordering so first three column headers are SNP, CHR and BP in this order.
-#> Reordering so the fourth and fifth columns are A1 and A2.
-
Searches for and lists local GWAS summary statistics files munged by -format_sumstats or -import_sumstats.
-list_sumstats(
- save_dir = getwd(),
- pattern = "*.tsv.gz$",
- ids_from_file = TRUE,
- verbose = TRUE
-)
Top-level directory to recursively search -for summary statistics files within.
Regex pattern to search for files with.
Try to extract dataset IDs from file names.
-If FALSE
, will infer IDs from the directory names instead.
Print messages.
Named vector of summary stats paths.
-save_dir <- system.file("extdata",package = "MungeSumstats")
-munged_files <- MungeSumstats::list_sumstats(save_dir = save_dir)
-#> 1 file(s) found.
-
-
R/load_ref_genome_data.R
- load_ref_genome_data.Rd
Load the reference genome data for SNPs of interest
-load_ref_genome_data(
- snps,
- ref_genome,
- dbSNP = c(144, 155),
- msg = NULL,
- chr_filt = NULL
-)
-sumstats_dt <- formatted_example()
-rsids <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
- ref_genome = "GRCH37",
- dbSNP=144)
-
Character vector SNPs by rs_id from sumstats file of interest.
Name of the reference genome used for the GWAS -(GRCh37 or GRCh38)
version of dbSNP to be used (144 or 155)
Optional name of the column missing from the dataset in question. -Default is NULL
Internal for testing - filter reference genomes and sumstats -to specific chromosomes for testing. Pass a list of chroms in format: -c("1","2"). Default is NULL i.e. no filtering.
data table of snpsById, filtered to SNPs of interest.
-R/load_snp_loc_data.R
- load_snp_loc_data.Rd
Loads the SNP locations and alleles for Homo sapiens extracted from -NCBI dbSNP Build 144. Reference genome version is dependent on user input.
-load_snp_loc_data(ref_genome, dbSNP = c(144, 155), msg = NULL)
name of the reference genome used for the GWAS -(GRCh37 or GRCh38)
version of dbSNP to be used (144 or 155)
Optional name of the column missing from the dataset in question
SNP_LOC_DATA SNP positions and alleles for Homo sapiens extracted -from NCBI dbSNP Build 144
-SNP_LOC_DATA <- load_snp_loc_data("GRCH37",dbSNP=144)
-#> Loading SNPlocs data.
-
Example logs file produced by format_sumstats.
-logs_example(read = FALSE)
-eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats")
-sumstats_dt <- data.table::fread(eduAttainOkbayPth)
-#### Introduce values that need to be fixed ####
-sumstats_dt$Pval[10:15] <- 5
-sumstats_dt$Pval[20:22] <- -5
-sumstats_dt$Pval[23:25] <- "5e-324"
-ss_path <- tempfile()
-data.table::fwrite(sumstats_dt, ss_path)
-log_folder <- tempdir()
-reformatted <- MungeSumstats::format_sumstats(
- path = ss_path,
- ref_genome = "GRCh37",
- log_folder = log_folder,
- log_mungesumstats_msgs = TRUE,
- log_folder_ind = TRUE,
-)
-file.copy(reformatted$log_files$MungeSumstats_log_msg,
- "inst/extdata",overwrite = TRUE)
-
Whether to read the logs file into memory.
Path to logs file.
-Ensure A1 and A2 are upper case
-make_allele_upper(sumstats_dt, log_files)
list of log file locations
list containing sumstats_dt, the modified summary statistics data -table object and the log file list
-R/message_parallel.R
- message_parallel.Rd
Send messages to console even from within parallel processes
-message_parallel(...)
A message
-Print messages with option to silence.
-messager(..., v = TRUE)
Message input.
Whether to print messages.
Null output.
-R/parse_dropped_INFO.R
- parse_dropped_INFO.Rd
Support function for parse_logs.
-parse_dropped_INFO(l)
Lines of text from log file.
Numeric
-R/parse_dropped_chrom.R
- parse_dropped_chrom.Rd
Support function for parse_logs.
-parse_dropped_chrom(l)
Lines of text from log file.
Numeric
-R/parse_dropped_duplicates.R
- parse_dropped_duplicates.Rd
Support function for parse_logs.
-parse_dropped_duplicates(l)
Lines of text from log file.
Numeric
-R/parse_dropped_nonA1A2.R
- parse_dropped_nonA1A2.Rd
Support function for parse_logs.
-parse_dropped_nonA1A2(l)
Lines of text from log file.
Numeric
-R/parse_dropped_nonBiallelic.R
- parse_dropped_nonBiallelic.Rd
Support function for parse_logs.
-parse_dropped_nonBiallelic(l)
Lines of text from log file.
Numeric
-R/parse_dropped_nonRef.R
- parse_dropped_nonRef.Rd
Support function for parse_logs.
-parse_dropped_nonRef(l)
Lines of text from log file.
Numeric
-R/parse_flipped.R
- parse_flipped.Rd
Support function for parse_logs.
-parse_flipped(l)
Lines of text from log file.
Numeric
-R/parse_genome_build.R
- parse_genome_build.Rd
Support function for parse_logs.
-parse_genome_build(l)
Lines of text from log file.
Character
-Support function for parse_logs.
-parse_idStandard(l)
Lines of text from log file.
Character
-Parses data from the log files generated by
-format_sumstats or
-import_sumstats when the argument
-log_mungesumstats_msgs
is set to TRUE
.
parse_logs(
- save_dir = getwd(),
- pattern = "MungeSumstats_log_msg.txt$",
- verbose = TRUE
-)
Top-level directory to recursively search -for log files within.
Regex pattern to search for files with.
Print messages.
data.table of parsed log data.
-save_dir <- system.file("extdata",package = "MungeSumstats")
-log_data <- MungeSumstats::parse_logs(save_dir = save_dir)
-#> Parsing info from 1 log file(s).
-
Support function for parse_logs.
-parse_pval_large(l)
Lines of text from log file.
Numeric
-Support function for parse_logs.
-parse_pval_neg(l)
Lines of text from log file.
Numeric
-R/parse_pval_small.R
- parse_pval_small.Rd
Support function for parse_logs.
-parse_pval_small(l)
Lines of text from log file.
Numeric
-Support function for parse_logs.
-parse_report(l, entry = 1, line = 1)
Lines of text from log file.
Numeric
-R/parse_snps_freq_05.R
- parse_snps_freq_05.Rd
Support function for parse_logs.
-parse_snps_freq_05(l, percent = FALSE)
Lines of text from log file.
Numeric
-R/parse_snps_not_formatted.R
- parse_snps_not_formatted.Rd
Support function for parse_logs.
-parse_snps_not_formatted(l)
Lines of text from log file.
Numeric
-Support function for parse_logs.
-parse_time(l)
Lines of text from log file.
Character
-Prints the first n
lines of the sum stats.
preview_sumstats(save_path, nrows = 5L)
File path to save formatted data. Defaults to
-tempfile(fileext=".tsv.gz")
.
No return
-R/data.R
- raw_ALSvcf.Rd
VCF (VCFv4.2) of the GWAS Amyotrophic lateral sclerosis ieu -open GWAS project Dataset: ebi-a-GCST005647. -A subset of 99 SNPs
-vcf document with 528 items relating to 99 SNPs
-The summary statistics VCF (VCFv4.2) file was downloaded from
-https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST005647/
-and formatted to a .rda with the following:
-
-#Get example VCF dataset, use GWAS Amyotrophic lateral sclerosis
-ALS_GWAS_VCF <- readLines("ebi-a-GCST005647.vcf.gz")
-#Subset to just the first 99 SNPs
-ALSvcf <- ALS_GWAS_VCF[1:528]
-writeLines(ALSvcf,"inst/extdata/ALSvcf.vcf")
-
A VCF file (VCFv4.2) of the GWAS Amyotrophic lateral sclerosis ieu -open GWAS project has been subsetted here to act as an example summary -statistic file in VCF format which has some issues in the formatting. -MungeSumstats can correct these issues and produced a standardised summary -statistics format.
-NA
-GWAS Summary Statistics on Educational Attainment by Okbay et -al 2016: -PMID: 27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b. -A subset of 93 SNPs
-txt document with 94 items
-The summary statistics file was downloaded from
-https://www.nature.com/articles/ng.3552
-and formatted to a .rda with the following:
-
-#Get example dataset, use Educational-Attainment_Okbay_2016
-link<-"Educational-Attainment_Okbay_2016/EduYears_Discovery_5000.txt"
-eduAttainOkbay<-readLines(link,n=100)
-#There is an issue where values end with .0, this 0 is removed in func
-#There are also SNPs not on ref genome or arebi/tri allelic
-#So need to remove these in this dataset as its used for testing
-tmp <- tempfile()
-writeLines(eduAttainOkbay,con=tmp)
-eduAttainOkbay <- data.table::fread(tmp) #DT read removes the .0's
-#remove those not on ref genome and withbi/tri allelic
-rmv <- c("rs192818565","rs79925071","rs1606974","rs1871109",
- "rs73074378","rs7955289")
-eduAttainOkbay <- eduAttainOkbay[!MarkerName
-data.table::fwrite(eduAttainOkbay,file=tmp,sep="\t")
-eduAttainOkbay <- readLines(tmp)
-writeLines(eduAttainOkbay,"inst/extdata/eduAttainOkbay.txt")
-
GWAS Summary Statistics on Educational Attainment by Okbay et -al 2016 has been subsetted here to act as an example summary statistic file -which has some issues in the formatting. MungeSumstats can correct these -issues.
-NA
-Read in file header
-read_header(path, n = 2L, skip_vcf_metadata = FALSE, nThread = 1)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
integer. The (maximal) number of lines to read. Negative values -indicate that one should read up to the end of input on the connection.
logical, should VCF metadata be ignored
Number of threads to use for parallel processes.
First n
lines of the VCF header
path <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats")
-header <- read_header(path = path)
-#> Reading header.
-
Parse p-value column in VCF file.of other general -loq10 p-values
-read_log_pval(
- sumstats_dt,
- mapping_file = sumstatsColHeaders,
- return_list = TRUE
-)
Summary stats data.table.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
Binary, whether to return the dt in a list or not - list
-is standard for the format_sumstats()
function.
Null output.
-R/read_sumstats.R
- read_sumstats.Rd
Determine summary statistics file type and read them into memory
-read_sumstats(
- path,
- nrows = Inf,
- standardise_headers = FALSE,
- samples = 1,
- sampled_rows = 10000L,
- nThread = 1,
- mapping_file = sumstatsColHeaders
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
integer. The (maximal) number of lines to read.
-If Inf
, will read in all rows.
Standardise headers first.
Which samples to use:
1 : Only the first sample will be used (DEFAULT).
NULL : All samples will be used.
c("<sample_id1>","<sample_id2>",...) : -Only user-selected samples will be used (case-insensitive).
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Number of threads to use for parallel processes.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
data.table
of formatted summary statistics
path <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats"
-)
-eduAttainOkbay <- read_sumstats(path = path)
-#> Importing tabular file: /private/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T/RtmpKLvRpi/temp_libpath17f3d19176b21/MungeSumstats/extdata/eduAttainOkbay.txt
-#> Checking for empty columns.
-
Read in a VCF file as a VCF or a -data.table. -Can optionally save the VCF/data.table as well.
-read_vcf(
- path,
- as_datatable = TRUE,
- save_path = NULL,
- tabix_index = FALSE,
- samples = 1,
- which = NULL,
- use_params = TRUE,
- sampled_rows = 10000L,
- download = TRUE,
- vcf_dir = tempdir(),
- download_method = "download.file",
- force_new = FALSE,
- mt_thresh = 100000L,
- nThread = 1,
- verbose = TRUE
-)
Path to local or remote VCF file.
Return the data as a
-data.table (default: TRUE
)
-or a VCF (FALSE
).
File path to save formatted data. Defaults to
-tempfile(fileext=".tsv.gz")
.
Index the formatted summary statistics with -tabix for fast querying.
Which samples to use:
1 : Only the first sample will be used (DEFAULT).
NULL : All samples will be used.
c("<sample_id1>","<sample_id2>",...) : -Only user-selected samples will be used (case-insensitive).
Genomic ranges to be added if supplied. Default is NULL.
When TRUE
(default), increases the speed of reading in the VCF by
-omitting columns that are empty based on the head of the VCF (NAs only).
-NOTE that that this requires the VCF to be sorted, bgzip-compressed,
-tabix-indexed, which read_vcf will attempt to do.
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Download the VCF (and its index file)
-to a temp folder before reading it into R.
-This is important to keep TRUE
when nThread>1
to avoid
-making too many queries to remote file.
Where to download the original VCF from Open GWAS.
-WARNING: This is set to tempdir()
by default.
-This means the raw (pre-formatted) VCFs be deleted upon ending the R session.
-Change this to keep the raw VCF file on disk
-(e.g. vcf_dir="./raw_vcf"
).
"axel"
(multi-threaded) or
-"download.file"
(single-threaded) .
If a formatted file of the same names as save_path
-exists, formatting will be skipped and this file will be imported instead
-(default). Set force_new=TRUE
to override this.
When the number of rows (variants) in the VCF is
-< mt_thresh
, only use single-threading for reading in the VCF.
-This is because the overhead of parallelisation outweighs the speed benefits
-when VCFs are small.
Number of threads to use for parallel processes.
Print messages.
The VCF file in data.table format.
-#### Local file ####
-path <- system.file("extdata","ALSvcf.vcf", package="MungeSumstats")
-sumstats_dt <- read_vcf(path = path)
-#> Loading required namespace: GenomicFiles
-#> Using local VCF.
-#> bgzip-compressing VCF file.
-#> Finding empty VCF columns based on first 10,000 rows.
-#> Dropping 1 duplicate column(s).
-#> 1 sample detected: EBI-a-GCST005647
-#> Constructing ScanVcfParam object.
-#> VCF contains: 39,630,630 variant(s) x 1 sample(s)
-#> Reading VCF file: single-threaded
-#> Converting VCF to data.table.
-#> Expanding VCF first, so number of rows may increase.
-#> Dropping 1 duplicate column(s).
-#> Checking for empty columns.
-#> Unlisting 3 columns.
-#> Dropped 314 duplicate rows.
-#> Time difference of 0.1 secs
-#> VCF data.table contains: 101 rows x 11 columns.
-#> Time difference of 0.4 secs
-#> Renaming ID as SNP.
-#> sumstats has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
-#> No INFO (SI) column detected.
-
-#### Remote file ####
-## Small GWAS (0.2Mb)
-# path <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
-# sumstats_dt2 <- read_vcf(path = path)
-
-## Large GWAS (250Mb)
-# path <- "https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz"
-# sumstats_dt3 <- read_vcf(path = path, nThread=11)
-
-### Very large GWAS (500Mb)
-# path <- "https://gwas.mrcieu.ac.uk/files/ieu-a-1124/ieu-a-1124.vcf.gz"
-# sumstats_dt4 <- read_vcf(path = path, nThread=11)
-
Get the genome build of a remote or local VCF file.
-read_vcf_genome(
- header = NULL,
- validate = FALSE,
- default_genome = "HG19/GRCh37",
- verbose = TRUE
-)
Header extracted by scanVcfHeader.
Validate genome name using -mapGenomeBuilds.
When no genome can be extracted, -default to this genome build.
Print messages.
genome
-Parse INFO column in VCF file.
-read_vcf_info(sumstats_dt)
Summary stats data.table.
Null output.
-Parse MarkerName/SNP column in VCF file.
-read_vcf_markername(sumstats_dt)
Summary stats data.table.
Null output.
-Read a VCF file across 1 or more threads in parallel.
-If tilewidth
is not specified, the size of each chunk will be
-determined by total genome size divided by ntile
.
-By default, ntile
is equal to the number of threads, nThread
.
-For further discussion on how this function was optimised,
-see
-here
-and
-here.
read_vcf_parallel(
- path,
- samples = 1,
- which = NULL,
- use_params = TRUE,
- as_datatable = TRUE,
- sampled_rows = 10000L,
- include_xy = FALSE,
- download = TRUE,
- vcf_dir = tempdir(),
- download_method = "download.file",
- force_new = FALSE,
- tilewidth = NULL,
- mt_thresh = 100000L,
- nThread = 1,
- ntile = nThread,
- verbose = TRUE
-)
-path <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
-#### Single-threaded ####
-vcf <- MungeSumstats:::read_vcf_parallel(path = path)
-#### Parallel ####
-vcf2 <- MungeSumstats:::read_vcf_parallel(path = path, nThread=11)
-
Path to local or remote VCF file.
Which samples to use:
1 : Only the first sample will be used (DEFAULT).
NULL : All samples will be used.
c("<sample_id1>","<sample_id2>",...) : -Only user-selected samples will be used (case-insensitive).
Genomic ranges to be added if supplied. Default is NULL.
When TRUE
(default), increases the speed of reading in the VCF by
-omitting columns that are empty based on the head of the VCF (NAs only).
-NOTE that that this requires the VCF to be sorted, bgzip-compressed,
-tabix-indexed, which read_vcf will attempt to do.
Return the data as a
-data.table (default: TRUE
)
-or a VCF (FALSE
).
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Download the VCF (and its index file)
-to a temp folder before reading it into R.
-This is important to keep TRUE
when nThread>1
to avoid
-making too many queries to remote file.
Where to download the original VCF from Open GWAS.
-WARNING: This is set to tempdir()
by default.
-This means the raw (pre-formatted) VCFs be deleted upon ending the R session.
-Change this to keep the raw VCF file on disk
-(e.g. vcf_dir="./raw_vcf"
).
"axel"
(multi-threaded) or
-"download.file"
(single-threaded) .
If a formatted file of the same names as save_path
-exists, formatting will be skipped and this file will be imported instead
-(default). Set force_new=TRUE
to override this.
The desired tile width. The effective tile width might be slightly - different but is guaranteed to never be more than the desired width.
When the number of rows (variants) in the VCF is
-< mt_thresh
, only use single-threading for reading in the VCF.
-This is because the overhead of parallelisation outweighs the speed benefits
-when VCFs are small.
Number of threads to use for parallel processes.
The number of tiles to generate.
Print messages.
VCF file.
-Register a multi-threaded instances using BiocParallel.
-register_cores(workers = 1, progressbar = TRUE)
integer(1)
Number of workers. Defaults to the maximum of 1 or
- the number of cores determined by detectCores
minus 2 unless
- environment variables R_PARALLELLY_AVAILABLECORES_FALLBACK
or
- BIOCPARALLEL_WORKER_NUMBER
are set otherwise. For a
- SOCK
cluster, workers
can be a character()
- vector of host names.
logical(1)
Enable progress bar (based on plyr:::progress_text).
Null output.
-Remote columns that are empty or contain all the same values in a data.table.
-remove_empty_cols(sumstats_dt, sampled_rows = NULL, verbose = TRUE)
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Print messages.
Null output.
-R/report_summary.R
- report_summary.Rd
Prints report.
-report_summary(sumstats_dt, orig_dims = NULL)
data table obj of the summary -statistics file for the GWAS.
No return
-Select non-empty columns from each VCF field type.
-select_vcf_fields(
- path,
- sampled_rows = 10000L,
- which = NULL,
- samples = NULL,
- nThread = 1,
- verbose = TRUE
-)
Path to local or remote VCF file.
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Genomic ranges to be added if supplied. Default is NULL.
Which samples to use:
1 : Only the first sample will be used (DEFAULT).
NULL : All samples will be used.
c("<sample_id1>","<sample_id2>",...) : -Only user-selected samples will be used (case-insensitive).
Number of threads to use for parallel processes.
Print messages.
ScanVcfParam
object.
R/sort_coord_genomicranges.R
- sort_coord_genomicranges.Rd
Sort summary statistics table by genomic coordinates using a slower
-(but in some cases more robust) GenomicRanges
strategy
sort_coord_genomicranges(sumstats_dt)
data.table obj of the -summary statistics file for the GWAS.
Sorted sumstats_dt
-Sort summary statistics table by genomic coordinates.
-sort_coords(
- sumstats_dt,
- sort_coordinates = TRUE,
- sort_method = c("data.table", "GenomicRanges")
-)
data.table obj of the -summary statistics file for the GWAS.
Method to sort coordinates by:
"data.table" (default)Uses setorderv, -which is must faster than "GenomicRanges" -but less robust to variations in some sum stats files.
"GenomicRanges"Uses sort.GenomicRanges, -which is more robust to variations in sum stats files -but much slower than the "data.table" method.
Whether to sort by coordinates.
Sorted sumstats_dt
-Sort summary statistics table by genomic coordinates using a fast
-data.table
-native strategy
sort_coords_datatable(
- sumstats_dt,
- chr_col = "CHR",
- start_col = "BP",
- end_col = start_col
-)
data.table obj of the -summary statistics file for the GWAS.
Chromosome column name.
Genomic end position column name.
Sorted sumstats_dt
-R/standardise_sumstats_column_headers_crossplatform.R
- standardise_header.Rd
Use a reference data table of common column header names (stored in
-sumstatsColHeaders
or user inputted mapping file) to convert them to a
-standard set, i.e. chromosome -> CHR. This function does not check that all
-the required column headers are present. The amended header is written
-directly back into the file
standardise_header(
- sumstats_dt,
- mapping_file = sumstatsColHeaders,
- uppercase_unmapped = TRUE,
- convert_A0 = TRUE,
- return_list = TRUE
-)
data table obj of the summary statistics file for the -GWAS.
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
For columns that could not be identified in
-the mapping_file
, return them in the same format they were input as
-(without forcing them to uppercase).
Whether to convert A* (representing A0) to A1/A2. This -should be done unless checking if A0 was present in the input as if you do -it you can't infer this. Default is TRUE
Return the sumstats_dt
within a named list
-(default: TRUE
).
list containing sumstats_dt, the modified summary statistics data -table object
-sumstats_dt <- data.table::fread(system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats"))
-sumstats_dt2 <- standardise_header(sumstats_dt=sumstats_dt)
-#> Standardising column headers.
-#> First line of summary statistics file:
-#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
-
List of uncorrected column headers often found in GWAS Summary -Statistics column headers. Note the effect allele will always be the A2 -allele, this is the approach done for -VCF(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7805039/). This is enforced -with the column header corrections here and also the check allele flipping -test.
-data("sumstatsColHeaders")
dataframe with 2 columns
-The code to prepare the .Rda file file from the marker file is:
-
-# Most the data in the below table comes from the LDSC github wiki
-data("sumstatsColHeaders")
-# Make additions to sumstatsColHeaders using github version of MungeSumstats-
-# Shown is an example of adding new A1 and A2 naming
-a1_name <- c("NON","RISK","ALLELE")
-a2_name <- c("RISK","ALLELE")
-all_delims <- c("_",".",""," ","-")
-all_uncorr_a1 <- vector(mode="list",length = length(all_delims))
-all_corr_a1 <- vector(mode="list",length = length(all_delims))
-all_uncorr_a2 <- vector(mode="list",length = length(all_delims))
-all_corr_a2 <- vector(mode="list",length = length(all_delims))
-for(i in seq_along(all_delims)){
-delim <- all_delims[i]
-a1 <- unlist(paste(a1_name,collapse=delim))
-a2 <- unlist(paste(a2_name,collapse=delim))
-all_uncorr_a1[[i]] <- a1
-all_uncorr_a2[[i]] <- a2
-all_corr_a1[[i]] <- "A1"
- all_corr_a2[[i]] <- "A2"
-}
-se_cols <- data.frame("Uncorrected"=c(unlist(all_uncorr_a1),unlist(all_uncorr_a2)),
- "Corrected"=c(unlist(all_corr_a1),unlist(all_corr_a2)))
-# Or another example .....
-# shown is an example of adding columns for Standard Error (SE)
-se_cols <- data.frame("Uncorrected"=c("SE","se","STANDARD.ERROR",
- "STANDARD_ERROR","STANDARD-ERROR"),
- "Corrected"=rep("SE",5))
-sumstatsColHeaders <- rbind(sumstatsColHeaders,se_cols)
-#Once additions are made, order & save the new mapping dataset
-#now sort ordering -important for logic that
-# uncorrected=corrected comes first
-sumstatsColHeaders$ordering <-
- sumstatsColHeaders$Uncorrected==sumstatsColHeaders$Corrected
-sumstatsColHeaders <-
- sumstatsColHeaders[order(sumstatsColHeaders$Corrected,
- sumstatsColHeaders$ordering,decreasing = TRUE),]
-rownames(sumstatsColHeaders)<-1:nrow(sumstatsColHeaders)
-sumstatsColHeaders$ordering <- NULL
-#manually move FREQUENCY to above MAR - github issue 95
-frequency <- sumstatsColHeaders[sumstatsColHeaders$Uncorrected=="FREQUENCY",]
-maf <- sumstatsColHeaders[sumstatsColHeaders$Uncorrected=="MAF",]
-if(as.integer(rownames(frequency))>as.integer(rownames(maf))){
- sumstatsColHeaders[as.integer(rownames(frequency)),] <- maf
- sumstatsColHeaders[as.integer(rownames(maf)),] <- frequency
-}
-usethis::use_data(sumstatsColHeaders,overwrite = TRUE, internal=TRUE)
-save(sumstatsColHeaders,
- file="data/sumstatsColHeaders.rda")
-# You will need to restart your r session for effects to take account
-
List supported file formats
-supported_suffixes(
- tabular = TRUE,
- tabular_compressed = TRUE,
- vcf = TRUE,
- vcf_compressed = TRUE
-)
Include tabular formats.
Include compressed tabular formats.
Include Variant Call Format.
Include compressed Variant Call Format.
File formats
-Convert a data.table to GRanges.
-to_granges(
- sumstats_dt,
- seqnames.field = "CHR",
- start.field = "BP",
- end.field = "BP",
- style = c("NCBI", "UCSC")
-)
data table obj of the summary statistics file -for the GWAS.
A character vector of recognized names for the column in df
- that contains the chromosome name (a.k.a. sequence name) associated
- with each genomic range.
- Only the first name in seqnames.field
that is found
- in colnames(df)
is used.
- If no one is found, then an error is raised.
A character vector of recognized names for the column in df
- that contains the start positions of the genomic ranges.
- Only the first name in start.field
that is found
- in colnames(df)
is used.
- If no one is found, then an error is raised.
A character vector of recognized names for the column in df
- that contains the end positions of the genomic ranges.
- Only the first name in start.field
that is found
- in colnames(df)
is used.
- If no one is found, then an error is raised.
GRanges
style to convert to, "NCBI" or "UCSC".
GRanges
object
Convert to VRanges
to_vranges(sumstats_dt)
data table obj of the summary statistics -file for the GWAS.
VRanges
object
Identify columns that are lists and turn them into vectors.
-unlist_dt(dt, verbose = TRUE)
data.table
Print messages.
dt
with list columns turned into vectors.
R/validate_parameters.R
- validate_parameters.Rd
Ensure that the input parameters are logical
-validate_parameters(
- path,
- ref_genome,
- convert_ref_genome,
- convert_small_p,
- es_is_beta,
- compute_z,
- compute_n,
- convert_n_int,
- analysis_trait,
- INFO_filter,
- FRQ_filter,
- pos_se,
- effect_columns_nonzero,
- N_std,
- N_dropNA,
- chr_style,
- rmv_chr,
- on_ref_genome,
- infer_eff_direction,
- eff_on_minor_alleles,
- strand_ambig_filter,
- allele_flip_check,
- allele_flip_drop,
- allele_flip_z,
- allele_flip_frq,
- bi_allelic_filter,
- flip_frq_as_biallelic,
- snp_ids_are_rs_ids,
- remove_multi_rs_snp,
- frq_is_maf,
- indels,
- drop_indels,
- check_dups,
- dbSNP,
- write_vcf,
- return_format,
- ldsc_format,
- save_format,
- imputation_ind,
- log_folder_ind,
- log_mungesumstats_msgs,
- mapping_file,
- tabix_index,
- chain_source,
- local_chain,
- drop_na_cols,
- rmv_chrPrefix
-)
Filepath for the summary statistics file to be formatted. A -dataframe or datatable of the summary statistics file can also be passed -directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
name of the reference genome to convert to -("GRCh37" or "GRCh38"). This will only occur if the current genome build does -not match. Default is not to convert the genome build (NULL).
Binary, should non-negative -p-values <= 5e-324 be converted to 0? -Small p-values pass the R limit and can cause errors with LDSC/MAGMA and -should be converted. Default is TRUE.
Binary, whether to map ES to BETA. We take BETA to be any -BETA-like value (including Effect Size). If this is not the case for your -sumstats, change this to FALSE. Default is TRUE.
Whether to compute Z-score column. Default is FALSE. This -can be computed from Beta and SE with (Beta/SE) or P -(Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))). -Note that imputing the Z-score from P for every SNP will not be -perfectly correct and may result in a loss of power. This should only be done -as a last resort. Use 'BETA' to impute by BETA/SE and 'P' to impute by SNP -p-value.
Whether to impute N. Default of 0 won't impute, any other -integer will be imputed as the N (sample size) for every SNP in the dataset. -Note that imputing the sample size for every SNP is not correct and -should only be done as a last resort. N can also be inputted with "ldsc", -"sum", "giant" or "metal" by passing one of these for this field or a vector -of multiple. Sum and an integer value creates an N column in the output -whereas giant, metal or ldsc create an Neff or effective sample size. If -multiples are passed, the formula used to derive it will be indicated.
Binary, if N (the number of samples) is not an integer, -should this be rounded? Default is TRUE.
If multiple traits were studied, name of the trait for -analysis from the GWAS. Default is NULL.
numeric The minimum value permissible of the imputation -information score (if present in sumstats file). Default 0.9.
numeric The minimum value permissible of the frequency(FRQ) -of the SNP (i.e. Allele Frequency (AF)) (if present in sumstats file). By -default no filtering is done, i.e. value of 0.
Binary Should the standard Error (SE) column be checked to -ensure it is greater than 0? Those that are, are removed (if present in -sumstats file). Default TRUE.
Binary should the effect columns in the data -BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to ensure no SNP=0. -Those that do are removed(if present in sumstats file). Default FALSE.
numeric The number of standard deviations above the mean a SNP's -N is needed to be removed. Default is 5.
Drop rows where N is missing.Default is TRUE.
Chromosome naming style to use in the formatted summary
-statistics file ("NCBI", "UCSC", "dbSNP", or "Ensembl"). The NCBI and
-Ensembl styles both code chromosomes as 1-22, X, Y, MT
; the UCSC style is
-chr1-chr22, chrX, chrY, chrM
; and the dbSNP style is
-ch1-ch22, chX, chY, chMT
. Default is Ensembl.
Chromosomes to exclude from the formatted summary statistics
-file. Use NULL if no filtering is necessary. Default is c("X", "Y", "MT")
-which removes all non-autosomal SNPs.
Binary Should a check take place that all SNPs are on -the reference genome by SNP ID. Default is TRUE.
Binary Should a check take place to ensure the -alleles match the effect direction? Default is TRUE.
Binary Should MungeSumstats assume that the -effects are majoritively measured on the minor alleles? Default is FALSE as -this is an assumption that won't be appropriate in all cases. However, the -benefit is that if we know the majority of SNPs have their effects based on -the minor alleles, we can catch cases where the allele columns have been -mislabelled.
Binary Should SNPs with strand-ambiguous alleles -be removed. Default is FALSE.
Binary Should the allele columns be checked against -reference genome to infer if flipping is necessary. Default is TRUE.
Binary Should the SNPs for which neither their A1 or -A2 base pair values match a reference genome be dropped. Default is TRUE.
Binary should the Z-score be flipped along with effect -and FRQ columns like Beta? It is assumed to be calculated off the effect size -not the P-value and so will be flipped i.e. default TRUE.
Binary should the frequency (FRQ) column be flipped -along with effect and z-score columns like Beta? Default TRUE.
Binary Should non-bi-allelic SNPs be removed. -Default is TRUE.
Binary Should non-bi-allelic SNPs frequency -values be flipped as 1-p despite there being other alternative alleles? -Default is FALSE but if set to TRUE, this allows non-bi-allelic SNPs to be -kept despite needing flipping.
Binary Should the supplied SNP ID's be assumed to -be RSIDs. If not, imputation using the SNP ID for other columns like -base-pair position or chromosome will not be possible. If set to FALSE, the -SNP RS ID will be imputed from the reference genome if possible. Default is -TRUE.
Binary Sometimes summary statistics can have -multiple RSIDs on one row (i.e. related to one SNP), for example -"rs5772025_rs397784053". This can cause an error so by default, the first -RS ID will be kept and the rest removed e.g."rs5772025". If you want to just -remove these SNPs entirely, set it to TRUE. Default is FALSE.
Conventionally the FRQ column is intended to show the -minor/effect allele frequency (MAF) but sometimes the major allele frequency -can be inferred as the FRQ column. This logical variable indicates that the -FRQ column should be renamed to MAJOR_ALLELE_FRQ if the frequency values -appear to relate to the major allele i.e. >0.5. By default this mapping won't -occur i.e. is TRUE.
Binary does your Sumstats file contain Indels? These don't -exist in our reference file so they will be excluded from checks if this -value is TRUE. Default is TRUE.
Binary, should any indels found in the sumstats be -dropped? These can not be checked against a reference dataset and will have -the same RS ID and position as SNPs which can affect downstream analysis. -Default is False.
whether to check for duplicates - if formatting QTL -datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
version of dbSNP to be used for imputation (144 or 155).
Whether to write as VCF (TRUE) or tabular file (FALSE).
If return_data is TRUE. Object type to be returned -("data.table","vranges","granges").
DEPRECATED, do not use. Use save_format="LDSC" instead.
Output format of sumstats. Options are NULL - standardised -output format from MungeSumstats, LDSC - output format compatible with LDSC -and openGWAS - output compatible with openGWAS VCFs. Default is NULL. -NOTE - If LDSC format is used, the naming convention of A1 as the -reference (genome build) allele and A2 as the effect allele will be reversed -to match LDSC (A1 will now be the effect allele). See more info on this -here. Note that any -effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
Binary Should a column be added for each imputation -step to show what SNPs have imputed values for differing fields. This -includes a field denoting SNP allele flipping (flipped). On the flipped -value, this denoted whether the alelles where switched based on -MungeSumstats initial choice of A1, A2 from the input column headers and thus -may not align with what the creator intended.Note these columns will be -in the formatted summary statistics returned. Default is FALSE.
Binary Should log files be stored containing all -filtered out SNPs (separate file per filter). The data is outputted in the -same format specified for the resulting sumstats file. The only exception to -this rule is if output is vcf, then log file saved as .tsv.gz. Default is -FALSE.
Binary Should a log be stored containing all -messages and errors printed by MungeSumstats in a run. Default is FALSE
MungeSumstats has a pre-defined column-name mapping file -which should cover the most common column headers and their interpretations. -However, if a column header that is in youf file is missing of the mapping we -give is incorrect you can supply your own mapping file. Must be a 2 column -dataframe with column names "Uncorrected" and "Corrected". See -data(sumstatsColHeaders) for default mapping and necessary format.
Index the formatted summary statistics with -tabix for fast querying.
source of the chain file to use in liftover, if converting -genome build ("ucsc" or "ensembl"). Note that the UCSC chain files require a -license for commercial use. The Ensembl chain is used by default ("ensembl").
Path to local chain file to use instead of downlaoding. -Default of NULL i.e. no local file to use. NOTE if passing a local chain file -make sure to specify the path to convert from and to the correct build like -GRCh37 to GRCh38. We can not sense check this for local files. The chain file -can be submitted as a gz file (as downloaed from source) or unzipped.
A character vector of column names to be checked for
-missing values. Rows with missing values in any of these columns (if present
-in the dataset) will be dropped. If NULL
, all columns will be checked for
-missing values. Default columns are SNP, chromosome, position, allele 1,
-allele2, effect columns (frequency, beta, Z-score, standard error, log odds,
-signed sumstats, odds ratio), p value and N columns.
Is now deprecated, do. not use. Use chr_style instead - -chr_style = 'Ensembl' will give the same result as rmv_chrPrefix=TRUE used to -give.
No return
-Function to convert a VariantAnnotation
-CollapsedVCF
/ExpandedVCF
-object to a data.frame
.
vcf2df(
- vcf,
- add_sample_names = TRUE,
- add_rowranges = TRUE,
- drop_empty_cols = TRUE,
- unique_cols = TRUE,
- unique_rows = TRUE,
- unlist_cols = TRUE,
- sampled_rows = NULL,
- verbose = TRUE
-)
Variant Call Format (VCF) file imported into R -as a VariantAnnotation -CollapsedVCF/ -ExpandedVCF object.
Append sample names to column names -(e.g. "EZ" --> "EZ_ubm-a-2929").
Include rowRanges
from VCF as well.
Drop columns that are filled entirely with:
-NA
, "."
, or ""
.
Only keep uniquely named columns.
Only keep unique rows.
If any columns are lists instead of vectors, unlist them.
-Required to be TRUE
when unique_rows=TRUE
.
First N rows to sample.
-Set NULL
to use full sumstats_file
.
-when determining whether cols are empty.
Print messages.
data.frame version of VCF
-
-#### VariantAnnotation ####
-# path <- "https://github.com/brentp/vcfanno/raw/master/example/exac.vcf.gz"
-path <- system.file("extdata", "ALSvcf.vcf",
- package = "MungeSumstats")
-
-vcf <- VariantAnnotation::readVcf(file = path)
-vcf_df <- MungeSumstats:::vcf2df(vcf = vcf)
-#> Converting VCF to data.table.
-#> Expanding VCF first, so number of rows may increase.
-#> Checking for empty columns.
-#> Removing 2 empty columns.
-#> Unlisting 4 columns.
-#> Dropped 314 duplicate rows.
-#> Time difference of 0.1 secs
-#> VCF data.table contains: 101 rows x 12 columns.
-
Write sum stats file to disk
-write_sumstats(
- sumstats_dt,
- save_path,
- ref_genome = NULL,
- sep = "\t",
- write_vcf = FALSE,
- save_format = NULL,
- tabix_index = FALSE,
- nThread = 1,
- return_path = FALSE,
- save_path_check = FALSE
-)
data table obj of the summary statistics -file for the GWAS.
File path to save formatted data. Defaults to
-tempfile(fileext=".tsv.gz")
.
name of the reference genome used for the GWAS ("GRCh37" or -"GRCh38"). Argument is case-insensitive. Default is NULL which infers the -reference genome from the data.
The separator between columns. Defaults to the character in the set [,\t |;:]
that separates the sample of rows into the most number of lines with the same number of fields. Use NULL
or ""
to specify no separator; i.e. each line a single character column like base::readLines
does.
Whether to write as VCF (TRUE) or tabular file (FALSE).
Output format of sumstats. Options are NULL - standardised -output format from MungeSumstats, LDSC - output format compatible with LDSC -and openGWAS - output compatible with openGWAS VCFs. Default is NULL. -NOTE - If LDSC format is used, the naming convention of A1 as the -reference (genome build) allele and A2 as the effect allele will be reversed -to match LDSC (A1 will now be the effect allele). See more info on this -here. Note that any -effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
Index the formatted summary statistics with -tabix for fast querying.
The number of threads to use. Experiment to see what works best for your data on your hardware.
Return save_path
.
-This will have been modified in some cases
-(e.g. after compressing and tabix-indexing a
-previously un-compressed file).
Ensure path name is valid (given the other arguments) -before writing (default: FALSE).
If return_path=TRUE
, returns save_path
.
-Else returns NULL
.
path <- system.file("extdata", "eduAttainOkbay.txt",
- package = "MungeSumstats"
-)
-eduAttainOkbay <- read_sumstats(path = path)
-#> Importing tabular file: /private/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T/RtmpKLvRpi/temp_libpath17f3d19176b21/MungeSumstats/extdata/eduAttainOkbay.txt
-#> Checking for empty columns.
-write_sumstats(
- sumstats_dt = eduAttainOkbay,
- save_path = tempfile(fileext = ".tsv.gz")
-)
-#> Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp4DII6I/filec16d7adaa0e3.tsv.gz
-