2 Overview

The MungeSumstats package is designed to facilitate the standardisation of GWAS summary statistics as utilised in our Nature Genetics paper1.

The package is designed to handle the lack of standardisation of output files by the GWAS community. There is a group who have now manually standardised many GWAS: R interface to the IEU GWAS database API • ieugwasr and gwasvcf but because a lot of GWAS remain closed access, these repositories are not all encompassing.

The GWAS-Download project has collated summary statistics from 200+ GWAS. This repository has been utilsed to identify the most common formats, all of which can be standardised with MungeSumstats.

Moreover, there is an emerging standard of VCF format for summary statistics files with multiple, useful, associated R packages such as vcfR. However, there is currently no method to convert VCF formats to a standardised format that matches older approaches.

The MungeSumstats package standardises both VCF and the most common summary statistic file formats to enable downstream integration and analysis.

MungeSumstats also offers a number of Quality Control (QC) steps which are important prerequisites for downstream analysis like Linkage disequilibrium score regression (LDSC) and MAGMA.

3 Aim

MungeSumstats will ensure that the all essential columns for analysis are present and syntactically correct. Generally, summary statistic files include (but are not limited to) the columns:

  • SNP : SNP ID (rs IDs)
  • CHR : Chromosome number
  • BP : Base pair positions
  • A1 : Effect allele
  • A2 : Non-effect allele
  • Z : Z-score
  • BETA : Effect size estimate relative to the alternative allele
  • P : Unadjusted p-value for SNP
  • N : Sample size
  • INFO: The imputation information score
  • FRQ: The minor allele frequency (MAF) of the SNP

Tests run by MungeSumstats include:

  • Check VCF format
  • Check tab, space or comma delimited
  • Check for header name synonyms
  • Check for multiple models or traits in GWAS
  • Check for uniformity in SNP ID - no mix of rs/missing rs/chr:bp
  • Check for CHR:BP:A2:A1 all in one column
  • Check for CHR:BP in one column
  • Check for A1/A2 in one column
  • Check if CHR and/or BP is missing (infer from reference genome)
  • Check if SNP ID is missing (infer from reference genome)
  • Check if A1 and/or A2 are missing (infer from reference genome)
  • Check that vital columns are present (SNP,CHR,BP,P,A1,A2)
  • Check for one signed/effect column (Z,OR,BETA,LOG_ODDS,SIGNED_SUMSTAT)
  • Check for missing data
  • Check for duplicated columns
  • Check for small p-values (lower than 5e-324)
  • Check N column is an integer
  • Check for SNPs with N greater than 5 times standard dev. plus the mean
  • Check SNPs are RS ID’s
  • Check for duplicated rows, based on SNP ID
  • Check for duplicated rows, based on base-pair position
  • Check for SNPs on reference genome. Correct not found SNP IDs using CHR and BP (infer from reference genome)
  • Check INFO score
  • Check for strand-ambiguous SNPs
  • Check for non-biallelic SNPs (infer from reference genome)
  • Check for allele flipping
  • Check for SNPs on chromosome X, Y, and mitochondrial SNPs (MT)

If a test is failed, the user will be notified and if possible, the input will be corrected. The QC steps from the checks above can also be adjusted to suit the user’s analysis, see MungeSumstats::format_sumstats.

4 Data

The MungeSumstats package contains small subsets of GWAS summary statistics files. Firstly, on Educational Attainment by Okbay et al 2016: PMID: 27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b.

Secondly, a VCF file (VCFv4.2) relating to the GWAS Amyotrophic lateral sclerosis from ieu open GWAS project. Dataset: ebi-a-GCST005647: https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST005647/

These datasets will be used to showcase MungeSumstats functionality.

5 Running MungeSumstats

MungeSumstats is available on Bioconductor. To install the package on Bioconductor run the following lines of code:

if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("MungeSumstats")

Once installed, load the package:

library(MungeSumstats)

To standardise the summary statistics’ file format, simply call format_sumstats() passing in the path to your summary statistics file. You must also specify which genome build was used in the GWAS(GRCh37 or GRCh38). The reference genome is used for multiple checks like deriving missing data such SNP/BP/CHR/A1/A2 and for QC steps like removing non-biallelic SNPs or strand-ambiguous SNPs. The path to the reformatted summary statistics file is returned by the function call.

Note that for a number of the checks implored by MungeSumstats a reference genome is used. If your GWAS summary statistics file of interest relates to GRCh38, you will need to install SNPlocs.Hsapiens.dbSNP144.GRCh38 and BSgenome.Hsapiens.NCBI.GRCh38 from Bioconductor as follows:

BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh38")
BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")

If your GWAS summary statistics file of interest relates to GRCh37, you will need to install SNPlocs.Hsapiens.dbSNP144.GRCh37 and BSgenome.Hsapiens.1000genomes.hs37d5 from Bioconductor as follows:

BiocManager::install("SNPlocs.Hsapiens.dbSNP144.GRCh37")
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")

These may take some time to install and are not included in the package as some users may only need one of GRCh37/GRCh38.

The Educational Attainment by Okbay GWAS summary statistics file is saved as a text document in the package’s external data folder so we can just pass the file path to it straight to MungeSumstats.

eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
                                  package="MungeSumstats")
reformatted <- 
  MungeSumstats::format_sumstats(path=eduAttainOkbayPth,ref_genome="GRCh37")
## First line of summary statistics file:
## MarkerName   CHR POS A1  A2  EAF Beta    SE  Pval    
## Succesfully finished preparing sumstats file, preview:
## SNP  CHR BP  A1  A2  FRQ BETA    SE  P
## rs12646808   4   3249828 T   C   0.6418  0.016   0.003   4.002e-08
## rs301800 1   8490603 T   C   0.1791  0.019   0.003   1.794e-08

The hyperparameters format_sumstats in that control the level of QC conducted by MungeSumstats are:

  • convert_small_p Binary, should p-values < 5e-324 be converted to 0? Small p-values pass the R limit and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE.
  • convert_n_int Binary, if N (the number of samples) is not an integer, should this be rounded? Default is TRUE. analysis_trait If multiple traits were studied, name of the trait for analysis from the GWAS. Default is NULL
  • INFO_filter 0-1 The minimum value permissible of the imputation information score (if present in sumstatsfile). Default 0.9
  • N_std Numeric, the number of standard deviations above the mean a SNP’s N is needed to be removed. Default is 5.
  • rmv_chr vector or character The chromosomes on which the SNPs should be removed. Use NULL if no filtering necessary. Default is X, Y and mitochondrial.
  • on_ref_genome Binary, should a check take place that all SNPs are on the reference genome by SNP ID. Any that aren’t on SNPs not on the reference genome, will be corrected from the reference genome (if possible) from the chromosome and base pair postion data. Default is TRUE
  • strand_ambig_filter Binary, should SNPs with strand-ambiguous alleles be removed. Default is FALSE
  • allele_flip_check Binary, should the allele columns be chacked against reference genome to infer if flipping is necessary. Default is TRUE
  • bi_allelic_filter Binary, should non-biallelic SNPs be removed. Default is TRUE

VCF files can also be standardised to the same format as other summary statistic files. A subset of the Amyotrophic lateral sclerosis GWAS from the ieu open GWAS project (a .vcf file) has been added to MungeSumstats to demonstrate this functionality.Simply pass the path to the file in the same manner you would for other summary statistic files:

#save ALS GWAS from the ieu open GWAS project to a temp directory
ALSvcfPth <- system.file("extdata","ALSvcf.vcf",package="MungeSumstats")
#set a low INFO filter so we get a return
reformatted_vcf <- 
  MungeSumstats::format_sumstats(path=ALSvcfPth, ref_genome="GRCh37",INFO_filter=0.1)
## VCF format detected, this will be converted to a standard summary statistics file format.
## Inputted VCF format has -log10 P-values, these will be converted to unadjusted p-values in the 'P' column.
## First line of summary statistics file:
## CHROM    POS REF ALT INFO    ES  SE  LP  AF  ID  P   
## 12 SNPs are not on the reference genome.  These will be corrected from the reference genome.
## Reordering so first three column headers are SNP, CHR and BP in this order.
## 67 SNPs are below the INFO threshold of 0.1 and will be removed
## Succesfully finished preparing sumstats file, preview:
## SNP  CHR BP  A1  A2  INFO    BETA    SE  LP  AF  P
## rs58108140   1   10583   G   A   0.1589  0.0312  0.0393  0.369267    0.1589  0.427300105456596
## rs806731 1   30923   G   T   0.7843  -0.0114 0.0353  0.126854    0.7843  0.746699739815279

6 Future Enhancements

The MungeSumstats package aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted by MungeSumstats feel free to report the bug on github: https://github.com/neurogenomics/MungeSumstats along with your summary statistic file header.

We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by MungeSumstats but correspond to one of:

SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS,
SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ 

feel free to update the MungeSumstats::sumstatsColHeaders following the approach in the data.R file and add your mapping. Then use a pull request on github and we will incorporate this change into the package.

7 Session Information

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MungeSumstats_1.0.1 BiocStyle_2.20.2   
## 
## loaded via a namespace (and not attached):
##  [1] restfulr_0.0.13                            
##  [2] bslib_0.2.5.1                              
##  [3] compiler_4.1.0                             
##  [4] BiocManager_1.30.16                        
##  [5] jquerylib_0.1.4                            
##  [6] GenomeInfoDb_1.28.0                        
##  [7] XVector_0.32.0                             
##  [8] MatrixGenerics_1.4.0                       
##  [9] bitops_1.0-7                               
## [10] tools_4.1.0                                
## [11] zlibbioc_1.38.0                            
## [12] digest_0.6.27                              
## [13] lattice_0.20-44                            
## [14] jsonlite_1.7.2                             
## [15] evaluate_0.14                              
## [16] BSgenome_1.60.0                            
## [17] rlang_0.4.11                               
## [18] Matrix_1.3-4                               
## [19] DelayedArray_0.18.0                        
## [20] yaml_2.2.1                                 
## [21] parallel_4.1.0                             
## [22] xfun_0.24                                  
## [23] GenomeInfoDbData_1.2.6                     
## [24] rtracklayer_1.52.0                         
## [25] stringr_1.4.0                              
## [26] knitr_1.33                                 
## [27] Biostrings_2.60.1                          
## [28] S4Vectors_0.30.0                           
## [29] sass_0.4.0                                 
## [30] IRanges_2.26.0                             
## [31] grid_4.1.0                                 
## [32] stats4_4.1.0                               
## [33] data.table_1.14.0                          
## [34] Biobase_2.52.0                             
## [35] R6_2.5.0                                   
## [36] XML_3.99-0.6                               
## [37] BiocParallel_1.26.0                        
## [38] rmarkdown_2.9                              
## [39] bookdown_0.22                              
## [40] magrittr_2.0.1                             
## [41] GenomicAlignments_1.28.0                   
## [42] Rsamtools_2.8.0                            
## [43] matrixStats_0.59.0                         
## [44] htmltools_0.5.1.1                          
## [45] BiocGenerics_0.38.0                        
## [46] GenomicRanges_1.44.0                       
## [47] SummarizedExperiment_1.22.0                
## [48] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
## [49] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20   
## [50] stringi_1.6.2                              
## [51] RCurl_1.98-1.3                             
## [52] rjson_0.2.20                               
## [53] crayon_1.4.1                               
## [54] BiocIO_1.2.0

References

1. Nathan G. Skene, T. E. B., Julien Bryois. Genetic identification of brain cell types underlying schizophrenia. Nature Genetics (2018). doi:10.1038/s41588-018-0129-5