1 Introduction

This vignette describes the package Methylation-Aware Genotype Association with R (MAGAR) available from GitHub. MAGAR uses DNA methylation data obtained using the Illumina BeadArrays, and genotyping data from Illumina genotyping microarrays or whole genome sequencing to compute methylation quantitative trait loci (methQTL). The package provides mutliple flavors of linear modeling strategies to compute methQTL as statistically significant interactions between single nucleotide polymorphisms (SNPs) and changes in the DNA methylation state of individual CpGs. DNA methylation values at single CpGs are first summarized into correlation blocks, and a representative of this correlation block (tag-CpG) is used in the methQTL calling.

2 Installation

MAGAR can be installed using the basic Bioconductor installation functions.

if(!requireNamespace("BiocManager")){
    install.packages("BiocManager")
}
if(!requireNamespace("MAGAR")){
    BiocManager::install("MAGAR")
}
suppressPackageStartupMessages(library(MAGAR))  
## No methods found in package 'oligoClasses' for request: 'mean' when loading 'crlmm'

2.1 Required external software tools

MAGAR depends on a funtional installation of PLINK for handling genotyping data. Follow the installation instructions available here and specify the path to the installation for the package qtlSetOption(plink.path="path_to_plink").

Additionally, depending on the type of the analysis, more external software tools are required (). More specifically, if you want to perform imputation bgzip and tabix from the htslib package are needed, as well as vcftools for sorting VCF files. Lastly, if methQTL calling is to be performed through fastQTL, the software tool needs to be installed and the location of the executable specified to MAGAR using the qtlSetOption function with the value bgzip.path, tabix.path, vcftools.path, fast.qtl.path.

Tool Description Required for URL
PLINK Toolsuite for processing genotyping data Import, Processing genotyping data in binary PLINK format or as VCF files http://zzz.bwh.harvard.edu/plink/
bgzip Tool for compression of files Imputation, Formating data for upload to the Michigan Imputation Server http://www.htslib.org/download/
tabix Tool for indexing sequencing files Imputation, Formating data for upload to the Michigan Imputation Server http://www.htslib.org/download/
vcftools Tool for handling VCF files Imputation, Formating data for upload to the Michigan Imputation Server https://vcftools.github.io/downloads.html
fastQTL Tool for determining eQTLs (methQTLs) from genotyping and DNA methylation data methQTL calling, Alternative methQTL calling method in comparison to the default linear model http://fastqtl.sourceforge.net/

3 Input data

MAGAR uses two types of data as input: DNA methylation data obtained using the Illumina Infinium BeadArrays or bisulfite sequencing and genotyping data obtained using genotyping microarrays or whole genome sequencing.

3.1 DNA methylation data (microarrays)

MAGAR utilizes the widely used RnBeads software package for DNA methylation data import. Thus, MAGAR supports the various input options available in RnBeads, including a direct download from the Gene Expression Omnibus (GEO), IDAT, and BED files. For further options, we refer to the RnBeads vignette and documentation. In addition to the raw methylation data, a sample annotation sheet specifying the samples to be analyzed needs to be provided. The sheet contains a line for each sample and looks as follows:

SampleID,age,sex,barcode
Sample_1,14,f,209054857842_R01C01
Sample_2,42,f,209054857842_R02C01
Sample_3,45,m,209054857842_R03C01

For further details on the import process, we refer to the RnBeads vignette. Most importantly, analysis options need to be specified for the import and preprocessing modules of RnBeads. MAGAR provides a default setting, which is available in extdata/rnbeads_options.xml. You can use this file as a template for your own setting and then specify it to the package:

opts <- rnb.xml2options(system.file("extdata","rnbeads_options.xml",package="MAGAR"))
rnb.options(identifiers.column="geo_accession",
    import.idat.platform="probes450")
xml.fi <- file.path(getwd(),"rnbeads_options.xml")
cat(rnb.options2xml(),file=xml.fi)
qtlSetOption(rnbeads.options = xml.fi)

3.2 Genotyping data

3.2.2 IDAT files

MAGAR also supports raw IDAT files and uses the CRLMM R-package, together with PLINK to perform genotype calling and data import. The package requires a single sample annotation sheet in the format described in the DNA methylation data section. In addition to the column names specified above, a column named GenoSentrixPosition has to be added, which specifies the IDAT file IDs.

SampleID,age,sex,barcode,GenoSentrixPosition
Sample_1,14,f,209054857842_R01C01,9701756058_R05C01
Sample_2,42,f,209054857842_R02C01,9701756058_R07C01
Sample_3,45,m,209054857842_R03C01,9742011016_R04C01

3.2.3 Imputation

Since Illumina SNP BeadArray data is typically imputed before further analysis, the package integrates a imputation functionality through the Michigan Imputation Server. Using the option setting described below, the package will automatically submit imputation jobs to the server and process the resulting files. In order to be able to perform computation on the server, an account is required. After the account is created, one has to request an API token in the user settings and specify it to MAGAR using the option imputation.user.token. During the imputation process, the package will stall for a while and wait for the job to finish. After the job is completed, the package will prompt for entering the password send via e-mail to the user account. The imputation process has to be split according to chromosomes, which is why multiple e-mails will be send to the account, and the imputation process can take up to several days. However, after imputation, the imputed data will be available as PLINK files, such that the imputation has to be performed only once. For preprocessing the data for upload to the imputation server, the package requires the bgzip and tabix tools from the htslib package. Also see further options to configure the imputation jobs at the Michigan Imputation Server documentation:

qtlSetOption(
  impute.geno.data=TRUE,
  imputation.reference.panel="apps@hrc-r1.1",
  imputation.phasing.method="shapeit",
  imputation.population="eur"
)

3.3 Perform data import

The doImport function requires the paths to the respective genotyping and DNA methylation data, as well as a sample annotation sheet as discussed earlier. In this vignette, we will describe the import of DNA methylation data in IDAT format and genotyping data as PLINK files. First, you’ll have to specify the paths to the corresponding IDAT and plink files. Additionally, you have to specify the sample identifier column in the sample annotation sheet that determines the samples in both the genotyping and DNA methylation data. For larger files, we recommend to activate the option to store large matrices on disk rather than in main memory (hdf5dump).

For imputed data, no further processing is performed on the genotyping data and the dosage values are used as they are:

idat.dir <- "idat_dir"
geno.dir <- "geno_dir"
anno.sheet <- "sample_annotation.tsv"
qtlSetOption(hdf5dump=TRUE)
imp.data <- doImport(data.location = c(idat.dir=idat.dir,geno.dir=geno.dir),
                    s.anno = anno.sheet,
                    s.id.col = "ID",
                    tab.sep = "\t",
                    out.folder = getwd())

Please note that the recode.allele.frequencies option specifies, if, according to the cohort analyzed, SNP reference and alternative allele are to be recoded according to the allele frequencies in the available samples. Alternatively, a path to a local version of dbSNP (Sherry et al. 2001) can be provided through db.snp.ref, and reference/alternative allele information will be automatically parsed from the database. This is especially crucial, if imputation is to be performed, since the Michigan Imputation Server is sensitive to reference mismatches.

4 methQTL calling

Although MAGAR conceptually splits the methQTL calling into two steps ((i) compute correlation block, (ii) call methQTL per correlation block), only a single function call is needed. The function only requires the input methQTLInput object produced in the previous step, but further options, such as covariates and the p-value cutoff can be directly specified as a function parameter, or as global parameters using ?qtlSetOption.

imp.data <- loadMethQTLInput(system.file("extdata","reduced_methQTL",package="MAGAR"))
qtlSetOption(standard.deviation.gauss=100,
    cluster.cor.threshold=0.75)
meth.qtl.res <- doMethQTL(imp.data,default.options=FALSE,p.val.cutoff=0.05)
## 2022-04-26 17:02:18     1.3  STATUS STARTED Imputation procedure knn 
## 2022-04-26 17:02:18     1.3  STATUS COMPLETED Imputation procedure knn 
## 
## 2022-04-26 17:02:18     1.3  STATUS STARTED Computing methQTLs
## 2022-04-26 17:02:18     1.3  STATUS     STARTED Computing methQTL for chromosome chr18
## 2022-04-26 17:02:18     1.3  STATUS         STARTED Compute correlation blocks
## 2022-04-26 17:02:18     1.3  STATUS             STARTED Compute correlation matrix
## 2022-04-26 17:02:18     1.3  STATUS             COMPLETED Compute correlation matrix
## 2022-04-26 17:02:18     1.3  STATUS             STARTED Compute pairwise distances
## 2022-04-26 17:02:19     1.3  STATUS             COMPLETED Compute pairwise distances
## 2022-04-26 17:02:20     1.3  STATUS             STARTED Weight distances
## 2022-04-26 17:02:20     1.3  STATUS             COMPLETED Weight distances
## 2022-04-26 17:02:22     1.3  STATUS             STARTED Compute graph
## 2022-04-26 17:02:22     1.3  STATUS             COMPLETED Compute graph
## 2022-04-26 17:02:22     1.3  STATUS             STARTED Compute clustering
## 2022-04-26 17:02:23     1.3  STATUS             COMPLETED Compute clustering
## 2022-04-26 17:02:23     1.3  STATUS         COMPLETED Compute correlation blocks
## Saving 7 x 5 in image
## 2022-04-26 17:02:23     1.3  STATUS         STARTED Compute methQTL per correlation block
## 2022-04-26 17:02:23     1.3  STATUS             STARTED Setting up Multicore
## 2022-04-26 17:02:23     1.3    INFO                 Using 1 cores
## 2022-04-26 17:02:23     1.3  STATUS             COMPLETED Setting up Multicore
## 2022-04-26 17:02:41     1.6  STATUS         COMPLETED Compute methQTL per correlation block
## 2022-04-26 17:02:41     1.6  STATUS     COMPLETED Computing methQTL for chromosome chr18
## 2022-04-26 17:02:42     1.6  STATUS COMPLETED Computing methQTLs

We will now present the two steps of the methQTL calling procedure in more detail.

4.1 Compute CpG correlation blocks

Since neighboring CpGs are often highly correlated, using each CpG independently as a potential methQTL candidate leads to many redundant results. We thus aimed to approximate DNA methylation haplotypes by determining highly correlated CpGs in close vicinity. The procedure itself is split into six steps, and is performed for each chromosome independently:

  1. Compute the (Pearson) correlation matrix between all CpGs (futher correlation types available in option correlation.type)
  2. Construct the distance matrix from the correlation matrix
  3. Discard all interactions with a correlation lower than a given threshold (option: cluster.cor.threshold)
  4. Weight the distance according to the genomic distance between the two CpGs with a Gaussian (option: standard.deviation.gauss). Higher values for the standard deviation lead to a lower penalty on distal CpGs, thus the clusters will become larger.
  5. Discard all interactions ranging longer than the option absolute.distance.cutoff
  6. Compute the Louvain clustering on the undirected, weighted graph induced by the distance matrix

This will return a clustering according to the correlation structure between neighboring CpGs that we will later use for methQTL calling. Note that we used simultation experiments to determine the parameters for each data type individually. They will be automatically loaded for the dataset that is used and are:

  • 450k: cluster.cor.threshold=0.2, standard.deviation.gauss=5,000, absolute.distance.cutoff=500,000
  • EPIC: cluster.cor.threshold=0.2, standard.deviation.gauss=3,000, absolute.distance.cutoff=500,000
  • RRBS/WGBS: cluster.cor.threshold=0.2, standard.deviation.gauss=250, absolute.distance.cutoff=500,000

4.2 Call methQTL per correlation block

From the list of correlation blocks, MAGAR computes methQTL interactions with all SNPs on the same chromosome. The process is split into three steps:

  1. Compute a representative CpG (tag-CpG) per correlation block, as specified with the option representative.cpg.computation (default: row.medians).
  2. Discard all SNPs that are further than absolute.distance.cutoff (default: 1,000,000) away from the representative CpG
  3. Call methQTL by using linear models. Multiple options of methQTL calling are available and can be selected via the option linear.model.type (default: classical.linear). Alternatively, fastQTL can be set as an option for meth.qtl.type. This will tell the package to use the fastQTL software (Ongen et al. 2016).

The meth.qtl.type tells, how a methQTL interaction is defined and provides three options, in addition to the already mentioned fastQTL:

  1. oneVSall: A CpG can only be influenced by one SNP. We choose the one with the lowest p-value.
  2. twoVSall: A CpG can both positively and negatively be influenced by two independent SNPs. The package will output those fulfilling the p-value cutoff.
  3. allVSall: For each CpG, all SNPs showing a p-value lower than the p-value cutoff will be returned.

In the methQTL calling process, potential covariates can be specified using the option sel.covariates. We recommend to include at least age and sex as covariates, as they have a strong influence on the DNA methylation pattern.

5 Downstream analysis and interpretation

5.1 How to use methQTLResult

The above procedure will create an object of class methQTLResult, which contains the methQTL that are called in the previous step. To get a table of all the methQTL, you need to extract the information from the object. In the majority of the function calls below, there is the option type, which takes on the values: * ‘SNP’: To characterize the SNPs that influence any DNA methylation state * ‘CpG’: To characterize the representative CpGs per correlation block that are influences by any SNP genotype * ‘cor.block’: To characterize all CpGs, which are part of a correlation block, whose representative CpG is influenced by any genotype

Furthermore, you can obtain genomic annotations for both the CpGs and the SNPs involved in the methQTL interactions:

result.table <- getResult(meth.qtl.res)
head(result.table)
anno.meth <- getAnno(meth.qtl.res,"meth")
head(anno.meth)
anno.geno <- getAnno(meth.qtl.res,"geno")
head(anno.geno)

For more detailed information about the output, also see the function getResultsGWASMap.

5.2 Plots

To visualize methQTL, MAGAR provides some plotting functions. Most functions return an object of type ggplot, which can be subsequently stored or viewed. Either all methQTL can be simultaneously visualized in a single plot, or a specific methQTL can be visualized:

result.table <- result.table[order(result.table$P.value,decreasing=FALSE),]
qtlPlotSNPCpGInteraction(imp.data,result.table$CpG[1],result.table$SNP[1])
## `geom_smooth()` using formula 'y ~ x'

qtlDistanceScatterplot(meth.qtl.res)