Contents

1 Using RaMWAS for with other methylation platforms or data types

RaMWAS is primarily designed for studies of methylation measurements from enrichment platforms.

However, RaMWAS can also be useful for the analysis of methylation measurements from other platforms (e.g. Illumina HumanMethylation450K array) or other data types such as gene expression levels or genotype information. RaMWAS can perform several analysis steps on such data including: principal component analysis (PCA), association testing (MWAS, TWAS, GWAS), and multimarker analysis with cross validation using the elastic net.

1.1 Import data from other sources

Without external data source at hand, we show how to create and fill data matrices with artificial data. Importing real data can be done in a similar way, with random data generation replaced with reading data from existing sources.

We create data files in the same format as produced by Step 3 of RaMWAS.

These files include

First, we load the package and set up a working directory. The project directory dr can be set to a more convenient location when running the code.

library(ramwas)

# work in a temporary directory
dr = paste0(tempdir(), "/simulated_matrix_data")
dir.create(dr, showWarnings = FALSE)
cat(dr,"\n")
## /tmp/RtmpK0APgL/simulated_matrix_data

Let the sample data matrix have 200 samples and 100,000 variables.

nsamples = 200
nvariables = 100000

For these 200 samples we generate a data frame with age and sex phenotypes and a batch effect covariate.

covariates = data.frame(
    sample = paste0("Sample_",seq_len(nsamples)),
    sex = seq_len(nsamples) %% 2,
    age = runif(nsamples, min = 20, max = 80),
    batch = paste0("batch",(seq_len(nsamples) %% 3))
)
pander(head(covariates))
sample sex age batch
Sample_1 1 71.5 batch1
Sample_2 0 35.8 batch2
Sample_3 1 60.4 batch0
Sample_4 0 64.5 batch1
Sample_5 1 28.4 batch2
Sample_6 0 26.3 batch0

Next, we create the genomic locations for 100,000 variables.

temp = cumsum(sample(20e7 / nvariables, nvariables, replace = TRUE) + 0)
chr      = as.integer(temp %/% 1e7) + 1L
position = as.integer(temp %% 1e7)

locmat = cbind(chr = chr, position = position)
chrnames = paste0("chr", 1:22)
pander(head(locmat))
chr position
1 1717
1 2245
1 3591
1 5074
1 5355
1 5565

Now we save locations in a filematrix and create a text file with chromosome names.

fmloc = fm.create.from.matrix(
            filenamebase = paste0(dr,"/CpG_locations"),
            mat = locmat)
close(fmloc)
writeLines(con = paste0(dr,"/CpG_chromosome_names.txt"),
           text = chrnames)

Finally, we create data matrix. We include sex effect in 225 variables and age effect in 16 variables out of each 2000. Each variable is also affected by noise and batch effects.

fm = fm.create(paste0(dr,"/Coverage"), nrow = nsamples, ncol = nvariables)

# Row names of the matrix are set to sample names
rownames(fm) = as.character(covariates$sample)

# The matrix is filled, 2000 variables at a time
byrows = 2000
for( i in seq_len(nvariables/byrows) ){ # i=1
    slice = matrix(runif(nsamples*byrows), nrow = nsamples, ncol = byrows)
    slice[,  1:225] = slice[,  1:225] + covariates$sex / 30 / sd(covariates$sex)
    slice[,101:116] = slice[,101:116] + covariates$age / 10 / sd(covariates$age)
    slice = slice + ((as.integer(factor(covariates$batch))+i) %% 3) / 40
    fm[,(1:byrows) + byrows*(i-1)] = slice
}
close(fm)

1.2 Principal Component Analysis (PCA)

To run PCA with RaMWAS we specify three parameters:

param = ramwasParameters(
    dircoveragenorm = dr,
    covariates = covariates,
    modelcovariates = NULL
)

Now we run PCA.

ramwas4PCA(param)

The top several PCs are marginally distinct from the rest.

There are strong correlations between top PCs with sex, age, and batch covariates.
Note, for the categorical covariate (batch) the table shows R2 instead of correlations.

# Get the directory with PCA results
pfull = parameterPreprocess(param)
tblcr = read.table(paste0(pfull$dirpca, "/PC_vs_covs_corr.txt"),
                 header = TRUE,
                 sep = "\t")
pander(head(tblcr, 5))
name sex age batch_R2
PC1 -0.0278 -0.0991 0.984
PC2 0.0326 0.0372 0.986
PC3 -0.938 -0.163 0.00167
PC4 0.286 -0.942 0.000988
PC5 0.0126 -0.0021 8.94e-05

The p-values for these correlations and R2 show that the top two PCs are correlated with sex and age while a number of other PCs are affected by sample batch effects.

pfull = parameterPreprocess(param)
tblpv = read.table(paste0(pfull$dirpca, "/PC_vs_covs_pvalue.txt"),
                 header = TRUE,
                 sep = "\t")
pander(head(tblpv, 5))
name sex age batch_R2
PC1 0.696 0.163 1.11e-178
PC2 0.647 0.601 1.6e-183
PC3 2.55e-93 0.0211 0.848
PC4 4.06e-05 6.37e-96 0.907
PC5 0.86 0.976 0.991

1.3 PCA with batch regressed out

It is common to regress out batch and lab-technical effects from the data in the analysis.

Let’s regress out batch in our example by changing modelcovariates parameter.

param$modelcovariates = "batch"

ramwas4PCA(param)

The p-values for association between PCs and covariates changed slightly:

# Get the directory with PCA results
pfull = parameterPreprocess(param)
tblpv = read.table(paste0(pfull$dirpca, "/PC_vs_covs_pvalue.txt"),
                 header = TRUE,
                 sep = "\t")
pander(head(tblpv, 5))
name sex age batch_R2
PC1 4.54e-93 0.0185 1
PC2 4.41e-05 1.25e-98 1
PC3 0.86 0.997 1
PC4 0.852 0.584 1
PC5 0.883 0.692 1

Note that the PCs are now orthogonal to the batch effects and thus the corresponding p-values all equal to 1.

1.4 Association testing

Let us test for association between variables in the data matrix and the sex covariate (modeloutcome parameter) correcting for batch effects (modelcovariates parameter). Save top 20 results (toppvthreshold parameter) in a text file.

param$modelcovariates = "batch"
param$modeloutcome = "sex"
param$toppvthreshold = 20

ramwas5MWAS(param)

The QQ-plot shows mild enrichment among a large number of variables, which is consistent with how the data was generated – 22% of variables are affected by sex.

The top finding saved in the text file are:

# Get the directory with testing results
pfull = parameterPreprocess(param)
toptbl = read.table(
                paste0(pfull$dirmwas,"/Top_tests.txt"),
                header = TRUE,
                sep = "\t")
pander(head(toptbl, 5))
chr position tstat pvalue qvalue
chr9 3952879 5.63 6.12e-08 0.00612
chr2 8054473 5.18 5.46e-07 0.0162
chr9 5871812 5.15 6.45e-07 0.0162
chr1 214380 5.15 6.47e-07 0.0162
chr2 6013918 5.09 8.34e-07 0.0163

1.5 Further steps of RaMWAS pipeline

Steps 6 and 7 of RaMWAS pipeline can also be applied to the data matrix exactly as described in the overview vignette.

1.6 Cleanup

Here we remove all the files created by the code above.

unlink(paste0(dr,"/*"), recursive=TRUE)

2 Version information

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Ecoli.NCBI.20080805_1.3.1000
##  [2] BSgenome_1.44.0                      
##  [3] rtracklayer_1.36.0                   
##  [4] Biostrings_2.44.0                    
##  [5] XVector_0.16.0                       
##  [6] GenomicRanges_1.28.0                 
##  [7] GenomeInfoDb_1.12.0                  
##  [8] IRanges_2.10.0                       
##  [9] S4Vectors_0.14.0                     
## [10] BiocGenerics_0.22.0                  
## [11] ramwas_1.0.0                         
## [12] filematrix_1.1.0                     
## [13] RSQLite_1.1-2                        
## [14] pander_0.6.0                         
## [15] knitr_1.15.1                         
## [16] BiocStyle_2.4.0                      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10               compiler_3.4.0            
##  [3] iterators_1.0.8            bitops_1.0-6              
##  [5] tools_3.4.0                zlibbioc_1.22.0           
##  [7] biomaRt_2.32.0             digest_0.6.12             
##  [9] evaluate_0.10              memoise_1.1.0             
## [11] lattice_0.20-35            foreach_1.4.3             
## [13] Matrix_1.2-9               DelayedArray_0.2.0        
## [15] DBI_0.6-1                  yaml_2.1.14               
## [17] GenomeInfoDbData_0.99.0    stringr_1.2.0             
## [19] glmnet_2.0-5               rprojroot_1.2             
## [21] grid_3.4.0                 Biobase_2.36.0            
## [23] AnnotationDbi_1.38.0       XML_3.98-1.6              
## [25] BiocParallel_1.10.0        rmarkdown_1.4             
## [27] magrittr_1.5               codetools_0.2-15          
## [29] backports_1.0.5            Rsamtools_1.28.0          
## [31] htmltools_0.3.5            matrixStats_0.52.2        
## [33] GenomicAlignments_1.12.0   SummarizedExperiment_1.6.0
## [35] KernSmooth_2.23-15         stringi_1.1.5             
## [37] RCurl_1.95-4.8