Projects in biomedicine are generating information about multiple types of omics data. Most of these datasets are publicly available through on-line data backs like GEO (Gene Expression Omnibus), TCGA (The Cancer Genome Atlas) or EGA (European Genome-phenome Archive) among others. Standard association analyses between a given phenotype and a single omic dataset (aka. SNPs, gene expression, methylations,…) are insufficient to address most of the scientific questions that rely on knowing the mechanisms involved in complex diseases. Methods to properly integrate multi-data sets have to be used to analyze such amount of information.
One of the computational challenges when dealing with multi-omic data is to define an easy way to compact multiple pieces of information. This methodology must allow retrieving data for common-samples, subsetting individuals having a specific characteristic or selecting features from a given genomic region or a selected gene. MultiDataSet
is conceived as a container and manager of multiple omic datasets while allowing performing common object operations such as accession to elements or subsetting among others. The vignette of MultiDataSet
describes how to manipulate a MultiDataSet object as well as how to perform other basic operations.
In this document, we show how to implement a new integration function using MultiDataSet. The process is exemplified with a function that integrates methylation and expression, which can be found in MEAL package. We propose a general scheme that could be used for any integration function.
The objective of this document is to illustrate how to create an omic data integration function taking profit of MultiDataSet
features. This tutorial is meant for software developers who have created a statistical method to analyze multiple omic datasets and want to implement it using MultiDataSet
objects as input.
We will exemplify the implementation of an integration algorithm with a function that integrates methylation and expression. This example function can be found in MEAL package (Ruiz, Hernandez-Ferrer, and González 2016) and it is an adaptation of an existing method described in a previous paper (Steenaard et al. 2015). Briefly, the method consists on pairing CpGs to nearby expression probes and to measure the correlation between CpG and expression probes in each pair.
A typical integration function requires 4 parts:
The following sub-sections contain the code and an explanation of how to write the 4 parts of an integration function. The complete definition of the function can be found at the end of this section.
Before starting with the body of our function, we need to set the parameters. Our function will be called correlationMethExprs
and will have 11 arguments:
correlationMethExprs <- function(multiset,
meth_set_name = NULL, exprs_set_name = NULL,
vars_meth = NULL, vars_exprs = NULL,
vars_meth_types = rep(NA, length(vars_meth)),
vars_exprs_types = rep(NA, length(vars_exprs)),
sel_cpgs, flank = 250000, num_cores = 1, verbose = TRUE){
multiset
is the MultiDataSet
containing the data. meth_set_name
and exprs_set_name
are the names of the methylation and expression datasets in multiset
object. The arguments vars_meth
and vars_exprs
denotes the variables used to adjust linear models between gene expression and methylation (e.g. sex, age, …). The arguments vars_meth_types
and vars_exprs_types
encodes the type of variables we are adjusting for. The argument sel_cpgs
is used to select those CpGs we are interested in analyzing. The argument flank
indicates the minimum distance to pair a CpG to an expression probe. Finally, num_cores
will be used to allow parallelizing the algorithm and verbose
is a flag to decide whether display messages during the executing. In the following subsections, these arguments will be further discussed and explained.
When users work with a package they could easily pass unsuitable objects to the function (e.g. a data.frame
with misspelled column names). If our function does not detect these mistakes, their inner processes may throw an error which will be very difficult to understand by the users. Consequently, they will not be able to fix it and run the function properly. A good practice to avoid these problems is to introduce data checks to the input object to ensure that the function could run properly. If there are problems, it returns an error with the mistake and, sometimes, proposes ways to solve it.
When creating integration functions with MultiDataSet
, data checking is performed at two levels. On one hand, functions that add data to MultiDataSet
check that the introduced datasets fulfill some requirements. In our example, we will use the functions of MultiDataSet
package to add methylation and expression that already includes checks for the incoming datasets. For more information on how to create a function to add new classes to a MultiDataSet
take a look at “Adding a new type of data to MultiDataSet objects” (Supplementary Material 3).
On the other hand, integration functions must do other checks to the incoming MultiDataSet
. In our function, we will include three checks. The first is just to check that our object is a MultiDataSet
. Second, we will check that meth_set_name
and exprs_set_name
are characters or NULL. These parameters are useful to choose the datasets used in the analysis when we have several methylation or expression sets. If these parameters have the right format, we will check that our MultiDataSet
contains datasets with these names and that these datasets have a GenomicRanges
. Finally, we will check that the parameter flank
is a numeric that contains only one positive number.
# [Check 1] Is multiset a MultiDataSet object?
if (!is(multiset, "MultiDataSet")){
stop("multiset must be a MultiDataSet")
}
# [Check 2A] Are meth_set_name and exprs_set_name characters?
if (!(is.character(meth_set_name) | is.null(meth_set_name))){
stop("meth_set_name must be a character.")
}
if (!(is.character(exprs_set_name) | is.null(exprs_set_name))){
stop("exprs_set_name must be a character.")
}
# Add the dataset type to the names
meth_set_name <- paste(c("methylation", meth_set_name), collapse = "+")
exprs_set_name <- paste(c("expression", exprs_set_name), collapse = "+")
# [Check 2B] Has multiset the right sets?
if (!all(c(meth_set_name, exprs_set_name) %in% names(multiset))){
stop("multiset must contain meth_set_name and exprs_set_name.")
}
if (!all(c(meth_set_name, exprs_set_name) %in% rowRangesElements(multiset))){
stop("multiset must contain meth_set_name and exprs_set_name.")
}
# [Check 3] Is flank numeric, positive and has only one number?
if (!is(flank, "numeric") || length(flank) > 1 || flank < 0){
stop("flank must be a positive integer")
}
It is important to notice the lines where we add the dataset type to the set names. When users add a set to a MultiDataSet
, the name of the incoming set includes its data type. Consequently, all methylation objects start by “methylation” and all expression sets by “expression”. This procedure reduces the checks in the integration function. Using the name of the MultiDataSet
datasets, we can deduce their original class. Then, we can be sure that these datasets already have some specific features that we do not need to check again.
MultiDataSet
is useful for managing omic datasets but the implementation of the algorithm requires using simpler data structures. For example, imagine that we want to run a simpler linear regression. R has implemented in its base package the function lm
, which requires data.frame
s or list
s. Therefore, the first step would be transforming the data of MultiDataSet
to a data structure compatible with the functions that we will call. In this subsection, we will exemplify how to perform this process. We will show how to obtain the classes required by the functions that perform the algorithm: ExpressionSet
, MethylationSet
, and GenomicRanges
.
Our algorithm requires that all the datasets have the same samples. Consequently, the first step is to filter the datasets to contain only those samples present in both datasets. This step can be easily done using the method commonSamples
:
# Select only our datasets
multiset <- multiset[ , c(meth_set_name, exprs_set_name)]
## Select common samples
multiset <- commonSamples(multiset)
We can select specific datasets of MultiDataSet
using [
and pass their names to the second argument. This code returns a MultiDataSet
with the specified datasets. The method commonSamples
returns a MultiDataSet
where all datasets have the same samples in the same order. Given that multiset
can have other datasets, it is advisable to previously reduce it to only the datasets that we will use. Otherwise, we might lose samples that are not present in other datasets.
Next, we will get the original MethylationSet
and ExpressionSet
from multiset
. The original classes will help us to perform easily perform other operations that we will do later. The [[
operator retrieves the original datasets from a MultiDataSet
:
# Obtain the original objects
mset <- multiset[[meth_set_name]]
eset <- multiset[[exprs_set_name]]
In this kind of analysis, it is very common to use only the CpGs that in a previous analysis had been found differentially methylated. We have included the parameter sel_cpgs
to select CpGs:
if (!missing(sel_cpgs)){
if (!is.character(sel_cpgs) | length(sel_cpgs) == 0){
stop("sel_cpgs must be a character vector with the name of the CpGs to be used.")
} else{
mset <- mset[sel_cpgs, ]
}
}
This parameter is not mandatory, so the first line checks if it is present. The second line ensures that the parameter contains the names of CpGs in a character vector.
Finally, we get the GenomicRanges
of the methylation and expression sets, with rowRanges
accessors:
# Get GRanges
rangesMeth <- rowRanges(multiset)[[meth_set_name]]
rangesMeth <- rangesMeth[featureNames(mset)]
rangesExprs <- rowRanges(multiset)[[exprs_set_name]]
The integration algorithm that we will use consists of three steps:
To pair the CpGs and the expression probes, we will use the following method. For each CpG, we will define a region of interest using the position of the CpG plus-minus the parameter flank
(flank
units are bases: 250Kb by default). Then, we will pair this CpG to all the expression probes whose start and end coordinates are inside the region, using the function findOverlaps
of GenomicRanges
package. If there are no CpG-expression probes pairs, an empty data.frame is returned and a warning is thrown:
start(rangesMeth) <- start(rangesMeth) - flank
end(rangesMeth) <- end(rangesMeth) + flank
pairs <- GenomicRanges::findOverlaps(rangesExprs, rangesMeth, type = "within")
pairs <- data.frame(cpg = rownames(mset)[S4Vectors::subjectHits(pairs)],
exprs = rownames(eset)[S4Vectors::queryHits(pairs)],
stringsAsFactors = FALSE)
# If there are no pairs, do not continue
if (nrow(pairs) == 0){
warning("There are no expression probes in the range of the cpgs. An empty data.frame will be returned.")
return(data.frame(cpg = character(0), exprs = character(0), Beta = integer(0),
se = integer(0), P.Value = integer(0), adj.P.val = integer(0)))
}
Now, we will introduce the possibility of removing the effect of co-variates. We will perform a linear regression of our data against the co-variables and we will compute the residuals. Then, these residuals will be used to compute the association. We have defined the function setResidues
to do this process taking profit of eSet
derived objects. Although its definition can be found in the complete integration function, an exhaustive explanation will not be provided.
Four additional parameters of the integration function are needed here. vars_meth
and vars_meth_type
contain the name and the type of the methylation co-variables. vars_exprs
and vars_exprs_types
are the equivalent for expression:
# Computing residuals
methres <- setResidues(mset, vars_names = vars_meth, vars_types = vars_meth_types)
exprsres <- setResidues(eset, vars_names = vars_exprs, vars_types = vars_exprs_types)
The last step is to compute the association between each CpG-expression probe pair using a linear regression, where expression will be the outcome. We have implemented this step using the lm
function and mclapply
of parallel package. This design allows the user to parallelize this step:
## We define the function to be used in the mclapply
residualsCorr <- function (methy_res, exprs_res){
fit <- lm(exprs_res ~ methy_res)
return(c(summary(fit)$coef[2, 1], summary(fit)$coef[2,2], summary(fit)$coef[2,4]))
}
# use mclapply to speed up computation
regvals <- mclapply(1:nrow(pairs),
function(x) residualsCorr(methres[pairs[x, 1], ], exprsres[pairs[x, 2], ]),
mc.cores = num_cores)
The very last step is to generate an output suitable for users. In our case, the results will be returned in a data.frame
containing the names of the CpGs and expression probes and the estimates of the linear regression:
res <- data.frame(pairs, t(data.frame(regvals)))
colnames(res) <- c("cpg", "exprs", "Beta", "se", "P.Value")
res$adj.P.Val <- p.adjust(res$P.Value, "BH")
res <- res[order(res$adj.P.Val), ]
rownames(res) <- NULL
res
This integration function was previously implemented in MEAL package. The next chunk contains the whole function as it can be found in the package. It should be noticed that there are lines that have not been previously commented, such as code to display messages during the execution:
correlationMethExprs <- function(multiset,
meth_set_name = NULL, exprs_set_name = NULL,
vars_meth = NULL, vars_exprs = NULL,
vars_meth_types = rep(NA, length(vars_meth)),
vars_exprs_types = rep(NA, length(vars_exprs)),
sel_cpgs, flank = 250000, num_cores = 1, verbose = TRUE){
######################################################################################
## Data Checking
# Check object is a MultiDataSet
if (!is(multiset, "MultiDataSet")){
stop("multiset must be a MultiDataSet")
}
# Check meth_set_name and exprs_set_name are characters
if (!(is.character(meth_set_name) | is.null(meth_set_name))){
stop("meth_set_name must be a character.")
}
if (!(is.character(exprs_set_name) | is.null(exprs_set_name))){
stop("exprs_set_name must be a character.")
}
# Add the dataset type to the names
meth_set_name <- paste(c("methylation", meth_set_name), collapse = "+")
exprs_set_name <- paste(c("expression", exprs_set_name), collapse = "+")
# Check our object has the right sets
if (!all(c(meth_set_name, exprs_set_name) %in% names(multiset))){
stop("multiset must contain meth_set_name and exprs_set_name.")
}
if (!all(c(meth_set_name, exprs_set_name) %in% rowRangesElements(multiset))){
stop("multiset must contain meth_set_name and exprs_set_name.")
}
# Check that flank is numeric, positive and is only one number
if (!is(flank, "numeric") || length(flank) > 1 || flank < 0){
stop("flank must be a positive integer")
}
#####################################################################################
#####################################################################################
## Preparation of data
#############################
## Preparation of data 1
# Select only our sets
multiset <- multiset[, c(meth_set_name, exprs_set_name)]
## Select common samples
multiset <- commonSamples(multiset)
#############################
## Preparation of data 2
mset <- multiset[[meth_set_name]]
eset <- multiset[[exprs_set_name]]
#############################
## Preparation of data 3
if (!missing(sel_cpgs)){
if (!is.character(sel_cpgs) | length(sel_cpgs) == 0){
stop("sel_cpgs must be a character vector with the name of the CpGs to be used.")
} else{
mset <- mset[sel_cpgs, ]
}
}
#############################
## Preparation of data 4
# Compute Methylation-Expression pairs
rangesMeth <- rowRanges(multiset)[[meth_set_name]]
rangesMeth <- rangesMeth[featureNames(mset)]
rangesExprs <- rowRanges(multiset)[[exprs_set_name]]
#####################################################################################
#####################################################################################
## Implementation of the algorithm
#############################
## Implementation of the algorithm 1
start(rangesMeth) <- start(rangesMeth) - flank
end(rangesMeth) <- end(rangesMeth) + flank
pairs <- GenomicRanges::findOverlaps(rangesExprs, rangesMeth, type = "within")
pairs <- data.frame(cpg = rownames(mset)[S4Vectors::subjectHits(pairs)],
exprs = rownames(eset)[S4Vectors::queryHits(pairs)],
stringsAsFactors = FALSE)
if (nrow(pairs) == 0){
warning("There are no expression probes in the range of the cpgs. An empty data.frame will be returned.")
return(data.frame(cpg = character(0), exprs = character(0), Beta = integer(0),
se = integer(0), P.Value = integer(0), adj.P.val = integer(0)))
}
#############################
## Filter sets to only features in the pairs
mset <- mset[unique(pairs[ , 1]), ]
eset <- eset[unique(pairs[ , 2]), ]
if (verbose){
message("Computing residuals")
}
#############################
## Implementation of the algorithm 2
# Computing residuals
methres <- setResidues(mset, vars_names = vars_meth, vars_types = vars_meth_types)
exprsres <- setResidues(eset, vars_names = vars_exprs, vars_types = vars_exprs_types)
#############################
if (verbose){
message("Computing correlation Methylation-Expression")
}
#############################
## Implementation of the algorithm 3
residualsCorr <- function (methy_res, exprs_res){
fit <- lm(exprs_res ~ methy_res)
return(c(summary(fit)$coef[2, 1], summary(fit)$coef[2,2], summary(fit)$coef[2,4]))
}
regvals <- mclapply(1:nrow(pairs),
function(x) residualsCorr(methres[pairs[x, 1], ], exprsres[pairs[x, 2], ]),
mc.cores = num_cores)
#####################################################################################
#####################################################################################
## Foprmatting the results
res <- data.frame(pairs, t(data.frame(regvals)))
colnames(res) <- c("cpg", "exprs", "Beta", "se", "P.Value")
res$adj.P.Val <- p.adjust(res$P.Value, "BH")
res <- res[order(res$adj.P.Val), ]
rownames(res) <- NULL
res
#####################################################################################
}
## Function to remove covariables effect of methylation and expression data
setResidues <- function(set, vars_names, vars_types){
if (length(vars_names) != 0){
if (ncol(pData(set)) == 0 | nrow(pData(set)) == 0){
warning("set has no phenotypes. No residues will be computed")
} else if (!sum(vars_names %in% colnames(pData(set)))){
if (is(set, "ExpressionSet")){
warning("vars_exprs is/are not a valid column of the eset phenoData. No residues will be computed")
} else{
warning("vars_meth is/are not a valid column of the mset phenoData. No residues will be computed")
}
}else{
pData(set) <- preparePhenotype(phenotypes = pData(set),
variable_names = vars_names,
variable_types = vars_types)
model <- createModel(data = pData(set))
if (is(set, "ExpressionSet")){
vals <- exprs(set)
} else if (is(set, "RatioSet")){
vals <- minfi::getBeta(set)
} else{
vals <- betas(set)
## Convert methylation to M values prior fitting the linear model
vals <- minfi::logit2(vals)
}
res <- residuals(limma::lmFit(vals, model), vals)
if (is(set, "MethylationSet")){
res <- minfi::ilogit2(res)
}
return(res)
}
}
if (is(set, "ExpressionSet")){
return(exprs(set))
} else if (is(set, "RatioSet")){
return(minfi::getBeta(set))
} else{
return(betas(set))
}
}
Now, it is time to check that the function is working. To this end, we will use MEALData package that contains methylation and expression data from the same set of samples. The following lines load the libraries used in our function:
library(MultiDataSet)
library(GenomicRanges)
library(MEALData)
library(limma)
library(parallel)
library(minfi)
MEALData contains the expression data and the methylation data of few samples. We will add this data to a MultiDataSet
to test our function:
multi <- createMultiDataSet()
multi <- add_methy(multi, mset)
multi <- add_genexp(multi, eset)
We are ready to use our function. To speed up the computation, we will select three CpGs to do the pairing:
res <- correlationMethExprs(multi, sel_cpgs = c("cg17504453", "cg25946965", "cg07938442"))
res
## cpg exprs Beta se P.Value adj.P.Val
## 1 cg17504453 ILMN_1739283 -4.95014769 1.9694751 0.01491473 0.1715194
## 2 cg07938442 ILMN_1784515 -0.69234873 0.2596194 0.01003772 0.1715194
## 3 cg07938442 ILMN_2384857 0.77879064 0.3333493 0.02314957 0.1774800
## 4 cg07938442 ILMN_1669714 -0.54672695 0.2884685 0.06331630 0.2912550
## 5 cg07938442 ILMN_2276598 0.52183955 0.2736896 0.06179083 0.2912550
## 6 cg07938442 ILMN_1775373 0.36647374 0.2131860 0.09123263 0.3497251
## 7 cg07938442 ILMN_1765248 0.41135640 0.2575939 0.11601422 0.3811896
## 8 cg07938442 ILMN_1720243 0.42683608 0.2848812 0.13977448 0.3992447
## 9 cg07938442 ILMN_1748738 -0.48864726 0.3399159 0.15622618 0.3992447
## 10 cg17504453 ILMN_2383693 -1.03293435 0.7796320 0.19068270 0.4385702
## 11 cg07938442 ILMN_1759154 -1.79026109 1.4866907 0.23367216 0.4885873
## 12 cg07938442 ILMN_2401884 0.31572131 0.2883531 0.27832685 0.5334598
## 13 cg07938442 ILMN_1653200 1.10543266 1.3711013 0.42357860 0.7494083
## 14 cg07938442 ILMN_1737168 0.32640112 0.4951686 0.51253605 0.8033805
## 15 cg17504453 ILMN_2072178 2.14361394 3.3422229 0.52394380 0.8033805
## 16 cg17504453 ILMN_1679417 -0.17916674 0.4757439 0.70791746 0.8141051
## 17 cg07938442 ILMN_1725726 0.15661949 0.3833937 0.68448674 0.8141051
## 18 cg07938442 ILMN_1746561 0.56952998 1.1585006 0.62495167 0.8141051
## 19 cg07938442 ILMN_1754179 -0.42327291 0.9499611 0.65765783 0.8141051
## 20 cg17504453 ILMN_2128639 0.68924507 1.5422864 0.65670343 0.8141051
## 21 cg07938442 ILMN_1758829 -0.05436785 0.2598779 0.83506086 0.9049254
## 22 cg07938442 ILMN_2401883 -0.03880677 0.2281840 0.86558079 0.9049254
## 23 cg07938442 ILMN_1702105 0.01347782 1.2222576 0.99124182 0.9912418
res
is a data.frame
with the results of the algorithm.
Ruiz, C, C Hernandez-Ferrer, and J González. 2016. “MEAL: Perform methylation analysis. R package version 1.2.3.”
Steenaard, Rebecca V, Symen Ligthart, Lisette Stolk, Marjolein J Peters, Joyce B van Meurs, Andre G Uitterlinden, Albert Hofman, Oscar H Franco, and Abbas Dehghan. 2015. “Tobacco smoking is associated with methylation of genes related to coronary artery disease.” Clinical Epigenetics 7 (1): 54. doi:10.1186/s13148-015-0088-y.