missMethyl 1.22.0
The missMethyl package contains functions to analyse methylation data from Illumina’s HumanMethylation450 and MethylationEPIC beadchip. These arrays are a cost-effective alternative to whole genome bisulphite sequencing, and as such are widely used to profile DNA methylation. Specifically, missMethyl contains functions to perform SWAN normalisation (Maksimovic, Gordon, and Oshlack 2012), perform differential methylation analysis using RUVm (Maksimovic et al. 2015), differential variability analysis (Phipson and Oshlack 2014) and gene set analysis (Phipson, Maksimovic, and Oshlack 2016). As our lab’s research into specialised analyses of these arrays continues we anticipate that the package will be continuously updated with new functions.
Raw data files are in IDAT format, which can be read into R using the minfi package (Aryee et al. 2014). Statistical analyses are usually performed on M-values, and \(\beta\) values are used for visualisation, both of which can be extracted from objects, which is a class of object created by minfi. For detecting differentially variable CpGs we recommend that the analysis is performed on M-values. All analyses described here are performed at the CpG site level.
We will use the data in the minfiData package to demonstrate the
functions in missMethyl.
The example dataset has 6 samples across two slides. The sample
information is in the targets file. An essential column in the targets
file is the Basename
column which tells where the idat files to be
read in are located. The R commands to read in the data are taken from
the minfi User’s Guide. For additional details on how to read
the IDAT files into R, as well as information regarding quality control please
refer to the minfi User’s Guide.
library(missMethyl)
library(limma)
library(minfi)
library(minfiData)
baseDir <- system.file("extdata", package = "minfiData")
targets <- read.metharray.sheet(baseDir)
## [1] "/home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/SampleSheet.csv"
targets[,1:9]
## Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID person age sex
## 1 GroupA_3 H5 <NA> GroupA <NA> id3 83 M
## 2 GroupA_2 D5 <NA> GroupA <NA> id2 58 F
## 3 GroupB_3 C6 <NA> GroupB <NA> id3 83 M
## 4 GroupB_1 F7 <NA> GroupB <NA> id1 75 F
## 5 GroupA_1 G7 <NA> GroupA <NA> id1 75 F
## 6 GroupB_2 H7 <NA> GroupB <NA> id2 58 F
## status
## 1 normal
## 2 normal
## 3 cancer
## 4 cancer
## 5 normal
## 6 cancer
targets[,10:12]
## Array Slide
## 1 R02C02 5723646052
## 2 R04C01 5723646052
## 3 R05C02 5723646052
## 4 R04C02 5723646053
## 5 R05C02 5723646053
## 6 R06C02 5723646053
## Basename
## 1 /home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/5723646052/5723646052_R02C02
## 2 /home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/5723646052/5723646052_R04C01
## 3 /home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/5723646052/5723646052_R05C02
## 4 /home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/5723646053/5723646053_R04C02
## 5 /home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/5723646053/5723646053_R05C02
## 6 /home/biocbuild/bbs-3.11-bioc/R/library/minfiData/extdata/5723646053/5723646053_R06C02
rgSet <- read.metharray.exp(targets = targets)
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
## Warning in readChar(con, nchars = n): truncating string with embedded nuls
The data is now an RGChannelSet
object and needs to be normalised and
converted to a MethylSet
object.
SWAN (subset-quantile within array normalization) is a within-array normalization method for Illumina 450k & EPIC BeadChips. Technical differencs have been demonstrated to exist between the Infinium I and Infinium II assays on a single Illumina HumanMethylation array (Bibikova et al. 2011, Dedeurwaerder, Defrance, and Calonne (2011)). Using the SWAN method substantially reduces the technical variability between the assay designs whilst maintaining important biological differences. The SWAN method makes the assumption that the number of CpGs within the 50bp probe sequence reflects the underlying biology of the region being interrogated. Hence, the overall distribution of intensities of probes with the same number of CpGs in the probe body should be the same regardless of assay type. The method then uses a subset quantile normalization approach to adjust the intensities of each array (Maksimovic, Gordon, and Oshlack 2012).
SWAN
can take a MethylSet
, RGChannelSet
or MethyLumiSet
as input. It
should be noted that, in order to create the normalization subset, SWAN
randomly selects Infinium I and II probes that have one, two and three
underlying CpGs; as such, we recommend using set.seed
before to ensure that
the normalized intensities will be
identical, if the normalization is repeated.
The technical differences between Infinium I and II assay designs can result in aberrant beta value distributions (Figure (???)(fig:betasByType), panel “Raw”). Using SWAN corrects for the technical differences between the Infinium I and II assay designs and produces a smoother overall \(\beta\) value distribution (Figure (???)(fig:betasByType), panel “SWAN”).
mSet <- preprocessRaw(rgSet)
mSetSw <- SWAN(mSet,verbose=TRUE)
## [SWAN] Preparing normalization subset
## 450k
## [SWAN] Normalizing methylated channel
## [SWAN] Normalizing array 1 of 6
## [SWAN] Normalizing array 2 of 6
## [SWAN] Normalizing array 3 of 6
## [SWAN] Normalizing array 4 of 6
## [SWAN] Normalizing array 5 of 6
## [SWAN] Normalizing array 6 of 6
## [SWAN] Normalizing unmethylated channel
## [SWAN] Normalizing array 1 of 6
## [SWAN] Normalizing array 2 of 6
## [SWAN] Normalizing array 3 of 6
## [SWAN] Normalizing array 4 of 6
## [SWAN] Normalizing array 5 of 6
## [SWAN] Normalizing array 6 of 6
par(mfrow=c(1,2), cex=1.25)
densityByProbeType(mSet[,1], main = "Raw")
densityByProbeType(mSetSw[,1], main = "SWAN")
Poor quality probes can be filtered out based on the detection p-value. For this example, to retain a CpG for further analysis, we require that the detection p-value is less than 0.01 in all samples.
detP <- detectionP(rgSet)
keep <- rowSums(detP < 0.01) == ncol(rgSet)
mSetSw <- mSetSw[keep,]
Now that the data has been SWAN
normalised we can extract \(\beta\) and
M-values from the object. We prefer to add an offset to the methylated
and unmethylated intensities when calculating M-values, hence we extract
the methylated and unmethylated channels separately and perform our own
calculation. For all subsequent analysis we use a random selection of
20000 CpGs to reduce computation time.
mset_reduced <- mSetSw[sample(1:nrow(mSetSw), 20000),]
meth <- getMeth(mset_reduced)
unmeth <- getUnmeth(mset_reduced)
Mval <- log2((meth + 100)/(unmeth + 100))
beta <- getBeta(mset_reduced)
dim(Mval)
## [1] 20000 6
par(mfrow=c(1,1))
plotMDS(Mval, labels=targets$Sample_Name, col=as.integer(factor(targets$status)))
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1.2,col=1:2)
An MDS plot (Figure 2) is a good sanity check to make sure samples cluster together according to the main factor of interest, in this case, cancer and normal.
To test for differential methylation we use the limma
package (Smyth 2005), which employs an empirical Bayes framework based on
Guassian model theory. First we need to set up the design matrix.
There are a number of
ways to do this, the most straightforward is directly from the targets
file. There are a number of variables, with the status
column indicating
cancer/normal samples. From the person
column of the targets file, we
see that the cancer/normal samples are matched, with 3 individuals each
contributing both a cancer and normal sample. Since the
limma model framework can handle any experimental design which
can be summarised by
a design matrix, we can take into account the paired nature of the data
in the analysis. For more complicated experimental designs, please refer
to the limma User’s Guide.
group <- factor(targets$status,levels=c("normal","cancer"))
id <- factor(targets$person)
design <- model.matrix(~id + group)
design
## (Intercept) idid2 idid3 groupcancer
## 1 1 0 1 0
## 2 1 1 0 0
## 3 1 0 1 1
## 4 1 0 0 1
## 5 1 0 0 0
## 6 1 1 0 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$id
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
Now we can test for differential methylation using the lmFit
and eBayes
functions from limma. As input data we use the matrix of
M-values.
fit.reduced <- lmFit(Mval,design)
fit.reduced <- eBayes(fit.reduced)
The numbers of hyper-methylated (1) and hypo-methylated (-1) can be
displayed using the decideTests
function in limma and the top
10 differentially methylated CpGs for cancer versus normal extracted using
topTable
.
summary(decideTests(fit.reduced))
## (Intercept) idid2 idid3 groupcancer
## Down 7106 1 98 681
## NotSig 3299 19999 19895 18807
## Up 9595 0 7 512
top<-topTable(fit.reduced,coef=4)
top
## logFC AveExpr t P.Value adj.P.Val B
## cg25937714 4.216479 -0.29048044 16.88118 8.110645e-06 0.02685689 4.180446
## cg27028555 4.372173 -0.09128251 15.04708 1.477539e-05 0.02685689 3.749281
## cg11117364 4.965181 -0.53852862 14.74633 1.641156e-05 0.02685689 3.669933
## cg11430157 4.841606 -1.28300292 14.65509 1.694976e-05 0.02685689 3.645328
## cg12804278 3.918198 -1.71951466 14.29423 1.929390e-05 0.02685689 3.545505
## cg00082713 3.895754 -1.57777784 14.28159 1.938274e-05 0.02685689 3.541934
## cg04770504 3.866282 -2.03860245 14.01199 2.139801e-05 0.02685689 3.464540
## cg27493301 3.941008 -1.86633266 13.77597 2.336872e-05 0.02685689 3.394801
## cg11198851 -3.533764 1.38946030 -13.68847 2.415340e-05 0.02685689 3.368464
## cg00662647 4.819181 -0.39535341 13.67846 2.424514e-05 0.02685689 3.365433
Note that since we performed our analysis on M-values, the logFC
and
AveExpr
columns are computed on the M-value scale. For interpretability
and visualisation we can look at the \(\beta\) values. The beta values for
the top 4 differentially methylated CpGs shown in Figure 3.
cpgs <- rownames(top)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgs[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgs[i],cex.main=1.5)
}
Like other platforms, 450k array studies are subject to unwanted technical variation such as batch effects and other, often unknown, sources of variation. The adverse effects of unwanted variation have been extensively documented in gene expression array studies and have been shown to be able to both reduce power to detect true differences and to increase the number of false discoveries. As such, when it is apparent that data is significantly affected by unwanted variation, it is advisable to perform an adjustment to mitigate its effects.
missMethyl provides a limma inspired interface to functions from the CRAN package ruv, which enable the removal of unwanted variation when performing a differential methylation analysis (Maksimovic et al. 2015).
RUVfit
uses the RUV-inverse method by default, as this does not require the
user to specify a \(k\) parameter. The ridged version of RUV-inverse is also
available by setting method = rinv
. The RUV-2 and RUV-4 functions can also
be used by setting method = ruv2
or method = ruv4
, respectively, and
specifying an appropriate value for k (number of components of unwanted
variation to remove) where \(0 \leq k < no. samples\).
All of the methods rely on negative control features to accurately estimate the components of unwanted variation. Negative control features are probes/genes/etc. that are known a priori to not truly be associated with the biological factor of interest, but are affected by unwanted variation. For example, in a microarray gene expression study, these could be house-keeping genes or a set of spike-in controls. Negative control features are extensively discussed in Gagnon-Bartsch and Speed (2012) and Gagnon-Bartsch et al. (2013). Once the unwanted factors are accurately estimated from the data, they are adjusted for in the linear model that describes the differential analysis.
If the negative control features are not known a priori, they can be identified empirically. This can be achieved via a 2-stage approach, RUVm. Stage 1 involves performing a differential methylation analysis using RUV-inverse (by default) and the 613 Illumina negative controls (INCs) as negative control features. This will produce a list of CpGs ranked by p-value according to their level of association with the factor of interest. This list can then be used to identify a set of empirical control probes (ECPs), which will capture more of the unwanted variation than using the INCs alone. ECPs are selected by designating a proportion of the CpGs least associated with the factor of interest as negative control features; this can be done based on either an FDR cut-off or by taking a fixed percentage of probes from the bottom of the ranked list. Stage 2 involves performing a second differential methylation analysis on the original data using RUV-inverse (by default) and the ECPs. For simplicity, we are ignoring the paired nature of the cancer and normal samples in this example.
# get M-values for ALL probes
meth <- getMeth(mSet)
unmeth <- getUnmeth(mSet)
M <- log2((meth + 100)/(unmeth + 100))
# setup the factor of interest
grp <- factor(targets$status, labels=c(0,1))
# extract Illumina negative control data
INCs <- getINCs(rgSet)
head(INCs)
## 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## 13792480 -0.3299654 -1.0955482 -0.5266103
## 69649505 -1.0354488 -1.4943396 -1.0067050
## 34772371 -1.1286422 -0.2995603 -0.8192636
## 28715352 -0.5553373 -0.7599489 -0.7186973
## 74737439 -1.1169178 -0.8656399 -0.6429681
## 33730459 -0.7714684 -0.5622424 -0.7724825
## 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
## 13792480 -0.6374299 -1.116598 -0.4332793
## 69649505 -0.8854881 -1.586679 -0.9217329
## 34772371 -0.6895514 -1.161155 -0.6186795
## 28715352 -1.7903619 -1.348105 -1.0067259
## 74737439 -0.8872082 -1.064986 -0.9841833
## 33730459 -1.5623138 -2.079184 -1.0445246
# add negative control data to M-values
Mc <- rbind(M,INCs)
# create vector marking negative controls in data matrix
ctl1 <- rownames(Mc) %in% rownames(INCs)
table(ctl1)
## ctl1
## FALSE TRUE
## 485512 613
rfit1 <- RUVfit(Y = Mc, X = grp, ctl = ctl1) # Stage 1 analysis
rfit2 <- RUVadj(Y = Mc, fit = rfit1)
Now that we have performed an initial differential methylation analysis to rank the CpGs with respect to their association with the factor of interest, we can designate the CpGs that are least associated with the factor of interest based on FDR-adjusted p-value as ECPs.
top1 <- topRUV(rfit2, num=Inf, p.BH = 1)
head(top1)
## F.p F.p.BH p_X1.1 p.BH_X1.1 b_X1.1 sigma2
## cg04743961 3.516091e-07 0.01017357 3.516091e-07 0.01017357 -4.838190 0.10749571
## cg07155336 3.583107e-07 0.01017357 3.583107e-07 0.01017357 -5.887409 0.16002329
## cg20925841 3.730375e-07 0.01017357 3.730375e-07 0.01017357 -4.790211 0.10714494
## cg03607359 4.721205e-07 0.01017357 4.721205e-07 0.01017357 -4.394397 0.09636129
## cg10566121 5.238865e-07 0.01017357 5.238865e-07 0.01017357 -4.787914 0.11780108
## cg07655636 6.080091e-07 0.01017357 6.080091e-07 0.01017357 -4.571758 0.11201883
## var.b_X1.1 fit.ctl mean
## cg04743961 0.07729156 FALSE -2.31731496
## cg07155336 0.11505994 FALSE -1.27676413
## cg20925841 0.07703935 FALSE -0.87168892
## cg03607359 0.06928569 FALSE -2.19187147
## cg10566121 0.08470133 FALSE 0.03138961
## cg07655636 0.08054378 FALSE -1.29294851
ctl2 <- rownames(M) %in% rownames(top1[top1$p.BH_X1.1 > 0.5,])
table(ctl2)
## ctl2
## FALSE TRUE
## 172540 312972
We can then use the ECPs to perform a second differential methylation with RUV-inverse, which is adjusted for the unwanted variation estimated from the data.
# Perform RUV adjustment and fit
rfit3 <- RUVfit(Y = M, X = grp, ctl = ctl2) # Stage 2 analysis
rfit4 <- RUVadj(Y = M, fit = rfit3)
# Look at table of top results
topRUV(rfit4)
## F.p F.p.BH p_X1.1 p.BH_X1.1 b_X1.1 sigma2 var.b_X1.1 fit.ctl
## cg07155336 1e-24 1e-24 1e-24 1e-24 -5.769286 0.1668414 0.1349771 FALSE
## cg06463958 1e-24 1e-24 1e-24 1e-24 -5.733093 0.1668414 0.1349771 FALSE
## cg00024472 1e-24 1e-24 1e-24 1e-24 -5.662959 0.1668414 0.1349771 FALSE
## cg02040433 1e-24 1e-24 1e-24 1e-24 -5.651399 0.1668414 0.1349771 FALSE
## cg13355248 1e-24 1e-24 1e-24 1e-24 -5.595396 0.1668414 0.1349771 FALSE
## cg02467990 1e-24 1e-24 1e-24 1e-24 -5.592707 0.1668414 0.1349771 FALSE
## cg00817367 1e-24 1e-24 1e-24 1e-24 -5.527501 0.1668414 0.1349771 FALSE
## cg11396157 1e-24 1e-24 1e-24 1e-24 -5.487992 0.1668414 0.1349771 FALSE
## cg16306898 1e-24 1e-24 1e-24 1e-24 -5.466780 0.1668414 0.1349771 FALSE
## cg03735888 1e-24 1e-24 1e-24 1e-24 -5.396242 0.1668414 0.1349771 FALSE
## mean
## cg07155336 -1.2767641
## cg06463958 0.2776252
## cg00024472 -2.4445762
## cg02040433 -1.2918259
## cg13355248 -0.8483387
## cg02467990 -0.4154370
## cg00817367 -0.2911294
## cg11396157 -1.3800170
## cg16306898 -2.1469768
## cg03735888 -1.1527557
If the number of samples in your experiment is greater than the number of Illumina negative controls on the array platform used - 613 for 450k, 411 for EPIC - stage 1 of RUVm will not work. In such cases, we recommend performing a standard limma analysis in stage 1.
# setup design matrix
des <- model.matrix(~grp)
des
## (Intercept) grp1
## 1 1 1
## 2 1 1
## 3 1 0
## 4 1 0
## 5 1 1
## 6 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$grp
## [1] "contr.treatment"
# limma differential methylation analysis
lfit1 <- lmFit(M, design=des)
lfit2 <- eBayes(lfit1) # Stage 1 analysis
# Look at table of top results
topTable(lfit2)
## Removing intercept from test coefficients
## logFC AveExpr t P.Value adj.P.Val B
## cg07155336 -6.037439 -1.276764 -19.22210 1.175108e-07 0.005755968 7.635736
## cg04743961 -4.887986 -2.317315 -19.21709 1.177367e-07 0.005755968 7.634494
## cg03607359 -4.393946 -2.191871 -18.07007 1.852304e-07 0.005755968 7.334032
## cg13272280 -4.559707 -2.099665 -17.25531 2.599766e-07 0.005755968 7.099628
## cg22263007 -4.438420 -1.010994 -17.12384 2.749857e-07 0.005755968 7.060036
## cg03556069 -5.456754 -1.811718 -17.00720 2.891269e-07 0.005755968 7.024476
## cg08443814 -4.597347 -2.062275 -16.80835 3.151706e-07 0.005755968 6.962907
## cg18672939 -5.159383 -0.705992 -16.65643 3.368597e-07 0.005755968 6.915046
## cg24385334 -4.157473 -1.943370 -16.59313 3.463909e-07 0.005755968 6.894890
## cg18044663 -4.426118 -1.197724 -16.57851 3.486357e-07 0.005755968 6.890216
The results of this can then be used to define ECPs for stage 2, as in the previous example.
topl1 <- topTable(lfit2, num=Inf)
## Removing intercept from test coefficients
head(topl1)
## logFC AveExpr t P.Value adj.P.Val B
## cg07155336 -6.037439 -1.276764 -19.22210 1.175108e-07 0.005755968 7.635736
## cg04743961 -4.887986 -2.317315 -19.21709 1.177367e-07 0.005755968 7.634494
## cg03607359 -4.393946 -2.191871 -18.07007 1.852304e-07 0.005755968 7.334032
## cg13272280 -4.559707 -2.099665 -17.25531 2.599766e-07 0.005755968 7.099628
## cg22263007 -4.438420 -1.010994 -17.12384 2.749857e-07 0.005755968 7.060036
## cg03556069 -5.456754 -1.811718 -17.00720 2.891269e-07 0.005755968 7.024476
ctl3 <- rownames(M) %in% rownames(topl1[topl1$adj.P.Val > 0.5,])
table(ctl3)
## ctl3
## FALSE TRUE
## 199150 286362
We can then use the ECPs to perform a second differential methylation
with RUV-inverse
as before.
# Perform RUV adjustment and fit
rfit5 <- RUVfit(Y = M, X = grp, ctl = ctl3) # Stage 2 analysis
rfit6 <- RUVadj(Y = M, fit = rfit5)
# Look at table of top results
topRUV(rfit6)
## F.p F.p.BH p_X1.1 p.BH_X1.1 b_X1.1 sigma2 var.b_X1.1 fit.ctl
## cg06463958 1e-24 1e-24 1e-24 1e-24 -5.910598 0.1667397 0.1170589 FALSE
## cg07155336 1e-24 1e-24 1e-24 1e-24 -5.909549 0.1667397 0.1170589 FALSE
## cg02467990 1e-24 1e-24 1e-24 1e-24 -5.841079 0.1667397 0.1170589 FALSE
## cg00024472 1e-24 1e-24 1e-24 1e-24 -5.823529 0.1667397 0.1170589 FALSE
## cg01893212 1e-24 1e-24 1e-24 1e-24 -5.699627 0.1667397 0.1170589 FALSE
## cg11396157 1e-24 1e-24 1e-24 1e-24 -5.699331 0.1667397 0.1170589 FALSE
## cg13355248 1e-24 1e-24 1e-24 1e-24 -5.658606 0.1667397 0.1170589 FALSE
## cg00817367 1e-24 1e-24 1e-24 1e-24 -5.649284 0.1667397 0.1170589 FALSE
## cg16306898 1e-24 1e-24 1e-24 1e-24 -5.610118 0.1667397 0.1170589 FALSE
## cg16556906 1e-24 1e-24 1e-24 1e-24 -5.567659 0.1667397 0.1170589 FALSE
## mean
## cg06463958 0.27762518
## cg07155336 -1.27676413
## cg02467990 -0.41543703
## cg00024472 -2.44457624
## cg01893212 -0.08273355
## cg11396157 -1.38001701
## cg13355248 -0.84833866
## cg00817367 -0.29112939
## cg16306898 -2.14697683
## cg16556906 -0.96821744
To visualise the effect that the RUVm adjustment is having on the data,
using an MDS plot for example, the getAdj
function can be used to extract
the adjusted values from the RUVm fit object produced by RUVfit
.
NOTE: The adjusted values should only be used for visualisations - it is NOT
recommended that they are used in any downstream analysis.
Madj <- getAdj(M, rfit5) # get adjusted values
The MDS plots below show how the relationship between the samples changes with and without RUVm adjustment. RUVm reduces the distance between the samples in each group by removing unwanted variation. It can be useful to examine this type of plot when trying to decide on the best set of ECPs or to help select the optimal value of \(k\), if using RUV-4 or RUV-2.
par(mfrow=c(1,2))
plotMDS(M, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Unadjusted", gene.selection = "common")
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-inverse", gene.selection = "common")
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
To illustrate how the getAdj
function can be used to help select an
appropriate value for \(k\), we will run the second stage of the RUVm analysis
using RUV-4 with two different \(k\) values.
# Use RUV-4 in stage 2 of RUVm with k=1 and k=2
rfit7 <- RUVfit(Y = M, X = grp, ctl = ctl3,
method = "ruv4", k=1) # Stage 2 with RUV-4, k=1
rfit9 <- RUVfit(Y = M, X = grp, ctl = ctl3,
method = "ruv4", k=2) # Stage 2 with RUV-4, k=2
# get adjusted values
Madj1 <- getAdj(M, rfit7)
Madj2 <- getAdj(M, rfit9)
The following MDS plots show how the relationship between the samples changes from the unadjusted data to data adjusted with RUV-inverse and RUV-4 with two different \(k\) values. For this small dataset, RUV-inverse appears to be removing far too much variation as we can see the samples in each group are completely overlapping. Using RUV-4 and choosing a smaller value for \(k\) produces more sensible results.
par(mfrow=c(2,2))
plotMDS(M, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Unadjusted", gene.selection = "common")
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-inverse", gene.selection = "common")
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj1, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-4, k=1", gene.selection = "common")
legend("bottomleft",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
plotMDS(Madj2, labels=targets$Sample_Name, col=as.integer(factor(targets$status)),
main="Adjusted: RUV-4, k=2", gene.selection = "common")
legend("bottomright",legend=c("Cancer","Normal"),pch=16,cex=1,col=1:2)
More information about the various RUV methods can be found at [http://www-personal.umich.edu/~johanngb/ruv/](http://www-personal.umich.edu/~johanngb/ruv/), including links to all relevant publications. Further examples of RUV analyses, with code, can be found at https://github.com/johanngb/ruv-useR2018. The tutorials demonstrate how the various plotting functions available in the ruv package (which are not covered in this vignette) can be used to select sensible parameters and assess if the adjustment is “helping” your analysis.
Rather than testing for differences in mean methylation, we may be interested in testing for differences between group variances. For example, it has been hypothesised that highly variable CpGs in cancer are important for tumour progression (Hansen et al. 2011). Hence we may be interested in CpG sites that are consistently methylated in the normal samples, but variably methylated in the cancer samples.
In general we recommend at least 10 samples in each group for accurate
variance estimation, however for the purpose of this vignette we perform
the analysis on 3 vs 3. In this example, we are interested in testing
for differential variability in the cancer versus normal group. Note
that when we specify the coef
parameter, which corresponds to the
columns of the design matrix to be used for testing differential
variability, we need to specify both the intercept and the fourth
column. The ID variable is a nuisance parameter and not used when
obtaining the absolute deviations, however it can be included in the
linear modelling step. For methylation data, the function will take
either a matrix of M-values, \(\beta\) values or a object as input. If
\(\beta\) values are supplied, a logit transformation is performed. Note
that as a default, varFit
uses the robust setting in the limma
framework, which requires the use of the statmod package.
fitvar <- varFit(Mval, design = design, coef = c(1,4))
The numbers of hyper-variable (1) and hypo-variable (-1) genes in cancer
vs normal can be obtained using decideTests
.
summary(decideTests(fitvar))
## (Intercept) idid2 idid3 groupcancer
## Down 0 1 0 1
## NotSig 19789 19996 19992 19997
## Up 211 3 8 2
topDV <- topVar(fitvar, coef=4)
topDV
## SampleVar LogVarRatio DiffLevene t P.Value Adj.P.Value
## cg05196820 7.509582 4.602670 2.959474 4.873331 1.101519e-06 0.02203038
## cg07446674 5.742262 -5.162693 -2.912521 -4.558453 5.168423e-06 0.04804416
## cg26943708 8.107262 2.561826 2.439186 4.488086 7.206625e-06 0.04804416
## cg10639888 7.603196 5.658425 2.803961 4.287154 1.813999e-05 0.09069996
## cg08190044 5.271941 6.261236 2.790618 4.187969 2.820644e-05 0.11282575
## cg27018185 7.524526 2.581696 2.345547 4.110629 3.953685e-05 0.11342059
## cg22793988 4.669360 4.057126 2.592355 4.093937 4.249411e-05 0.11342059
## cg19729116 6.585301 10.149999 2.765000 4.065933 4.793205e-05 0.11342059
## cg16087447 4.681714 5.364508 2.680076 4.023779 5.737684e-05 0.11342059
## cg15063355 6.393190 8.241551 2.705779 3.976513 7.005618e-05 0.11342059
An alternate parameterisation of the design matrix that does not include
an intercept term can also be used, and specific contrasts tested with
contrasts.varFit
.
Here we specify the design matrix such that the first two columns
correspond to the normal and cancer groups, respectively.
design2 <- model.matrix(~0+group+id)
fitvar.contr <- varFit(Mval, design=design2, coef=c(1,2))
contr <- makeContrasts(groupcancer-groupnormal,levels=colnames(design2))
fitvar.contr <- contrasts.varFit(fitvar.contr,contrasts=contr)
The results are identical to before.
summary(decideTests(fitvar.contr))
## groupcancer - groupnormal
## Down 1
## NotSig 19997
## Up 2
topVar(fitvar.contr,coef=1)
## SampleVar LogVarRatio DiffLevene t P.Value Adj.P.Value
## cg05196820 7.509582 4.602670 2.959474 4.873331 1.101519e-06 0.02203038
## cg07446674 5.742262 -5.162693 -2.912521 -4.558453 5.168423e-06 0.04804416
## cg26943708 8.107262 2.561826 2.439186 4.488086 7.206625e-06 0.04804416
## cg10639888 7.603196 5.658425 2.803961 4.287154 1.813999e-05 0.09069996
## cg08190044 5.271941 6.261236 2.790618 4.187969 2.820644e-05 0.11282575
## cg27018185 7.524526 2.581696 2.345547 4.110629 3.953685e-05 0.11342059
## cg22793988 4.669360 4.057126 2.592355 4.093937 4.249411e-05 0.11342059
## cg19729116 6.585301 10.149999 2.765000 4.065933 4.793205e-05 0.11342059
## cg16087447 4.681714 5.364508 2.680076 4.023779 5.737684e-05 0.11342059
## cg15063355 6.393190 8.241551 2.705779 3.976513 7.005618e-05 0.11342059
The \(\beta\) values for the top 4 differentially variable CpGs can be seen in Figure 6.
cpgsDV <- rownames(topDV)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgsDV[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgsDV[i],cex.main=1.5)
}
Testing for differential variability in expression data is
straightforward if the technology is gene expression microarrays. The
matrix of expression values can be supplied directly to the varFit
function.
For RNA-Seq data, the mean-variance relationship that occurs in count
data needs to be taken into account. In order to deal with this issue,
we apply a voom
transformation (Law et al. 2014) to obtain observation weights, which
are then used in the linear modelling step. For RNA-Seq data, the varFit
function will take a DGElist
object as input.
To demonstrate this, we use data from the tweeDEseqCountData package. This data is part of the International HapMap project, consisting of RNA-Seq profiles from 69 unrelated Nigerian individuals (Pickrell et al. 2010). The only covariate is gender, so we can look at differentially variable expression between males and females. We follow the code from the limma vignette to read in and process the data before testing for differential variability.
First we load up the data and extract the relevant information.
library(tweeDEseqCountData)
data(pickrell1)
counts<-exprs(pickrell1.eset)
dim(counts)
## [1] 38415 69
gender <- pickrell1.eset$gender
table(gender)
## gender
## female male
## 40 29
rm(pickrell1.eset)
data(genderGenes)
data(annotEnsembl63)
annot <- annotEnsembl63[,c("Symbol","Chr")]
rm(annotEnsembl63)
We now have the counts, gender of each sample and annotation (gene
symbol and chromosome) for each Ensemble gene. We can form a DGElist
object
using the edgeR package.
library(edgeR)
y <- DGEList(counts=counts, genes=annot[rownames(counts),])
We filter out lowly expressed genes by keeping genes with at least 1 count per million reads in at least 20 samples, as well as genes that have defined annotation. Finally we perform scaling normalisation.
isexpr <- rowSums(cpm(y)>1) >= 20
hasannot <- rowSums(is.na(y$genes))==0
y <- y[isexpr & hasannot,,keep.lib.sizes=FALSE]
dim(y)
## [1] 17310 69
y <- calcNormFactors(y)
We set up the design matrix and test for differential variability. In
this case there are no nuisance parameters, so coef
does not need to
be explicitly specified.
design.hapmap <- model.matrix(~gender)
fitvar.hapmap <- varFit(y, design = design.hapmap)
## Converting counts to log counts-per-million using voom.
fitvar.hapmap$genes <- y$genes
We can display the results of the test:
summary(decideTests(fitvar.hapmap))
## (Intercept) gendermale
## Down 0 2
## NotSig 0 17308
## Up 17310 0
topDV.hapmap <- topVar(fitvar.hapmap,coef=ncol(design.hapmap))
topDV.hapmap
## Symbol Chr SampleVar LogVarRatio DiffLevene t
## ENSG00000213318 RP11-331F4.1 16 5.69839463 -2.562939 -0.9859943 -8.031243
## ENSG00000129824 RPS4Y1 Y 2.32497726 -2.087025 -0.4585620 -4.957005
## ENSG00000233864 TTTY15 Y 6.79004140 -2.245369 -0.6085233 -4.612934
## ENSG00000176171 BNIP3 10 0.41317384 1.199292 0.3632133 4.219404
## ENSG00000197358 BNIP3P1 14 0.39969125 1.149754 0.3353288 4.058147
## ENSG00000025039 RRAGD 6 0.91837213 1.091229 0.4926839 3.977022
## ENSG00000103671 TRIP4 15 0.07456448 -1.457139 -0.1520583 -3.911300
## ENSG00000171100 MTM1 X 0.44049558 -1.133295 -0.3334619 -3.896490
## ENSG00000149476 DAK 11 0.13289523 -1.470460 -0.1919880 -3.785893
## ENSG00000064886 CHI3L2 1 2.70234584 1.468059 0.8449434 3.782010
## P.Value Adj.P.Value
## ENSG00000213318 7.238039e-12 1.252905e-07
## ENSG00000129824 3.960855e-06 3.428120e-02
## ENSG00000233864 1.496237e-05 8.633290e-02
## ENSG00000176171 6.441668e-05 2.787632e-01
## ENSG00000197358 1.147886e-04 3.973982e-01
## ENSG00000025039 1.527695e-04 4.375736e-01
## ENSG00000103671 1.921104e-04 4.375736e-01
## ENSG00000171100 2.022293e-04 4.375736e-01
## ENSG00000149476 2.956364e-04 4.425050e-01
## ENSG00000064886 2.995692e-04 4.425050e-01
The log counts per million for the top 4 differentially variable genes can be seen in Figure 7.
genesDV <- rownames(topDV.hapmap)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(cpm(y,log=TRUE)[rownames(y)==genesDV[i],]~design.hapmap[,ncol(design.hapmap)],method="jitter",
group.names=c("Female","Male"),pch=16,cex=1.5,col=c(4,2),ylab="Log counts per million",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(genesDV[i],cex.main=1.5)
}
Once a differential methylation or differential variability analysis has been performed, it may be of interest to know which gene pathways are targeted by the significant CpG sites. It is not entirely clear from the literature how best to perform such an analysis, however Geeleher et al. (Geeleher et al. 2013) showed there is a severe bias when performing gene ontology analysis with methylation data. This is due to the fact that there are differing numbers of probes per gene on several different array technologies. For the Illumina Infinium HumanMethylation450 array the number of probes per gene ranges from 1 to 1299, with a median of 15 probes per gene. For the EPIC array, the range is 1 to 1487, with a median of 20 probes per gene. This means that when mapping CpG sites to genes, a gene is more likely to be selected if there are many CpG sites associated with the gene.
One way to take into account this selection bias is to model the relationship between the number of probes per gene and the probability of being selected. This can be performed by adapting the goseq method of Young et al. (Young et al. 2010). Each gene then has a prior probability associated with it, and a modified version of a hypergeometric test can be performed, testing for over-representation of the selected genes in each gene set.
The gometh
function performs gene set testing on GO categories or KEGG pathways
(Phipson, Maksimovic, and Oshlack 2016). The gsameth
function is a more generalised gene set testing
function which can take as input a list of user specified gene sets.
Note that for gsameth
, the format for the gene ids for each gene in the gene
set needs to be Entrez Gene IDs. For example, the entire curated gene
set list (C2) from the Broad’s Molecular Signatures Database can be
specified as input. The R version of these lists can be downloaded from
http://bioinf.wehi.edu.au/software/MSigDB/index.html. Both functions
take a vector of significant CpG probe names as input.
To illustrate how to use gometh
, consider the results from the differential
methylation analysis with RUVm.
top <- topRUV(rfit4, number = Inf, p.BH = 1)
table(top$p.BH_X1.1 < 0.01)
##
## FALSE TRUE
## 424168 61344
At a 1% false discovery rate cut-off, there are still tens of thousands of CpG sites differentially methylated. These will undoubtably map to almost all the genes in the genome, making a gene ontology analysis irrelevant. One option for selecting CpGs in this context is to apply not only a false discovery rate cut-off, but also a \(\Delta\beta\) cut-off. However, for this dataset, taking a relatively large \(\Delta\beta\) cut-off of 0.25 still leaves more than 30000 CpGs differentially methylated.
beta <- getBeta(mSet)
# make sure that order of beta values matches orer after analysis
beta <- beta[match(rownames(top),rownames(beta)),]
beta_norm <- rowMeans(beta[,grp==0])
beta_can <- rowMeans(beta[,grp==1])
Delta_beta <- beta_can - beta_norm
sigDM <- top$p.BH_X1.1 < 0.01 & abs(Delta_beta) > 0.25
table(sigDM)
## sigDM
## FALSE TRUE
## 451760 33748
Instead, we take the top 10000 CpG sites as input to gometh
.
topCpGs<-topRUV(rfit4,number=10000)
sigCpGs <- rownames(topCpGs)
sigCpGs[1:10]
## [1] "cg07155336" "cg06463958" "cg00024472" "cg02040433" "cg13355248"
## [6] "cg02467990" "cg00817367" "cg11396157" "cg16306898" "cg03735888"
The takes as input a character vector of CpG names, and optionally, a
character vector of all CpG sites tested. If the all.cpg
argument is
omitted, all the CpGs on the array are used as background. To change the
array type, the array.type
argument can be specified as either
“450K” or “EPIC”. The default is “450K”.
If the plot.bias
argument is TRUE
, a figure showing the relationship
between the probability of being selected and the number of probes per
gene will be displayed.
For testing of GO terms, the collection
argument takes the value
“GO”, which is the default setting. For KEGG pathway analysis, set
collection
to “KEGG”. The function topGSA
shows the top enriched GO
categories. The function gsameth is
called for GO and KEGG pathway analysis with the appropriate inputs.
For GO testing on our example dataset:
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
gst <- gometh(sig.cpg=sigCpGs, all.cpg=rownames(top), collection="GO")
topGSA(gst)
## ONTOLOGY TERM N DE
## GO:0007399 BP nervous system development 2349 589.0000
## GO:0048731 BP system development 4833 951.3333
## GO:0048856 BP anatomical structure development 5888 1096.8333
## GO:0007275 BP multicellular organism development 5401 1022.8333
## GO:0031226 CC intrinsic component of plasma membrane 1685 386.0000
## GO:0007417 BP central nervous system development 991 293.5000
## GO:0032502 BP developmental process 6290 1136.8333
## GO:0005887 CC integral component of plasma membrane 1606 367.0000
## GO:0030182 BP neuron differentiation 1352 375.5000
## GO:0048699 BP generation of neurons 1510 404.5000
## GO:0022008 BP neurogenesis 1609 422.5000
## GO:0030154 BP cell differentiation 4160 811.0000
## GO:0032501 BP multicellular organismal process 7540 1255.6190
## GO:0048869 BP cellular developmental process 4349 834.0000
## GO:0009653 BP anatomical structure morphogenesis 2710 603.3333
## GO:0005886 CC plasma membrane 5209 922.0000
## GO:0071944 CC cell periphery 5326 937.5000
## GO:0044459 CC plasma membrane part 2769 563.0000
## GO:0007420 BP brain development 722 216.5000
## GO:0031224 CC intrinsic component of membrane 5189 840.0000
## P.DE FDR
## GO:0007399 1.311192e-29 3.000400e-25
## GO:0048731 1.772587e-26 2.028105e-22
## GO:0048856 4.234268e-25 3.229759e-21
## GO:0007275 2.028367e-24 1.160378e-20
## GO:0031226 9.641687e-24 4.412615e-20
## GO:0007417 1.345937e-23 5.133181e-20
## GO:0032502 7.331934e-23 2.396809e-19
## GO:0005887 1.225636e-22 3.505778e-19
## GO:0030182 1.778893e-22 4.522934e-19
## GO:0048699 5.037918e-22 1.152827e-18
## GO:0022008 8.162777e-22 1.698080e-18
## GO:0030154 2.534460e-21 4.833004e-18
## GO:0032501 8.247924e-21 1.451825e-17
## GO:0048869 2.617455e-20 4.278230e-17
## GO:0009653 4.478328e-20 6.831839e-17
## GO:0005886 5.389914e-20 7.708587e-17
## GO:0071944 3.363401e-19 4.527335e-16
## GO:0044459 5.194549e-18 6.603714e-15
## GO:0007420 1.005120e-17 1.210535e-14
## GO:0031224 1.058405e-17 1.210974e-14
For a more generalised version of gene set testing in methylation data
where the user can specify the gene set to be tested, the function gsameth
can
be used. To display the top 20 pathways, topGSA
can be called. gsameth
can
take a single gene set, or a list of gene sets. The gene identifiers in the
gene set must be Entrez Gene IDs. To demonstrate gsameth
, a toy example is
shown below, with gene sets made up of randomly selected genes from the
org.Hs.eg.db package.
library(AnnotationDbi)
library(org.Hs.eg.db)
genes <- select(org.Hs.eg.db, columns=c("ENTREZID"), keys = keys(org.Hs.eg.db))
set1 <- sample(genes$ENTREZID,size=80)
set2 <- sample(genes$ENTREZID,size=100)
set3 <- sample(genes$ENTREZID,size=30)
genesets <- list(set1,set2,set3)
gsa <- gsameth(sig.cpg=sigCpGs, all.cpg=rownames(top), collection=genesets)
topGSA(gsa)
## N DE P.DE FDR
## 2 34 5 0.3813073 0.7037813
## 1 23 3 0.4691875 0.7037813
## 3 13 0 1.0000000 1.0000000
Note that if it is of interest to obtain the Entrez Gene IDs that the
significant CpGs are mapped to, the getMappedEntrezIDs
can be called.
sessionInfo()
R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS
Matrix products: default BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base
other attached packages:
[1] org.Hs.eg.db_3.10.0
[2] AnnotationDbi_1.50.0
[3] edgeR_3.30.0
[4] tweeDEseqCountData_1.25.0
[5] minfiData_0.33.0
[6] IlluminaHumanMethylation450kmanifest_0.4.0
[7] limma_3.44.0
[8] missMethyl_1.22.0
[9] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
[10] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
[11] minfi_1.34.0
[12] bumphunter_1.30.0
[13] locfit_1.5-9.4
[14] iterators_1.0.12
[15] foreach_1.5.0
[16] Biostrings_2.56.0
[17] XVector_0.28.0
[18] SummarizedExperiment_1.18.0
[19] DelayedArray_0.14.0
[20] matrixStats_0.56.0
[21] Biobase_2.48.0
[22] GenomicRanges_1.40.0
[23] GenomeInfoDb_1.24.0
[24] IRanges_2.22.0
[25] S4Vectors_0.26.0
[26] BiocGenerics_0.34.0
[27] BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 ellipsis_0.3.0
[3] siggenes_1.62.0 mclust_5.4.6
[5] base64_2.0 bit64_0.9-7
[7] xml2_1.3.2 codetools_0.2-16
[9] splines_4.0.0 scrime_1.3.5
[11] knitr_1.28 Rsamtools_2.4.0
[13] annotate_1.66.0 GO.db_3.10.0
[15] dbplyr_1.4.3 HDF5Array_1.16.0
[17] BiocManager_1.30.10 readr_1.3.1
[19] compiler_4.0.0 httr_1.4.1
[21] assertthat_0.2.1 Matrix_1.2-18
[23] htmltools_0.4.0 prettyunits_1.1.1
[25] tools_4.0.0 gtable_0.3.0
[27] glue_1.4.0 GenomeInfoDbData_1.2.3
[29] dplyr_0.8.5 rappdirs_0.3.1
[31] doRNG_1.8.2 Rcpp_1.0.4.6
[33] vctrs_0.2.4 multtest_2.44.0
[35] preprocessCore_1.50.0 nlme_3.1-147
[37] rtracklayer_1.48.0 DelayedMatrixStats_1.10.0
[39] xfun_0.13 stringr_1.4.0
[41] lifecycle_0.2.0 rngtools_1.5
[43] statmod_1.4.34 XML_3.99-0.3
[45] beanplot_1.2 scales_1.1.0
[47] zlibbioc_1.34.0 MASS_7.3-51.6
[49] hms_0.5.3 rhdf5_2.32.0
[51] GEOquery_2.56.0 RColorBrewer_1.1-2
[53] yaml_2.2.1 curl_4.3
[55] gridExtra_2.3 memoise_1.1.0
[57] ggplot2_3.3.0 biomaRt_2.44.0
[59] reshape_0.8.8 stringi_1.4.6
[61] RSQLite_2.2.0 highr_0.8
[63] genefilter_1.70.0 GenomicFeatures_1.40.0
[65] BiocParallel_1.22.0 rlang_0.4.5
[67] pkgconfig_2.0.3 bitops_1.0-6
[69] nor1mix_1.3-0 evaluate_0.14
[71] lattice_0.20-41 ruv_0.9.7.1
[73] purrr_0.3.4 Rhdf5lib_1.10.0
[75] GenomicAlignments_1.24.0 bit_1.1-15.2
[77] tidyselect_1.0.0 plyr_1.8.6
[79] magrittr_1.5 bookdown_0.18
[81] R6_2.4.1 magick_2.3
[83] DBI_1.1.0 pillar_1.4.3
[85] survival_3.1-12 RCurl_1.98-1.2
[87] tibble_3.0.1 crayon_1.3.4
[89] BiocFileCache_1.12.0 rmarkdown_2.1
[91] progress_1.2.2 grid_4.0.0
[93] data.table_1.12.8 blob_1.2.1
[95] digest_0.6.25 xtable_1.8-4
[97] tidyr_1.0.2 illuminaio_0.30.0
[99] munsell_0.5.0 openssl_1.4.1
[101] BiasedUrn_1.07 askpass_1.1
[103] quadprog_1.5-8
Aryee, Martin J, Andrew E Jaffe, Hector Corrada-Bravo, Christine Ladd-Acosta, Andrew P Feinberg, Kasper D Hansen, and Rafael a Irizarry. 2014. “Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.” Bioinformatics (Oxford, England) 30 (10):1363–9. https://doi.org/10.1093/bioinformatics/btu049.
Bibikova, Marina, Bret Barnes, Chan Tsan, Vincent Ho, Brandy Klotzle, Jennie M Le, David Delano, et al. 2011. “High density DNA methylation array with single CpG site resolution.” Genomics 98 (4). Elsevier Inc.:288–95. https://doi.org/10.1016/j.ygeno.2011.07.007.
Dedeurwaerder, Sarah, Matthieu Defrance, and Emilie Calonne. 2011. “Evaluation of the Infinium Methylation 450K technology.” Epigenomics 3 (6):771–84. http://www.futuremedicine.com/doi/abs/10.2217/epi.11.105.
Gagnon-Bartsch, Johann A, Laurent Jacob, and Terence P Speed. 2013. Removing Unwanted Variation from High Dimensional Data with Negative Controls. Berkeley: Tech. Rep. 820, Department of Statistics, University of California.
Gagnon-Bartsch, Johann a, and Terence P Speed. 2012. “Using control genes to correct for unwanted variation in microarray data.” Biostatistics (Oxford, England) 13 (3):539–52. https://doi.org/10.1093/biostatistics/kxr034.
Geeleher, Paul, Lori Hartnett, Laurance J Egan, Aaron Golden, Raja Affendi Raja Ali, and Cathal Seoighe. 2013. “Gene-set analysis is severely biased when applied to genome-wide methylation data.” Bioinformatics (Oxford, England) 29 (15). Oxford University Press:1851–7. https://doi.org/10.1093/bioinformatics/btt311.
Hansen, Kasper Daniel, Winston Timp, Héctor Corrada Bravo, Sarven Sabunciyan, Benjamin Langmead, Oliver G McDonald, Bo Wen, et al. 2011. “Increased methylation variation in epigenetic domains across cancer types.” Nature Genetics 43 (8):768–75. https://doi.org/10.1038/ng.865.
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2):R29. https://doi.org/10.1186/gb-2014-15-2-r29.
Maksimovic, Jovana, Johann A Gagnon-Bartsch, Terence P Speed, and Alicia Oshlack. 2015. “Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data.” Nucleic Acids Research, May, gkv526. https://doi.org/10.1093/nar/gkv526.
Maksimovic, Jovana, Lavinia Gordon, and Alicia Oshlack. 2012. “SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips.” Genome Biology 13 (6). BioMed Central Ltd:R44. https://doi.org/10.1186/gb-2012-13-6-r44.
Phipson, Belinda, Jovana Maksimovic, and Alicia Oshlack. 2016. “missMethyl: an R package for analyzing data from Illumina’s HumanMethylation450 platform.” Bioinformatics (Oxford, England) 32 (2):286–88. https://doi.org/10.1093/bioinformatics/btv560.
Phipson, Belinda, and Alicia Oshlack. 2014. “DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging.” Genome Biology 15 (9). BioMed Central:465. https://doi.org/10.1186/s13059-014-0465-4.
Pickrell, Joseph K., John C. Marioni, Athma A. Pai, Jacob F. Degner, Barbara E. Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, and Jonathan K. Pritchard. 2010. “Understanding mechanisms underlying human gene expression variation with RNA sequencing.” Nature 464 (7289). Nature Publishing Group:768–72. https://doi.org/10.1038/nature08872.
Smyth, G. K. 2005. “limma: Linear Models for Microarray Data.” In Bioinformatics and Computational Biology Solutions Using R and Bioconductor, 397–420. New York: Springer-Verlag. https://doi.org/10.1007/0-387-29362-0_23.
Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology 11 (2):R14. https://doi.org/10.1186/gb-2010-11-2-r14.