SeSAMe User Guide

Wanding Zhou, Timothy J Triche Jr, Hui Shen

Installation

From Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("sesame")

Development version can be installed from github

library(devtools)
install_github('zwdzwd/sesameData')
install_github('zwdzwd/sesame')

Data Structure for Signal Intensities

SeSAMe is designed to process Illumina Infinium DNA methylation data. It currently supports EPIC, HM450 and HM27 platforms. The design includes a light-weight full exposure of internal signal intensities (essential information for users of Illumina methylation array data, as demonstrated in Zhou et al 2018), which permits sensitive and specific joint inference on copy number and DNA methylation.

Central to the SeSAMe platform is the SigSet data structure, an S4 class with slots containing signals for six different classes of probes:

  1. II - Type-II probes;
  2. IR - Type-I Red channel probes;
  3. IG - Type-I Grn channel probes;
  4. oobG - Out-of-band Grn channel probes (matching Type-I Red channel probes in number);
  5. oobR - Out-of-band Red channel probes (matching Type-I Grn channel probes in number);
  6. ctl - control probes.

For all save control probes, signal intensities are stored as an Nx2 numeric matrix, with N representing the number of probes in the class. The two columns of the matrix represent the methylated probe intensity and the unmethylated probe intensity. (Previously, this was implemented in an R6 Reference class, SignalSet. The current S4 implementation in SigSet complies with Bioconductor guidelines, and for backwards compatibility, the signalR6toS4 function transforms a SignalSet to a SigSet.

library(sesameData)
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colMeans, colSums, colnames,
##     dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
##     intersect, is.unsorted, lapply, lengths, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
##     rowMeans, rowSums, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: AnnotationHub
## Loading sesameData.
library(sesame)
sset <- sesameDataGet('EPIC.1.LNCaP')$sset

For example, printing the SigSet directly shows its content

sset
## SigSet EPIC 
##  - @IG probes: 49989 - 332 4145 70 7094 599 2958 ...
##   - @IR probes: 92294 - 183 8040 1949 6152 833 89 ...
##   - @II probes: 724612 - 6543 1596 3133 1011 3035 2837 ...
##   - @oobG probes: 92294 - 138 277 107 218 232 80 ...
##   - @oobR probes: 49989 - 1013 150 81 910 448 183 ...
##   - @ctl probes: 635 ...
##   - @pval: 866895 - 0.005141179 0.04914081 0.002757492 ...

Type-II probe signal can be browsed in

head(slot(sset, 'II'))
##               M    U
## cg07881041 6543 1011
## cg23229610 1596 3035
## cg03513874 3133 2837
## cg05451842  376 5673
## cg14797042 3571  368
## cg09838562  131 2523

or via the getter function

head(II(sset))
##               M    U
## cg07881041 6543 1011
## cg23229610 1596 3035
## cg03513874 3133 2837
## cg05451842  376 5673
## cg14797042 3571  368
## cg09838562  131 2523

Similarly, signals for Type-I probes (sset@IR and sset@IG) and out-of-band probes (sset@oobG and sset@oobR) can be found in

head(IR(sset))
##               M    U
## cg09835024  183 6152
## cg14361672 8040  833
## cg12950382 1949   89
## cg02115394  287 4494
## cg12480843  163 8861
## cg26724186 8874  668
head(oobG(sset))
##              M   U
## cg09835024 138 218
## cg14361672 277 232
## cg12950382 107  80
## cg02115394  91 145
## cg12480843 125 499
## cg26724186 227 240

as one can see the probe names (row names) of IR always coincide with the probe names (row names) of oobG (and vice versa). This is because the out-of-band probe signal for red channel probes is in green channel (and vice versa). Lastly, Control probes are represented in a data frame with the last column holding the type of the control.

head(ctl(sset))
##                  G     R    col     type
## DNP.20K.       192 26107    -99 STAINING
## Biotin.5K.    6068   141    -99 STAINING
## DNP..High.      90 20356    Red STAINING
## Biotin..Bkg.    82   215   Blue STAINING
## Biotin..High. 8849   285  Green STAINING
## DNP..Bkg.      107   183 Purple STAINING

The openSesame pipeline

The openSesame pipeline is composed of noob, nonlinear dye bias correction and pOOBAH, achieved through:

IDATprefixes <- searchIDATprefixes(
    system.file("extdata/", package = "sesameData"))
betas <- openSesame(IDATprefixes)

or equivalently

betas <- getBetas(dyeBiasCorrTypeINorm(noob(sset)))

behind the scene.

Functionalities

SeSAMe implements stricter QC and preprocessing standards: comprehensive probe quality masking, bleed-through correction in background subtraction, nonlinear dye bias correction, stricter nondetection calling and control for bisulfite conversion based on C/T-extension probes. The package also provides convenient, performant implementations of typical analysis steps, such as the inference of gender, age, ethnicity (based on both internal SNP probes and channel-switching Type-I probes) directly from the data. This allows users to infer these common covariates if such information is not provided, and to check for potential sample swaps when it is provided. SeSAMe also provides functionality for calling differential methylation and segmented copy number.

Read IDATs into SigSet list

A simple list of “SigSet”s are returned. One can also just provide a vector of file paths prefixes (excluding _Red.idat and _Grn.idat, one prefix for a pair of IDATs) and call readIDATpair directly.

Background subtraction

Like many other Infinium Methylation-targeted software, SeSAMe implements the background subtraction based on normal-exponential deconvolution using out-of-band probes noob (Triche et al. 2013) and optionally with extra bleed-through subtraction. Signal bleed-through happens when measurement from one channel affects the measurement in the other channel. SeSAMe’s noobsb further removes residual background by regressing out the green-to-red and red-to-green relationship using Type-I probes.

Dye bias correction

Dye bias refers to the difference in signal intensity between the two color channel. SeSAMe offers two flavors of dye bias correction: linear scaling (dyeBiasCorr) and nonlinear scaling (dyeBiasCorrTypeINorm). Linear scaling equalize the mean of all probes from the two color channel.

Residual dye bias can be corrected using nonlinear quantile interpolation with Type-I probes.

Under this correction, Type-I Red probes and Type-I Grn probes have the same distribution of signal.

Note that linear scaling does not shift beta values of Type-I probes while nonlinear scaling does shift beta values of Type-I probes.

Get betas

Beta values are defined as methylated signal/(methylated signal + unmethylated signal). It can be computed using getBetas function. The output is a named vector with probe ID as name. There are two options for getBetas that affects probe masking. The first is quality.mask=TRUE/FALSE which switches probe quality masking. The quality masking includes mapping issues, SNPs and non-uniqueness, and is described in Zhou et al 2017. nondetection.mask = TRUE/FALSE is used to switch masking of nondetection based on detection P-value. Both masks are recommended to ensure data quality and defaulted to TRUE.

## cg00000029 cg00000103 cg00000109 cg00000155 cg00000158 cg00000165 
##  0.8237945         NA  0.8125637  0.9152265  0.9105163  0.8196466

Beta values for Type-I probes can also be obtained by summing up the two in-band channel and out-of-band channel. This rescues probes with SNP hitting the extension base and hence switching color channel. More details can be found in Zhou et al 2017.

For such probes, extra SNP allele frequencies can be derived by summing up methylated and umethylated alleles.

Sample/experiment QC

SeSAMe implements inference of sex, age, ethnicity. These are valuable information for checking the integrity of the experiment and detecting sample swaps.

Sex

Sex is inferred based on our curated X-linked probes and Y chromosome probes excluding pseudo-autosomal regions.

## [1] "MALE"
## [1] "XaY"

Ethnicity

Ethnicity is inferred using a random forest model trained based on both the built-in SNPs (rs probes) and channel-switching Type-I probes.

## [1] "WHITE"

Age

SeSAMe provides age regression a la the Horvath 353 model.

## [1] 84.13913

Mean intensity

The mean intensity of all the probes characterize the quantity of input DNA and efficiency of probe hybridization.

## [1] 3171.483

Bisulfite conversion control using GCT scores

Infinium platforms are intrinsically robust to incomplete bisulfite conversion as non-converted probes would fail to hybridize to the target. Residual incomplete bisulfite conversion can be quantified using GCT score based on C/T-extension probes. Details of this method can be found in Zhou et al. 2017. The closer the score to 1.0, the more complete the bisulfite conversion.

## [1] 1.067894

Probe retrieval and \(\beta\)-value visualization

To visualize all probes from a gene

To visualize probes from arbitrary region

To visualize by probe names

CNV

SeSAMe performs copy number variation in three steps: 1) normalizes the signal intensity using a copy-number-normal data set; 2) groups adjacent probes into bins; 3) runs DNAcopy internally to group bins into segments.

To visualize segmentation in SeSAMe,

Cell Composition Deconvolution

SeSAMe estimates leukocyte fraction using a two-component model.This function works for samples whose targeted cell-of-origin is not related to white blood cells.

## [1] 0.2197238