Shearwater ML

Inigo Martincorena and Moritz Gerstung

6 March 2015

ShearwaterML is a maximum-likelihood adaptation of the original Shearwater algorithm. Unlike the original algorithm, ShearwaterML does not use prior information and yields p-values, instead of Bayes factors, using a Likelihood-Ratio Test. This allows using standard multiple testing correction methods to obtain a list of significant variants with a controlled false discovery rate.

For a detailed description of the algorithm see:

Martincorena I, Roshan A, Gerstung M, et al. (2015). High burden and pervasive positive selection of somatic mutations in normal human skin. Science (2015).

Load data from deepSNV example

library(deepSNV)
## Loading required package: parallel
## Loading required package: Rhtslib
## Rhtslib htslib version 1.7
## Loading required package: IRanges
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect,
##     is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unsplit, which,
##     which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: VGAM
## Loading required package: splines
## Loading required package: VariantAnnotation
## Loading required package: Rsamtools
## 
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:base':
## 
##     tabulate
## 
## Attaching package: 'deepSNV'
## The following objects are masked from 'package:VGAM':
## 
##     dbetabinom, pbetabinom
## The following object is masked from 'package:BiocGenerics':
## 
##     normalize
regions <- GRanges("B.FR.83.HXB2_LAI_IIIB_BRU_K034", IRanges(start = 3120, end=3140))
files <- c(system.file("extdata", "test.bam", package="deepSNV"), system.file("extdata", "control.bam", package="deepSNV"))
counts <- loadAllData(files, regions, q=30)

ShearwaterML: “betabinLRT” calculates p-values for each possible mutation

pvals <- betabinLRT(counts, rho=1e-4, maxtruncate = 1)$pvals
qvals <- p.adjust(pvals, method="BH")
dim(qvals) = dim(pvals)
vcfML = qvals2Vcf(qvals, counts, regions, samples = files, mvcf = TRUE)

Original Shearwater: “bbb” computes the original Bayes factors

bf <- bbb(counts, model = "OR", rho=1e-4)
vcfBF <- bf2Vcf(bf, counts, regions, samples = files, prior = 0.5, mvcf = TRUE)

plot(pvals[1,,], bf[1,,]/(1+bf[1,,]), log="xy")