The MADSEQ package is a group of hierarchical Bayesian model for the detection and quantification of potential mosaic aneuploidy in sample using massive parallel sequencing data.
The MADSEQ package takes two pieces of information for the detection and quantification of mosaic aneuploidy:
MADSEQ works on the whole chromosome resolution. It applies all of the five models (normal, monosomy, mitotic trisomy, meiotic trisomy, loss of heterozygosity) to fit the distribution of the AAF of all the heterozygous sites, and fit the distribution of the coverage from that chromosome. After fitting the same data using all models, it does model comparison using BIC (Bayesian Information Criteria) to select the best model. The model selected tells us whether the chromosome is aneuploid or not, and also the type of mosaic aneuploidy. Then, from the posterior distribution of the best model, we could get the estimation of the fraction of aneuploidy cells.
Note: Currently our package only supports one bam and one vcf file per sample. If you have more than one sample, please prepare multiple bam and vcf files for each of them.
There are two sets of example data come with the package:
To access the data use
system.file("extdata","aneuploidy.bam",package="MADSEQ")
system.file("extdata","aneuploidy.vcf.gz",package="MADSEQ")
Note:This is just a set of example data, only contains a very little region of the genome.
We will start with the bam file, vcf file and bed file in the example data to show you each step for the analysis.
Started with bam file and bed file, you can use prepareCoverageGC function to get the coverage and GC information for each targeted regions.
## load the package
suppressMessages(library("MADSEQ"))
## get path to the location of example data
aneuploidy_bam = system.file("extdata","aneuploidy.bam",package="MADSEQ")
normal_bam = system.file("extdata","normal.bam",package="MADSEQ")
target = system.file("extdata","target.bed",package="MADSEQ")
## Note: for your own data, just specify the path to the location
## of your file using character.
## prepare coverage and GC content for each targeted region
# aneuploidy sample
aneuploidy_cov = prepareCoverageGC(target_bed=target,
bam=aneuploidy_bam,
"hg19")
#> 384 regions from 24 chromosomes in the bed file.
#> calculating depth from BAM...
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, cbind, colMeans, colSums, colnames, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect,
#> is.unsorted, lapply, lengths, mapply, match, mget, order,
#> paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
#> rowSums, rownames, sapply, setdiff, sort, table, tapply,
#> union, unique, unsplit, which, which.max, which.min
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#>
#> expand.grid
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> calculating GC content...
# normal sample
normal_cov = prepareCoverageGC(target_bed=target,
bam=normal_bam,
"hg19")
#> 384 regions from 24 chromosomes in the bed file.
#> calculating depth from BAM...
#> calculating GC content...
## view the first two rows of prepared coverage data (A GRanges Object)
aneuploidy_cov[1:2]
#> GRanges object with 2 ranges and 2 metadata columns:
#> seqnames ranges strand | depth GC
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr14 [20528156, 20528257] * | 0 0.303921568627451
#> [2] chr14 [21538016, 21538117] * | 21 0.647058823529412
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
normal_cov[1:2]
#> GRanges object with 2 ranges and 2 metadata columns:
#> seqnames ranges strand | depth GC
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr14 [20528156, 20528257] * | 0 0.303921568627451
#> [2] chr14 [21538016, 21538117] * | 4 0.647058823529412
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
The normalization function takes prepared coverage GRanges object from prepareCoverageGC function, normalize the coverage and calculate the expected coverage for the sample. If there is only one sample, the function will correct the coverage by GC content, and take the average coverage for the whole genome as expected coverage. If there are more than one samples given, the function will first quantile normalize coverage across samples, then correct the coverage by GC for each sample. If control sample is not specified, the expected coverage is the median coverage across all samples, if a normal control is specified, the average coverage for control sample is taken as expected coverage for further analysis.
Note:
If you choose to write the output to file (recommended)
## normalize coverage data
## set plot=FALSE here because similar plot will show in the following example
normalizeCoverage(aneuploidy_cov,writeToFile=TRUE, destination=".",plot=FALSE)
#> no control provided
#> there are 1 samples
#> correct GC bias in sample 'aneuploidy_cov' ...
#> normalized depth for sample aneuploidy_cov is written to ./aneuploidy_cov_normed_depth.txt
If you don’t want to write output to file
## normalize coverage data
aneuploidy_normed = normalizeCoverage(aneuploidy_cov,writeToFile=FALSE,
plot=FALSE)
#> no control provided
#> there are 1 samples
#> correct GC bias in sample 'aneuploidy_cov' ...
## a GRangesList object will be produced by the function, look at it by
names(aneuploidy_normed)
#> [1] "aneuploidy_cov"
aneuploidy_normed[["aneuploidy_cov"]]
#> GRanges object with 188 ranges and 4 metadata columns:
#> seqnames ranges strand | depth GC
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr14 [21538016, 21538117] * | 21 0.647058823529412
#> [2] chr14 [22377854, 22377955] * | 24 0.362745098039216
#> [3] chr14 [26201540, 26201641] * | 45 0.362745098039216
#> [4] chr14 [26475535, 26475636] * | 8 0.264705882352941
#> [5] chr14 [29080232, 29080333] * | 6 0.323529411764706
#> ... ... ... ... . ... ...
#> [184] chr18 [69135718, 69135819] * | 14 0.323529411764706
#> [185] chr18 [72988854, 72988955] * | 118 0.519607843137255
#> [186] chr18 [73100816, 73100917] * | 12 0.254901960784314
#> [187] chr18 [73278496, 73278597] * | 91 0.441176470588235
#> [188] chr18 [77147421, 77147522] * | 75 0.588235294117647
#> normed_depth ref_depth
#> <numeric> <numeric>
#> [1] 30 0
#> [2] 28 0
#> [3] 49 0
#> [4] 30 0
#> [5] 25 0
#> ... ... ...
#> [184] 33 0
#> [185] 100 0
#> [186] 32 0
#> [187] 85 0
#> [188] 67 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
If you choose to write the output to file (recommended)
## normalize coverage data
normalizeCoverage(aneuploidy_cov, normal_cov,
writeToFile =TRUE, destination = ".", plot=FALSE)
#> no control provided
#> there are 2 samples
#> Quantile normalizing ...
#> correct GC bias in sample' aneuploidy_cov '...
#> correct GC bias in sample' normal_cov '...
#> normalized depth for sample aneuploidy_cov is written to ./aneuploidy_cov_normed_depth.txt
#> normalized depth for sample normal_cov is written to ./normal_cov_normed_depth.txt
If you don’t want to write output to file
## normalize coverage data
normed_without_control = normalizeCoverage(aneuploidy_cov, normal_cov,
writeToFile=FALSE, plot=TRUE)
#> no control provided
#> there are 2 samples
#> Quantile normalizing ...
#> correct GC bias in sample' aneuploidy_cov '...
#> correct GC bias in sample' normal_cov '...
## a GRangesList object will be produced by the function
length(normed_without_control)
#> [1] 2
names(normed_without_control)
#> [1] "aneuploidy_cov" "normal_cov"
## subsetting
normed_without_control[["aneuploidy_cov"]]
#> GRanges object with 188 ranges and 5 metadata columns:
#> seqnames ranges strand | depth quantiled_depth
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr14 [21538016, 21538117] * | 21 17
#> [2] chr14 [22377854, 22377955] * | 24 19
#> [3] chr14 [26201540, 26201641] * | 45 36
#> [4] chr14 [26475535, 26475636] * | 8 6
#> [5] chr14 [29080232, 29080333] * | 6 5
#> ... ... ... ... . ... ...
#> [184] chr18 [69135718, 69135819] * | 14 12
#> [185] chr18 [72988854, 72988955] * | 118 93
#> [186] chr18 [73100816, 73100917] * | 12 10
#> [187] chr18 [73278496, 73278597] * | 91 70
#> [188] chr18 [77147421, 77147522] * | 75 58
#> GC normed_depth ref_depth
#> <numeric> <numeric> <numeric>
#> [1] 0.647058823529412 24 37.2867772108844
#> [2] 0.362745098039216 22 37.2867772108844
#> [3] 0.362745098039216 39 37.2867772108844
#> [4] 0.264705882352941 23 37.2867772108844
#> [5] 0.323529411764706 20 37.2867772108844
#> ... ... ... ...
#> [184] 0.323529411764706 27 35.76
#> [185] 0.519607843137255 80 35.76
#> [186] 0.254901960784314 25 35.76
#> [187] 0.441176470588235 65 35.76
#> [188] 0.588235294117647 51 35.76
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
normed_without_control[["normal_cov"]]
#> GRanges object with 190 ranges and 5 metadata columns:
#> seqnames ranges strand | depth quantiled_depth
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr14 [21538016, 21538117] * | 4 5
#> [2] chr14 [22377854, 22377955] * | 32 43
#> [3] chr14 [26201540, 26201641] * | 24 32
#> [4] chr14 [26475535, 26475636] * | 12 15
#> [5] chr14 [29080232, 29080333] * | 11 14
#> ... ... ... ... . ... ...
#> [186] chr18 [69135718, 69135819] * | 13 18
#> [187] chr18 [72988854, 72988955] * | 57 78
#> [188] chr18 [73100816, 73100917] * | 10 12
#> [189] chr18 [73278496, 73278597] * | 29 37
#> [190] chr18 [77147421, 77147522] * | 30 38
#> GC normed_depth ref_depth
#> <numeric> <numeric> <numeric>
#> [1] 0.647058823529412 17 37.2867772108844
#> [2] 0.362745098039216 43 37.2867772108844
#> [3] 0.362745098039216 32 37.2867772108844
#> [4] 0.264705882352941 31 37.2867772108844
#> [5] 0.323529411764706 24 37.2867772108844
#> ... ... ... ...
#> [186] 0.323529411764706 28 35.76
#> [187] 0.519607843137255 69 35.76
#> [188] 0.254901960784314 28 35.76
#> [189] 0.441176470588235 32 35.76
#> [190] 0.588235294117647 40 35.76
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
If you choose to write the output to file (recommended)
## normalize coverage data, normal_cov is the control sample
normalizeCoverage(aneuploidy_cov, control=normal_cov,
writeToFile=TRUE, destination = ".",plot=FALSE)
#> control: normal_cov
#> there are 2 samples
#> Quantile normalizing ...
#> correct GC bias in sample' normal_cov '...
#> correct GC bias in sample' aneuploidy_cov '...
#> normalized depth for sample normal_cov is written to ./normal_cov_normed_depth.txt
#> normalized depth for sample aneuploidy_cov is written to ./aneuploidy_cov_normed_depth.txt
If you don’t want to write output to file
normed_with_control = normalizeCoverage(aneuploidy_cov, control=normal_cov,
writeToFile =FALSE, plot=FALSE)
#> control: normal_cov
#> there are 2 samples
#> Quantile normalizing ...
#> correct GC bias in sample' normal_cov '...
#> correct GC bias in sample' aneuploidy_cov '...
## a GRangesList object will be produced by the function
length(normed_without_control)
#> [1] 2
names(normed_with_control)
#> [1] "normal_cov" "aneuploidy_cov"
Having vcf.gz file and target bed file ready, use prepareHetero function to process the heterozygous sites.
## specify the path to vcf.gz file
aneuploidy_vcf = system.file("extdata","aneuploidy.vcf.gz",package="MADSEQ")
## target bed file specified before
## If you choose to write the output to file (recommended)
prepareHetero(aneuploidy_vcf, target, genome="hg19",
writeToFile=TRUE, destination=".")
#> starting filter
#> filtering 387 records
#> completed filtering
#> filtered heterozygous sites for sample aneuploidy.vcf.gz is written to ./aneuploidy.vcf.gz_filtered_heterozygous.txt
## If you don't want to write output to file
aneuploidy_hetero = prepareHetero(aneuploidy_vcf, target,
genome="hg19", writeToFile=FALSE)
#> starting filter
#> filtering 387 records
#> completed filtering
The function runMadSeq will run the models and select the best model for the input data.
Note:
## specify the path to processed files
aneuploidy_hetero = "./aneuploidy.vcf.gz_filtered_heterozygous.txt"
aneuploidy_normed_cov = "./aneuploidy_cov_normed_depth.txt"
## run the model
aneuploidy_chr18 = runMadSeq(hetero=aneuploidy_hetero,
coverage=aneuploidy_normed_cov,
target_chr="chr18",
nChain=1, nStep=1000, thinSteps=1,
adapt=100,burnin=200)
#> total number of heterozygous site: 17
#> total number of coverage 34
#> module mix loaded
#> 1. running normal model
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 51
#> Unobserved stochastic nodes: 19
#> Total graph size: 141
#>
#> Initializing model
#> 2. running monosomy model
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 53
#> Unobserved stochastic nodes: 18
#> Total graph size: 188
#>
#> Initializing model
#> 3. running mitotic trisomy model
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 53
#> Unobserved stochastic nodes: 18
#> Total graph size: 188
#>
#> Initializing model
#> 4. running meiotic trisomy model
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 53
#> Unobserved stochastic nodes: 19
#> Total graph size: 218
#>
#> Initializing model
#> 5. running loss of heterozygosity model
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 51
#> Unobserved stochastic nodes: 22
#> Total graph size: 1105
#>
#> Initializing model
#> models done, comparing models
#> Order and delta BIC of the preference of models
#> BIC_normal BIC_mitotic_trisomy BIC_meiotic_trisomy
#> 0.000000 9.220744 16.529681
#> BIC_monosomy BIC_LOH
#> 20.260492 33.074884
#> model selected: normal
## An MadSeq object will be returned
aneuploidy_chr18
#> MadSeq object with the posterior distribution from normal model
#> kappa m_cov mu p_cov r_cov
#> [1,] 1.908792 31.96 0.4687737 0.16946992 6.521448
#> [2,] 4.928790 31.96 0.4687737 0.09850659 3.492283
#> [3,] 2.523402 31.96 0.4687737 0.10912061 3.914666
#> [4,] 2.060676 31.96 0.4687737 0.10089398 3.586420
#> [5,] 2.094958 31.96 0.4687737 0.22641513 9.354148
#> [6,] 2.035052 31.96 0.4687737 0.13647759 5.051199
#> ------
#> BIC_normal BIC_mitotic_trisomy BIC_meiotic_trisomy
#> 0.000000 9.220744 16.529681
#> BIC_monosomy BIC_LOH
#> 20.260492 33.074884
Note: In order to save time, we only run 1 chain with a much less steps compared with default settings. For real cases, the default settings are recommended.
## subset normalized coverage for aneuploidy sample from the GRangesList
## returned by normalizeCoverage function
aneuploidy_normed_cov = normed_with_control[["aneuploidy_cov"]]
## run the model
aneuploidy_chr18 = runMadSeq(hetero=aneuploidy_hetero,
coverage=aneuploidy_normed_cov,
target_chr="chr18")
## An MadSeq object will be returned
aneuploidy_chr18
The MadSeq object from the runMadSeq function contains:
Note: The value of delta BIC suggests the strength of the confidence of the selected model against other models. It can be summarized as
deltaBIC | Evidence against higher BIC |
---|---|
[0,2] | Not worth more than a bare mention |
(2,6] | Positive |
(6,10] | Strong |
>10 | Very Strong |
There are a group of plot functions to plot the output MadSeq object from the runMadSeq.
## plot the posterior distribution for all the parameters in selected model
plotMadSeq(aneuploidy_chr18)
## plot the histogram for the estimated fraction of aneuploidy
plotFraction(aneuploidy_chr18, prob=0.95)
#> selected model is normal, f is estimated to be 0%
## plot the distribution of AAF as estimated by the model
plotMixture(aneuploidy_chr18)
parameters | description |
---|---|
f | Fraction of mosaic aneuploidy |
m | The midpoint of the alternative allele frequency (AAF) for all heterozygous sites |
mu[1] | Mean AAF of mixture 1: the AAFs of this mixture shifted from midpoint to some higher values |
mu[2] | Mean AAF of mixture 2: the AAFs of this mixture shifted from midpoint to some lower values |
mu[3] (LOH model) | Mean AAF of mixture 3: In LOH model, mu[3] indicates normal sites without loss of heterozygosity |
mu[3] (meiotic trisomy model) | Mean AAF of mixture 3: In meiotic model, the AAFs of this mixture shifted from 0 to some higher value |
mu[4] | Mean AAF of mixture 4: the AAFs of this mixture shifted from 1 to some lower value (only in meiotic model) |
kappa | Indicate variance of the AAF mixtures: larger kappa means smaller variance |
p[1] | Weight of mixture 1: indicate the proportion of heterozygous sites in the mixture 1 |
p[2] | Weight of mixture 2: indicate the proportion of heterozygous sites in the mixture 2 |
p[3] | Weight of mixture 3: indicate the proportion of heterozygous sites in the mixture 3 (only in LOH and meiotic model) |
p[4] | Weight of mixture 4: indicate the proportion of heterozygous sites in the mixture 4 (only in meiotic model) |
p[5] | Weight of outlier component: the AAF of 1% sites might not well behaved, so these sites are treated as noise. |
m_cov | Mean coverage of all the sites from the chromosome, estimated from a negative binomial distribution |
p_cov | Prob of the negative binomial distribution for the coverage |
r_cov | Another parameter (r) for the negative binomial disbribution of the coverage, small r means large variance |