In this vignette, we describe usage of a suite of tools, SEESAW, Statistical Estimation of allelic Expression using Salmon and Swish, which allow for testing allelic imbalance across samples.
The methods are described in Wu et al. (2022) doi: 10.1101/2022.08.12.503785.
SEESAW makes use of Swish (Zhu et al. 2019) for paired inference, which is an extension of the SAMseq (Li and Tibshirani 2011) methods for permutation-based FDR control.
Type of tests
SEESAW allows for testing global allelic imbalance across all samples (pairwise testing within each individual), as well as differential, or dynamic allelic imbalance (pairwise allelic fold changes estimated within individual, followed by testing across two groups, or along an additional covariate). Each of these allelic imbalance (AI) analyses takes into account the potentially heterogeneous amount of inferential uncertainty per sample, per feature (transcript, transcript-group, or gene), and per allele.
Steps in SEESAW
Running SEESAW involves generation of a diploid transcriptome
(e.g. using g2gtools,
construction of a diploid Salmon index (specifying
--keepDuplicates
), followed by Salmon quantification with a
number of bootstrap
inferential replicates (we recommend 30 bootstrap replicates). These
three steps (diploid reference preparation, indexing, quantification
with bootstraps) provide the input data for the following statistical
analyses in R/Bioconductor. The steps shown in this vignette leverage
Bioconductor infrastructure including SummarizedExperiment for
storage of input data and results, tximport for data import,
and GRanges and Gviz for plotting.
In short the SEESAW steps are as listed, and diagrammed below:
--keepDuplicates
makeTx2Tss()
aggregates data to TSS-level
(optional)importAllelicCounts()
creates a
SummarizedExperimentlabelKeep()
and swish()
(skip scaling)Below we demonstrate an analysis where transcripts are grouped by
their transcription start site (TSS), although gene-level or
transcript-level analysis is also possible. Additionally, any custom
grouping could be used, by manually generating a t2g
table
as shown below. Special plotting functions in fishpond
facilitate visualization of allelic and isoform changes at different
resolutions, alongside gene models. In three examples, we perform global
AI testing, differential AI testing, and dynamic AI testing, in all
cases on simulated data associated with human genes.
We begin assuming steps 1-3 have been completed. We can use the
makeTx2Tss
function to generate a GRanges object
t2g
that connects transcripts to transcript groups.
suppressPackageStartupMessages(library(ensembldb))
library(EnsDb.Hsapiens.v86)
library(fishpond)
<- EnsDb.Hsapiens.v86
edb <- makeTx2Tss(edb) # GRanges object
t2g mcols(t2g)[,c("tx_id","group_id")]
## DataFrame with 216741 rows and 2 columns
## tx_id group_id
## <character> <character>
## ENST00000456328 ENST00000456328 ENSG00000223972-11869
## ENST00000450305 ENST00000450305 ENSG00000223972-12010
## ENST00000488147 ENST00000488147 ENSG00000227232-29570
## ENST00000619216 ENST00000619216 ENSG00000278267-17436
## ENST00000473358 ENST00000473358 ENSG00000243485-29554
## ... ... ...
## ENST00000420810 ENST00000420810 ENSG00000224240-2654..
## ENST00000456738 ENST00000456738 ENSG00000227629-2659..
## ENST00000435945 ENST00000435945 ENSG00000237917-2663..
## ENST00000435741 ENST00000435741 ENSG00000231514-2662..
## ENST00000431853 ENST00000431853 ENSG00000235857-5685..
Alternatively for gene-level analysis, one could either prepare a
t2g
data.frame with tx_id
and
gene_id
columns, or a t2g
GRanges
object with a column group_id
that is equal to
gene_id
.
Here we will use simulated data, but we can import allelic counts
with the importAllelicCounts()
function. It is best to read
over the manual page for this function. For TSS-level analysis, the
t2g
GRanges generated above should be passed to
the tx2gene
argument. This will summarize transcript-level
counts to the TSS level, and will attach rowRanges
that
provide the genomic locations of the grouped transcripts. Note that
importAllelicCounts
does not yet have the ability to
automatically generate ranges based on sequence hashing (as occurs in
tximeta
).
Because we use --keepDuplicates
in the step when we
build the Salmon index, there will be a number of features in which
there is no information about the allelic expression in the reads. We
can find these features in bootstrap data by examining when the
inferential replicates are nearly identical for the two alleles, as this
is how the EM will split the reads. Removing these features avoids
downstream problems during differential testing. Code for this filtering
follows:
<- ncol(y)/2
n <- assay(y, "infRep1")[,y$allele == "a1"]
rep1a1 <- assay(y, "infRep1")[,y$allele == "a2"]
rep1a2 mcols(y)$someInfo <- rowSums(abs(rep1a1 - rep1a2) < 1) < n
<- y[ mcols(y)$someInfo, ] y
We begin by generating a simulated data object that resembles what
one would obtain with importAllelicCounts()
. The import
function arranges the a2
(non-effect) allelic counts first,
followed by the a1
(effect) allelic counts. Allelic ratios
are calculated as a1/a2
, which follows the notational
standard in PLINK and other tools.
suppressPackageStartupMessages(library(SummarizedExperiment))
set.seed(1)
<- makeSimSwishData(allelic=TRUE)
y colData(y)
## DataFrame with 20 rows and 2 columns
## allele sample
## <factor> <factor>
## s1-a2 a2 sample1
## s2-a2 a2 sample2
## s3-a2 a2 sample3
## s4-a2 a2 sample4
## s5-a2 a2 sample5
## ... ... ...
## s6-a1 a1 sample6
## s7-a1 a1 sample7
## s8-a1 a1 sample8
## s9-a1 a1 sample9
## s10-a1 a1 sample10
levels(y$allele) # a1/a2 allelic fold changes
## [1] "a2" "a1"
A hidden code chunk is used to add ranges from the EnsDb to
the simulated dataset. For a real dataset, the ranges would be added
either by importAllelicCounts
(if using
tx2gene
) or could be added manually for transcript- or
gene-level analysis, using the rowRanges<-
setter
function. The ranges are only needed for the
plotAllelicGene
plotting function below.
<hidden code chunk>
We can already plot a heatmap of allelic ratios, before performing statistical testing. We can see in the first gene, ADSS, there appear to be two groups of transcripts with opposing allelic fold change. SEESAW makes use of pheatmap for plotting a heatmap of allelic ratios.
<- computeInfRV(y) # for posterior mean, variance
y <- rowRanges(y)$gene_id[1]
gene <- mcols(y)$gene_id == gene
idx plotAllelicHeatmap(y, idx=idx)
The following two functions perform a Swish analysis, comparing the allelic counts within sample, while accounting for uncertainty in the assignment of the reads. The underlying test statistic is a Wilcoxon signed-rank statistic, which compares the two allele counts from each sample, so a paired analysis.
Scaling: Note that we do not use
scaleInfReps
in the allelic pipeline. Because we compare
the two alleles within samples, there is no need to perform scaling of
the counts to adjust for sequencing depth. We simply import counts,
filter low counts with lableKeep
and then run the
statistical testing with swish
.
Fast mode: for basic allelic analysis, we use a
paired test, comparing one allele to the other. The default in
swish
for a simple paired test is to use a Wilcoxon signed
rank test statistic with bootstrap aggregation and permutation
significance. The ranks must be recomputed per permutation, which is a
slow operation that is not necessary with other designs in
swish
. A faster test statistic is the one-sample z-score,
which gives similar results. Here we demonstrate using the fast version
of the paired test. Note that fast=1
is only relevant for
simple paired tests, not for other designs, which are already fast.
<- labelKeep(y)
y <- swish(y, x="allele", pair="sample", fast=1) y
We can return to the heatmap, and now add q-values, etc. For details
on adding metadata to a pheatmap plot object, see
?pheatmap
.
<- data.frame(minusLogQ=-log10(mcols(y)$qvalue[idx]),
dat row.names=rownames(y)[idx])
plotAllelicHeatmap(y, idx=idx, annotation_row=dat)
In order to visualize the inferential uncertainty, we can make use of
plotInfReps()
:
par(mfrow=c(2,1), mar=c(1,4.1,2,2))
plotInfReps(y, idx=1, x="allele", cov="sample", xaxis=FALSE, xlab="")
plotInfReps(y, idx=2, x="allele", cov="sample", xaxis=FALSE, xlab="")
Plotting results in genomic context
For analysis at the isoform or TSS-level, it may be useful to display
results within a gene, relating the allelic differences to various gene
features. SEESAW provides plotAllelicGene()
in order to
build visualization of Swish test statistics, allelic proportions, and
isoform proportions, in a genomic context, making use of Gviz.
Note that this function is not relevant for gene-level AI analysis. The
first three arguments to plotAllelicGene()
are the
SummarizedExperiment object, the name of a gene (should match
gene_id
column), and a TxDb or EnsDb to
use for plotting the gene model at the top. The statistics and
proportions are then plotted at the first position of the feature
(start
for +
features and end
for
-
features).
<- rowRanges(y)$gene_id[1]
gene plotAllelicGene(y, gene, edb)
You can also specify the gene using symbol
:
plotAllelicGene(y, symbol="ADSS", db=edb)
In the allelic proportion and isoform proportion tracks, a line is drawn through the mean proportion for a2 and a1 allele, and for the isoform proportion, across samples, at the start site for each transcript group. The line is meant only to help visualize the mean value as it may change across transcript groups, but the line has no meaning in the ranges in between features. That is, unlike continuous genomic features (methylation or accessibility), there is no meaning to the allelic proportion or isoform proportion outside of measured start sites of transcription.
We can further customize the plot, for example, changing the labels
displayed on the gene model, and changing the labels for the alleles. An
ideogram can be added with ideogram=TRUE
, although this
requires connecting to an external FTP site.
See importAllelicGene()
manual page for more
details.
plotAllelicGene(y, gene, edb,
transcriptAnnotation="transcript",
labels=list(a2="maternal",a1="paternal"))
We can also customize the display of the alleles in the
plotInfReps()
plots, by adding a new factor, while
carefully noting the existing and new allele labels, to make sure the
annotation is correct:
$allele_new <- y$allele
y# note a2 is non-effect, a1 is effect:
levels(y$allele)
## [1] "a2" "a1"
# replace a2 then a1:
levels(y$allele_new) <- c("maternal","paternal")
plotInfReps(y, idx=1, x="allele_new",
legend=TRUE, legendPos="bottom")
Above, we tested for global AI, where the allelic fold change is
consistent across all samples. We can also test for differential or
dynamic AI, by adding specification of a cov
(covariate)
which can be either a two-group factor, or a continuous variable. Here
we demonstrate differential AI, when cov
is a two-group
factor, in this case called "condition"
.
set.seed(1)
<- makeSimSwishData(diffAI=TRUE, n=12)
y colData(y)
## DataFrame with 24 rows and 3 columns
## allele sample condition
## <factor> <factor> <factor>
## s1-a2 a2 sample1 A
## s2-a2 a2 sample2 A
## s3-a2 a2 sample3 A
## s4-a2 a2 sample4 A
## s5-a2 a2 sample5 A
## ... ... ... ...
## s8-a1 a1 sample8 B
## s9-a1 a1 sample9 B
## s10-a1 a1 sample10 B
## s11-a1 a1 sample11 B
## s12-a1 a1 sample12 B
table(y$condition, y$allele)
##
## a2 a1
## A 6 6
## B 6 6
In the following, we test for changes in allelic imbalance across
condition
. This is implemented as an “interaction” test,
where we test if the fold change associated with allele
,
for paired samples, differs across condition
.
<- labelKeep(y)
y <- swish(y, x="allele", pair="sample",
y cov="condition", interaction=TRUE)
In this simulated data, the top two features exhibit differential AI with low uncertainty, so these emerge as highly significant, as expected.
mcols(y)[1:2,c("stat","qvalue")]
## DataFrame with 2 rows and 2 columns
## stat qvalue
## <numeric> <numeric>
## gene-1 17.3 0.005
## gene-2 -17.5 0.005
The non-AI features have roughly uniform p-values:
hist(mcols(y)[-c(1:6),"pvalue"])
We can plot the allelic counts with uncertainty, grouped by the condition (black and grey lines at bottom).
plotInfReps(y, idx=1, x="allele", cov="condition",
xaxis=FALSE, legend=TRUE, legendPos="bottomright")
We can also visualize the data across multiple features, in terms of allelic ratios:
<- c(1:6)
idx <- data.frame(minusLogQ=-log10(mcols(y)$qvalue[idx]),
row_dat row.names=rownames(y)[idx])
<- data.frame(condition=y$condition[1:12],
col_dat row.names=paste0("s",1:12))
plotAllelicHeatmap(y, idx=idx,
annotation_row=row_dat,
annotation_col=col_dat,
cluster_rows=FALSE)
Now we demonstrate dynamic AI testing when cov
(covariate) is a continuous variable. In this case, the user should
specify a correlation test, either cor="pearson"
or
"spearman"
, which is the underlying test statistic used by
Swish (it will then be averaged over bootstraps and a permutation null
is generated to assess FDR). We have found that Pearson correlations
work well in our testing, but the Spearman correlation offers additional
robustness against outlying values in cov
.
set.seed(1)
<- makeSimSwishData(dynamicAI=TRUE)
y colData(y)
## DataFrame with 20 rows and 3 columns
## allele sample time
## <factor> <factor> <numeric>
## s1-a2 a2 sample1 0.00
## s2-a2 a2 sample2 0.11
## s3-a2 a2 sample3 0.22
## s4-a2 a2 sample4 0.33
## s5-a2 a2 sample5 0.44
## ... ... ... ...
## s6-a1 a1 sample6 0.56
## s7-a1 a1 sample7 0.67
## s8-a1 a1 sample8 0.78
## s9-a1 a1 sample9 0.89
## s10-a1 a1 sample10 1.00
A hidden code chunk adds ranges to our simulation data.
<hidden code chunk>
In the following, we test for changes in allelic imbalance within
sample that correlate with a covariate time
.
<- labelKeep(y)
y <- swish(y, x="allele", pair="sample", cov="time", cor="pearson") y
Note the first two features have small q-values and opposite test
statistic; here the test statistic is the average Pearson correlation of
the allelic log fold change with the time
variable,
averaging over bootstrap replicates.
mcols(y)[1:2,c("stat","qvalue")]
## DataFrame with 2 rows and 2 columns
## stat qvalue
## <numeric> <numeric>
## ADSS-244452134 0.870969 0.005
## ADSS-244419273 -0.861573 0.005
For plotting inferential replicates over a continuous variable, we must first compute summary statistics of inferential mean and variance:
<- computeInfRV(y) y
Now we can examine the allelic counts across the time
variable:
par(mfrow=c(2,1), mar=c(2.5,4,2,2))
plotInfReps(y, idx=1, x="time", cov="allele", shiftX=.01, xaxis=FALSE, xlab="", main="")
par(mar=c(4.5,4,0,2))
plotInfReps(y, idx=2, x="time", cov="allele", shiftX=.01, main="")
With a little more code, we can add a lowess
line for
each series:
plotInfReps(y, idx=1, x="time", cov="allele", shiftX=.01)
<- data.frame(
dat time = y$time[1:10],
a2 = assay(y, "mean")[1,y$allele=="a2"],
a1 = assay(y, "mean")[1,y$allele=="a1"])
lines(lowess(dat[,c(1,2)]), col="dodgerblue")
lines(lowess(dat[,c(1,3)]), col="goldenrod4")
Visualizing the allelic proportion in a heatmap helps to see
relationships with the time
variable, while also showing
data from multiple features at once:
<- c(1:4)
idx <- data.frame(minusLogQ=-log10(mcols(y)$qvalue[idx]),
row_dat row.names=rownames(y)[idx])
<- data.frame(time=y$time[1:10],
col_dat row.names=paste0("s",1:10))
plotAllelicHeatmap(y, idx=idx,
annotation_row=row_dat,
annotation_col=col_dat)
Plotting results in genomic context
Previously, in the global AI section, we demonstrated how to plot
TSS-level results in a genomic context using
plotAllelicGene()
. Here we demonstrate how to repeat such a
plot for differential or dynamic AI analysis. There is an extra step for
dynamic analysis (binning the continuous covariate into groups) but
otherwise the code would be similar.
We begin by binning the time
covariate into a few
groups, so that we can diagram the allelic and isoform proportions in
the genomic context, but facetting across time.
We create the binned covariate using cut
, and rename the
labels for nicer labels in our plot. For differential AI, this step
would be skipped (as there already exists a two-group covariate for
grouping samples).
$time_bins <- cut(y$time,breaks=c(0,.25,.75,1),
yinclude.lowest=TRUE, labels=FALSE)
$time_bins <- paste0("time-",y$time_bins)
ytable(y$time_bins[ y$allele == "a2" ])
##
## time-1 time-2 time-3
## 3 4 3
We can then make our facetted allelic proportion plot:
<- rowRanges(y)$gene_id[1]
gene plotAllelicGene(y, gene, edb, cov="time_bins",
qvalue=FALSE, log2FC=FALSE)
If we also want to visualize how isoform proportions may be changing,
we can add covFacetIsoform=TRUE
, which additionally facets
the isoform proportion plot by the covariate:
plotAllelicGene(y, gene, edb, cov="time_bins",
covFacetIsoform=TRUE,
qvalue=FALSE, log2FC=FALSE)
For further questions about the SEESAW steps, please post to one of these locations:
fishpond
or swish
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.28.0 MatrixGenerics_1.10.0
## [3] matrixStats_0.63.0 fishpond_2.4.1
## [5] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.22.0
## [7] AnnotationFilter_1.22.0 GenomicFeatures_1.50.4
## [9] AnnotationDbi_1.60.0 Biobase_2.58.0
## [11] GenomicRanges_1.50.2 GenomeInfoDb_1.34.7
## [13] IRanges_2.32.0 S4Vectors_0.36.1
## [15] BiocGenerics_0.44.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.1-0 rjson_0.2.21
## [3] deldir_1.0-6 ellipsis_0.3.2
## [5] htmlTable_2.4.1 biovizBase_1.46.0
## [7] qvalue_2.30.0 XVector_0.38.0
## [9] base64enc_0.1-3 dichromat_2.0-0.1
## [11] rstudioapi_0.14 farver_2.1.1
## [13] bit64_4.0.5 fansi_1.0.4
## [15] xml2_1.3.3 codetools_0.2-18
## [17] splines_4.2.2 cachem_1.0.6
## [19] knitr_1.41 Formula_1.2-4
## [21] jsonlite_1.8.4 Rsamtools_2.14.0
## [23] cluster_2.1.4 dbplyr_2.3.0
## [25] png_0.1-8 pheatmap_1.0.12
## [27] compiler_4.2.2 httr_1.4.4
## [29] backports_1.4.1 assertthat_0.2.1
## [31] Matrix_1.5-3 fastmap_1.1.0
## [33] lazyeval_0.2.2 cli_3.6.0
## [35] htmltools_0.5.4 prettyunits_1.1.1
## [37] tools_4.2.2 gtable_0.3.1
## [39] glue_1.6.2 GenomeInfoDbData_1.2.9
## [41] reshape2_1.4.4 dplyr_1.0.10
## [43] rappdirs_0.3.3 Rcpp_1.0.10
## [45] jquerylib_0.1.4 vctrs_0.5.2
## [47] Biostrings_2.66.0 rtracklayer_1.58.0
## [49] xfun_0.36 stringr_1.5.0
## [51] lifecycle_1.0.3 restfulr_0.0.15
## [53] gtools_3.9.4 XML_3.99-0.13
## [55] zlibbioc_1.44.0 scales_1.2.1
## [57] BSgenome_1.66.2 VariantAnnotation_1.44.0
## [59] hms_1.1.2 ProtGenerics_1.30.0
## [61] parallel_4.2.2 RColorBrewer_1.1-3
## [63] SingleCellExperiment_1.20.0 yaml_2.3.7
## [65] curl_5.0.0 gridExtra_2.3
## [67] memoise_2.0.1 ggplot2_3.4.0
## [69] sass_0.4.5 rpart_4.1.19
## [71] biomaRt_2.54.0 latticeExtra_0.6-30
## [73] stringi_1.7.12 RSQLite_2.2.20
## [75] highr_0.10 BiocIO_1.8.0
## [77] checkmate_2.1.0 filelock_1.0.2
## [79] BiocParallel_1.32.5 rlang_1.0.6
## [81] pkgconfig_2.0.3 bitops_1.0-7
## [83] evaluate_0.20 lattice_0.20-45
## [85] htmlwidgets_1.6.1 GenomicAlignments_1.34.0
## [87] bit_4.0.5 tidyselect_1.2.0
## [89] plyr_1.8.8 magrittr_2.0.3
## [91] R6_2.5.1 generics_0.1.3
## [93] Hmisc_4.7-2 DelayedArray_0.24.0
## [95] DBI_1.1.3 pillar_1.8.1
## [97] foreign_0.8-84 svMisc_1.2.3
## [99] survival_3.5-0 KEGGREST_1.38.0
## [101] abind_1.4-5 RCurl_1.98-1.9
## [103] nnet_7.3-18 tibble_3.1.8
## [105] crayon_1.5.2 interp_1.1-3
## [107] utf8_1.2.2 BiocFileCache_2.6.0
## [109] rmarkdown_2.20 jpeg_0.1-10
## [111] progress_1.2.2 grid_4.2.2
## [113] data.table_1.14.6 blob_1.2.3
## [115] digest_0.6.31 munsell_0.5.0
## [117] Gviz_1.42.0 bslib_0.4.2