require(IgGeneUsage)
require(knitr)
require(ggplot2)
require(ggforce)
require(gridExtra)
require(ggrepel)
require(rstan)
require(reshape2)
rstan_options(auto_write = TRUE)
Decoding the properties of immune repertoires is key in understanding the response of adaptive immunity to challenges such as viral infection. One important property is biases in Ig gene usage between biological conditions (e.g. healthy vs turmor). Yet, most analyses for differential gene usage are performed qualitatively, or with inadequate statistical methods. Here we introduce IgGeneUsage, a computational tool for the analysis of differential gene usage.
The input to IgGeneUsage is a data.frame object with usage frequencies for each gene of a repertoire that belongs to a particular biological condition. The usage data.frame has the following 4 columns:
The sum of all gene usage counts (column 4) for a given repertoire should be equal to the total gene usage in that repertoire.
IgGeneUsage transforms the provided input in the following way. Given \(R\) repertoires, each having \(G\) genes, IgGeneUsage generates a gene usage matrix \(Y^{R \times G}\). Row sums in \(Y\) define the total usage in each repertoire (\(N\)). The design variable \(X\) is set to \(X = 1\) for repertoires that belong to the first condition, and \(X = -1\) otherwise.
For the analysis of DGU between two biological conditions, we designed the following Bayesian model (\(M\)) for zero-inflated beta-binomial regression. This model can fit over-dispersed gene usage data. The immune repertoire data is also not exhaustive, which leads to misdetection of genes that are systematically rearranged at low probability. The zero-inflated component of our model accounts for this:
\[\begin{align} p(Y_{ij} \mid M) = \begin{cases} \kappa + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{i}, \theta_{ij}, \phi \right), & \text{if $Y_{ij}$ = 0} \\ (1 - \kappa) \operatorname{BB}\left(Y_{ij} \mid N_{i}, \theta_{ij}, \phi \right), & \text{if $Y_{ij}$ > 0} \end{cases}\\ \theta_{ij} = \operatorname{logit^{-1}}\left(\alpha_{j}+\beta_{ij}X_{i}\right)\\ \beta_{ij}\sim\operatorname{Normal}\left(\gamma_{j},\gamma_{\sigma} \right)\\ \gamma_{j}\sim\operatorname{Normal}\left(\hat{\gamma},\hat{\gamma}_{\sigma} \right) \\ \alpha_{j}\sim\operatorname{Normal}\left(\hat{\alpha},\hat{\alpha}_{\sigma} \right) \\ \hat{\gamma} \sim \operatorname{Normal}\left(0, 5\right) \\ \hat{\alpha} \sim \operatorname{Normal}\left(0, 10\right) \\ \gamma_{\sigma}, \hat{\gamma}_{\sigma}, \hat{\alpha}_{\sigma} \sim \operatorname{Cauchy^{+}}\left(0, 1\right) \\ \phi \sim \operatorname{Exponential}\left(\tau\right) \\ \tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\ \kappa \sim \operatorname{Beta}\left(1, 3\right) \end{align}\]
Model \(M\) legend:
In the output of IgGeneUsage, we report the mean effect size (\(\gamma\)) and its 95% highest density interval (HDI). Genes with \(\gamma \neq 0\) (e.g. if 95% HDI of \(\gamma\) excludes 0) are most likely to experience differential usage. Additionally, we report the probability of differential gene usage (\(\pi\)): \[\begin{align} \pi = 2 \cdot \max\left(\int_{\gamma = -\infty}^{0} p(\gamma)\mathrm{d}\gamma, \int_{\gamma = 0}^{\infty} p(\gamma)\mathrm{d}\gamma\right) - 1 \end{align}\] with \(\pi = 1\) for genes with strong differential usage, and \(\pi = 0\) for genes with negligible differential gene usage. Both metrics are computed based on the posterior distribution of \(\gamma\), and are thus related. We find \(\pi\) slightly easier to interpret.
IgGeneUsage provides built-in datasets from studies that evaluate biases in Ig gene usage. The dataset IGHV_HCV contains publicly available data of human immunoglobulin heavy chain VDJ rearrangements from a study that evaluates the effect of HCV infection on the human BCR repertoire1 Tucci, Felicia A., et al. “Biased IGH VDJ gene repertoire and clonal expansions in B cells of chronically hepatitis C virus–infected individuals.” Blood 131.5 (2018): 546-557.]. The dataset consists of a population of class-switched memory (CSM) B cells of 22 HCV-infected patients (HCV+) and 7 healthy donors (HD). Gene usage data is available for 69 IGHV gene segments.
The data is already formatted such that it can directly be passed as input to IgGeneUsage. The following steps allow you to load the data, and to inspect its column names and table entries.
data("IGHV_HCV", package = "IgGeneUsage")
kable(x = head(IGHV_HCV), row.names = FALSE)
sample_id | condition | gene_name | gene_usage_count |
---|---|---|---|
HD1 | healthy | IGHV1-18 | 69 |
HD1 | healthy | IGHV4-30-2 | 16 |
HD1 | healthy | IGHV7-4-1 | 0 |
HD1 | healthy | IGHV3-19 | 1 |
HD1 | healthy | IGHV3-33 | 110 |
HD1 | healthy | IGHV3-69-1 | 4 |
Lets visualize the gene usage with ggplot2.
# we can compute the total number of rearrangements per sample
total.usage <- aggregate(gene_usage_count~sample_id+condition,
FUN = sum, data = IGHV_HCV)
total.usage$total <- total.usage$gene_usage_count
total.usage$gene_usage_count <- NULL
# merge it with the original data
viz <- merge(x = IGHV_HCV, y = total.usage,
by = c("sample_id", "condition"),
all.x = TRUE)
# compute %
viz$gene_usage_pct <- viz$gene_usage_count/viz$total*100
# For this example lets only consider the top 45 genes
# Hint: In real analyses you should not filter any gene
top <- aggregate(gene_usage_pct~gene_name, data = viz, FUN = mean)
top <- top[order(top$gene_usage_pct, decreasing = TRUE), ]
IGHV_HCV <- IGHV_HCV[IGHV_HCV$gene_name %in% top$gene_name[seq_len(45)], ]
viz <- viz[viz$gene_name %in% top$gene_name[seq_len(45)], ]
# visualize
ggplot(data = viz)+
geom_point(aes(x = gene_name, y = gene_usage_pct,
fill = condition, shape = condition),
position = position_dodge(width = .7), stroke = 0)+
theme_bw(base_size = 11)+
ylab(label = "Usage [%]")+
xlab(label = '')+
theme(legend.position = "top")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
scale_shape_manual(name = "Condition", values = c(21, 22))
This is the main method of IgGeneUsage. The primary input is the dataset IGHV_HCV. Other inputs allow you to configure specific settings of the Markov Chain Monte Carlo (MCMC) simulation.
In this example we analyze IGHV_HCV with 2 MCMC chains (750 iterations each, including 250 warm-ups) in parallel using 2 CPU cores. We compute for each parameter of our model its 95% highest density interval (HDIs).
Important remark: you should run your analysis using the default argument values of the function DGU. These values have been carefully chosen to fit most analyses for differential gene usage of immune repertoires. If any messages or warnings are reported concerning the MCMC sampling, please consult the Stan manual2 https://mc-stan.org/misc/warnings.html and adjust the MCMC arguments accordingly. If the warnings persist, please submit an issue with a reproducible script at the Bioconductor support site or on Github3 https://github.com/snaketron/IgGeneUsage/issues.
M <- DGU(usage.data = IGHV_HCV, # input data
mcmc.warmup = 750, # how many MCMC warm-ups per chain (default: 500)
mcmc.steps = 2000, # how many MCMC steps per chain (default: 1,500)
mcmc.chains = 2, # how many MCMC chain to run (default: 4)
mcmc.cores = 1, # how many PC cores to use? (e.g. parallel chains)
hdi.level = 0.95, # highest density interval level (de fault: 0.95)
adapt.delta = 0.95, # MCMC target acceptance rate (default: 0.95)
max.treedepth = 10) # tree depth evaluated at each step (default: 12)
FALSE
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 1).
FALSE Chain 1:
FALSE Chain 1: Gradient evaluation took 0.00166 seconds
FALSE Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 16.6 seconds.
FALSE Chain 1: Adjust your expectations accordingly!
FALSE Chain 1:
FALSE Chain 1:
FALSE Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
FALSE Chain 1: Iteration: 250 / 2000 [ 12%] (Warmup)
FALSE Chain 1: Iteration: 500 / 2000 [ 25%] (Warmup)
FALSE Chain 1: Iteration: 750 / 2000 [ 37%] (Warmup)
FALSE Chain 1: Iteration: 751 / 2000 [ 37%] (Sampling)
FALSE Chain 1: Iteration: 1000 / 2000 [ 50%] (Sampling)
FALSE Chain 1: Iteration: 1250 / 2000 [ 62%] (Sampling)
FALSE Chain 1: Iteration: 1500 / 2000 [ 75%] (Sampling)
FALSE Chain 1: Iteration: 1750 / 2000 [ 87%] (Sampling)
FALSE Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
FALSE Chain 1:
FALSE Chain 1: Elapsed Time: 169.082 seconds (Warm-up)
FALSE Chain 1: 187.928 seconds (Sampling)
FALSE Chain 1: 357.01 seconds (Total)
FALSE Chain 1:
FALSE
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 2).
FALSE Chain 2:
FALSE Chain 2: Gradient evaluation took 0.001126 seconds
FALSE Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 11.26 seconds.
FALSE Chain 2: Adjust your expectations accordingly!
FALSE Chain 2:
FALSE Chain 2:
FALSE Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
FALSE Chain 2: Iteration: 250 / 2000 [ 12%] (Warmup)
FALSE Chain 2: Iteration: 500 / 2000 [ 25%] (Warmup)
FALSE Chain 2: Iteration: 750 / 2000 [ 37%] (Warmup)
FALSE Chain 2: Iteration: 751 / 2000 [ 37%] (Sampling)
FALSE Chain 2: Iteration: 1000 / 2000 [ 50%] (Sampling)
FALSE Chain 2: Iteration: 1250 / 2000 [ 62%] (Sampling)
FALSE Chain 2: Iteration: 1500 / 2000 [ 75%] (Sampling)
FALSE Chain 2: Iteration: 1750 / 2000 [ 87%] (Sampling)
FALSE Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
FALSE Chain 2:
FALSE Chain 2: Elapsed Time: 170.236 seconds (Warm-up)
FALSE Chain 2: 187.853 seconds (Sampling)
FALSE Chain 2: 358.088 seconds (Total)
FALSE Chain 2:
The following objects are provided as part of the output of DGU:
summary(M)
FALSE Length Class Mode
FALSE glm.summary 9 data.frame list
FALSE test.summary 9 data.frame list
FALSE glm 1 stanfit S4
FALSE ppc.data 2 -none- list
FALSE usage.data 8 -none- list
Do not blindly trust your model \(\rightarrow\) check it extensively. You can use the object glm for these checks.
rstan::check_hmc_diagnostics(M$glm)
FALSE
FALSE Divergences:
FALSE
FALSE Tree depth:
FALSE
FALSE Energy:
gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm),
rstan::stan_ess(object = M$glm),
nrow = 1)
IgGeneUsage has built-in checks for repertoire-specific posterior prediction of gene usage. It uses the fitted model to predict the usage of each gene in a given repertoire (y-axis). Error bars show 95% HDI of mean posterior prediction. The predictions can be compared with the observed data (x-axis). For points near the diagnoal \(\rightarrow\) accurate prediction.
The following figure shows that our model can reproduce the fitted data \(\rightarrow\) this is an indicator of a valid model.
ggplot(data = M$ppc.data$ppc.repertoire)+
facet_wrap(facets = ~sample_name, ncol = 5)+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
geom_errorbar(aes(x = observed_count, y = ppc_mean_count,
ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
geom_point(aes(x = observed_count, y = ppc_mean_count,
fill = condition), shape = 21, size = 1)+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_x_log10(breaks = c(1, 10, 100, 1000),
labels = expression(10^0, 10^1, 10^2, 10^3))+
scale_y_log10(breaks = c(1, 10, 100, 1000),
labels = expression(10^0, 10^1, 10^2, 10^3))+
xlab(label = "Observed usage [counts]")+
ylab(label = "Predicted usage [counts]")+
annotation_logticks(base = 10, sides = "lb")
IgGeneUsage has built-in checks for gene-specific posterior prediction of gene usage within each biological condition. We show the predictions (y-axis) of the model, and compare them against the observed mean usage (x-axis). If the points are near the diagnoal \(\rightarrow\) accurate prediction. Errors show 95% HDI of mean posterior prediction.
ggplot(data = M$ppc.data$ppc.gene)+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100), col = "darkgray")+
geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100,
fill = condition), shape = 21, size = 1)+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
xlab(label = "Observed usage [%]")+
ylab(label = "Predicted usage [%]")+
scale_x_log10()+
scale_y_log10()+
annotation_logticks(base = 10, sides = "lb")
Next we present the main results of IgGeneUsage contained in the data.frame glm.summary. Each row of glm.summary summarizes the degree of differential gene usage in a specific immune gene using two metrics (es and pmax).
Legend:
kable(x = head(M$glm.summary), row.names = FALSE, digits = 3)
es_mean | es_mean_se | es_sd | es_median | es_L | es_H | contrast | pmax | gene_name |
---|---|---|---|---|---|---|---|---|
-0.050 | 0.001 | 0.066 | -0.049 | -0.179 | 0.079 | healthy - hcv | 0.565 | IGHV1-18 |
0.086 | 0.001 | 0.072 | 0.083 | -0.048 | 0.232 | healthy - hcv | 0.783 | IGHV1-2 |
-0.021 | 0.001 | 0.090 | -0.020 | -0.199 | 0.157 | healthy - hcv | 0.182 | IGHV1-24 |
-0.102 | 0.002 | 0.092 | -0.098 | -0.290 | 0.067 | healthy - hcv | 0.747 | IGHV1-3 |
-0.034 | 0.001 | 0.077 | -0.033 | -0.198 | 0.115 | healthy - hcv | 0.331 | IGHV1-46 |
-0.070 | 0.002 | 0.102 | -0.065 | -0.282 | 0.126 | healthy - hcv | 0.513 | IGHV1-58 |
The effect size and \(\pi\) are related. Lets visualize them for each gene (shown as a point). Names are shown of genes with \(\pi \geq 0.9\). Dashed horizontal line represents null-effect (\(\gamma = 0\)). Notice that the gene with \(\pi \approx 1\) also has an effect size whose 95% HDI (error bar) does not overlap the null-effect. The genes with high degree of differential usage are easy to detect with this figure.
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)
stats <- merge(x = stats, y = M$test.summary, by = "gene_name")
ggplot(data = stats)+
geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H),
col = "darkgray")+
geom_point(aes(x = pmax, y = es_mean))+
geom_text_repel(data = stats[stats$pmax >= 0.9, ],
aes(x = pmax, y = es_mean, label = gene_fac))+
theme_bw(base_size = 11)+
xlab(label = expression(pi))
Lets visualize the observed data of the genes with high probability of differential gene usage (\(\pi \geq 0.9\)). Here we show the gene usage in %.
promising.genes <- stats$gene_name[stats$pmax >= 0.9]
ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]
ggplot()+
geom_errorbar(data = ppc.gene,
aes(x = gene_name, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100, col = condition),
position = position_dodge(width = .8), width = 0.75)+
geom_point(data = viz[viz$gene_name %in% promising.genes, ],
aes(x = gene_name, y = gene_usage_pct, col = condition),
shape = 21, size = 1.5, fill = "black",
position = position_jitterdodge(jitter.width = 0.15,
jitter.height = 0,
dodge.width = 0.8))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
ylab(label = "Usage [%]")+
xlab(label = '')
Lets also visualize the gene usage frequencies. Point size represents total usage in repertoire.
promising.genes <- stats$gene_name[stats$pmax >= 0.9]
ggplot(data = viz[viz$gene_name %in% promising.genes, ])+
facet_wrap(facets = ~gene_name, ncol = 1, scales = "free_y")+
geom_point(aes(x = sample_id, y = gene_usage_count, fill = condition,
size = total/10^3), shape = 21)+
theme_bw(base_size = 11)+
ylab(label = "Usage [count]")+
xlab(label = '')+
theme(legend.position = "top")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
scale_size_continuous(name = expression("N ("*10^3*")"), range = c(1, 5))+
ylab(label = "Usage [count]")+
xlab(label = '')+
scale_y_log10()+
annotation_logticks(base = 10, sides = "l")
Despite the fact that the data is not normaly distributed, we performed the analysis of differential gene usage with the T-test. Prior to using the T-test, we must convert the usage frequencies into proportions. This is automatically done by IgGeneUsage (object test.summary).
Next, we compare the probabilities of differential gene usage (\(\pi\)) with the FDR corrected P-values (-log10 scale) from the T-test. Dashed lines show significance levels of 0.05 and 0.01. We observe few disagreements between the two tests. Lets inspect them in more detail.
ggplot()+
geom_hline(yintercept = c(-log10(0.05), -log10(0.01)),
linetype = "dashed", col = "darkgray")+
geom_point(data = stats, col = "red", size = 2,
aes(x = pmax, y = -log10(t.test.fdr.pvalue)))+
geom_text_repel(data = stats[stats$pmax >= 0.5, ],
aes(x = pmax, y = -log10(t.test.fdr.pvalue),
label = gene_name), size = 4,
min.segment.length = 0.1)+
xlim(0, 1)+
ylab(label = "-log10 (P-value) from t-test [FDR corrected]")+
xlab(label = expression(pi))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_color_discrete(name = '')
The two genes IGHV1-58 and IGHV3-72 are prioritized by the T-test. Their \(\pi\) estimates are however far from 1. Violation of T-test’s assumptions is the most probable cause for the disagreement. Discarding information about the sample size, and thus about uncertainty could be an alternative explanation. Lets visualize the data for IGHV1-58 and IGHV3-72.
promising.genes <- c("IGHV1-58", "IGHV3-72")
ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]
ggplot()+
geom_errorbar(data = ppc.gene,
aes(x = gene_name, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100, col = condition),
position = position_dodge(width = .8), width = 0.75)+
geom_point(data = viz[viz$gene_name %in% promising.genes, ],
aes(x = gene_name, y = gene_usage_pct, col = condition),
shape = 21, size = 1.5, fill = "black",
position = position_jitterdodge(jitter.width = 0.15,
jitter.height = 0,
dodge.width = 0.8))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
ylab(label = "Gene usage [%]")+
xlab(label = '')
promising.genes <- c("IGHV1-58", "IGHV3-72")
ggplot(data = viz[viz$gene_name %in% promising.genes, ])+
facet_wrap(facets = ~gene_name, ncol = 1, scales = "free_y")+
geom_point(aes(x = sample_id, y = gene_usage_count, fill = condition,
size = total/10^3), shape = 21)+
theme_bw(base_size = 11)+
ylab(label = "Usage [count]")+
xlab(label = '')+
theme(legend.position = "top")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
scale_size_continuous(name = expression("N ("*10^3*")"), range = c(1, 5))+
ylab(label = "Usage [count]")+
xlab(label = '')+
scale_y_log10()+
annotation_logticks(base = 10, sides = "l")
The nonparametric U-test can also be used for the analysis of differential gene usage. It assumes data with equal shape in both groups (also not met by our data). Prior to using the U-test, we must again convert the usage frequencies into proportions. This is automatically done by IgGeneUsage.
Lets also compare \(\pi\) with the FDR corrected P-values (-log10 scale) from the U-test. Dashed lines show significance levels of 0.05 and 0.01. The U-test finds no evidence of differential gene usage.
ggplot()+
geom_hline(yintercept = c(-log10(0.05), -log10(0.01)),
linetype = "dashed", col = "darkgray")+
geom_point(data = stats, col = "red", size = 2,
aes(x = pmax, y = -log10(u.test.fdr.pvalue)))+
geom_text_repel(data = stats[stats$pmax >= 0.5, ],
aes(x = pmax, y = -log10(u.test.fdr.pvalue),
label = gene_name), size = 4,
min.segment.length = 0.1)+
xlim(0, 1)+
ylab(label = "-log10 (P-value) from U-test [FDR corrected]")+
xlab(label = expression(pi))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_color_discrete(name = '')
For our second case study we use a dataset from a study evaluating vaccine-induced changes in B-cell populations 5 Laserson U and Vigneault F, et al. High-resolution antibody dynamics of vaccine-induced immune responses. Proc Natl Acad Sci USA. 2014 111:4928-33.. The data is publicly provided via the R-package alakazam (version 0.2.11). The IGHV family usage is reported in four B-cell populations (repertoires IgM, IgD, IgG and IgA) across two timepoints (conditions = -1 hour vs. +7 days). In this case study we investigate the overal effect of the vaccine on the IGHV family usage.
The data is already formatted such that it can directly be passed as input to IgGeneUsage. Lets load the data, and to inspect its content.
data(Ig, package = "IgGeneUsage")
kable(x = head(Ig), row.names = FALSE)
gene_name | sample_id | condition | gene_usage_count |
---|---|---|---|
IGHV1 | -1h_IgA | -1h | 8 |
IGHV1 | -1h_IgD | -1h | 38 |
IGHV1 | -1h_IgG | -1h | 7 |
IGHV1 | -1h_IgM | -1h | 110 |
IGHV1 | +7d_IgA | +7d | 2 |
IGHV1 | +7d_IgD | +7d | 7 |
Lets look at the gene usage data with ggplot2.
# we can compute the total number of rearrangements per sample
total.usage <- aggregate(gene_usage_count~sample_id+condition,
FUN = sum, data = Ig)
total.usage$total <- total.usage$gene_usage_count
total.usage$gene_usage_count <- NULL
# merge it with the original data
viz <- merge(x = Ig, y = total.usage,
by = c("sample_id", "condition"),
all.x = TRUE)
# compute %
viz$gene_usage_pct <- viz$gene_usage_count/viz$total*100
# visualize
ggplot(data = viz)+
geom_point(aes(x = gene_name, y = gene_usage_pct, fill = condition,
shape = condition), stroke = 0, size = 3,
position = position_jitterdodge(jitter.width = 0.25,
dodge.width = 0.75))+
theme_bw(base_size = 11)+
ylab(label = "Usage [%]")+
xlab(label = '')+
theme(legend.position = "top")+
scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
scale_shape_manual(name = "Condition", values = c(21, 22))
Lets analyze the differential gene usage in the data Ig. We use the default values of most DGU arguments.
M <- DGU(usage.data = Ig,
mcmc.warmup = 500,
mcmc.steps = 1500,
mcmc.chains = 2,
mcmc.cores = 1,
hdi.level = 0.95,
adapt.delta = 0.95,
max.treedepth = 12)
FALSE
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 1).
FALSE Chain 1:
FALSE Chain 1: Gradient evaluation took 9.5e-05 seconds
FALSE Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.95 seconds.
FALSE Chain 1: Adjust your expectations accordingly!
FALSE Chain 1:
FALSE Chain 1:
FALSE Chain 1: Iteration: 1 / 1500 [ 0%] (Warmup)
FALSE Chain 1: Iteration: 250 / 1500 [ 16%] (Warmup)
FALSE Chain 1: Iteration: 500 / 1500 [ 33%] (Warmup)
FALSE Chain 1: Iteration: 501 / 1500 [ 33%] (Sampling)
FALSE Chain 1: Iteration: 750 / 1500 [ 50%] (Sampling)
FALSE Chain 1: Iteration: 1000 / 1500 [ 66%] (Sampling)
FALSE Chain 1: Iteration: 1250 / 1500 [ 83%] (Sampling)
FALSE Chain 1: Iteration: 1500 / 1500 [100%] (Sampling)
FALSE Chain 1:
FALSE Chain 1: Elapsed Time: 2.2332 seconds (Warm-up)
FALSE Chain 1: 3.86954 seconds (Sampling)
FALSE Chain 1: 6.10273 seconds (Total)
FALSE Chain 1:
FALSE
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 2).
FALSE Chain 2:
FALSE Chain 2: Gradient evaluation took 6.1e-05 seconds
FALSE Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.61 seconds.
FALSE Chain 2: Adjust your expectations accordingly!
FALSE Chain 2:
FALSE Chain 2:
FALSE Chain 2: Iteration: 1 / 1500 [ 0%] (Warmup)
FALSE Chain 2: Iteration: 250 / 1500 [ 16%] (Warmup)
FALSE Chain 2: Iteration: 500 / 1500 [ 33%] (Warmup)
FALSE Chain 2: Iteration: 501 / 1500 [ 33%] (Sampling)
FALSE Chain 2: Iteration: 750 / 1500 [ 50%] (Sampling)
FALSE Chain 2: Iteration: 1000 / 1500 [ 66%] (Sampling)
FALSE Chain 2: Iteration: 1250 / 1500 [ 83%] (Sampling)
FALSE Chain 2: Iteration: 1500 / 1500 [100%] (Sampling)
FALSE Chain 2:
FALSE Chain 2: Elapsed Time: 2.25325 seconds (Warm-up)
FALSE Chain 2: 3.86539 seconds (Sampling)
FALSE Chain 2: 6.11865 seconds (Total)
FALSE Chain 2:
rstan::check_hmc_diagnostics(M$glm)
FALSE
FALSE Divergences:
FALSE
FALSE Tree depth:
FALSE
FALSE Energy:
gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm),
rstan::stan_ess(object = M$glm),
nrow = 1)
ggplot(data = M$ppc$ppc.repertoire)+
facet_wrap(facets = ~sample_name, ncol = 4)+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
geom_errorbar(aes(x = observed_count, y = ppc_mean_count,
ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
geom_point(aes(x = observed_count, y = ppc_mean_count,
fill = condition), shape = 21, size = 1)+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_x_log10(breaks = c(1, 10, 100, 1000),
labels = expression(10^0, 10^1, 10^2, 10^3))+
scale_y_log10(breaks = c(1, 10, 100, 1000),
labels = expression(10^0, 10^1, 10^2, 10^3))+
xlab(label = "Observed usage [counts]")+
ylab(label = "Predicted usage [counts]")+
annotation_logticks(base = 10, sides = "bl")
ggplot(data = M$ppc.data$ppc.gene)+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100), col = "darkgray")+
geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100,
fill = condition), shape = 21, size = 1)+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
xlab(label = "Observed usage [%]")+
ylab(label = "Predicted usage [%]")+
scale_x_log10()+
scale_y_log10()+
annotation_logticks(base = 10, sides = "bl")
Main results:
kable(x = M$glm.summary, row.names = FALSE, digits = 3)
es_mean | es_mean_se | es_sd | es_median | es_L | es_H | contrast | pmax | gene_name |
---|---|---|---|---|---|---|---|---|
0.225 | 0.006 | 0.245 | 0.197 | -0.198 | 0.767 | -1h - +7d | 0.658 | IGHV1 |
-0.023 | 0.007 | 0.297 | 0.004 | -0.699 | 0.510 | -1h - +7d | 0.013 | IGHV2 |
-0.124 | 0.005 | 0.215 | -0.111 | -0.579 | 0.261 | -1h - +7d | 0.411 | IGHV3 |
0.101 | 0.004 | 0.204 | 0.093 | -0.296 | 0.530 | -1h - +7d | 0.389 | IGHV4 |
0.213 | 0.005 | 0.245 | 0.189 | -0.220 | 0.761 | -1h - +7d | 0.649 | IGHV5 |
0.046 | 0.005 | 0.267 | 0.050 | -0.523 | 0.595 | -1h - +7d | 0.187 | IGHV6 |
The effect size and \(\pi\) are related. Lets visualize them for each gene (shown as a point). There are no genes with \(\pi \approx 1\) \(\rightarrow\) no sufficiently strong evidence of differential gene usage based on IgGeneUsage.
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)
stats <- merge(x= stats, y = M$test.summary, by = "gene_name")
ggplot(data = stats)+
geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H),
col = "darkgray")+
geom_point(aes(x = pmax, y = es_mean))+
geom_text_repel(data = stats, aes(x = pmax, y = es_mean, label = gene_fac))+
theme_bw(base_size = 11)+
xlab(label = expression(pi))+
xlim(0, 1)
ggplot()+
geom_hline(yintercept = c(-log10(0.05), -log10(0.01)),
linetype = "dashed", col = "darkgray")+
geom_point(data = stats, col = "red", size = 2,
aes(x = pmax, y = -log10(t.test.fdr.pvalue)))+
geom_text_repel(data = stats,
aes(x = pmax, y = -log10(t.test.fdr.pvalue),
label = gene_name), size = 4,
min.segment.length = 0.1)+
xlim(0, 1)+
ylab(label = "-log10 (P-value) from t-test [FDR corrected]")+
xlab(label = expression(pi))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_color_discrete(name = '')
promising.genes <- unique(M$usage.data$gene_names)
ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]
ggplot()+
geom_errorbar(data = ppc.gene,
aes(x = gene_name, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100, col = condition),
position = position_dodge(width = .8), width = 0.75)+
geom_point(data = viz[viz$gene_name %in% promising.genes, ],
aes(x = gene_name, y = gene_usage_pct, col = condition),
shape = 21, size = 1.5, fill = "black",
position = position_jitterdodge(jitter.width = 0.15,
jitter.height = 0,
dodge.width = 0.8))+
theme_bw(base_size = 11)+
theme(legend.position = "top")
promising.genes <- "IGHV5"
ggplot(data = viz[viz$gene_name %in% promising.genes, ])+
facet_wrap(facets = ~gene_name, ncol = 1, scales = "free_y")+
geom_point(aes(x = sample_id, y = gene_usage_count, fill = condition,
size = total/10^3), shape = 21)+
theme_bw(base_size = 11)+
ylab(label = "Usage [count]")+
xlab(label = '')+
theme(legend.position = "top")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
scale_size_continuous(name = expression("N ("*10^3*")"), range = c(1, 5))+
ylab(label = "Usage [count]")+
xlab(label = '')+
scale_y_log10()+
annotation_logticks(base = 10, sides = "l")
No evidence of differential usage (at FDR level 0.05)
ggplot()+
geom_hline(yintercept = c(-log10(0.05), -log10(0.01)),
linetype = "dashed", col = "darkgray")+
geom_point(data = stats, col = "red", size = 2,
aes(x = pmax, y = -log10(u.test.fdr.pvalue)))+
geom_text_repel(data = stats,
aes(x = pmax, y = -log10(u.test.fdr.pvalue),
label = gene_name), size = 4,
min.segment.length = 0.1)+
xlim(0, 1)+
ylab(label = "-log10 (P-value) from U-test [FDR corrected]")+
xlab(label = expression(pi))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_color_discrete(name = '')
The main purpose of using IgGeneUsage is to discover differential gene usage. However, the tool can be be used for all differential count analyses. In the third case study we explain how to detect differential usage of net CDR3 sequence charge between repertoires belonging to two biological conditions. For this purpose we use CDR3 sequence data of human T-cells (TRB-chain) downloaded from VDJdb 6 https://vdjdb.cdr3.net/.
Each CDR3 sequence is annotated to a specific viral epitope. Here we focus on the two viruses InfluenzaA and CMV which represent the two biological conditions. We consider the different data sources (publications) as different repertoires. To compute the net sequence charge, we consider the amino acids K, R and H as +1 charged, while D and E as -1 charged. Thus, we computed the net charge of a CDR3 sequence by adding up the individual residue charges.
The data is already formatted such that it can directly be passed as input to IgGeneUsage. Lets load the data, and to inspect its content.
data(CDR3_Epitopes, package = "IgGeneUsage")
kable(x = head(CDR3_Epitopes), row.names = FALSE)
sample_id | condition | gene_name | gene_usage_count |
---|---|---|---|
https://github.com/antigenomics/vdjdb-db/issues/267_InfluenzaA | InfluenzaA | -5 | 0 |
https://github.com/antigenomics/vdjdb-db/issues/267_InfluenzaA | InfluenzaA | -4 | 0 |
https://github.com/antigenomics/vdjdb-db/issues/267_InfluenzaA | InfluenzaA | -3 | 0 |
https://github.com/antigenomics/vdjdb-db/issues/267_InfluenzaA | InfluenzaA | -2 | 6 |
https://github.com/antigenomics/vdjdb-db/issues/267_InfluenzaA | InfluenzaA | -1 | 46 |
https://github.com/antigenomics/vdjdb-db/issues/267_InfluenzaA | InfluenzaA | 0 | 164 |
Lets look at the input data with ggplot2.
# we can compute the total number of rearrangements per sample
total.usage <- aggregate(gene_usage_count~sample_id+condition,
FUN = sum, data = CDR3_Epitopes)
total.usage$total <- total.usage$gene_usage_count
total.usage$gene_usage_count <- NULL
# merge it with the original data
viz <- merge(x = CDR3_Epitopes, y = total.usage,
by = c("sample_id", "condition"),
all.x = TRUE)
# compute %
viz$gene_usage_pct <- viz$gene_usage_count/viz$total*100
# visualize
ggplot(data = viz)+
geom_point(aes(x = as.numeric(gene_name),
y = gene_usage_pct, fill = condition,
shape = condition), stroke = 0, size = 2, alpha = 0.5,
position = position_jitterdodge(jitter.width = 0.15,
dodge.width = 0.5))+
theme_bw(base_size = 11)+
ylab(label = "Usage [%]")+
xlab(label = 'Net Charge')+
theme(legend.position = "top")+
scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
scale_shape_manual(name = "Condition", values = c(21, 22))+
scale_x_continuous(breaks = -7:7, labels = -7:7)
Here we study differential gene usage in the data CDR3_Epitopes. We use the default values of most DGU arguments.
M <- DGU(usage.data = CDR3_Epitopes,
mcmc.chains = 2,
mcmc.cores = 1)
FALSE
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 1).
FALSE Chain 1:
FALSE Chain 1: Gradient evaluation took 0.000171 seconds
FALSE Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.71 seconds.
FALSE Chain 1: Adjust your expectations accordingly!
FALSE Chain 1:
FALSE Chain 1:
FALSE Chain 1: Iteration: 1 / 1500 [ 0%] (Warmup)
FALSE Chain 1: Iteration: 250 / 1500 [ 16%] (Warmup)
FALSE Chain 1: Iteration: 500 / 1500 [ 33%] (Warmup)
FALSE Chain 1: Iteration: 501 / 1500 [ 33%] (Sampling)
FALSE Chain 1: Iteration: 750 / 1500 [ 50%] (Sampling)
FALSE Chain 1: Iteration: 1000 / 1500 [ 66%] (Sampling)
FALSE Chain 1: Iteration: 1250 / 1500 [ 83%] (Sampling)
FALSE Chain 1: Iteration: 1500 / 1500 [100%] (Sampling)
FALSE Chain 1:
FALSE Chain 1: Elapsed Time: 26.7638 seconds (Warm-up)
FALSE Chain 1: 38.7611 seconds (Sampling)
FALSE Chain 1: 65.525 seconds (Total)
FALSE Chain 1:
FALSE
FALSE SAMPLING FOR MODEL 'zibb' NOW (CHAIN 2).
FALSE Chain 2:
FALSE Chain 2: Gradient evaluation took 0.000161 seconds
FALSE Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.61 seconds.
FALSE Chain 2: Adjust your expectations accordingly!
FALSE Chain 2:
FALSE Chain 2:
FALSE Chain 2: Iteration: 1 / 1500 [ 0%] (Warmup)
FALSE Chain 2: Iteration: 250 / 1500 [ 16%] (Warmup)
FALSE Chain 2: Iteration: 500 / 1500 [ 33%] (Warmup)
FALSE Chain 2: Iteration: 501 / 1500 [ 33%] (Sampling)
FALSE Chain 2: Iteration: 750 / 1500 [ 50%] (Sampling)
FALSE Chain 2: Iteration: 1000 / 1500 [ 66%] (Sampling)
FALSE Chain 2: Iteration: 1250 / 1500 [ 83%] (Sampling)
FALSE Chain 2: Iteration: 1500 / 1500 [100%] (Sampling)
FALSE Chain 2:
FALSE Chain 2: Elapsed Time: 28.5554 seconds (Warm-up)
FALSE Chain 2: 37.9834 seconds (Sampling)
FALSE Chain 2: 66.5388 seconds (Total)
FALSE Chain 2:
# default DGU parameters:
# mcmc.warmup = 500
# mcmc.steps = 1,500
# mcmc.chains = 2
# mcmc.cores = 1
# hdi.level = 0.95
# adapt.delta = 0.95
# max.treedepth = 12
rstan::check_hmc_diagnostics(M$glm)
FALSE
FALSE Divergences:
FALSE
FALSE Tree depth:
FALSE
FALSE Energy:
grid.arrange(rstan::stan_rhat(object = M$glm),
rstan::stan_ess(object = M$glm),
nrow = 1)
ggplot(data = M$ppc.data$ppc.repertoire)+
facet_wrap(facets = ~sample_name, ncol = 4)+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
geom_errorbar(aes(x = observed_count, y = ppc_mean_count,
ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
geom_point(aes(x = observed_count, y = ppc_mean_count, fill = condition),
shape = 21, size = 1)+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
xlab(label = "Observed usage [counts]")+
ylab(label = "Predicted usage [counts]")+
scale_x_log10()+
scale_y_log10()+
annotation_logticks(base = 10, sides = "bl")
ggplot(data = M$ppc.data$ppc.gene)+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100), col = "darkgray")+
geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100,
fill = condition), shape = 21, size = 1)+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
xlab(label = "Observed usage [%]")+
ylab(label = "Predicted usage [%]")+
scale_x_log10()+
scale_y_log10()+
annotation_logticks(base = 10, sides = "bl")
Main results:
kable(x = head(M$glm.summary), row.names = FALSE, digits = 3)
es_mean | es_mean_se | es_sd | es_median | es_L | es_H | contrast | pmax | gene_name |
---|---|---|---|---|---|---|---|---|
-0.397 | 0.003 | 0.113 | -0.394 | -0.621 | -0.179 | InfluenzaA - CMV | 0.998 | -1 |
-0.442 | 0.003 | 0.136 | -0.441 | -0.726 | -0.179 | InfluenzaA - CMV | 0.998 | -2 |
-0.322 | 0.004 | 0.212 | -0.313 | -0.764 | 0.070 | InfluenzaA - CMV | 0.884 | -3 |
-0.120 | 0.006 | 0.307 | -0.110 | -0.754 | 0.452 | InfluenzaA - CMV | 0.293 | -4 |
-0.090 | 0.007 | 0.333 | -0.082 | -0.777 | 0.550 | InfluenzaA - CMV | 0.230 | -5 |
0.399 | 0.004 | 0.109 | 0.398 | 0.193 | 0.612 | InfluenzaA - CMV | 1.000 | 0 |
The effect size and \(\pi\) are related. Lets visualize them for each gene (shown as a point). There are couple of net-charges groups with \(\pi \approx 1\).
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)
stats <- merge(x = stats, y = M$test.summary, by = "gene_name")
ggplot(data = stats)+
geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H),
col = "darkgray")+
geom_point(aes(x = pmax, y = es_mean))+
geom_text_repel(data = stats,
aes(x = pmax, y = es_mean, label = gene_fac))+
theme_bw(base_size = 11)+
xlab(label = expression(pi))+
scale_x_continuous(limits = c(0, 1))
ggplot(data = stats)+
geom_hline(yintercept = c(-log10(0.05), -log10(0.01)),
linetype = "dashed", col = "darkgray")+
geom_point(aes(x = pmax, y = -log10(t.test.fdr.pvalue)),
size = 1.5, col = "red")+
geom_text_repel(aes(x = pmax, y = -log10(t.test.fdr.pvalue),
label = gene_name), size = 5)+
xlim(0, 1)+
ylab(label = "-log10 (P-value) from t-test [FDR corrected]")+
xlab(label = expression(pi))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_color_discrete(name = '')
ggplot(data = stats)+
geom_hline(yintercept = c(-log10(0.05), -log10(0.01)),
linetype = "dashed", col = "darkgray")+
geom_point(aes(x = pmax, y = -log10(u.test.fdr.pvalue)),
size = 1.5, col = "red")+
geom_text_repel(aes(x = pmax, y = -log10(u.test.fdr.pvalue),
label = gene_name), size = 5)+
xlim(0, 1)+
ylab(label = "-log10(p-value) from U-test [FDR corrected]")+
xlab(label = expression(pi))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
scale_color_discrete(name = '')
promising.genes <- M$usage.data$gene_names
ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]
ppc.gene$gene_name <- factor(x = ppc.gene$gene_name, levels = -5:6)
ggplot()+
geom_errorbar(data = ppc.gene,
aes(x = gene_name, ymin = ppc_L_prop*100,
ymax = ppc_H_prop*100, col = condition),
position = position_dodge(width = .8), width = 0.75)+
geom_point(data = viz[viz$gene_name %in% promising.genes, ],
aes(x = gene_name, y = gene_usage_pct, col = condition),
shape = 21, size = 1.5, fill = "black",
position = position_jitterdodge(jitter.width = 0.2,
jitter.height = 0,
dodge.width = 0.75))+
theme_bw(base_size = 11)+
theme(legend.position = "top")+
ylab(label = "Gene usage [%]")+
xlab(label = '')