library(ClustIRR)
library(knitr)
library(visNetwork)
Adaptive immune repertoires of B- and T-cell receptors (BCRs/TCRs) play an important role in protecting the host against genetically diverse and rapidly evolving pathogens, such as viruses, bacteria, or cancers. The BCR and TCR sequence diversity originates in part due to V(D)J recombination, in which different germline-encoded genes are joined to form immune receptors (IRs). As a result of this process, practically every newly formed naive mature T cell and B cell is equipped with a distinct IR, and this allows them to recognize distinct sets of antigens.
B-cells bind antigens directly via the complementarity determining regions (CDR) of their BCRs, and T-cells recognize antigenic peptides presented by major histocompatibility (MHC) molecules via the CDRs of their TCRs. Antigen recognition may lead to B/T cell activation, and in such a case, the cells start to proliferate rapidly, forming antigen-specific clones that are capable of mounting effective immune response.
Recent studies have shown that similarity in TCR sequences implies shared antigen specificity between receptors. Hence, by clustering of TCR sequences of a repertoire derived by high-throughput sequencing (HT-seq), we can identify groups of TCRs with shared antigen specificity, which is essential for the development of cancer immunotherapies, vaccines, antiviral drugs, etc.
This vignette introduces ClustIRR, a computational method for clustering of immune receptor repertoires.
ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of adaptive immune repertoires.
To install ClustIRR please start R and enter:
if(!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("ClustIRR")
The algorithm of ClustIRR performs clustering of IR sequences to find groups of IRs with similar specificity.
The main input of ClustIRR are two repertoires, i.e., sets of amino acid sequences of the complementarity determining region 3 (CDR3). The CDR3s may come from one T cell receptor chain (e.g., only CDR3\(\alpha\)s or only CDR3\(\beta\)s) or from both chains (CDR3\(\alpha\beta\)). The two sets represent two repertoires, and each element in the sets represents a T-cell:
s
: IR repertoire under investigation (case sample)r
: reference IR repertoire (control/reference sample)Hint: s
and r
may also contain CDR3s from \(\gamma\delta\) T-cells
(CDR3\(\gamma\) and CDR3\(\delta\)) or B-cells (CDR3H and CDR3L). This is
explained in case study 4.
Let’s have a look at an example data set which we will use as input:
data("CDR3ab")
# take the first 500 CDR3b sequences from the data -> s
# s <- base::data.frame(CDR3b = CDR3ab$CDR3b[1:500])
s <- base::data.frame(CDR3b = c(CDR3ab$CDR3b[1:500], "CASSSSPGDQEQFF"))
# take 5000 CDR3b sequences 501:5500 from the data -> r
r <- base::data.frame(CDR3b = CDR3ab$CDR3b[501:5500])
str(s)
#> 'data.frame': 501 obs. of 1 variable:
#> $ CDR3b: chr "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
str(r)
#> 'data.frame': 5000 obs. of 1 variable:
#> $ CDR3b: chr "CASSPRWGSGNTIYF" "CASSDTPNYGYTF" "CAWTPGTRIQETQYF" "CATSRGRYGKQFF" ...
ClustIRR employs two clustering strategies:
s
compared to r
s
that have similar
sequencesThe rationale behind these two clustering strategies is the following: two identical CDR3 sequences have the same specificity. Similar CDR3 sequences (e.g., CDR3 sequences that differ by only one amino acid) may have similar specificity. Global clustering is designed to find pairs of CDR3s that are globally (based on the complete CDR3 sequences) similar.
We also know that two CDR3s with significantly different sequences may still
recognize the same peptide1 Glanville, Jacob, et al. “Identifying specificity groups in the T cell
receptor repertoire.” Nature 547.7661 (2017): 94-98. if they share a motif in their core regions
(e.g., identical 4-mer). Such “useful” motifs may be enriched in s
but not
in r
, and local clustering aims to identify them.
CDR3 sequences are segmented into overlapping motifs (\(k\)-mers), where \(k\) is specified by setting the input \(ks = 4\).
Example of segmenting CDR3 sequence into 4-mers:
cdr3 <- "CASSTTTGTGELFF"
k <- 4
base::colnames(stringdist::qgrams(cdr3, q = k))
#> [1] "CASS" "SSTT" "STTT" "TTTG" "TTGT" "TGTG" "TGEL" "GTGE" "GELF" "ELFF"
#> [11] "ASST"
\(k\)-mers found in the core region of the CDR3 loop are more likely to
establish contact with an antigenic peptide than the \(k\)-mers found at the
flanks of the CDR3 sequence. Hence, the user is encouraged to remove a few
amino acids from the flanks of each CDR3 sequence. This can be done by changing
the control input from control$trim_flank_aa,
e.g.:
control$trim_flank_aa = 0
: no trimmingcontrol$trim_flank_aa = 3
: trim three amino acids from both flanks of the
CDR3 sequenceAn example of trimming CDR3 flanks and segmenting the core of the CDR3 sequence
into 4-mers is shown below. We now focus on five overlapping motifs found in
the core region of the CASSTTTGTGELFF
.
t <- 3
cdr3_trimmed <- base::substr(x = cdr3, start = t + 1, stop = nchar(cdr3) - t)
base::colnames(stringdist::qgrams(cdr3_trimmed, q = k))
#> [1] "STTT" "TTTG" "TTGT" "TGTG" "GTGE"
A motif is considered enriched if the following (user-defined) criteria are satisfied:
control$local_min_o
: minimum motif frequency in s
control$local_min_ove
: minimum ratio of observed vs. expected (OvE)
relative motif frequency, with \(OvE=\dfrac{f_s}{n_s}/\dfrac{f_r}{n_r}\)
s
and r
s
and r
control$local_max_fdr
: maximum false discovery rate (FDR). Corrected
p-value, computed by Fisher’s exact test (actually hypergeometic test)For global clustering, ClustIRR quantifies the dissimilarity
between pairs of CDR3 sequences using Hamming distances. Two CDR3 sequences
with Hamming distance \(\leq x\) are considered globally similar, where \(x\) is
the user-defined threshold control$global_max_dist
(default = 1).
With this, ClustIRR provides a very simple and intuitive heuristic for identifying globally similar CDR3s. But this approach also has drawbacks:
To overcome these challenges, ClustIRR provides a second input
option (see red input in ClustIRR workflow), which allows the
user to provide a matrix of globally similar CDR3 sequences computed by
complementary approaches (e.g., tcrdist). This input can be provided via
the input parameter control$global_pairs
.
The main function in ClustIRR is cluster_irr
. This function
returns an S4 object of class clust_irr
. The object contains two sublists
(slots):
s
) and parametersWe will describe the inputs, outputs, and the algorithm of ClustIRR with the help of the following case studies.
In this example, we will insert the motif LEAR in the core regions of the
first 20 CDR3b sequences of repertoire s
. With this, we simulate enrichment
of this motif.
This motif is not enriched in repertoire r
, and ClustIRR
should be able to detect LEAR as enriched:
# insert motif LEAR
base::substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "LEAR"
… and then we perform clustering with ClustIRR:
o <- cluster_irr(s = s,
r = r,
version = 2,
ks = 4,
cores = 1,
control = list(trim_flank_aa = 3))
In the following table we see that ClustIRR has detected
enrichment of LEAR in s
compared to r
:
# extract local motifs
local_motifs <- get_clustirr_clust(o)$CDR3b$local$m
# display only passed motifs
knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE)
motif | f_s | n_s | f_r | n_r | k | ove | ove_ci_l95 | ove_ci_h95 | p_value | fdr | pass |
---|---|---|---|---|---|---|---|---|---|---|---|
LEAR | 20 | 2844 | 2 | 28286 | 4 | 100.1755 | 28.54331 | Inf | 0.0e+00 | 0.0000000 | TRUE |
LLEA | 5 | 2844 | 0 | 28286 | 4 | Inf | 12.13360 | Inf | 6.3e-06 | 0.0073939 | TRUE |
Interestingly, the motif LLEA is also enriched. This is probably because LLEA and LEAR share the substring LEA, hence, the enrichment of LLEA can be seen as a byproduct of the inserted motif LEAR.
This can also be seen from the lower frequency of LLEA (\(f_s\)=5) in s
compared to LEAR (\(f_s\)=20). For LEAR we see FDR\(\approx 10^{-15}\),
which is significantly smaller than the FDR\(\approx 10^{-2}\) observed for
LLEA. Finally, for LEAR we see OvE\(\approx 100\), whereas for LLEA’s
OvE\(=\infty\) (LLEA has f_r=0
; division by 0 when calculating OvE).
In our data we had only one pair of globally similar CDR3 sequences, i.e. the CDR3 sequences CASSPLEARGYTF and CASRPLEARGYTF which differ by one amino acid at position 4. ClustIRR has identified this:
# display globally similar pairs
knitr::kable(get_clustirr_clust(o)$CDR3b$global, row.names = FALSE)
CASSPLEARGYTF | CASRPLEARGYTF |
CASSSSPGDQEQFF | CASSSSPGDNEQFF |
To interpret the ClustIRR output we can inspect the tables
of locally/globally similar CDR3s. However, sometimes these tables can be
massive and difficult to interpret. Hence, we provide the functions get_graph
,
which allows us to generate an undirected graph (igraph object)
from the main outputs of ClustIRR, and plot_graph
for
visualization of the graph.
Each vertex in the graph is a T-cell clone, and we draw an edge between two vertices if a) they are globally similar; or b) if they share an enriched motif (locally similar).
Let’s visualize the graph output for this case study:
par(mai = c(0,0,0,0))
plot_graph(clust_irr = o)
The graph shows a cluster of 20 T-cell clones (vertices). These are the 20 clones with CDR3 sequences in which we inserted the motif LEAR. The remaining clones (about 480) are shown as singleton vertices.
We also wee two vertices connected by an edge. These are the two clones that are globally similar.
We can also plot an interactive graph with visNetwork.
plot_graph(clust_irr = o, as_visnet = TRUE)
Between a pair of vertices, we draw an edge if: a) they are globally similar; or b) they share an enriched motif. Furthermore, each chain type may have an edge if the repertoire contains paired information, e.g., on \(\beta\) and \(\alpha\) chain CDR3 sequences.
The color, linetype and thickness of the edges are determined as follows.
edge colors
edge linetypes
edge thickness: number of edges between two clones
In this case study, we assume that the data contains a clone with 20 T-cells ($$4% of the size of the initial repertoire). The T-cells in the expanded clone have the same CDR3b sequence CATSRPDGLAQYF. ClustIRR should detect most motifs at the core of CATSRPDGLAQYF as enriched while also reporting that all cells within have globally similar (in fact identical) CDR3b sequences.
Let’s insert a clone in dataset s
:
# create a clone of 10 T-cells
clone <- base::data.frame(CDR3b = rep("CATSRPDGLAQYF", times = 20))
# append the clone to the original repertoire 's'
s <- base::rbind(s, clone)
… and once again perform clustering with ClustIRR:
o <- cluster_irr(s = s,
r = r,
version = 2,
ks = 4,
cores = 1,
control = list(trim_flank_aa = 3))
ClustIRR once again reports enrichment of LEAR, but also of many additional motifs that are part of the core of CATSRPDGLAQYF, such as DGLA, PDGL, RPDG, SRPG, etc.
# extract local motifs
local_motifs <- get_clustirr_clust(o)$CDR3b$local$m
# display only passed motifs
knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE)
motif | f_s | n_s | f_r | n_r | k | ove | ove_ci_l95 | ove_ci_h95 | p_value | fdr | pass |
---|---|---|---|---|---|---|---|---|---|---|---|
DGLA | 20 | 2924 | 4 | 28286 | 4 | 48.66629 | 18.734764 | Inf | 0.0000000 | 0.0000000 | TRUE |
EARG | 5 | 2924 | 2 | 28286 | 4 | 24.21081 | 5.015885 | Inf | 0.0001285 | 0.0375044 | TRUE |
EARN | 4 | 2924 | 0 | 28286 | 4 | Inf | 8.684378 | Inf | 0.0000769 | 0.0256516 | TRUE |
LEAR | 20 | 2924 | 2 | 28286 | 4 | 97.41392 | 27.747649 | Inf | 0.0000000 | 0.0000000 | TRUE |
LLEA | 5 | 2924 | 0 | 28286 | 4 | Inf | 11.801297 | Inf | 0.0000072 | 0.0028003 | TRUE |
PDGL | 20 | 2924 | 1 | 28286 | 4 | 194.74092 | 37.363708 | Inf | 0.0000000 | 0.0000000 | TRUE |
RPDG | 20 | 2924 | 1 | 28286 | 4 | 194.74092 | 37.363708 | Inf | 0.0000000 | 0.0000000 | TRUE |
SRPD | 20 | 2924 | 1 | 28286 | 4 | 194.74092 | 37.363708 | Inf | 0.0000000 | 0.0000000 | TRUE |
Once again, ClustIRR finds the same pair of globally similar
CDR3 sequences. However, remember that this time our repertoire s
contains
20 identical CDR3s belonging to the expanded clone’s T-cells. Each pair
of CDR3 sequences in the clone are globally similar. Meanwhile, the CDR3
sequences CASSPLEARGYTF and CASRPLEARGYTF differ by one amino
acid at position 4.
Let’s check how the global and local similarities are represented in the graph (see next section).
Let’s plot the clustering results:
# plot the clust_irr object
plot_graph(clust_irr = o, as_visnet = TRUE)
We see one connected component, which is identical to the one we saw in case study 1. Furthermore, we see a large vertex. This represents the clonal expansion, where the size of the vertex scales as the logarithm of the number of T-cells in the clone.
Clonally expanded cells are globally similar to each other. If the specific
clonal expansion is only found in s
but not in r
, then it is likely that
we will also see an enrichment of certain motifs from the core of the
corresponding CDR3 sequences. In summary, CDR3s of expanded clones are similar
in terms of the global sequences but may also be locally similar.
Single-cell technology allows us to sequence entire TCR repertoires, and to extract the sequences of both TCR chains: \(\beta\) and \(\alpha\).
ClustIRR can analyze such data. Clustering is performed separately using the CDR3 sequences of each chain.
Let’s create the input data. We create two repetoires:
S0
: 1,000 CDR3\(\beta\) and CDR3\(\alpha\) sequence pairs.S1
: 1,000 CDR3\(\beta\) and CDR3\(\alpha\) sequence pairs.Imagine that S0
and S1
are two TCR repertoires of a cancer patient,
sequenced before and after cancer therapy, respectively.
data("CDR3ab")
S0 <- base::data.frame(CDR3a = CDR3ab$CDR3a[3001:4000],
CDR3b = CDR3ab$CDR3b[3001:4000])
S1 <- base::data.frame(CDR3a = CDR3ab$CDR3a[5001:6000],
CDR3b = CDR3ab$CDR3b[5001:6000])
TCR repertoire S0
has
# insert motif LEAR in CDR3b
base::substr(x = S0$CDR3b[1:10], start = 6, stop = 9) <- "LEAR"
# insert motif REAL in CDR3a
base::substr(x = S0$CDR3a[50:59], start = 6, stop = 9) <- "REAL"
# create 3 clones
clone_1 <- data.frame(CDR3a = rep(x = "CASSEGEQFF", times = 100),
CDR3b = rep(x = "CASSLLARAEQFF", times = 100))
clone_2 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 50),
CDR3b = rep(x = "CASSLEEEEEEPLHF", times = 50))
clone_3 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 10),
CDR3b = rep(x = "CASSLAAAAASPLHF", times = 10))
# append clones to s
S0 <- rbind(S0, clone_1, clone_2, clone_3)
TCR repertoire S1
has
S0
]base::substr(x = S1$CDR3b[1:20], start = 6, stop = 9) <- "WWWW"
S1 <- rbind(S1, clone_1)
We will perform two sets of analysis with ClustIRR.
First, we will inspect how the specificity structure of repertoire S1
(s
=S1
) is modulated compare to repertoire S0
(r
=S0
). Second, we
will do the reverse, i.e. we will inspect the specificity structure of
repertoire S0
(s
=S0
) compared to S1
(r
=S1
).
o_S1_vs_S0 <- cluster_irr(s = S1,
r = S0,
version = 2,
ks = 4,
cores = 1,
control = list(trim_flank_aa = 3))
o_S0_vs_S1 <- cluster_irr(s = S0,
r = S1,
version = 2,
ks = 4,
cores = 1,
control = list(trim_flank_aa = 3))
Let’s plot the joint graph:
# beta & alpha chain
plot_joint_graph(clust_irr_1 = o_S1_vs_S0,
clust_irr_2 = o_S0_vs_S1,
as_visnet = TRUE)
The clones of repertoire S0
and S1
are shown are orange and blue
vertices, respectively.
Repertoire S0
contains three expanded clones. Hence, we see three large
orange vertices. Repertoire S1
contains only one expanded clone shown as
a large blue vertex. The remaining vertices are small and likely contain a
single T-cell.
Within each repertoire, the edges are drawn as explained earlier.
Meanwhile, in the joint graph we also have edges between the vertices from
the two repertoires. These will be drawn if a pair of clones in S0
and S1
have globally similar CDR3 sequences. In this particular example, the vertex
that represents the expanded clone 1 (large orange and blue vertices) are
connected by such an edge.
ClustIRR can be employed to study the specificity structure of:
To carry out this analysis, the user must appropriately configure the inputs. Here we demonstrate this point. The rest of the clustering analysis proceeds similarly as outlined in case studies 1-3.
To analyze \(\alpha\beta\) chain TCR repertoires, the input data sets s
and r
are required to have one or both (e.g., if we have paired data as explained in
case study 3) of the columns: CDR3a
and CDR3b
.
For the analysis of \(\gamma\delta\) chain TCR repertoires the columns CDR3g
and CDR3d
have to be specified, whereas for the analysis of heavy(H)/light(L)
chain BCR repertoires ClustIRR uses the two columns CDR3h and
CDR3l.
Dummy example:
data("CDR3ab")
# gamma/delta chain TCR data -> notice CDR3g and CDR3d columns of 's' and 'r'
s <- base::data.frame(CDR3g = CDR3ab$CDR3a[4501:5000],
CDR3d = CDR3ab$CDR3b[4501:5000])
r <- base::data.frame(CDR3g = CDR3ab$CDR3a[5001:10000],
CDR3d = CDR3ab$CDR3b[5001:10000])
data("CDR3ab")
# heavy/light chain BCR data -> notice CDR3h and CDR3l columns of 's' and 'r'
s <- base::data.frame(CDR3h = CDR3ab$CDR3a[4501:5000],
CDR3l = CDR3ab$CDR3b[4501:5000])
r <- base::data.frame(CDR3h = CDR3ab$CDR3a[5001:10000],
CDR3l = CDR3ab$CDR3b[5001:10000])
The rest of the analysis proceeds as usual, i.e., by calling the function
cluster_irr
.
utils::sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] visNetwork_2.1.2 knitr_1.44 ClustIRR_1.0.0 BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.1 magick_2.8.1 rlang_1.1.1
#> [4] xfun_0.40 jsonlite_1.8.7 listenv_0.9.0
#> [7] future.apply_1.11.0 htmltools_0.5.6.1 sass_0.4.7
#> [10] rmarkdown_2.25 evaluate_0.22 jquerylib_0.1.4
#> [13] ellipsis_0.3.2 fastmap_1.1.1 yaml_2.3.7
#> [16] bookdown_0.36 BiocManager_1.30.22 compiler_4.3.1
#> [19] igraph_1.5.1 codetools_0.2-19 Rcpp_1.0.11
#> [22] pkgconfig_2.0.3 htmlwidgets_1.6.2 future_1.33.0
#> [25] digest_0.6.33 R6_2.5.1 parallelly_1.36.0
#> [28] parallel_4.3.1 magrittr_2.0.3 stringdist_0.9.10
#> [31] bslib_0.5.1 tools_4.3.1 globals_0.16.2
#> [34] cachem_1.0.8