library(ClustIRR)
library(knitr)
library(visNetwork)

1 Introduction

Adaptive immune repertoires of B- and T-cell receptors (BCRs/TCRs) play an important role in protecting the host against genetically diverse and rapidly evolving pathogens, such as viruses, bacteria, or cancers. The BCR and TCR sequence diversity originates in part due to V(D)J recombination, in which different germline-encoded genes are joined to form immune receptors (IRs). As a result of this process, practically every newly formed naive mature T cell and B cell is equipped with a distinct IR, and this allows them to recognize distinct sets of antigens.

B-cells bind antigens directly via the complementarity determining regions (CDR) of their BCRs, and T-cells recognize antigenic peptides presented by major histocompatibility (MHC) molecules via the CDRs of their TCRs. Antigen recognition may lead to B/T cell activation, and in such a case, the cells start to proliferate rapidly, forming antigen-specific clones that are capable of mounting effective immune response.

Recent studies have shown that similarity in TCR sequences implies shared antigen specificity between receptors. Hence, by clustering of TCR sequences of a repertoire derived by high-throughput sequencing (HT-seq), we can identify groups of TCRs with shared antigen specificity, which is essential for the development of cancer immunotherapies, vaccines, antiviral drugs, etc.

This vignette introduces ClustIRR, a computational method for clustering of immune receptor repertoires.

2 Installation

ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of adaptive immune repertoires.

To install ClustIRR please start R and enter:

if(!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ClustIRR")

3 ClustIRR algorithm

The algorithm of ClustIRR performs clustering of IR sequences to find groups of IRs with similar specificity.

3.1 Input

The main input of ClustIRR are two repertoires, i.e., sets of amino acid sequences of the complementarity determining region 3 (CDR3). The CDR3s may come from one T cell receptor chain (e.g., only CDR3$\alpha$s or only CDR3$\beta$s) or from both chains (CDR3$\alpha\beta$). The two sets represent two repertoires, and each element in the sets represents a T-cell:

s: IR repertoire under investigation (case sample)
r: reference IR repertoire (control/reference sample)

Hint: s and r may also contain CDR3s from $\gamma\delta$ T-cells (CDR3$\gamma$ and CDR3$\delta$) or B-cells (CDR3H and CDR3L). This is explained in case study 4.

Let’s have a look at an example data set which we will use as input:

data("CDR3ab")

# take the first 500 CDR3b sequences from the data -> s
# s <- base::data.frame(CDR3b = CDR3ab$CDR3b[1:500])
s <- base::data.frame(CDR3b = c(CDR3ab$CDR3b[1:500], "CASSSSPGDQEQFF"))

# take 5000 CDR3b sequences 501:5500 from the data -> r
r <- base::data.frame(CDR3b = CDR3ab$CDR3b[501:5500])

str(s)
#> 'data.frame':    501 obs. of  1 variable:
#>  $ CDR3b: chr  "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...

str(r)
#> 'data.frame':    5000 obs. of  1 variable:
#>  $ CDR3b: chr  "CASSPRWGSGNTIYF" "CASSDTPNYGYTF" "CAWTPGTRIQETQYF" "CATSRGRYGKQFF" ...

3.2 Clustering

ClustIRR employs two clustering strategies:

local clustering: detects enrichment of motifs in s compared to r
global clustering: identifies pairs of CDR3s in s that have similar sequences

The rationale behind these two clustering strategies is the following: two identical CDR3 sequences have the same specificity. Similar CDR3 sequences (e.g., CDR3 sequences that differ by only one amino acid) may have similar specificity. Global clustering is designed to find pairs of CDR3s that are globally (based on the complete CDR3 sequences) similar.

We also know that two CDR3s with significantly different sequences may still recognize the same peptide111 Glanville, Jacob, et al. “Identifying specificity groups in the T cell receptor repertoire.” Nature 547.7661 (2017): 94-98. if they share a motif in their core regions (e.g., identical 4-mer). Such “useful” motifs may be enriched in s but not in r, and local clustering aims to identify them.

3.2.1 Local clustering

CDR3 sequences are segmented into overlapping motifs ($k$-mers), where $k$ is specified by setting the input $ks = 4$.

Example of segmenting CDR3 sequence into 4-mers:

cdr3 <- "CASSTTTGTGELFF"
k <- 4
base::colnames(stringdist::qgrams(cdr3, q = k))
#>  [1] "CASS" "SSTT" "STTT" "TTTG" "TTGT" "TGTG" "TGEL" "GTGE" "GELF" "ELFF"
#> [11] "ASST"

$k$-mers found in the core region of the CDR3 loop are more likely to establish contact with an antigenic peptide than the $k$-mers found at the flanks of the CDR3 sequence. Hence, the user is encouraged to remove a few amino acids from the flanks of each CDR3 sequence. This can be done by changing the control input from control$trim_flank_aa, e.g.:

control$trim_flank_aa = 0: no trimming
control$trim_flank_aa = 3: trim three amino acids from both flanks of the CDR3 sequence

An example of trimming CDR3 flanks and segmenting the core of the CDR3 sequence into 4-mers is shown below. We now focus on five overlapping motifs found in the core region of the CASSTTTGTGELFF.

t <- 3
cdr3_trimmed <- base::substr(x = cdr3, start = t + 1, stop = nchar(cdr3) - t)
base::colnames(stringdist::qgrams(cdr3_trimmed, q = k))
#> [1] "STTT" "TTTG" "TTGT" "TGTG" "GTGE"

A motif is considered enriched if the following (user-defined) criteria are satisfied:

control$local_min_o: minimum motif frequency in s
control$local_min_ove: minimum ratio of observed vs. expected (OvE) relative motif frequency, with $OvE=\dfrac{f_s}{n_s}/\dfrac{f_r}{n_r}$
- $f_{s}$ and $f_{r}$: motif frequencies in repertoires s and r
- $n_{s}$ and $n_{r}$: total number of motifs in repertoires s and r
control$local_max_fdr: maximum false discovery rate (FDR). Corrected p-value, computed by Fisher’s exact test (actually hypergeometic test)

3.2.2 Global clustering

For global clustering, ClustIRR quantifies the dissimilarity between pairs of CDR3 sequences using Hamming distances. Two CDR3 sequences with Hamming distance $\leq x$ are considered globally similar, where $x$ is the user-defined threshold control$global_max_dist (default = 1).

With this, ClustIRR provides a very simple and intuitive heuristic for identifying globally similar CDR3s. But this approach also has drawbacks:

CDR3 sequences with different lengths are not compared
Hamming distance, by definition, does not take into account the properties of the replaced amino acids

To overcome these challenges, ClustIRR provides a second input option (see red input in ClustIRR workflow), which allows the user to provide a matrix of globally similar CDR3 sequences computed by complementary approaches (e.g., tcrdist). This input can be provided via the input parameter control$global_pairs.

3.3 Output

The main function in ClustIRR is cluster_irr. This function returns an S4 object of class clust_irr. The object contains two sublists (slots):

clustering results: tables and lists
processed inputs: processed version of the input data (s) and parameters

We will describe the inputs, outputs, and the algorithm of ClustIRR with the help of the following case studies.

4 Case study 1: simple TCR repertoire analysis with ClustIRR

In this example, we will insert the motif LEAR in the core regions of the first 20 CDR3b sequences of repertoire s. With this, we simulate enrichment of this motif.

This motif is not enriched in repertoire r, and ClustIRR should be able to detect LEAR as enriched:

# insert motif LEAR
base::substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "LEAR"

… and then we perform clustering with ClustIRR:

o <- cluster_irr(s = s,
                 r = r,
                 version = 2,
                 ks = 4,
                 cores = 1,
                 control = list(trim_flank_aa = 3))

4.1 Local clustering results

In the following table we see that ClustIRR has detected enrichment of LEAR in s compared to r:

# extract local motifs
local_motifs <- get_clustirr_clust(o)$CDR3b$local$m

# display only passed motifs
knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE)

motif	f_s	n_s	f_r	n_r	k	ove	ove_ci_l95	ove_ci_h95	p_value	fdr	pass
LEAR	20	2844	2	28286	4	100.1755	28.54331	Inf	0.0e+00	0.0000000	TRUE
LLEA	5	2844	0	28286	4	Inf	12.13360	Inf	6.3e-06	0.0073939	TRUE

Interestingly, the motif LLEA is also enriched. This is probably because LLEA and LEAR share the substring LEA, hence, the enrichment of LLEA can be seen as a byproduct of the inserted motif LEAR.

This can also be seen from the lower frequency of LLEA ($f_s$=5) in s compared to LEAR ($f_s$=20). For LEAR we see FDR$\approx 10^{-15}$, which is significantly smaller than the FDR$\approx 10^{-2}$ observed for LLEA. Finally, for LEAR we see OvE$\approx 100$, whereas for LLEA’s
OvE$=\infty$ (LLEA has f_r=0; division by 0 when calculating OvE).

4.2 Global clustering results

In our data we had only one pair of globally similar CDR3 sequences, i.e. the CDR3 sequences CASSPLEARGYTF and CASRPLEARGYTF which differ by one amino acid at position 4. ClustIRR has identified this:

# display globally similar pairs
knitr::kable(get_clustirr_clust(o)$CDR3b$global, row.names = FALSE)

CASSPLEARGYTF	CASRPLEARGYTF
CASSSSPGDQEQFF	CASSSSPGDNEQFF

4.3 Putting it all together $\rightarrow$ graph

To interpret the ClustIRR output we can inspect the tables of locally/globally similar CDR3s. However, sometimes these tables can be massive and difficult to interpret. Hence, we provide the functions get_graph, which allows us to generate an undirected graph (igraph object) from the main outputs of ClustIRR, and plot_graph for visualization of the graph.

Each vertex in the graph is a T-cell clone, and we draw an edge between two vertices if a) they are globally similar; or b) if they share an enriched motif (locally similar).

Let’s visualize the graph output for this case study:

par(mai = c(0,0,0,0))
plot_graph(clust_irr = o)

4.3.1 Vertices

The graph shows a cluster of 20 T-cell clones (vertices). These are the 20 clones with CDR3 sequences in which we inserted the motif LEAR. The remaining clones (about 480) are shown as singleton vertices.

We also wee two vertices connected by an edge. These are the two clones that are globally similar.

We can also plot an interactive graph with visNetwork.

plot_graph(clust_irr = o, as_visnet = TRUE)

4.3.2 Edges

Between a pair of vertices, we draw an edge if: a) they are globally similar; or b) they share an enriched motif. Furthermore, each chain type may have an edge if the repertoire contains paired information, e.g., on $\beta$ and $\alpha$ chain CDR3 sequences.

The color, linetype and thickness of the edges are determined as follows.

edge colors
- purple: local CDR3 similarity
- green: global CDR3 similarity
- black: local and global CDR3 similarity
edge linetypes
- dashed: similarity between CDR3$\beta$, CDR3$\delta$ and CDR3H
- dotted: similarity between CDR3$\alpha$, CDR3$\gamma$ and CDR3L
- solid: similarity between CDR3s from both chains (e.g. CDR3$\alpha$ and CDR3$\beta$)
edge thickness: number of edges between two clones

5 Case study 2: analysis of TCR repertoire with large expanded clone

In this case study, we assume that the data contains a clone with 20 T-cells ($$4% of the size of the initial repertoire). The T-cells in the expanded clone have the same CDR3b sequence CATSRPDGLAQYF. ClustIRR should detect most motifs at the core of CATSRPDGLAQYF as enriched while also reporting that all cells within have globally similar (in fact identical) CDR3b sequences.

Let’s insert a clone in dataset s:

# create a clone of 10 T-cells
clone <- base::data.frame(CDR3b = rep("CATSRPDGLAQYF", times = 20))

# append the clone to the original repertoire 's'
s <- base::rbind(s, clone)

… and once again perform clustering with ClustIRR:

o <- cluster_irr(s = s,
                 r = r,
                 version = 2,
                 ks = 4,
                 cores = 1,
                 control = list(trim_flank_aa = 3))

5.1 Local clustering results

ClustIRR once again reports enrichment of LEAR, but also of many additional motifs that are part of the core of CATSRPDGLAQYF, such as DGLA, PDGL, RPDG, SRPG, etc.

# extract local motifs
local_motifs <- get_clustirr_clust(o)$CDR3b$local$m

# display only passed motifs
knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE)

motif	f_s	n_s	f_r	n_r	k	ove	ove_ci_l95	ove_ci_h95	p_value	fdr	pass
DGLA	20	2924	4	28286	4	48.66629	18.734764	Inf	0.0000000	0.0000000	TRUE
EARG	5	2924	2	28286	4	24.21081	5.015885	Inf	0.0001285	0.0375044	TRUE
EARN	4	2924	0	28286	4	Inf	8.684378	Inf	0.0000769	0.0256516	TRUE
LEAR	20	2924	2	28286	4	97.41392	27.747649	Inf	0.0000000	0.0000000	TRUE
LLEA	5	2924	0	28286	4	Inf	11.801297	Inf	0.0000072	0.0028003	TRUE
PDGL	20	2924	1	28286	4	194.74092	37.363708	Inf	0.0000000	0.0000000	TRUE
RPDG	20	2924	1	28286	4	194.74092	37.363708	Inf	0.0000000	0.0000000	TRUE
SRPD	20	2924	1	28286	4	194.74092	37.363708	Inf	0.0000000	0.0000000	TRUE

5.2 Global clustering results

Once again, ClustIRR finds the same pair of globally similar CDR3 sequences. However, remember that this time our repertoire s contains 20 identical CDR3s belonging to the expanded clone’s T-cells. Each pair of CDR3 sequences in the clone are globally similar. Meanwhile, the CDR3 sequences CASSPLEARGYTF and CASRPLEARGYTF differ by one amino acid at position 4.

Let’s check how the global and local similarities are represented in the graph (see next section).

5.3 Graph output

Let’s plot the clustering results:

# plot the clust_irr object
plot_graph(clust_irr = o, as_visnet = TRUE)

We see one connected component, which is identical to the one we saw in case study 1. Furthermore, we see a large vertex. This represents the clonal expansion, where the size of the vertex scales as the logarithm of the number of T-cells in the clone.

Clonally expanded cells are globally similar to each other. If the specific clonal expansion is only found in s but not in r, then it is likely that we will also see an enrichment of certain motifs from the core of the corresponding CDR3 sequences. In summary, CDR3s of expanded clones are similar in terms of the global sequences but may also be locally similar.

6 Case study 3: analysis of TCR repertoire with paired $\alpha\beta$ TCR chains

Single-cell technology allows us to sequence entire TCR repertoires, and to extract the sequences of both TCR chains: $\beta$ and $\alpha$.

ClustIRR can analyze such data. Clustering is performed separately using the CDR3 sequences of each chain.

Let’s create the input data. We create two repetoires:

repertoire S0: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs.
repertoire S1: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs.

Imagine that S0 and S1 are two TCR repertoires of a cancer patient, sequenced before and after cancer therapy, respectively.

data("CDR3ab")

S0 <- base::data.frame(CDR3a = CDR3ab$CDR3a[3001:4000],
                      CDR3b = CDR3ab$CDR3b[3001:4000])

S1 <- base::data.frame(CDR3a = CDR3ab$CDR3a[5001:6000],
                      CDR3b = CDR3ab$CDR3b[5001:6000])

TCR repertoire S0 has

two enriched motifs: LEAR and REAL
three expanded clones
- clone 1: 100 T-cells (CDR3a: CASSEGEQFF and CDR3b CASSLLARAEQFF)
- clone 2: 50 T-cells (CDR3a: CASSLESPLHF and CDR3b: CASSLEEEEEEPLHF)
- clone 3: 10 T-cells (CDR3a: CASSLESPLHF and CDR3b: CASSLAAAAASPLHF)

# insert motif LEAR in CDR3b
base::substr(x = S0$CDR3b[1:10], start = 6, stop = 9) <- "LEAR"
# insert motif REAL in CDR3a
base::substr(x = S0$CDR3a[50:59], start = 6, stop = 9) <- "REAL"

# create 3 clones
clone_1 <- data.frame(CDR3a = rep(x = "CASSEGEQFF", times = 100),
                      CDR3b = rep(x = "CASSLLARAEQFF", times = 100))

clone_2 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 50),
                      CDR3b = rep(x = "CASSLEEEEEEPLHF", times = 50))

clone_3 <- data.frame(CDR3a = rep(x = "CASSLESPLHF", times = 10),
                      CDR3b = rep(x = "CASSLAAAAASPLHF", times = 10))

# append clones to s
S0 <- rbind(S0, clone_1, clone_2, clone_3)

TCR repertoire S1 has

one enriched motif: WWWW
one expanded clone 1: 100 T-cells (CDR3a: CASSEGEQFF and CDR3b CASSLLARAEQFF) [same as clone 1 in repertoire S0]

base::substr(x = S1$CDR3b[1:20], start = 6, stop = 9) <- "WWWW"
S1 <- rbind(S1, clone_1)

We will perform two sets of analysis with ClustIRR.

First, we will inspect how the specificity structure of repertoire S1 (s=S1) is modulated compare to repertoire S0 (r=S0). Second, we will do the reverse, i.e. we will inspect the specificity structure of repertoire S0 (s=S0) compared to S1 (r=S1).

o_S1_vs_S0 <- cluster_irr(s = S1,
                          r = S0,
                          version = 2,
                          ks = 4,
                          cores = 1,
                          control = list(trim_flank_aa = 3))


o_S0_vs_S1 <- cluster_irr(s = S0,
                          r = S1,
                          version = 2,
                          ks = 4,
                          cores = 1,
                          control = list(trim_flank_aa = 3))

Let’s plot the joint graph:

# beta & alpha chain
plot_joint_graph(clust_irr_1 = o_S1_vs_S0, 
                 clust_irr_2 = o_S0_vs_S1,
                 as_visnet = TRUE)

6.1 How to interpret the joint graph?

The clones of repertoire S0 and S1 are shown are orange and blue vertices, respectively.

Repertoire S0 contains three expanded clones. Hence, we see three large orange vertices. Repertoire S1 contains only one expanded clone shown as a large blue vertex. The remaining vertices are small and likely contain a single T-cell.

Within each repertoire, the edges are drawn as explained earlier.

Meanwhile, in the joint graph we also have edges between the vertices from the two repertoires. These will be drawn if a pair of clones in S0 and S1 have globally similar CDR3 sequences. In this particular example, the vertex that represents the expanded clone 1 (large orange and blue vertices) are connected by such an edge.

7 Case study 4: analysis of TCR repertoires with paired $\gamma\delta$ chains

ClustIRR can be employed to study the specificity structure of:

$\alpha\beta$ chain TCR repertoires
$\gamma\delta$ chain TCR repertoires
heavy(H)/light(L) chain BCR repertoires

To carry out this analysis, the user must appropriately configure the inputs. Here we demonstrate this point. The rest of the clustering analysis proceeds similarly as outlined in case studies 1-3.

7.1 Configuring the input

To analyze $\alpha\beta$ chain TCR repertoires, the input data sets s and r are required to have one or both (e.g., if we have paired data as explained in case study 3) of the columns: CDR3a and CDR3b.

For the analysis of $\gamma\delta$ chain TCR repertoires the columns CDR3g and CDR3d have to be specified, whereas for the analysis of heavy(H)/light(L) chain BCR repertoires ClustIRR uses the two columns CDR3h and CDR3l.

Dummy example:

data("CDR3ab")

# gamma/delta chain TCR data -> notice CDR3g and CDR3d columns of 's' and 'r'
s <- base::data.frame(CDR3g = CDR3ab$CDR3a[4501:5000],
                      CDR3d = CDR3ab$CDR3b[4501:5000])

r <- base::data.frame(CDR3g = CDR3ab$CDR3a[5001:10000],
                      CDR3d = CDR3ab$CDR3b[5001:10000])

data("CDR3ab")

# heavy/light chain BCR data -> notice CDR3h and CDR3l columns of 's' and 'r'
s <- base::data.frame(CDR3h = CDR3ab$CDR3a[4501:5000],
                      CDR3l = CDR3ab$CDR3b[4501:5000])

r <- base::data.frame(CDR3h = CDR3ab$CDR3a[5001:10000],
                      CDR3l = CDR3ab$CDR3b[5001:10000])

The rest of the analysis proceeds as usual, i.e., by calling the function cluster_irr.

utils::sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] visNetwork_2.1.2 knitr_1.44       ClustIRR_1.0.0   BiocStyle_2.30.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.1           magick_2.8.1        rlang_1.1.1        
#>  [4] xfun_0.40           jsonlite_1.8.7      listenv_0.9.0      
#>  [7] future.apply_1.11.0 htmltools_0.5.6.1   sass_0.4.7         
#> [10] rmarkdown_2.25      evaluate_0.22       jquerylib_0.1.4    
#> [13] ellipsis_0.3.2      fastmap_1.1.1       yaml_2.3.7         
#> [16] bookdown_0.36       BiocManager_1.30.22 compiler_4.3.1     
#> [19] igraph_1.5.1        codetools_0.2-19    Rcpp_1.0.11        
#> [22] pkgconfig_2.0.3     htmlwidgets_1.6.2   future_1.33.0      
#> [25] digest_0.6.33       R6_2.5.1            parallelly_1.36.0  
#> [28] parallel_4.3.1      magrittr_2.0.3      stringdist_0.9.10  
#> [31] bslib_0.5.1         tools_4.3.1         globals_0.16.2     
#> [34] cachem_1.0.8

Introduction to ClustIRR

2023-10-24

Contents