immGLIPH provides an R implementation of the GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) and GLIPH2 algorithms for clustering T cell receptors (TCRs) that are predicted to bind the same HLA-restricted peptide antigen.
The package identifies TCR specificity groups by detecting statistically enriched CDR3\(\beta\) motifs (local similarity) and structurally similar CDR3\(\beta\) sequences (global similarity), then clusters them into convergence groups and scores each group for biological significance.
immGLIPH is an R implementation of existing algorithms. Users should cite the original publications:
GLIPH: Glanville, J. et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017). doi:10.1038/nature22976
GLIPH2: Huang, H. et al. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology 38, 1194–1202 (2020). doi:10.1038/s41587-020-0505-4
immGLIPH can be installed from Bioconductor:
The reference repertoire data (~19 MB) is downloaded automatically
the first time you run runGLIPH() and cached locally via BiocFileCache.
You can pre-download the data with:
immGLIPH integrates with the scRepertoire
ecosystem through immApex. Both
scRepertoire and immApex are Bioconductor packages and can be installed
via BiocManager::install(). This means
runGLIPH() can directly accept:
immGLIPH includes built-in example data derived from the scRepertoire example dataset (Yost et al. 2019):
gliph_input_data: A data frame of 365
TRB CDR3 sequences with V-gene and patient annotations.gliph_sce: A SingleCellExperiment
object with TCR clonotype information in colData,
demonstrating the single-cell workflow.## CDR3b TRBV patient
## 1 CAISEQGKGELFF TRBV10-3 P17B
## 2 CASSVRRERANTGELFF TRBV9 P17B
## 3 CASSVRRERANTGELFF TRBV9 P17B
## 4 CASSSRQDSTDTQYF TRBV27 P17B
## 5 CASSVRRERANTGELFF TRBV9 P17B
## 6 CASSPRAGTPNTEAFF TRBV4-1 P17B
## [1] 365 3
runGLIPH() accepts a data frame with the following
columns:
| Column | Required | Description |
|---|---|---|
| CDR3b | Yes | CDR3\(\beta\) amino acid sequences |
| TRBV | No | V-gene usage (e.g., “TRBV5-1”) |
| patient | No | Donor/sample identifier |
| HLA | No | HLA alleles, comma-separated |
| counts | No | Clone frequency |
Alternative column names are automatically recognized (e.g.,
cdr3, v_gene, sample,
clone_count).
When working with single-cell immune repertoire data, you can use scRepertoire to prepare your data and pass the output directly to immGLIPH.
library(scRepertoire)
data("contig_list")
# After processing with cellranger/etc, combine contigs
combined <- combineTCR(contig_list[seq_len(2)],
samples = c("P1", "P2"))
# Take a small slice so the example runs quickly. In real use, pass
# all samples and rely on the bundled reference downloaded by
# getGLIPHreference().
combined_small <- lapply(combined, function(x) x[seq_len(50), ])
# Use a small custom reference built from the bundled example data
small_ref <- gliph_input_data[, c("CDR3b", "TRBV")]
# Pass scRepertoire output directly to runGLIPH
results_sc <- runGLIPH(combined_small,
method = "gliph2",
refdb_beta = small_ref,
sim_depth = 100,
n_cores = 1)For SingleCellExperiment objects that already
contain TCR metadata (e.g., added via
scRepertoire::combineExpression()), immGLIPH extracts the
receptor data automatically using immApex::getIR(). Here is
an example using the bundled gliph_sce dataset:
runGLIPH() Function| Argument | Default | Description |
|---|---|---|
cdr3_sequences |
– | Input data (data frame, vector, SCE, or list) |
method |
"gliph2" |
Algorithm preset: "gliph1",
"gliph2", or "custom" |
sim_depth |
1000 | Simulation depth (higher = more reproducible, slower) |
n_cores |
1 | Number of parallel cores |
refdb_beta |
"human_v2.0_CD48" |
Reference database (built-in name or custom data frame) |
local_similarities |
TRUE |
Search for local (motif-based) similarities |
global_similarities |
TRUE |
Search for global (structural) similarities |
structboundaries |
TRUE |
Trim structural boundaries before comparison |
accept_CF |
TRUE |
Filter to sequences starting with C, ending with F |
The method parameter configures a coordinated set of
algorithm choices:
| Setting | "gliph1" |
"gliph2" |
"custom" |
|---|---|---|---|
| Local method | Repeated random sampling | Fisher’s exact test | User-defined |
| Global method | Hamming distance cutoff | Struct-based + Fisher | User-defined |
| Clustering | Connected components | Per-motif isolated | User-defined |
| Scoring | GLIPH1 formula | GLIPH2 formula | User-defined |
By default runGLIPH() downloads a reference repertoire
via BiocFileCache. For this vignette we supply a custom reference data
frame directly using refdb_beta to avoid network
access:
The output is a list with the following elements:
## [1] "sample_log" "motif_enrichment" "global_enrichment"
## [4] "connections" "cluster_properties" "cluster_list"
## [7] "parameters"
The cluster_properties data frame summarizes each
convergence group with enrichment scores:
## NULL
The cluster_list is a named list where each element
contains the member TCRs of a convergence group:
## [1] 0
The motif_enrichment element contains the locally
enriched motifs:
# Significantly enriched motifs
if (!is.null(res_gliph2$motif_enrichment$selected_motifs)) {
head(res_gliph2$motif_enrichment$selected_motifs)
}## [1] motif counts num_in_ref avgRef topRef OvE p.value
## <0 rows> (or 0-length row.names)
method = "custom"The "custom" method allows independent control over each
algorithmic component:
res_custom <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
method = "custom",
refdb_beta = ref_df,
local_method = "fisher", # or "rrs"
global_method = "cutoff", # or "fisher"
clustering_method = "GLIPH1.0", # or "GLIPH2.0"
scoring_method = "GLIPH2.0", # or "GLIPH1.0"
sim_depth = 100,
n_cores = 1
)For the Fisher-based local method (GLIPH2), you can adjust:
lcminp: Maximum p-value for a motif to
be considered enriched (default 0.01)lcminove: Minimum fold-enrichment per
motif length (default c(1000, 100, 10) for 2-mers, 3-mers,
4-mers)kmer_mindepth: Minimum motif
observations in the sample (default 3)immGLIPH ships with reference repertoires for human and mouse from the original GLIPH publications. Each is available as CD4, CD8, or combined (CD48) subsets:
| Name | Species | Version | Subset | Source |
|---|---|---|---|---|
"human_v1.0_CD4" |
Human | v1.0 | CD4 | Glanville et al. (2017) |
"human_v1.0_CD8" |
Human | v1.0 | CD8 | Glanville et al. (2017) |
"human_v1.0_CD48" |
Human | v1.0 | CD4+CD8 | Glanville et al. (2017) |
"human_v2.0_CD4" |
Human | v2.0 | CD4 | Huang et al. (2020) |
"human_v2.0_CD8" |
Human | v2.0 | CD8 | Huang et al. (2020) |
"human_v2.0_CD48" |
Human | v2.0 | CD4+CD8 | Huang et al. (2020) |
"mouse_v1.0_CD4" |
Mouse | v1.0 | CD4 | Huang et al. (2020) |
"mouse_v1.0_CD8" |
Mouse | v1.0 | CD8 | Huang et al. (2020) |
"mouse_v1.0_CD48" |
Mouse | v1.0 | CD4+CD8 | Huang et al. (2020) |
"gliph_reference" |
Human | v1.0 | CD4+CD8 | Legacy alias for human_v1.0_CD48 |
Each reference includes pre-computed V-gene usage and CDR3 length frequency distributions, which are automatically used for cluster scoring.
To analyse mouse data, supply mouse CDR3\(\beta\) sequences as
cdr3_sequences and set
refdb_beta = "mouse_v1.0_CD48" (or one of the CD4/CD8
subsets in the table). The reference is fetched and cached the first
time it is requested.
You can also supply your own reference as a data frame. A minimal
reference is a two-column table of CDR3\(\beta\) amino-acid sequences and their
corresponding V-gene names. Here we build a small one from the bundled
gliph_input_data for illustration:
## CDR3b TRBV
## 1 CAISEQGKGELFF TRBV10-3
## 2 CASSVRRERANTGELFF TRBV9
## 3 CASSVRRERANTGELFF TRBV9
findMotifs()The findMotifs() function searches for continuous and
discontinuous k-mer motifs in a set of CDR3 sequences. It is used
internally by runGLIPH() but can also be called
independently.
| Argument | Default | Description |
|---|---|---|
seqs |
– | Character vector of CDR3\(\beta\) sequences |
q |
2:4 |
Motif lengths to search |
kmer_mindepth |
NULL |
Minimum motif count to return |
discontinuous |
FALSE |
Include discontinuous motifs (with one variable position) |
data("gliph_input_data")
sample_seqs <- as.character(gliph_input_data$CDR3b[seq_len(100)])
# Find all 3-mers appearing at least 5 times
motifs <- findMotifs(seqs = sample_seqs,
q = 3,
kmer_mindepth = 5)
head(motifs[order(motifs$V1, decreasing = TRUE), ])## motif V1
## 49 CAS 85
## 46 ASS 77
## 45 QYF 34
## 30 SSL 27
## 19 LFF 26
## 21 GEL 22
Including discontinuous motifs (e.g., C.S where
. is any amino acid):
disc_motifs <- findMotifs(seqs = sample_seqs,
q = 2,
kmer_mindepth = 5,
discontinuous = TRUE
)
# Show discontinuous motifs (those containing a dot)
disc_only <- disc_motifs[grep("\\.", disc_motifs$motif), ]
head(disc_only[order(disc_only$V1, decreasing = TRUE), ])## motif V1
## 52 .S 233
## 71 S. 233
## 58 .A 180
## 64 A. 178
## 41 .F 163
## 24 G. 162
clusterScoring()The clusterScoring() function evaluates convergence
groups using up to five metrics. This is called automatically by
runGLIPH(), but can be re-run with different parameters on
existing results.
| Argument | Default | Description |
|---|---|---|
cluster_list |
– | Named list of cluster data frames (from
runGLIPH()$cluster_list) |
cdr3_sequences |
– | Original input data frame |
refdb_beta |
"human_v2.0_CD48" |
Reference database |
gliph_version |
1 | Scoring formula: 1 (GLIPH) or 2 (GLIPH2) |
sim_depth |
1000 | Resampling depth for score estimation |
The total score is derived from up to five components (depending on available data):
network.size.score: Probability of
observing a cluster of this size by chancecdr3.length.score: Enrichment of CDR3
length distribution within the clustervgene.score: Enrichment of V-gene
usage (requires TRBV column)clonal.expansion.score: Enrichment of
expanded clones (requires counts column)hla.score: Enrichment of shared HLA
alleles among donors (requires patient + HLA columns)deNovoTCRs()The deNovoTCRs() function generates artificial CDR3\(\beta\) sequences that resemble the
positional amino acid composition of a given convergence group. This can
be used to predict novel TCR sequences with similar binding
characteristics.
| Argument | Default | Description |
|---|---|---|
convergence_group_tag |
– | Tag identifying the cluster (from
cluster_properties$tag) |
clustering_output |
NULL |
Output list from runGLIPH() |
result_folder |
"" |
Alternative: load from files |
sims |
100,000 | Number of de novo sequences to generate |
num_tops |
1,000 | Return top N highest-scoring sequences |
normalization |
FALSE |
Normalize scores against the reference database |
make_figure |
FALSE |
Plot score vs. rank |
# Generate de novo TCRs for the first convergence group (if any found)
if (length(res_gliph1$cluster_list) > 0) {
de_novo <- deNovoTCRs(
convergence_group_tag = names(res_gliph1$cluster_list)[1],
clustering_output = res_gliph1,
refdb_beta = ref_df,
sims = 10000,
num_tops = 100,
make_figure = FALSE,
n_cores = 1
)
# Top predicted sequences
head(de_novo$de_novo_sequences)
# Positional weight matrix used for generation
head(de_novo$PWM_Scoring)
}plotNetwork()The plotNetwork() function creates an interactive
network visualization of the convergence groups using the visNetwork
package.
| Argument | Default | Description |
|---|---|---|
clustering_output |
NULL |
Output list from runGLIPH() |
color_info |
"total.score" |
Column name for node coloring |
color_palette |
viridis::viridis |
Color palette function |
local_edge_color |
"orange" |
Color for local similarity edges |
global_edge_color |
"#68bceb" |
Color for global similarity edges |
size_info |
NULL |
Column name for node sizing |
cluster_min_size |
3 | Minimum cluster size to display |
loadGLIPH()If you save results to disk using result_folder, you can
reload them later. We use tempdir() here so the example
does not write to your working directory:
out_dir <- file.path(tempdir(), "gliph_results")
res_saved <- runGLIPH(
cdr3_sequences = gliph_input_data[seq_len(200), ],
method = "gliph2",
refdb_beta = ref_df,
result_folder = out_dir,
sim_depth = 100,
n_cores = 1
)
reloaded <- loadGLIPH(result_folder = out_dir)
names(reloaded)## [1] "motif_enrichment" "global_enrichment" "connections"
## [4] "cluster_properties" "cluster_list" "parameters"
When result_folder is specified, runGLIPH()
writes several output files:
| File | Description |
|---|---|
local_similarities.txt |
Enriched motifs |
all_motifs.txt |
All tested motifs with statistics |
clone_network.txt |
Network edge list |
convergence_groups.txt |
Cluster properties and scores |
cluster_member_details.txt |
Full member information per cluster |
parameters.txt |
All parameters used |
When immApex is installed, immGLIPH automatically uses its C++-accelerated backends for two computationally intensive steps:
Motif enumeration (findMotifs()):
Uses immApex::calculateMotif() with OpenMP multithreading
instead of the pure-R stringdist::qgrams()
approach.
Global Hamming distance network (GLIPH1 method):
Uses immApex::buildNetwork() to compute pairwise distances
in a single C++ call, replacing the parallel loop over
stringdist::stringdist().
If immApex is not installed, immGLIPH falls back to the original pure-R implementations transparently, no code changes are needed.
Start with GLIPH2: The Fisher-based approach is generally more statistically rigorous than repeated random sampling.
Sample size matters: GLIPH works best with >200 unique CDR3\(\beta\) sequences. Very small samples may yield few or no convergence groups.
Include V-gene information: When available, TRBV data improves both global similarity detection and scoring accuracy.
Adjust sim_depth: For
publication-quality results, use sim_depth >= 1000. For
exploratory analysis, sim_depth = 100 is faster.
Parallelization: For large datasets (>5,000
sequences), set n_cores > 1 to use parallel processing
via BiocParallel.
Install immApex: For best performance, install immApex to enable C++-accelerated motif enumeration and network construction (see Performance section above).
Choose the right reference: For mouse data, use
refdb_beta = "mouse_v1.0_CD48". For specialized
repertoires, provide a custom data frame via
refdb_beta.
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.35.1 SummarizedExperiment_1.43.0
## [3] Biobase_2.73.1 GenomicRanges_1.65.0
## [5] Seqinfo_1.3.0 IRanges_2.47.2
## [7] S4Vectors_0.51.3 BiocGenerics_0.59.6
## [9] generics_0.1.4 MatrixGenerics_1.25.0
## [11] matrixStats_1.5.0 scRepertoire_2.9.0
## [13] ggplot2_4.0.3 immGLIPH_0.99.5
## [15] BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.2.0 magrittr_2.0.5
## [4] otel_0.2.0 compiler_4.6.0 vctrs_0.7.3
## [7] reshape2_1.4.5 ggalluvial_0.12.6 gsl_2.1-9
## [10] quantreg_6.1 stringr_1.6.0 pkgconfig_2.0.3
## [13] fastmap_1.2.0 XVector_0.53.0 ggraph_2.2.2
## [16] rmarkdown_2.31 MatrixModels_0.5-4 purrr_1.2.2
## [19] xfun_0.58 cachem_1.1.0 jsonlite_2.0.0
## [22] DelayedArray_0.39.3 BiocParallel_1.47.0 tweenr_2.0.3
## [25] parallel_4.6.0 R6_2.6.1 bslib_0.11.0
## [28] stringi_1.8.7 RColorBrewer_1.1-3 parallelly_1.47.0
## [31] jquerylib_0.1.4 Rcpp_1.1.1-1.1 knitr_1.51
## [34] future.apply_1.20.2 iNEXT_3.0.2 Matrix_1.7-5
## [37] splines_4.6.0 igraph_2.3.2 tidyselect_1.2.1
## [40] stringdist_0.9.17 abind_1.4-8 yaml_2.3.12
## [43] viridis_0.6.5 codetools_0.2-20 listenv_0.10.1
## [46] lattice_0.22-9 tibble_3.3.1 plyr_1.8.9
## [49] withr_3.0.2 S7_0.2.2 evaluate_1.0.5
## [52] survival_3.8-6 future_1.70.0 polyclip_1.10-7
## [55] pillar_1.11.1 BiocManager_1.30.27 sp_2.2-1
## [58] scales_1.4.0 globals_0.19.1 immApex_1.7.0
## [61] glue_1.8.1 maketools_1.3.2 tools_4.6.0
## [64] sys_3.4.3 SparseM_1.84-2 buildtools_1.0.0
## [67] graphlayouts_1.2.3 dotCall64_1.2 tidygraph_1.3.1
## [70] grid_4.6.0 tidyr_1.3.2 evmix_2.12
## [73] ggforce_0.5.0 cli_3.6.6 spam_2.11-4
## [76] S4Arrays_1.13.0 viridisLite_0.4.3 ggdendro_0.2.0
## [79] dplyr_1.2.1 gtable_0.3.6 sass_0.4.10
## [82] digest_0.6.39 progressr_0.19.0 SparseArray_1.13.2
## [85] ggrepel_0.9.8 rjson_0.2.23 SeuratObject_5.4.0
## [88] farver_2.1.2 memoise_2.0.1 htmltools_0.5.9
## [91] lifecycle_1.0.5 MASS_7.3-65
runGLIPH() Function
findMotifs()
clusterScoring()
deNovoTCRs()
plotNetwork()
loadGLIPH()