In this vignette we demonstrate clustering of 3rd complementary determining region sequence (CDR3) and V-J gene identity of mouse T cells, ways to visualize and explore clusters that are expanded, pairing of alpha-beta clusters, tests of differential CDR3 usage, and permutation tests for overall clonal properties.

library(CellaRepertorium)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
library(tidyr)
library(stringr)
library(purrr)

1 Load filtered contig files

We begin with a data.frame of concatenated contig files (‘all_contig_annotations.csv’), output from the Cellranger VDJ pipeline.

data(contigs_qc)
MIN_CDR3_AA = 6


cdb = ContigCellDB_10XVDJ(contigs_qc, contig_pk = c('barcode', 'pop', 'sample', 'contig_id'), cell_pk = c('barcode', 'pop', 'sample'))
cdb
#> ContigCellDB of 1508 contigs; 832 cells; and 0 clusters.
#> Contigs keyed by barcode, pop, sample, contig_id; cells keyed by barcode, pop, sample.

Initially we start with 832 cells and 1508 contigs. We keep contigs that are

full - length
productive
high-confidence
only from T cells
and with CDR3 sufficiently long.

Then we add a descriptive readable name for each contig.

cdb$contig_tbl = dplyr::filter(cdb$contig_tbl, full_length, productive == 'True', high_confidence, chain != 'Multi', str_length(cdr3) > MIN_CDR3_AA) %>% mutate( fancy_name = fancy_name_contigs(., str_c(pop, '_', sample)))

After filtering, there are 832 cells and 1496 contigs.

2 Clustering contigs by sequence characteristics

As a first step to define clonotypes, we will first find equivalence classes of CDR3 sequences with the program CD-HIT. In this case, we use the translated amino acid residues, but often one might prefer to use the DNA sequences, by setting the sequence_key accordingly and type = 'DNA'. Additionally, a higher identity threshold might be appropriate (see below).

aa80 = cdhit_ccdb(cdb, sequence_key = 'cdr3', type = 'AA', cluster_pk = 'aa80', 
                  identity = .8, min_length = 5, G = 1)
aa80 = fine_clustering(aa80, sequence_key = 'cdr3', type = 'AA', keep_clustering_details = TRUE)
#> Calculating intradistances on 988 clusters.
#> Summarizing

This partitions sequences into sets with >80% mutual similarity in the amino acid sequence, adds some additional information about the clustering, and returns it as a ContigCellDB object named aa80. The primary key for the clusters is aa80. The min_length can be set somewhat smaller, but there is a lower limit for the cdhit algorithm. G=1, the default, specifies a global alignment. This is almost always what is desired, but local alignment is available if G=0.

head(aa80$cluster_tbl)
#> # A tibble: 6 × 4
#>    aa80 avg_distance fc               n_cluster
#>   <dbl>        <dbl> <list>               <int>
#> 1     1            0 <named list [5]>         1
#> 2     2            0 <named list [5]>         1
#> 3     3            0 <named list [5]>         2
#> 4     4            0 <named list [5]>         1
#> 5     5            0 <named list [5]>         1
#> 6     6            0 <named list [5]>         1
head(aa80$contig_tbl) %>% select(contig_id, aa80, is_medoid, `d(medoid)`)
#> # A tibble: 6 × 4
#>   contig_id                    aa80 is_medoid `d(medoid)`
#>   <chr>                       <dbl> <lgl>           <dbl>
#> 1 ATCTACTCAGTATGCT-1_contig_3     1 TRUE                0
#> 2 ACTGTCCTCAATCACG-1_contig_3     2 TRUE                0
#> 3 CACCTTGTCCAATGGT-1_contig_2     3 TRUE                0
#> 4 CACCTTGTCCAATGGT-1_contig_2     3 FALSE               0
#> 5 CGGACGTGTTCATGGT-1_contig_1     4 TRUE                0
#> 6 CTGCTGTTCCCTAATT-1_contig_4     5 TRUE                0

The cluster_tbl lists the 988 80% identity groups found, including the number of contigs in the cluster, and the average distance between elements in the group. In the contig_tbl, there are two columns specifying if the contig is_medoid, that is, is the most representative element of the set and the distance to the medoid element d(medoid).

cluster_plot(aa80)
#> Loading required namespace: cowplot
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1 Cluster CDR3 DNA sequences

cdb = cdhit_ccdb(cdb, 'cdr3_nt', type = 'DNA', cluster_pk = 'DNA97', identity = .965, min_length = MIN_CDR3_AA*3-1, G = 1)
cdb = fine_clustering(cdb, sequence_key = 'cdr3_nt', type = 'DNA')
#> Calculating intradistances on 1342 clusters.
#> Summarizing

cluster_plot(cdb)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also cluster by DNA identity.

2.2 Cluster by V-J identity

germline_cluster = cluster_germline(cdb, segment_keys = c('v_gene', 'j_gene', 'chain'), cluster_pk = 'segment_idx')
#> Warning in replace_cluster_tbl(ccdb, cluster_tbl, cl_con_tbl, cluster_pk =
#> cluster_pk): Replacing `cluster_tbl` with DNA97.

We can cluster by any other feature of the contigs. Here we cluster each contig based on the chain and V-J genes. This gives us the set of observed V-J pairings:

germline_cluster = fine_clustering(germline_cluster, sequence_key = 'cdr3_nt', type = 'DNA')
#> Calculating intradistances on 700 clusters.
#> Summarizing
#> Warning in left_join_warn(d_medoid, contig_tbl, by = ccdb$contig_pk, overwrite =
#> TRUE): Overwriting fields d(medoid), is_medoid in table contig_tbl
filter_cdb(germline_cluster, chain == 'TRB') %>% plot_cluster_factors(factors = c('v_gene','j_gene'), statistic = 'contigs', type = 'heatmap')

Number of pairs. The pearson residual (showing the difference from expected counts given marginals) is probably more informative, set statistic = 'residual' for this.

ggplot(germline_cluster$cluster_tbl %>% filter(chain == 'TRB'), aes(x = v_gene, y = j_gene, fill = avg_distance)) + geom_tile() + theme(axis.text.x = element_text(angle = 90))

Average Levenshtein distance of CDR3 within each pair. This might be turned into a z-score by fitting a weighted linear model with sum-to-zero contrasts and returning the studentized residuals. This could determine if a pairing has an unexpected small, or large, within cluster distance.

2.3 Expanded clusters

Next, we will examine the clusters that are found in many contigs. First we will get a canonical contig to represent each cluster. This will be the medoid contig, by default.

aa80 = canonicalize_cluster(aa80, representative = 'cdr3', contig_fields = c('cdr3', 'cdr3_nt', 'chain', 'v_gene', 'd_gene', 'j_gene'))
#> Filtering `contig_tbl` by `is_medoid`, override by setting `contig_filter_args == TRUE`

aa80 now includes the fields listed in contig_fields in the cluster_tbl, using the values found in the medoid contig.

MIN_OLIGO = 7
oligo_clusters = filter(aa80$cluster_tbl, n_cluster >= MIN_OLIGO)
oligo_contigs = aa80
oligo_contigs$contig_tbl = semi_join(oligo_contigs$contig_tbl, oligo_clusters, by = 'aa80')
oligo_contigs
#> ContigCellDB of 54 contigs; 832 cells; and 4 clusters.
#> Contigs keyed by barcode, pop, sample, contig_id; cells keyed by barcode, pop, sample.

Get contigs/cells/clusters found at least 7 times (across contigs). Note that replacing contig_tbl with the subset selected with the semi_join also automatically subsetted the cell_tbl and cluster_tbl.

oligo_clusters = oligo_contigs$contig_tbl %>% group_by(aa80) %>% summarize(`n subjects observed` = length(unique(sample))) %>% left_join(oligo_clusters)
#> Joining, by = "aa80"

knitr::kable(oligo_clusters %>% select(aa80:cdr3, chain:j_gene, avg_distance, n_cluster))

aa80	n subjects observed	cdr3	chain	v_gene	d_gene	j_gene	avg_distance	n_cluster
111	6	CVVGDRGSALGRLHF	TRA	TRAV11	None	TRAJ18	0.6071429	28
172	5	CAVSRASSGSWQLIF	TRA	TRAV9N-3	None	TRAJ22	2.1111111	9
296	6	CAASASSGSWQLIF	TRA	TRAV14D-2	None	TRAJ22	1.5000000	8
808	4	CATGNYAQGLTF	TRA	TRAV8D-2	None	TRAJ26	1.3333333	9

Report some statistics about these expanded clusters, such as how often they are found, how many subjects, etc.

oligo_plot = ggplot(oligo_contigs$contig_tbl, aes(x = representative, fill = chain)) + geom_bar() + coord_flip() + scale_fill_brewer(type = 'qual') + theme_minimal()
oligo_plot

These always come from a single chain.

oligo_plot + aes(fill =   sample) + facet_wrap(~pop)

But come from multiple populations and samples.

2.4 Some simple phylogenetic relationships

By using the within-cluster distances, some rudamentory plots attempting to show phylogenetic associations are possible. (These are most biologically appropriate for B cells that undergo somatic hypermutation.)

library(ggdendro)

dendro_plot = function(ccdb, idx, method = 'complete'){
    h = filter(ccdb$cluster_tbl, !!sym(ccdb$cluster_pk) == idx) %>% pull(fc) %>% .[[1]]
    quer = filter(ccdb$contig_tbl, !!sym(ccdb$cluster_pk) == idx)
    hc = hclust(as.dist(h$distance_mat), method = method) %>% dendro_data(type = "rectangle")
    hc$labels = cbind(hc$labels, quer)
   ggplot(hc$segments, aes(x=x, y=y)) + geom_segment(aes(xend=xend, yend=yend)) + 
  theme_classic() + geom_text(data = hc$labels, aes(color = sample, label = fancy_name), size = 3, angle = 60, hjust =0, vjust = 0) + scale_x_continuous(breaks = NULL) + ylab('AA Distance') + xlab('')
}

to_plot = aa80$cluster_tbl %>% filter(min_rank(-n_cluster) == 1)

map(to_plot$aa80, ~ dendro_plot(aa80, .))
#> [[1]]

A full-blown generative model of clonal generation and selection would be recommended for any actual analysis, but these plots may suffice to get a quick idea of the phylogenetic structure.

2.5 Formal testing for frequency differences

We can test for differential usage of a clone, or cluster with cluster_logistic_test and cluster_test_by. The latter splits the cluster_tbl by field = 'chain', thereby adjusting the number of cell trials included in the “denominator” of the logistic regression. The formula tests for differences between populations, including the sample as a random effect, and only tests clusters that are included in the oligo_clusters set.

mm_out = cluster_test_by(aa80, fields = 'chain', tbl = 'cluster_tbl', formula = ~ pop + (1|sample), filterset = cluster_filterset(white_list = oligo_clusters)) %>%
  left_join(oligo_clusters)
#> Fitting mixed logistic models to 4 clusters.
#> Loading required namespace: broom
#> Loading required namespace: lme4
#> Loading required namespace: broom.mixed
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> Fitting mixed logistic models to 0 clusters.
#> Joining, by = c("chain", "aa80")

mm_out = mutate(mm_out, conf.low = estimate-1.96*std.error, 
                conf.high = estimate + 1.96*std.error)

mm_outj = filter(ungroup(mm_out), term == 'popbalbc') %>% arrange(desc(representative))

ggplot(mm_outj, aes(x = representative, ymin = conf.low, ymax = conf.high, y = estimate)) + geom_pointrange()  + coord_flip() + theme_minimal() + geom_hline(yintercept = 0, lty = 2) + xlab("Isomorph") + ylab("log odds of isomorph")

We test if the binomial rate of clone expression differs between balbc and b6, for the selected clones. None appear to be different.

2.6 Length of CDR3

aa80$contig_tbl = aa80$contig_tbl %>% mutate(cdr3_length = str_length(cdr3_nt))
ggplot(aa80$contig_tbl, aes(fill = pop, x= cdr3_length)) +
  geom_histogram(binwidth = 1, mapping = aes(y = ..density..)) + 
  theme_minimal() + scale_fill_brewer(type = 'qual') + 
  facet_grid(sample ~chain) + theme(strip.text.y = element_text(angle = 0)) + coord_cartesian(xlim = c(25, 55))

Some authors have noted that the length of the CDR3 region can be predictive of T cell differentiation. In our study, there doesn’t appear to be a noticeable difference between BALB/c and C57BL/6J (b6) mice, but if we needed to make sure, an appropriate procedure would be to run a mixed model with a random sample effect (assumed to represent a biological replicate).

cdr_len = aa80$contig_tbl %>% group_by(chain) %>% do(broom::tidy(lme4::lmer(cdr3_length ~ pop + (1|sample), data = .), conf.int = TRUE))
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
ggplot(cdr_len %>% filter(term == 'popbalbc'), aes(x = interaction(chain, term), y = estimate, ymin = conf.low, ymax = conf.high)) + 
  geom_pointrange() + theme_minimal() + coord_flip() + 
  ylab('Length(CDR3 Nt)') + xlab('Term/Chain') + geom_hline(yintercept = 0, lty = 2)

We end up with a (harmless) convergence warning about a singular fit. This is expected, because the samples aren’t actually replicates – they are just subsamples drawn for illustrative purposes. The Balbc mice have .5 fewer nucleotides per contig, on average, and this is not significant.

3 Clonal pairs

Next, we can examine the pairing between \(\alpha-\beta\) chains and see if any pairs are found more than once.

aa80$cluster_pk = 'representative'
aa80 = rank_prevalence_ccdb(aa80)
pairing_list = pairing_tables(aa80, table_order = 2, orphan_level = 1, min_expansion = 3, cluster_keys = c('cdr3', 'representative', 'chain', 'v_gene', 'j_gene', 'avg_distance'))

pairing_tables finds all contig combinations of order table_order across cells. Among those combinations that occur at least min_expansion times, the expanded combinations and and any other combinations that shared an expanded combo.


pairs_plt = ggplot(pairing_list$cell_tbl, aes(x = cluster_idx.1_fct, y = cluster_idx.2_fct)) + geom_jitter(aes(color = sample, shape = pop), width = .2, height = .2) + theme_minimal() + xlab('TRB') + ylab('TRA') + theme(axis.text.x = element_text(angle = 45))

pairs_plt = map_axis_labels(pairs_plt, pairing_list$idx1_tbl, pairing_list$idx2_tbl, aes_label  = 'chain')
pairs_plt

3.1 Expanded clones

whitelist = oligo_clusters %>% dplyr::select(cluster_idx.1 = representative) %>% unique()
pairing_list = pairing_tables(aa80, table_order = 2, orphan_level = 1, min_expansion = Inf, cluster_whitelist = whitelist,  cluster_keys = c('cdr3', 'representative', 'chain', 'v_gene', 'j_gene', 'avg_distance'))

pairs_plt = ggplot(pairing_list$cell_tbl, aes(x = cluster_idx.1_fct, y = cluster_idx.2_fct)) + geom_jitter(aes(color = sample, shape = pop), width = .2, height = .2) + theme_minimal() + xlab('TRB') + ylab('TRA') + theme(axis.text.x = element_text(angle = 45))

pairs_plt = map_axis_labels(pairs_plt, pairing_list$idx1_tbl, pairing_list$idx2_tbl, aes_label  = 'chain')
pairs_plt

By setting min_expansion = Inf, cluster_whitelist = whitelist we can examine any pairings for a set of cluster_idx, in this case the ones that were seen multiple times. Interestingly (and unlike some human samples) the expanded clusters are \(\beta\)-chain, and their \(\alpha\) chains are sprinkled quite evenly across clusters.

4 Permutation tests

Permutation tests allow tests of independence between cluster assignments and other cell-level covariates (such as the sample from which the cell was derived). The cluster label is permuted to break the link between cell and cluster, and an arbitrary statistic of both cluster label, and cell covariate is evaluated.

aa80_chain = split_cdb(aa80, 'chain') %>% lapply(canonicalize_cell, contig_fields = 'aa80')

compare_expanded = function(cluster_idx, grp){
  # cluster_idx contains the permuted cluster assignments
  # grp the cell_covariate_keys.
  # NB: this is always a data.frame even if it is just a single column
  # cross tab by pop
  tab = table(cluster_idx, grp[[1]])
  # count number of times an aa80 class was expanded
  expanded = colSums(tab>=2)
  # compare difference
  expanded['b6'] - expanded['balbc']
}

The signature of the statistic should be of a vector cluster_idx and data.frame.

set.seed(1234)
perm1 = cluster_permute_test(aa80_chain$TRB, cell_covariate_keys = 'pop', cell_label_key = 'aa80', n_perm = 100, statistic = compare_expanded)

perm1
#> $observed
#> b6 
#>  5 
#> 
#> $expected
#> [1] 4.92
#> 
#> $p.value
#> [1] 0.86
#> 
#> $mc.se
#> [1] 0.7172732
#> 
#> $statistics
#>   [1]   5   0   2   6   6   9  10   6  23  -3   5 -10  12  16   4  17  -9   0
#>  [19]   6   5  14  -9  21   0  -1  -5   6  -7  13   6   2   9  -5   6   1   9
#>  [37]  -3  10   7   9   8   5   8   4  -7   4  10   8   5   1   4  -1   6   9
#>  [55]  24   3   4  11   6  10   8   7  -5   4  10   2  10 -15   1   4   2   3
#>  [73]  11   1   9  -8  12   7   2   9  -2   9   4  -2  -1  11   8  -4   6  18
#>  [91]   6   5   1  11  15 -11  10  -5  13  11
#> 
#> $call
#> .cluster_permute_test(labels = label, covariates = covariates, 
#>     strata = strata, statistic = statistic, n_perm = n_perm, 
#>     alternative = alternative)
#> 
#> attr(,"class")
#> [1] "PermuteTest"

Although b6 mice had 5 more clones observed to be expanded (occuring >=2 times) than balbc, this is not signficant under a null model where cells were permuted between mouse types (populations), where b6 are expected to have about 5 more expanded clones, just due to the additional number of cells sampled in b6 and the particular spectrum of clonal frequencies in this experiment:

knitr::kable(table(pop = aa80_chain$TRB$pop))

pop	Freq
b6	398
balbc	377

Indeed if we resample in a way that fixes each group to have the same number of cells:

rarify = aa80_chain$TRB$cell_tbl %>% group_by(pop) %>% do(slice_sample(., n = 377))

aa80_chain$TRB$cell_tbl = semi_join(aa80_chain$TRB$cell_tbl, rarify)
#> Joining, by = c("aa80", "barcode", "pop", "sample")

cluster_permute_test(aa80_chain$TRB, cell_covariate_keys = 'pop', cell_label_key = 'aa80', n_perm = 500, statistic = compare_expanded)
#> $observed
#> b6 
#> -1 
#> 
#> $expected
#> [1] 0.726
#> 
#> $p.value
#> [1] 0.74
#> 
#> $mc.se
#> [1] 0.3024381
#> 
#> $statistics
#>   [1]   9   1  -8   1  11   6  -9  -6  -4   6  -5   4   0  -7   2  -8  -5  10
#>  [19]  -7  -6 -10   6  -6   5   7  -4   0  -2  -3   8   5  -1   1   4  -1  -8
#>  [37]  -5   4  -1  -3   6  -6   4   2  -3  -1 -16   8  16  -1   0  -6  -1  -9
#>  [55] -10  -1  -3  -6  -8  -4  13   0  -5   3   1  -1   8 -14   3   2  -7  -7
#>  [73]   2   0   7  -7   5   8   1   1   0   3  10  -4   6   0   8   5  -3   6
#>  [91]  10   7   5  -3  10  -3  -3   8  -2   6   2  10   2  -5   1   3   2   2
#> [109]  14  -5   1  -5 -10  -4 -14  11   5   4  -6   3   8  -8  12 -13  -7   6
#> [127]  -2  13  -5   2 -17   2  -5  11   6  -4   6  14   3  -5  -9 -14  10  -4
#> [145]   2 -10  -5   3  -8  -9  -8 -12   4  10  11   5   0  -9  -5  -5   5   7
#> [163]  -5   5   7  -9  -2   1   2   3 -11  12   0   1 -10   0   2   4 -11 -11
#> [181]  11   2  -7  15   0  -6   4  21 -12   4   2 -10   9  -4   2  -9   0   2
#> [199]  -4   3  -4  -2  -1  -3  -1 -12   6  11  -7  -7   5   4   0   2   3  -2
#> [217]  -4  -1   8   3   5  -2  10   5  -3  -4  -9   2  -4   4  -4  -3  -4  -1
#> [235]  -7   7  -6   8  -4  -5  -9   2   0   6   7  -2   6   9  -2   0   0   5
#> [253] -13   9   3  -9   1  -5   0 -12  -6   4 -13  -1  -1  -2  -8  16   6  -3
#> [271] -10   1  -7   5 -17   3 -12 -10  15   2  12  -5  -6   0  -3   6  -3   2
#> [289] -14  -3   1  -2  -8   6  10   6   6   0  -3   5   5  -2  12  -8   1   2
#> [307]  -9   1 -10  -4  -2   0  -6   6   0  -3   0 -16  -5   8  -3   1   1  10
#> [325]   1  -8  12   0  -7   7  -1  -1   0   1  -4  11  -1   6   0   5   6   0
#> [343]  -7  18  -5   5  -3  -5   0  -6   3   4  10  -6  12   2   0   8  16   2
#> [361]  -4   6  -3   4  -4  10  -2  -6   3   7  10   1   9   5   6   2   0   4
#> [379]   2  -4   4   7   6   0   6  -4  -4   7  -6  -1  -2  -3   5  -1   3   0
#> [397]   9   3  -7 -10   1  -1  -3 -12  -1   8   9   2   4  -1  11  11  -4  -2
#> [415]   6   2   1   3   6  -7  -5  -3   0  -7   3  10   7  -2  11  -4  -4   4
#> [433]   1  16   8   0   1   4   2  16  -8  -3  14   0 -11   6  -8  -1   7   0
#> [451]  11  -1   3   8   5  -4   6   3   1  -2  14  11  13   4   6  -1   2   8
#> [469]   9   3   4  14  -1   9  12   9   7   5 -11  -6  11   0   2   5  17  -7
#> [487]  -3   7   6  -7   0  14   6   5  -2  -7  14   9   0  -4
#> 
#> $call
#> .cluster_permute_test(labels = label, covariates = covariates, 
#>     strata = strata, statistic = statistic, n_perm = n_perm, 
#>     alternative = alternative)
#> 
#> attr(,"class")
#> [1] "PermuteTest"

We see that this discrepacy between the number of expanded clones between subpopulations is mostly explained by a greater number of cells sampled in b6, but also random variability plays a role.

We can also test for oligoclonality, eg, how often is a beta chain expanded in a sample:

count_expanded = function(cluster_idx, grp){
  # clusters x sample contigency table
  tab = table(cluster_idx, grp[[1]])
  # number of cluster x samples that occured more than once
  expanded = sum(tab>1)
  expanded
}

perm3 = cluster_permute_test(aa80_chain$TRB,  cell_covariate_keys = 'sample', cell_label_key = 'aa80', n_perm = 500, statistic = count_expanded)
perm3
#> $observed
#> [1] 27
#> 
#> $expected
#> [1] 37.046
#> 
#> $p.value
#> [1] 0.032
#> 
#> $mc.se
#> [1] 0.1999593
#> 
#> $statistics
#>   [1] 43 40 37 37 38 28 39 42 33 39 38 40 35 44 37 27 37 36 40 39 39 42 38 34 42
#>  [26] 35 34 37 37 43 36 38 29 36 37 39 44 41 42 36 40 50 35 36 42 33 41 40 41 34
#>  [51] 36 39 26 36 38 36 41 36 43 38 35 38 45 37 33 24 36 38 35 38 39 40 38 43 41
#>  [76] 35 29 39 38 31 41 35 33 32 32 30 42 34 40 39 41 46 46 40 37 42 35 33 38 28
#> [101] 39 33 39 42 37 42 36 41 49 38 30 35 37 37 43 37 40 36 31 34 36 36 37 30 45
#> [126] 40 36 35 34 31 39 43 36 35 30 44 40 39 39 35 34 37 35 37 36 36 41 37 44 42
#> [151] 42 36 36 41 38 35 36 43 34 37 34 36 41 28 31 45 40 29 32 33 32 33 38 40 34
#> [176] 44 36 36 30 38 38 37 38 33 31 37 37 33 32 34 37 37 41 39 37 38 37 33 32 39
#> [201] 36 36 36 44 34 34 30 45 30 43 33 41 35 41 32 37 38 43 30 38 35 28 32 40 43
#> [226] 42 37 32 35 39 42 42 41 31 32 31 37 43 41 34 35 42 37 37 39 34 36 39 36 32
#> [251] 42 42 38 38 48 36 40 40 31 41 42 36 33 38 36 36 30 43 34 35 24 37 34 47 44
#> [276] 38 42 38 41 39 37 33 39 34 39 38 33 39 30 33 39 37 35 40 40 44 39 32 35 38
#> [301] 38 31 36 32 33 41 28 38 36 42 42 30 36 24 39 36 25 39 33 33 37 37 38 37 44
#> [326] 42 46 39 43 28 35 28 43 41 38 49 37 36 40 44 35 44 30 33 33 36 32 37 38 40
#> [351] 39 43 38 34 40 39 38 37 39 35 36 41 37 32 26 36 42 39 45 40 39 37 43 39 37
#> [376] 33 31 34 40 31 36 42 41 45 45 39 41 39 37 28 31 36 37 39 33 41 39 43 45 35
#> [401] 35 39 38 32 34 34 36 41 40 32 28 34 33 30 43 38 38 43 39 34 34 40 34 38 34
#> [426] 31 29 39 44 38 39 42 36 41 42 42 35 30 35 45 36 35 38 37 33 41 41 39 35 30
#> [451] 40 38 33 32 39 30 26 36 35 38 34 35 40 34 36 40 38 34 34 36 43 24 36 34 43
#> [476] 33 29 32 34 31 39 40 35 41 37 35 37 39 32 42 43 41 27 48 36 34 42 35 43 47
#> 
#> $call
#> .cluster_permute_test(labels = label, covariates = covariates, 
#>     strata = strata, statistic = statistic, n_perm = n_perm, 
#>     alternative = alternative)
#> 
#> attr(,"class")
#> [1] "PermuteTest"

27 expanded clones were observed in each of the two populations vs 37 expected, and this discrepancy would be significant at \(p<\) 0.04. This is indicating that there is underdispersion – fewer clusters are expanded than expected, given the spectrum of clonal frequencies and the number of cells per sample.

To further elucidate this, we can restrict the permutations to maintain certain margins of the table by specifying cell_stratify_keys. This doesn’t effect the observed values of the statistics, but will change the expected values (since these are now conditional expectations.) Here we restrict the permutations within levels of pop (eg, only permuting within balbc, and within b6).

cluster_permute_test(aa80_chain$TRB,   cell_covariate_keys = 'sample', cell_stratify_keys = 'pop', cell_label_key = 'aa80', n_perm = 500, statistic = count_expanded)
#> $observed
#> [1] 27
#> 
#> $expected
#> [1] 44.652
#> 
#> $p.value
#> [1] 0.002
#> 
#> $mc.se
#> [1] 0.1967007
#> 
#> $statistics
#>   [1] 53 42 42 44 42 44 44 47 49 41 48 46 46 44 48 37 42 42 52 42 44 47 48 45 47
#>  [26] 52 49 42 50 40 46 46 44 37 46 42 47 46 41 53 42 51 42 48 44 48 41 43 44 35
#>  [51] 43 40 41 47 49 49 43 45 44 44 43 46 47 44 48 42 43 44 44 44 46 43 38 43 38
#>  [76] 41 50 39 45 44 37 37 47 46 46 49 49 44 49 41 46 46 44 51 39 48 48 49 42 41
#> [101] 49 53 50 48 44 50 49 54 50 39 40 39 45 42 46 55 42 48 46 46 49 41 49 50 44
#> [126] 49 47 47 47 40 48 45 50 43 43 39 42 43 36 38 40 42 36 44 41 42 43 44 43 35
#> [151] 43 42 41 40 47 51 45 52 43 42 37 46 40 41 51 40 55 45 41 43 50 42 39 46 47
#> [176] 48 46 43 56 46 45 47 46 44 39 46 44 37 47 47 44 44 47 50 49 42 55 47 47 49
#> [201] 44 44 35 43 52 43 44 54 49 48 49 43 43 40 45 48 42 40 44 48 47 48 57 44 41
#> [226] 48 48 46 49 47 49 46 55 37 47 46 40 42 41 37 51 36 35 48 35 52 33 51 40 40
#> [251] 44 46 49 47 53 45 48 41 42 39 41 46 51 46 36 49 43 43 44 46 48 45 47 41 47
#> [276] 40 47 47 44 44 43 46 41 40 41 45 48 48 49 45 45 39 48 43 44 41 41 47 54 45
#> [301] 41 52 48 46 49 41 47 46 50 49 45 35 41 43 49 44 38 56 50 42 50 50 40 43 49
#> [326] 54 57 46 45 51 44 45 43 52 47 49 42 40 43 42 44 46 49 44 43 43 48 49 49 45
#> [351] 46 44 40 47 40 43 48 41 47 41 41 44 50 39 41 50 40 40 45 45 43 42 48 43 49
#> [376] 41 44 49 48 43 50 47 46 45 47 48 44 41 48 53 45 34 46 44 43 53 39 42 36 39
#> [401] 50 39 39 48 42 50 43 50 45 40 55 45 47 41 43 38 45 39 48 42 48 42 46 40 40
#> [426] 38 42 44 42 39 45 46 43 42 38 38 45 53 51 40 50 42 43 42 48 36 44 42 47 44
#> [451] 48 45 42 47 48 40 39 47 45 37 42 49 46 41 41 36 40 53 55 48 51 41 44 44 42
#> [476] 36 42 53 45 38 39 40 47 34 53 42 34 47 41 39 43 44 42 43 49 43 46 45 43 46
#> 
#> $call
#> .cluster_permute_test(labels = label, covariates = covariates, 
#>     strata = strata, statistic = statistic, n_perm = n_perm, 
#>     alternative = alternative)
#> 
#> attr(,"class")
#> [1] "PermuteTest"

In the restricted permutations, the expected number of expanded clusters is even greater. Both of these effects are due to the fact that the “sample” replicates, within each population actually are not biological replicates, which inflates the cluster_idx margin of the table.

5 Colophone

sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggdendro_0.1.23        purrr_0.3.4            stringr_1.4.0         
#> [4] tidyr_1.2.0            readr_2.1.2            ggplot2_3.3.5         
#> [7] dplyr_1.0.8            CellaRepertorium_1.6.0 BiocStyle_2.24.0      
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-157           bitops_1.0-7           RColorBrewer_1.1-3    
#>  [4] progress_1.2.2         GenomeInfoDb_1.32.0    tools_4.2.0           
#>  [7] backports_1.4.1        bslib_0.3.1            utf8_1.2.2            
#> [10] R6_2.5.1               DBI_1.1.2              BiocGenerics_0.42.0   
#> [13] colorspace_2.0-3       withr_2.5.0            prettyunits_1.1.1     
#> [16] tidyselect_1.1.2       compiler_4.2.0         cli_3.3.0             
#> [19] labeling_0.4.2         bookdown_0.26          sass_0.4.1            
#> [22] scales_1.2.0           digest_0.6.29          minqa_1.2.4           
#> [25] rmarkdown_2.14         XVector_0.36.0         pkgconfig_2.0.3       
#> [28] htmltools_0.5.2        parallelly_1.31.1      lme4_1.1-29           
#> [31] fastmap_1.1.0          highr_0.9              rlang_1.0.2           
#> [34] jquerylib_0.1.4        farver_2.1.0           generics_0.1.2        
#> [37] jsonlite_1.8.0         broom.mixed_0.2.9.4    RCurl_1.98-1.6        
#> [40] magrittr_2.0.3         GenomeInfoDbData_1.2.8 Matrix_1.4-1          
#> [43] Rcpp_1.0.8.3           munsell_0.5.0          S4Vectors_0.34.0      
#> [46] fansi_1.0.3            lifecycle_1.0.1        furrr_0.2.3           
#> [49] stringi_1.7.6          yaml_2.3.5             MASS_7.3-57           
#> [52] zlibbioc_1.42.0        plyr_1.8.7             grid_4.2.0            
#> [55] parallel_4.2.0         listenv_0.8.0          forcats_0.5.1         
#> [58] crayon_1.5.1           lattice_0.20-45        Biostrings_2.64.0     
#> [61] cowplot_1.1.1          splines_4.2.0          hms_1.1.1             
#> [64] magick_2.7.3           knitr_1.38             pillar_1.7.0          
#> [67] boot_1.3-28            reshape2_1.4.4         codetools_0.2-18      
#> [70] stats4_4.2.0           glue_1.6.2             evaluate_0.15         
#> [73] BiocManager_1.30.17    vctrs_0.4.1            nloptr_2.0.0          
#> [76] tzdb_0.3.0             gtable_0.3.0           future_1.25.0         
#> [79] assertthat_0.2.1       xfun_0.30              broom_0.8.0           
#> [82] tibble_3.1.6           IRanges_2.30.0         globals_0.14.0        
#> [85] ellipsis_0.3.2

Clustering and differential usage of repertoire CDR3 sequences

26 April 2022

Contents