scRepertoire 1.2.0
scRepertoire is designed to take filter contig outputs from the 10x Genomics Cell Ranger pipeline, processes that data to assign clonotype based on two TCR or Ig chains, and analyze the clonotype dynamics. The latter can be separated into 1) clonotype-only analysis functions, such as unique clonotypes or clonal space quantification and 2) interaction with mRNA expression data using Seurat, SingleCellExperiment or Monocle 3 packages.
suppressMessages(library(scRepertoire))
scRepertoire comes with a data set derived from T cells derived from three patients with renal clear cell carcinoma in order to demonstrate the functionality of the R package. More information on the data set can be found at preprint 1 and preprint 2. The samples consist of paired peripheral-blood and tumor-infiltrating runs, effectively creating 6 distinct runs for T cell receptor (TCR) enrichment. We can preview the elements in the list by using the head function and looking at the first contig annotation. Here notice the barcode is labeled as PX_P_############# - this refers to Patient X (PX) and Peripheral Blood (P).
The data, contig_list, is 6 filtered_contig.csv file outputs from Cell Ranger that made into a list.
data("contig_list") #the data built into scRepertoire
head(contig_list[[1]])
## barcode is_cell contig_id high_confidence length
## 1 AAACCTGAGAGCTGGT TRUE AAACCTGAGAGCTGGT-1_contig_1 TRUE 705
## 2 AAACCTGAGAGCTGGT TRUE AAACCTGAGAGCTGGT-1_contig_2 TRUE 502
## 3 AAACCTGAGCATCATC TRUE AAACCTGAGCATCATC-1_contig_1 TRUE 693
## 4 AAACCTGAGCATCATC TRUE AAACCTGAGCATCATC-1_contig_2 TRUE 567
## 5 AAACCTGAGCATCATC TRUE AAACCTGAGCATCATC-1_contig_5 TRUE 361
## 6 AAACCTGAGTGGTCCC TRUE AAACCTGAGTGGTCCC-1_contig_1 TRUE 593
## chain v_gene d_gene j_gene c_gene full_length productive cdr3
## 1 TRB TRBV20-1 TRBD1 TRBJ1-5 TRBC1 TRUE TRUE CSASMGPVVSNQPQHF
## 2 TRB None None TRBJ1-5 TRBC1 FALSE None None
## 3 TRB TRBV5-1 TRBD2 TRBJ2-2 TRBC2 TRUE TRUE CASSWSGAGDGELFF
## 4 TRA TRAV12-1 None TRAJ37 TRAC TRUE TRUE CVVNDEGSSNTGKLIF
## 5 TRB None None TRBJ1-5 TRBC1 FALSE None None
## 6 TRB TRBV7-9 TRBD1 TRBJ2-5 TRBC2 TRUE TRUE CASSPSEGGRQETQYF
## cdr3_nt reads umis raw_clonotype_id
## 1 TGCAGTGCTAGCATGGGACCGGTAGTGAGCAATCAGCCCCAGCATTTT 16718 6 clonotype96
## 2 None 6706 3 clonotype96
## 3 TGCGCCAGCAGCTGGTCAGGAGCGGGAGACGGGGAGCTGTTTTTT 26719 11 clonotype97
## 4 TGTGTGGTGAACGATGAAGGCTCTAGCAACACAGGCAAACTAATCTTT 18297 6 clonotype97
## 5 None 882 4 clonotype97
## 6 TGTGCCAGCAGCCCCTCCGAAGGGGGGAGACAAGAGACCCAGTACTTC 11218 6 clonotype98
## raw_consensus_id
## 1 clonotype96_consensus_1
## 2 None
## 3 clonotype97_consensus_2
## 4 clonotype97_consensus_1
## 5 None
## 6 clonotype98_consensus_1
Some workflows will have the additional labeling of the standard barcode. Before we proceed, we will use the function stripBarcode() in order to avoid any labeling issues down the line. Importantly, stripBarcode() is for removing prefixes on barcodes that have resulted from other pipelines.
No need for stripBarcode function, if the barcodes look like: + AAACGGGAGATGGCGT-1 + AAACGGGAGATGGCGT
In terms of using stripBarcode(), please think about the following parameters.
column + The column in which the barcodes are present
connector + The character that is connecting the barcode with the prefix
num_connects + the levels of barcode prefix, where X_X_AAACGGGAGATGGCGT-1 == 3, X_AAACGGGAGATGGCGT-1 = 2.
for (i in seq_along(contig_list)) {
contig_list[[i]] <- stripBarcode(contig_list[[i]], column = 1, connector = "_", num_connects = 3)
}
You can see now the barcode in column 1, we have removed the P## prefixes.
As the output of CellRanger are quantifications of both the TCRA and TCRB chains, the next step is to create a single list object with the TCR gene and CDR3 sequences by cell barcode. This is performed using the combineTCR(), where the input is the stripped contig_list. There is also the relabeling of the barcodes by sample and ID information to prevent duplicates.
cells + T-AB - T cells, alpha-beta TCR + T-GD - T cells, gamma-delta TCR
removeNA + TRUE - this is a stringent filter to remove any cell barcode with an NA value in at least one of the chains + FALSE - the default setting to include and incorporate cells with 1 NA value
removeMulti + TRUE - this is a stringent filter to remove any cell barcode with more than 2 immune receptor chains + FALSE - the default setting to include and incorporate cells with > 2 chains
filterMulti + TRUE - Isolated the top 2 expressed chains in cell barcodes with multiple chains + FALSE - the default setting to include and incorporate cells with > 2 chains
combined <- combineTCR(contig_list,
samples = c("PY", "PY", "PX", "PX", "PZ","PZ"),
ID = c("P", "T", "P", "T", "P", "T"), cells ="T-AB")
The output of combineTCR() will be a list of contig data frames that will be reduced to the reads associated with a single cell barcode. It will also combine the multiple reads into clonotype calls by either the nucleotide sequence (CTnt), amino acid sequence (CTaa), the gene sequence (CTgene) or the combination of the nucleotide and gene sequence (CTstrict). The analogous function for B cells, combineBCR() functions similarly with 2 major caveats: 1) Each barcode can only have a maximum of 2 sequences, if greater exists, the 2 with the highest reads are selected. 2) The strict definition of clonotype (CTstrict) is based on the v gene and >85% normalized hamming distance of the nucleotide sequence. The hamming distance is calculated across all BCR sequences recovered, regardless of the run.
What if there are more variables to add than just sample and ID? We can add them by using the addVariable() function. All we need is the name of the variable you’d like to add and the specific character or numeric values (variables). As an example, here we add the batches in which the samples were processed and sequenced.
example <- addVariable(combined, name = "batch",
variables = c("b1", "b1", "b2", "b2", "b2", "b2"))
example[[1]][1:5,ncol(example[[1]])] # This is showing the first 5 values of the new column added
## [1] "b1" "b1" "b1" "b1" "b1"
Likewise we can remove specific list elements after combineTCR() using the subsetContig() function. In order to subset, we need to identify the vector we would like to use for subsetting (name) and also the variable values to subset (variables). Below you can see us isolate just the 4 sequencing results from PX and PY.
subset <- subsetContig(combined, name = "sample",
variables = c("PX", "PY"))
cloneCall + “gene” - use the genes comprising the TCR/Ig + “nt” - use the nucleotide sequence of the CDR3 region + “aa” - use the amino acid sequence of the CDR3 region + “gene+nt” - use the genes comprising the TCR/Ig + the nucleotide sequence of the CDR3 region. This is the proper definition of clonotype.
Important to note, that the clonotype is called using essentially the combination of genes or nt/aa CDR3 sequences for both loci. As of this implementation of scRepertoire, clonotype calling is not incorporating small variations within the CDR3 sequences. As such the gene approach will be the most sensitive, while the use of nt or aa moderately so, and the most specific for clonotypes being gene+nt. Additionally, the clonotype call is trying to incorporate both loci, i.e, both TCRA and TCRB chains and if a single cell barcode has multiple sequences identified (i.e., 2 TCRA chains expressed in one cell). Using the 10x approach, there is a subset of barcodes that only return one of the immune receptor chains, the unreturned chain is assigned an NA value.
The first function to explore the clonotypes is quantContig() to return the total or relative numbers of unique clonotypes. scale + TRUE - relative percent of unique clonotypes scaled by total size of the size of the clonotype repertoire + FALSE - Report the total number of unique clonotypes
quantContig(combined, cloneCall="gene+nt", scale = TRUE)
Within each of the general analysis functions, there is the ability to export the data frame used to create the visualization. To get the exported values, use exportTable = TRUE. It will return the data frame used to make the graph, instead of the visual output.
quantContig_output <- quantContig(combined, cloneCall="gene+nt",
scale = TRUE, exportTable = TRUE)
quantContig_output
## contigs values total scaled
## 1 2691 PY_P 3208 83.88404
## 2 1508 PY_T 3119 48.34883
## 3 823 PX_P 1068 77.05993
## 4 925 PX_T 1678 55.12515
## 5 1147 PZ_P 1434 79.98605
## 6 761 PZ_T 2768 27.49277
The other option here is to be able to define the visualization by data classes. Here we used the combineTCR() to define the ID variable as part of the naming structure. We can the group to specifically use a column in the data set to organize the visualization.
quantContig(combined, cloneCall="gene", group = "ID", scale = TRUE)
We can also examine the relative distribution of clonotypes by abundance. Here abundanceContig() will produce a line graph with a total number of clonotypes by the number of instances within the sample or run. Like above, we can also group this by vectors within the contig object using the group variable in the function
abundanceContig(combined, cloneCall = "gene", scale = FALSE)