Application of high-throughput sequencing of T and B lymphocyte antigen receptors has great potential for improving the monitoring of lymphoid malignancies, assessing immune reconstitution after hematopoietic stem cell transplantation, and characterizing the composition of lymphocyte repertoires (Warren, E. H. et al. Blood 2013;122:19–22). LymhoSeq is an R package designed to import, analyze, and visualize antigen receptor sequencing from Adaptive Biotechnologies’ ImmunoSEQ assay. The package is also adaptable to the analysis of T and B cell receptor sequencing processed using other platforms such as MiXCR or IMGT/HighV-QUEST. This vignette has been written to highlight some of the features of LymphoSeq and guide the user through a typical workflow.

Importing data

The LymphoSeq function readImmunoSeq imports tab-separated value (.tsv) files exported by Adaptive Biotechnologies ImmunoSEQ analyzer v2 where each row represents a unique sequence and each column is a variable with information about that sequence such as read count, frequency, or variable gene name. Note that the file format for ImmunoSEQ analyzer v3 is not yet supported and users must choose to export the v2 format from the analyzer software. Only files with the extension .tsv are imported while all other are disregarded. It is possible to import files processed using other platforms as long as the files are tab-delimited, are given the extension .tsv and have identical column names as the ImmunoSEQs files (see readImmunoSeq manual for a list of column names used by this file type). Refer to the LymphoSeq manual regarding the required column names used by each function.

To explore the features of LymphoSeq, this package includes 2 example data sets. The first is a data set of T cell receptor beta (TCRB) sequencing from 10 blood samples acquired serially from a single patient who underwent a bone marrow transplant (Kanakry, C.G., et al. JCI Insight 2016;1(5):pii: e86252). The second, is a data set of B cell receptor immunoglobulin heavy (IGH) chain sequencing from Burkitt lymphoma tumor biopsies acquired from 10 different individuals (Lombardo, K.A., et al. Blood Advances 2017 1:535-544). To improve performance, both data sets contain only the top 1,000 most frequent sequences. The complete data sets are publicly available through Adapatives’ immuneACCESS portal. As shown in the example below, you can specify the path to the example data sets using the command system.file("extdata", "TCRB_sequencing", package = "LymphoSeq") for the TCRB files and system.file("extdata", "IGH_sequencing", package = "LymphoSeq") for the IGH files.

readImmunoSeq imports each file in the specified directory as a list object where each file becomes a data frame. You can import all columns from each file by setting the columns parameter to "all" or list just those columns of interest. Be aware that Adaptive Biotechnologies has changed the column names of their files over time and if the headings of your files are not all the same, you will need to specify "all" or provide all variations of the column header. By default, the columns parameter is set to import only those columns used by LymphoSeq.

library(LymphoSeq)
## Loading required package: LymphoSeqDB
TCRB.path <- system.file("extdata", "TCRB_sequencing", package = "LymphoSeq")

TCRB.list <- readImmunoSeq(path = TCRB.path)

Notice that each data frame listed in the TCRB.list object is named according the ImmunoSEQ file names. If different names are desired, you may rename the original .tsv files or assign names(TCRB.list) to a new character vector of desired names in the same order as the list.

names(TCRB.list)
 [1] "TRB_CD4_949"       "TRB_CD8_949"       "TRB_CD8_CMV_369"  
 [4] "TRB_Unsorted_0"    "TRB_Unsorted_1320" "TRB_Unsorted_1496"
 [7] "TRB_Unsorted_32"   "TRB_Unsorted_369"  "TRB_Unsorted_83"  
[10] "TRB_Unsorted_949" 

Having the data in the form of a list makes it easy to apply a function over that list using the base function lapply. For example, you may use the function dim to report the dimensions of each data frame as shown below. Noticed that each data frame in the example below has less than 1,000 rows and 11 columns.

lapply(TCRB.list, dim)
$TRB_CD4_949
[1] 1000   11

$TRB_CD8_949
[1] 1000   11

$TRB_CD8_CMV_369
[1] 414  11

$TRB_Unsorted_0
[1] 1000   11

$TRB_Unsorted_1320
[1] 1000   11

$TRB_Unsorted_1496
[1] 1000   11

$TRB_Unsorted_32
[1] 920  11

$TRB_Unsorted_369
[1] 1000   11

$TRB_Unsorted_83
[1] 1000   11

$TRB_Unsorted_949
[1] 1000   11

In place of dim, you may also use colnames, nrow, ncol, or other more complex functions that perform operations on subsetted columns.

Subsetting data

If you imported all of the files from your project but just want to perform an analysis on a subset, use standard R methods to subset the list. Remember that a single bracket [ returns a list and a double bracket [[ returns a single data frame.

CMV <- TCRB.list[grep("CMV", names(TCRB.list))]
names(CMV)
[1] "TRB_CD8_CMV_369"
TRB_Unsorted_0 <- TCRB.list[["TRB_Unsorted_0"]]
head(TRB_Unsorted_0)