Select target genes for TAP-seq

Andreas Gschwind

2020-04-28

One application of TAP-seq is to measure the expression of genes that enable assigning cells to different cell types. The TAPseq package offers a simple approach to identify a desired number of cell type markers based on differentially expressed genes between cell types.

This optional functionality requires additional R-packages such as Seurat. Please install the TAPseq package with all suggested dependencies.

Data

The TAPseq package contains a small subset of the mouse bone marrow data from Baccin et al., 2019 (https://www.nature.com/articles/s41556-019-0439-6). The full dataset can be downloaded here. This dataset is stored in a Seurat object which contains both gene expression and cell type data for every cell.

library(TAPseq)
library(Seurat)

# example of mouse bone marrow 10x gene expression data
data("bone_marrow_genex")

# transcript counts
GetAssayData(bone_marrow_genex)[1:10, 1:10]

# number of cells per cell type
table(Idents(bone_marrow_genex))

Select target genes

A linear model with lasso regularization is used to select target genes that best identify cell types. By default 10-fold cross-validation is used to select the number of target genes based on the largest value of lambda within 1 standard error of the minimum. See ?glmnet::cv.glmnet for more details.

# automatically select a number of target genes using 10-fold cross-validation
target_genes_cv <- selectTargetGenes(bone_marrow_genex)
head(target_genes_cv)
length(target_genes_cv)

This example results in a reasonable target gene panel, but in cases with many different cell types the resulting panels might be very large.

To avoid this, we can specify a desired number of targets. This selects a value for lamda in the lasso model that results in approximately this number of non-zero coefficients, i.e. marker genes.

# identify approximately 100 target genes that can be used to identify cell populations
target_genes_100 <- selectTargetGenes(bone_marrow_genex, targets = 100)
length(target_genes_100)

Assess target gene panels

To intuitively assess how well a chosen set of target genes distinguishes cell types, we can use UMAP plots based on the full gene expression data and on target genes only.

plotTargetGenes(bone_marrow_genex, target_genes = target_genes_100)

We can see that the expression of the 99 selected target genes groups cells of different populations together.

A good follow up would be to cluster the cells based on only the target genes following the same workflow used to define the cell identities in the original object. This could then be used to verify that the selected target genes reliably identify the correct cell types.