Contents

This document provides details on the import of single-cell data into a SingleCellExperiment object for use with the scater package (frequent use case) and how to quantify single-cell gene expression from within an R session using the popular RNA-seq quantification tools Salmon and kallisto.

1 Importing expression data into R

2 Read 10x results

3 Read Salmon or kallisto results

4 Using tximport

5 Using kallisto and Salmon to quantify transcript abundance from within R

Lior Pachter’s group at Berkeley has recently released a new software tool called kallisto for rapid quantification of transcript abundance from RNA-seq data. Similarly, Rob Patro at Stony Brook University and colleagues have released Salmon, a similarly fast and accurate RNA-seq quantification tool. In scater, a couple of wrapper functions to kallisto and Salmon enable easy and extremely fast quantification of transcript abundance from within R.

For simplicity, examples are shown below demonstrating the use of kallisto, but the same could be done with Salmon using almost identical interfaces (replacing runKallisto with runSalmon, readKallistoResults with readSalmonResults and so on).

################################################################################
### Tests and Examples

# Example if in the kallisto/test directory
setwd("/home/davis/kallisto/test")
kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE,
            output_prefix="output", verbose=TRUE, n_bootstrap_samples=10)

sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE)
sce_test

An example using real data from a project investigating cell cycle. Note that this analysis is not ‘live’ (the raw data are not included in the package), but it demonstrates what can be done with scater and kallisto.

setwd("/home/davis/021_Cell_Cycle/data/fastq")
system("wc -l targets.txt")
ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690))

kallisto_test <- runKallisto("targets.txt",
                             "Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx",
                             output_prefix="kallisto_output_Mmus", n_cores=12,
                             fragment_length=ave_frag_len, verbose=TRUE)
sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE)
sce_kall_mmus

sce_kall_mmus <- readKallistoResults(kallisto_test)

sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus)

head(fData(sce_kall_mmus))
head(pData(sce_kall_mmus))
sce_kall_mmus[["start_time"]]

counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6]

## Summarise expression at the gene level
sce_kall_mmus_gene <- summariseExprsAcrossFeatures(
    sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id")

datatable(fData(sce_kall_mmus_gene))

sce_kall_mmus_gene <- getBMFeatureAnnos(
    sce_kall_mmus_gene, filters="ensembl_gene_id",
    attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name",
                 "gene_biotype", "start_position", "end_position",
                 "percentage_gc_content", "description"),
    feature_symbol="mgi_symbol", feature_id="ensembl_gene_id",
    biomart="ensembl", dataset="mmusculus_gene_ensembl")

datatable(fData(sce_kall_mmus_gene))

## Add gene symbols to featureNames to make them more intuitive
new_feature_names <- featureNames(sce_kall_mmus_gene)
notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol)
new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb]
notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id)
new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid],
          fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_")
sum(duplicated(new_feature_names))
featureNames(sce_kall_mmus_gene) <- new_feature_names
head(featureNames(sce_kall_mmus_gene))
tail(featureNames(sce_kall_mmus_gene))
sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol))