1 About the template

This section provides general description and how to use this single cell RNAseq (scRNAseq) workflow. In the actual analysis report, this section is usually removed.

This scRNAseq workflow template is based on the counted 10x scRNA data. It means this workflow expect users have output from cell counting programs, like cell ranger. If you have the raw sequencing data and would like to count the cell gene counts, please use another workflow in systemPipeR, such as SPcellranger.

This workflow does:

  1. Read in single cell count data.
  2. Basic stats on input data.
  3. Create some basic QC on cell counting.
  4. Normalization.
  5. Find high variable genes.
  6. Scaling.
  7. dim reduction, PCA.
  8. Clustering with tSNE, uMAP.
  9. Find clustering markers (marker gene).
  10. Find cell types.
  11. Visualize cell types and clustering together.

This template is modified from Seurat Tutorial.

2 Introduction

Users want to provide here background information about the design of their scRNAseq project.

This report describes the analysis of a scRNAseq project studying drug …

2.1 Experimental design

Typically, users want to specify here all information relevant for the analysis of their scRNAseq study. This includes detailed descriptions of files, experimental design, reference genome, gene annotations, etc.

3 Workflow environment

systemPipeR workflows can be designed and built from start to finish with a single command, importing from an R Markdown file or stepwise in interactive mode from the R console.

This tutorial will demonstrate how to build the workflow in an interactive mode, appending each step. The workflow is constructed by connecting each step via appendStep method. Each SYSargsList instance contains instructions needed for processing a set of input files with a specific command-line or R software and the paths to the corresponding outfiles generated by a particular tool/step.

To create a Workflow within systemPipeR, we can start by defining an empty container and checking the directory structure:

library(systemPipeR)
sal <- SPRproject()
sal

3.1 Load packages

This is an empty template that contains only one demo step. Refer to our website for how to add more steps. If you prefer a more enriched template, read this page for other pre-configured templates.

cat(crayon::blue$bold("To use this workflow, following R packages are expected:\n"))
cat(c("'Seurat", "ggplot2", "ggpubr", "patchwork", "dplyr", "tibble",
    "readr'\n"), sep = "', '")
### pre-end
appendStep(sal) <- LineWise(code = {
    library(systemPipeR)
    library(Seurat)
    library(dplyr)
    library(ggplot2)
    library(ggpubr)
    library(patchwork)
}, step_name = "load_packages")

3.2 Load data

In this example, the single cell data is preprocessed/filtered 10x data from a healthy donor. Samples taken from peripheral blood mononuclear cells (PBMCs), about 3000 cells.

Dataset can be downloaded with this link: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

If the link is not working, visit 10x website for updated links.

For your real data, please preprocess and put the dataset inside data directory

appendStep(sal) <- LineWise(code = {
    # unzip the data
    untar("data/pbmc3k_filtered_gene_bc_matrices.tar.gz", exdir = "data")
    # load data
    pbmc.data <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19/")
    # Use dim to see the size of dataset, example data has
    # 2700 cells x 32738 genes
    dim(pbmc.data)
}, step_name = "load_data", dependency = "load_packages")

3.3 Simple count visulization

We can plot to see how many cells have good expressions.

appendStep(sal) <- LineWise(code = {
    at_least_one <- apply(pbmc.data, 2, function(x) sum(x > 0))
    count_p1 <- tibble::as_tibble(at_least_one) %>%
        ggplot() + geom_histogram(aes(x = value), binwidth = floor(nrow(pbmc.data)/400),
        fill = "#6b97c2", color = "white") + theme_pubr(16) +
        scale_y_continuous(expand = c(0, 0)) + scale_x_continuous(expand = c(0,
        0)) + labs(title = "Distribution of detected genes",
        x = "Genes with at least one tag")

    count_p2 <- tibble::as_tibble(BiocGenerics::colSums(pbmc.data)) %>%
        ggplot() + geom_histogram(aes(x = value), bins = floor(ncol(pbmc.data)/50),
        fill = "#6b97c2", color = "white") + theme_pubr(16) +
        scale_y_continuous(expand = c(0, 0)) + scale_x_continuous(expand = c(0,
        0)) + labs(title = "Expression sum per cell", x = "Sum expression")

    png("results/count_plots.png", 1000, 700)
    count_p1 + count_p2 + patchwork::plot_annotation(tag_levels = "A")
    dev.off()
}, step_name = "count_plot", dependency = "load_data")

3.4 Create Seurat object

appendStep(sal) <- LineWise(code = {
    sce <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k",
        min.cells = 3, min.features = 200)
    # calculate mitochondria gene ratio
    sce[["percent.mt"]] <- PercentageFeatureSet(sce, pattern = "^MT-")
}, step_name = "create_seurat", dependency = "load_data")

Filters applied here are:

  • min.cells = 3: for a single gene, it must be presented in at least 3 cells.
  • min.features = 200: for any cell, it must at least have 200 genes with readable gene counts.

In your real data, you may want to adjust the filters for different projects.

3.5 QC of cells

Some indicators are important to show the quality of the cell batch, for example

  • N genes (features) per cell
  • N counts per cell
  • Mitochondria (mt) gene ratio.
appendStep(sal) <- LineWise(code = {

    png("results/qc1.png", 700, 700)
    VlnPlot(sce, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3)
    dev.off()

    qc_p1 <- FeatureScatter(sce, feature1 = "nCount_RNA", feature2 = "percent.mt")
    qc_p2 <- FeatureScatter(sce, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
    png("results/qc2.png", 700, 450)
    qc_p1 + qc_p2 + patchwork::plot_annotation(tag_levels = "A")
    dev.off()
}, step_name = "qc_seurat", dependency = "create_seurat")