1 Introduction
2 Step.1 : Create gene-level expression matrix
3 Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)
- 3.1 Case I: Ensembl Gene ID to NCBI Gene ID
- 3.2 Case II: Gene Symbol to NCBI Gene ID
4 Step.3: Normalize the count matrix
Session information

1 Introduction

We explain the way to create a matrix, in which the row names are NCBI Gene ID (ENTREZID), for specifying an input of scTensor. Typical scTensor workflow can be described like below.

library("scTensor")
library("LRBase.XXX.eg.db") # e.g. LRBase.Hsa.eg.db
# Input matrix
input <- ...
sce <- SingleCellExperiment(assays=list(counts = input))
# Color vector
color <- ...
# Celltype vector
label <- ...
cellCellSetting(sce, LRBase.XXX.eg.db, color, label)

In scTensor, the row names of the input matrix is limited only NCBI Gene ID to cooperate with the other R packages (cf. data(“Germline”)). Since the user has many different types of the data matrix, here we introduce some situations and the way to convert the row names of the user’s matrix as NCBI Gene ID.

2 Step.1 : Create gene-level expression matrix

First of all, we have to prepare the expression matrix (gene \(\times\) cell). There are many types of single-cell RNA-Seq (scRNA-Seq) technologies and the situation will be changed by the used experimental methods and quantification tools described below.

2.1 Case I: Gene-level quantification

In Plate-based scRNA-Seq experiment (i.e. Smart-Seq2, Quart-Seq2, CEL-Seq2, MARS-Seq,…etc), FASTQ file is generated in each cell. After the mapping of reads in each FASTQ file to reference genome, the same number of BAM files will be generated. By using some quantification tools such as featureCounts, or HTSeq-count, user can get the expression matrix and used as the input of scTensor. These tools simply count the number of reads in union-exon in each gene. However, these tools do not take “multimap” of not unique reads into consideration and the quantification is not accurate. Therefore, we recommend the transcript-level quantification and gene-level summarization explained below.

2.2 Case II : Transcript-level quantification

Some quantification tools such as RSEM, Sailfish, Salmon, Kallisto, and StringTie use the reference transcriptome instead of genome, and quantify the expression level in each transcript. After the quantification, the transcript-level expression can be summarized to gene-level by using summarizeToGene function of tximport. The paper of tximport showed that the gene-level summalization is more accurate than featureCounts.

Note that if you use the reference transcriptome of GENCODE, this step becomes slightly complicated. Although the number of transcripts of GENCODE and that of Ensembl is almost the same, and actually, most of the transcript is duplicated in these two databases, the gene identifier used in GENCODE looks complicated like “ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|”. In such case, firstly only Ensembl Transcript ID should be extracted (e.g. ENST00000456328.2), removed the version (e.g. ENST00000456328), summarized to Ensembl Gene ID by tximport (e.g. ENSG00000223972), and then converted to NCBI Gene ID (e.g. 100287102).

2.3 Case III: UMI-count

In the droplet-based scRNA-Seq experiment (i.e. Drop-Seq, inDrop RNA-Seq, 10X Chromium), unique molecular identifier (UMI) is introduced for avoiding the bias of PCR amplification, and after multiplexing by cellular barcode, digital gene expression (DGE) matrix is generated by counting the number of types of UMI mapped in each gene.

When user perform Drop-seq, Drop-Seq tool can generate the DGE matrix.

Another tool Alevin, which is a subcommand of Salmon is also applicable to Drop-seq data. In such case [tximport] with option “type = ‘alevin’” can import the result of Alevin into R and summarize the DGE matrix.

When the user performs 10X Chromium, using Cell Ranger developed by 10X Genomics is straightforward.

Although Cell Ranger is implemented by Python, starting from the files generated by Cell Ranger (e.g. filtered_gene_bc_matrices/{hg19,mm10}/{barcodes.tsv,genes.tsv,matrix.mtx}), Seurat can import these files to an R object.

For example, according to the tutorial of Seurat (Seurat - Guided Clustering Tutorial), PBMC data of Cell Ranger can be imported by Read10X function and DGE matrix of UMI-count is available by the output of CreateSeuratObject function.

library("Seurat")

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data,
  project = "pbmc3k", min.cells = 3, min.features = 200)

Note that the matrix is formatted as a sparse matrix of Matrix package (MM: Matrix market), but the scTensor assumes dense matrix. By using as.matrix function, the sparse matrix is easily converted to dense matrix as follows.

# Sparse matrix to dense matrix
for_sc <- as.matrix(pbmc.data)

3 Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)

Even after creating the gene-level expression matrix in Step.1, many kinds of gene-level gene identifiers can be assigned as row names of the matrix such as Ensembl Gene ID, RefSeq, or Gene Symbol. Again, only NCBI Gene ID can be used as row names of input matrix of scTensor. To do such a task, we originally implemented a function convertToNCBIGeneID. The only user has to prepare for using this function is the 1. input matrix (or data.frame) filled with only numbers, 2. current gene-level gene identifier in each row of the input matrix, and 3. corresponding table containing current gene-level gene identifier (left) and corresponding NCBI Gene ID (right). The usage of this function is explained below.

3.1 Case I: Ensembl Gene ID to NCBI Gene ID

In addition to 1. and 2., the user has to prepare the 3. corresponding table. Here we introduce two approaches to assign user’s Ensembl Gene ID to NCBI Gene ID. First approarch is using Organism DB packages such as Homo.sapiens, Mus.musculus, and Rattus.norvegicus.

Using the select function of Organism DB, the corresponding table can be retrieved like below.

suppressPackageStartupMessages(library("scTensor"))
suppressPackageStartupMessages(library("Homo.sapiens"))

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("ENSG00000204531", "ENSG00000181449",
  "ENSG00000136997", "ENSG00000136826")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)

## 'select()' returned 1:1 mapping between keys and columns

# ID conversion
(input <- convertToNCBIGeneID(input, rowID, LefttoRight))

##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20

Second approarch is using AnnotationHub package.

Although only three Organism DB packages are explicitly developed, even if the data is generated from other species (e.g. Zebrafish, Arabidopsis thaliana), similar database is also available from AnnotationHub, and select function can be performed like below.

suppressPackageStartupMessages(library("AnnotationHub"))

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()

## snapshotDate(): 2019-10-29

# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]

## loading from cache

LefttoRight <- select(hs,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)

## 'select()' returned 1:1 mapping between keys and columns

(input <- convertToNCBIGeneID(input, rowID, LefttoRight))

##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20

3.2 Case II: Gene Symbol to NCBI Gene ID

When using cellranger or Seurat to quantify UMI-count (cf. Step1, Case III), the row names of input matrix might be Gene Symbol, and have to be converted to NCBI Gene ID. As well as the Case I described above, Organism DB and AnnotationHub will support such a task like below.

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("POU5F1", "SOX2", "MYC", "KLF4")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)

## 'select()' returned 1:1 mapping between keys and columns

# ID conversion
(input <- convertToNCBIGeneID(input, rowID, LefttoRight))

##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()

## snapshotDate(): 2019-10-29

# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]

## loading from cache

LefttoRight <- select(hs,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)

## 'select()' returned 1:1 mapping between keys and columns

(input <- convertToNCBIGeneID(input, rowID, LefttoRight))

##      [,1] [,2] [,3] [,4] [,5]
## 5460    1    5    9   13   17
## 6657    2    6   10   14   18
## 4609    3    7   11   15   19
## 9314    4    8   12   16   20

4 Step.3: Normalize the count matrix

Finally, we introduce some situations to perform some normalization methods of gene expression matrix.

If a user convert a Seurat object to a SingleCellExperient object by using as.SingleCellExperiment, the result of NormalizeData function (logcounts) is inherited to the SingleCellExperient object as follows;

pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize",
    scale.factor = 10000)
sce <- as.SingleCellExperiment(pbmc2)
assayNames(sce) # counts, logcounts

If the user want to use scater package, calculateCPM or normalize function can calculate the normalized expression values as follows; (see also the vignette of scater).

library("scater")
sce <- SingleCellExperiment(assays=list(counts = input))
cpm(sce) <- calculateCPM(sce)
sce <- normalize(sce)
assayNames(sce) # counts, normcounts, logcounts, cpm

Actually, any original normalization can be stored in the sce. For example, we can calculate the value of count per median (CPMED) as follows;

# User's Original Normalization Function
CPMED <- function(input){
    libsize <- colSums(input)
    median(libsize) * t(t(input) / libsize)
}
# Normalization
normcounts(sce) <- log10(CPMED(counts(sce)) + 1)

We recommend using the normcounts slot to save such original normalization values. After the normalization, such values can be specified by assayNames option in cellCellRanks cellCellDecomp and cellCellReport functions.

Session information

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] AnnotationHub_2.18.0                   
##  [2] BiocFileCache_1.10.2                   
##  [3] dbplyr_1.4.2                           
##  [4] Homo.sapiens_1.3.1                     
##  [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [6] org.Hs.eg.db_3.10.0                    
##  [7] GO.db_3.10.0                           
##  [8] OrganismDbi_1.28.0                     
##  [9] GenomicFeatures_1.38.0                 
## [10] AnnotationDbi_1.48.0                   
## [11] MeSH.Mmu.eg.db_1.13.0                  
## [12] LRBase.Mmu.eg.db_1.2.0                 
## [13] MeSH.Hsa.eg.db_1.13.0                  
## [14] MeSHDbi_1.22.0                         
## [15] SingleCellExperiment_1.8.0             
## [16] SummarizedExperiment_1.16.0            
## [17] DelayedArray_0.12.0                    
## [18] BiocParallel_1.20.0                    
## [19] matrixStats_0.55.0                     
## [20] Biobase_2.46.0                         
## [21] GenomicRanges_1.38.0                   
## [22] GenomeInfoDb_1.22.0                    
## [23] IRanges_2.20.0                         
## [24] S4Vectors_0.24.0                       
## [25] BiocGenerics_0.32.0                    
## [26] scTensor_1.2.1                         
## [27] RSQLite_2.1.2                          
## [28] LRBase.Hsa.eg.db_1.2.0                 
## [29] LRBaseDbi_1.4.0                        
## [30] BiocStyle_2.14.0                       
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.1                rtracklayer_1.46.0           
##   [3] AnnotationForge_1.28.0        tidyr_1.0.0                  
##   [5] ggplot2_3.2.1                 acepack_1.4.1                
##   [7] bit64_0.9-7                   knitr_1.25                   
##   [9] data.table_1.12.6             rpart_4.1-15                 
##  [11] RCurl_1.95-4.12               AnnotationFilter_1.10.0      
##  [13] cowplot_1.0.0                 europepmc_0.3                
##  [15] bit_1.1-14                    enrichplot_1.6.0             
##  [17] webshot_0.5.1                 xml2_1.2.2                   
##  [19] httpuv_1.5.2                  assertthat_0.2.1             
##  [21] viridis_0.5.1                 xfun_0.10                    
##  [23] hms_0.5.2                     evaluate_0.14                
##  [25] promises_1.1.0                TSP_1.1-7                    
##  [27] progress_1.2.2                caTools_1.17.1.2             
##  [29] dendextend_1.12.0             Rgraphviz_2.30.0             
##  [31] igraph_1.2.4.1                DBI_1.0.0                    
##  [33] htmlwidgets_1.5.1             MeSH.db_1.13.0               
##  [35] purrr_0.3.3                   dplyr_0.8.3                  
##  [37] backports_1.1.5               bookdown_0.14                
##  [39] annotate_1.64.0               biomaRt_2.42.0               
##  [41] vctrs_0.2.0                   ensembldb_2.10.0             
##  [43] abind_1.4-5                   ggforce_0.3.1                
##  [45] Gviz_1.30.0                   triebeard_0.3.0              
##  [47] BSgenome_1.54.0               checkmate_1.9.4              
##  [49] GenomicAlignments_1.22.0      gclus_1.3.2                  
##  [51] fdrtool_1.2.15                prettyunits_1.0.2            
##  [53] cluster_2.1.0                 DOSE_3.12.0                  
##  [55] dotCall64_1.0-0               lazyeval_0.2.2               
##  [57] crayon_1.3.4                  genefilter_1.68.0            
##  [59] pkgconfig_2.0.3               tweenr_1.0.1                 
##  [61] ProtGenerics_1.18.0           seriation_1.2-8              
##  [63] nnet_7.3-12                   rlang_0.4.1                  
##  [65] lifecycle_0.1.0               meshr_1.22.0                 
##  [67] registry_0.5-1                MeSH.PCR.db_1.13.0           
##  [69] rTensor_1.4                   GOstats_2.52.0               
##  [71] dichromat_2.0-0               polyclip_1.10-0              
##  [73] graph_1.64.0                  Matrix_1.2-17                
##  [75] urltools_1.7.3                base64enc_0.1-3              
##  [77] ggridges_0.5.1                viridisLite_0.3.0            
##  [79] MeSH.AOR.db_1.13.0            bitops_1.0-6                 
##  [81] visNetwork_2.0.8              KernSmooth_2.23-16           
##  [83] spam_2.4-0                    MeSH.Bsu.168.eg.db_1.13.0    
##  [85] Biostrings_2.54.0             blob_1.2.0                   
##  [87] stringr_1.4.0                 qvalue_2.18.0                
##  [89] nnTensor_1.0.2                gridGraphics_0.4-1           
##  [91] reactome.db_1.70.0            scales_1.0.0                 
##  [93] graphite_1.32.0               memoise_1.1.0                
##  [95] GSEABase_1.48.0               magrittr_1.5                 
##  [97] plyr_1.8.4                    gplots_3.0.1.1               
##  [99] gdata_2.18.0                  zlibbioc_1.32.0              
## [101] compiler_3.6.1                RColorBrewer_1.1-2           
## [103] plotrix_3.7-6                 Rsamtools_2.2.1              
## [105] XVector_0.26.0                Category_2.52.1              
## [107] MeSH.Aca.eg.db_1.13.0         htmlTable_1.13.2             
## [109] Formula_1.2-3                 MASS_7.3-51.4                
## [111] tidyselect_0.2.5              stringi_1.4.3                
## [113] highr_0.8                     MeSH.Syn.eg.db_1.13.0        
## [115] yaml_2.2.0                    GOSemSim_2.12.0              
## [117] askpass_1.1                   latticeExtra_0.6-28          
## [119] ggrepel_0.8.1                 grid_3.6.1                   
## [121] VariantAnnotation_1.32.0      fastmatch_1.1-0              
## [123] tools_3.6.1                   rstudioapi_0.10              
## [125] foreach_1.4.7                 foreign_0.8-72               
## [127] tagcloud_0.6                  outliers_0.14                
## [129] gridExtra_2.3                 farver_1.1.0                 
## [131] ggraph_2.0.0                  rvcheck_0.1.6                
## [133] digest_0.6.22                 BiocManager_1.30.9           
## [135] shiny_1.4.0                   Rcpp_1.0.3                   
## [137] BiocVersion_3.10.1            later_1.0.0                  
## [139] httr_1.4.1                    cummeRbund_2.28.0            
## [141] biovizBase_1.34.0             colorspace_1.4-1             
## [143] XML_3.98-1.20                 splines_3.6.1                
## [145] fields_9.9                    RBGL_1.62.1                  
## [147] graphlayouts_0.5.0            ggplotify_0.0.4              
## [149] plotly_4.9.1                  xtable_1.8-4                 
## [151] jsonlite_1.6                  heatmaply_0.16.0             
## [153] tidygraph_1.1.2               zeallot_0.1.0                
## [155] R6_2.4.0                      Hmisc_4.3-0                  
## [157] pillar_1.4.2                  htmltools_0.4.0              
## [159] mime_0.7                      glue_1.3.1                   
## [161] fastmap_1.0.1                 interactiveDisplayBase_1.24.0
## [163] codetools_0.2-16              maps_3.3.0                   
## [165] fgsea_1.12.0                  lattice_0.20-38              
## [167] tibble_2.1.3                  curl_4.2                     
## [169] gtools_3.8.1                  ReactomePA_1.30.0            
## [171] misc3d_0.8-4                  openssl_1.4.1                
## [173] survival_3.1-7                rmarkdown_1.16               
## [175] munsell_0.5.0                 DO.db_2.9                    
## [177] GenomeInfoDbData_1.2.2        plot3D_1.1.1                 
## [179] iterators_1.0.12              reshape2_1.4.3               
## [181] gtable_0.3.0

Roadmap to prepare the input matrix for `scTensor`

9 November 2019

Package

Contents

1 Introduction

2 Step.1 : Create gene-level expression matrix

2.1 Case I: Gene-level quantification

2.2 Case II : Transcript-level quantification

2.3 Case III: UMI-count

3 Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)

3.1 Case I: Ensembl Gene ID to NCBI Gene ID

3.2 Case II: Gene Symbol to NCBI Gene ID

4 Step.3: Normalize the count matrix

Session information

Roadmap to prepare the input matrix for scTensor

9 November 2019

Package

Contents

1 Introduction

2 Step.1 : Create gene-level expression matrix

2.1 Case I: Gene-level quantification

2.2 Case II : Transcript-level quantification

2.3 Case III: UMI-count

3 Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)

3.1 Case I: Ensembl Gene ID to NCBI Gene ID

3.2 Case II: Gene Symbol to NCBI Gene ID

4 Step.3: Normalize the count matrix

Session information

Roadmap to prepare the input matrix for `scTensor`