Contents

1 Introduction

1.1 About Cell-Cell Interaction (CCI)

Due to the rapid development of single-cell RNA-Seq (scRNA-Seq) technologies, wide variety of cell types such as multiple organs of a healthy person, stem cell niche and cancer stem cell have been found. Such complex systems are composed of communication between cells (cell-cell interaction or CCI).

Many CCI studies are based on the ligand-receptor (L-R)-pair list of FANTOM5 project1 Jordan A. Ramilowski, A draft network of ligand-receptor-mediated multicellular signaling in human, Nature Communications, 2015 as the evidence of CCI (http://fantom.gsc.riken.jp/5/suppl/Ramilowski_et_al_2015/data/PairsLigRec.txt). The project proposed the L-R-candidate genes by following two reasons.

  1. Subcellular Localization
    1. Known Annotation (UniProtKB and HPRD) : The term “Secreted” for candidate ligand genes and “Plasma Membrane” for candidate receptor genes
    2. Computational Prediction (LocTree3 and PolyPhobius)
  2. Physical Binding of Proteins : Experimentally validated PPI (protein-protein interaction) information of HPRD and STRING

The project also merged the data with previous L-R database such as IUPHAR/DLRP/HPMR and filter out the list without PMIDs.

Here, we extend a similar approach to the construction of the L-R-pair lists of 12 organisms and implemented as multiple R/Bioconductor annotation packages for sustainable maintenance (LRBaseDbi and LRBase.XXX.eg.db-type packages (Figure 1). XXX is the abbreviation of the scientific name of organisms such as LRBase.Hsa.eg.db for L-R database of Homo sapiens. Besides, we also developed scTensor, which is a method to detect CCI and the CCI-related L-R pairs simultaneously. This document provides the way to use LRBaseDbi, LRBase.XXX.eg.db-type packages, and scTensor package.

Figure 1 : Workflow of L-R-related packages

Figure 1 : Workflow of L-R-related packages

2 Usage

2.1 LRBase.XXX.eg.db (ligand-receptor database of 12 organisms)

To create the L-R-list of 12 organisms, we used the information about the subcellular localization of proteins from SWISSPROT (knowledge database) and TREMBL (computational prediction). We also used the PPI information from the STRING database (Figure 1). The proteins which are assigned to the term “Secreted” and “Cellular Membrane” are retrieved as candidate ligand and receptor, respectively. Finally, the L-R-pairs which is registered in STRING is extracted as candidate L-R-lists.

Following 12 organisms are implemented as LRBase.XXX.eg.db-type packages.

Table 1: The summary of 12 organisms of
LRBase\(_\cdot\)XXX\(_\cdot\)eg\(_\cdot\)db packages.
Organisms Package Name # Swiss-Prot (Secreted / Membrane) # TrEMBL (Secreted / Membrane) # STRING (PPI) # Pair (Swiss-Prot ?? STRING) # Pair (TrEMBL ?? STRING)
\(\textit{Homo sapiens}\) LRBase.Hsa.eg.db 1592 / 2269 176 / 334 18838 21882 472
\(\textit{Mus musculus}\) LRBase.Mmu.eg.db 1309 / 1806 325 / 1555 19715 16386 476
\(\textit{Arabidopsis thaliana}\) LRBase.Ath.eg.db 1260 / 1001 244 / 80 24174 8697 94
\(\textit{Rattus norvegicus}\) LRBase.Rno.eg.db 643 / 983 232 / 1229 19963 5270 65
\(\textit{Bos taurus}\) LRBase.Bta.eg.db 517 / 448 192 / 390 18349 2220 237
\(\textit{Caenorhabditis elegans}\) LRBase.Cel.eg.db 198 / 247 28 / 60 13545 106 1
\(\textit{Drosophila melanogaster}\) LRBase.Dme.eg.db 249 / 333 89 / 148 11903 384 9
\(\textit{Danio rerio}\) LRBase.Dre.eg.db 119 / 169 318 / 376 21746 99 432
\(\textit{Gallus gallus}\) LRBase.Gga.eg.db 175 / 173 185 / 154 13084 140 105
\(\textit{Pongo abelii}\) LRBase.Pab.eg.db 80 / 134 212 / 211 16691 34 184
\(\textit{Xenopus Silurana tropicalis}\) LRBase.Xtr.eg.db 57 / 83 141 / 114 15338 19 107
\(\textit{Sus scrofa}\) LRBase.Ssc.eg.db 223 / 153 202 / 445 18683 277 130

2.1.1 columns, keytypes, keys, and select

Some data access functions are available for LRBase.XXX.eg.db-type packages. Any data table are retrieved by 4 functions defined by AnnotationDbi; columns, keytypes, keys, and select and commonly implemented by LRBaseDbi package. columns returns the rows which we can retrieve in LRBase.XXX.eg.db-type packages. keytypes returns the rows which can be used as the optional parameter in keys and select functions against LRBase.XXX.eg.db-type packages. keys function returns the value of keytype. select function returns the rows in particular columns, which are having user-specified keys. This function returns the result as a dataframe. See the vignette of AnnotationDbi for more details.

columns(LRBase.Hsa.eg.db)
## [1] "GENEID_L" "GENEID_R" "SOURCEDB" "SOURCEID"
keytypes(LRBase.Hsa.eg.db)
## [1] "GENEID_L" "GENEID_R" "SOURCEDB" "SOURCEID"
key_HSA <- keys(LRBase.Hsa.eg.db, keytype="GENEID_L")
head(select(LRBase.Hsa.eg.db, keys=key_HSA[1:2],
            columns=c("GENEID_L", "GENEID_R"), keytype="GENEID_L"))
##   GENEID_L GENEID_R
## 1     4016       14
## 2   344752       14
## 3     4016     2678
## 4     4016     5251
## 5   344752    56670

2.1.2 Other functions

Other additional functions like species, nomenclature, and listDatabases are available. In each LRBase.XXX.eg.db-type package, species function returns the common name and nomenclature returns the scientific name. listDatabases function returns the source of data. dbInfo returns the information of the package. dbfile returns the directory where sqlite file is stored. dbschema returns the schema of the database. dbconn returns the connection to the sqlite database.

lrPackageName(LRBase.Hsa.eg.db)
## [1] "LRBase.Hsa.eg.db"
lrNomenclature(LRBase.Hsa.eg.db)
## [1] "Homo sapiens"
species(LRBase.Hsa.eg.db)
## [1] "Human"
lrListDatabases(LRBase.Hsa.eg.db)
##           SOURCEDB
## 1 SWISSPROT_STRING
## 2    TREMBL_STRING
## 3         SOURCEDB
## 4           IUPHAR
## 5             DLRP
lrVersion(LRBase.Hsa.eg.db)
##        NAME VALUE
## 1 LRVERSION  2019
dbInfo(LRBase.Hsa.eg.db)
##               NAME                                              VALUE
## 1       SOURCEDATE                                         7-Oct-2019
## 2      SOURCENAME1                                          SWISSPROT
## 3      SOURCENAME2                                             TREMBL
## 4      SOURCENAME3                                             STRING
## 5       SOURCEURL1 http://www.uniprot.org/uniprot/?query=reviewed:yes
## 6       SOURCEURL2  http://www.uniprot.org/uniprot/?query=reviewed:no
## 7       SOURCEURL3              https://string-db.org/cgi/download.pl
## 8         DBSCHEMA                                   LRBase.Hsa.eg.db
## 9  DBSCHEMAVERSION                                              1.2.0
## 10        ORGANISM                                       Homo sapiens
## 11         SPECIES                                              Human
## 12         package                                      AnnotationDbi
## 13         Db type                                           LRBaseDb
## 14       LRVERSION                                               2019
dbfile(LRBase.Hsa.eg.db)
## [1] "/home/biocbuild/bbs-3.10-bioc/R/library/LRBase.Hsa.eg.db/extdata/LRBase.Hsa.eg.db.sqlite"
dbschema(LRBase.Hsa.eg.db)
## [1] "CREATE TABLE `METADATA` (\n  `NAME` TEXT,\n  `VALUE` TEXT\n)"                                           
## [2] "CREATE TABLE `DATA` (\n  `GENEID_L` TEXT,\n  `GENEID_R` TEXT,\n  `SOURCEID` TEXT,\n  `SOURCEDB` TEXT\n)"
dbconn(LRBase.Hsa.eg.db)
## <SQLiteConnection>
##   Path: /home/biocbuild/bbs-3.10-bioc/R/library/LRBase.Hsa.eg.db/extdata/LRBase.Hsa.eg.db.sqlite
##   Extensions: TRUE

Combined with dbGetQuery function of RSQLite package, more complicated queries also can be submitted.

suppressPackageStartupMessages(library("RSQLite"))
dbGetQuery(dbconn(LRBase.Hsa.eg.db),
  "SELECT * FROM DATA WHERE GENEID_L = '9068' AND GENEID_R = '14' LIMIT 10")
## [1] GENEID_L GENEID_R SOURCEID SOURCEDB
## <0 rows> (or 0-length row.names)

2.2 LRBaseDbi (Class definition and meta-packaging)

LRBaseDbi regulates the class definition of LRBaseDb object instantiated from LRBaseDb-class. Besides, LRBaseDbi the package generates user’s original LRBase.XXX.eg.db-type packages by makeLRBasePackage function. This function is inspired by our previous package MeSHDbi, which constructs user’s original MeSH.XXX.eg.db-type packages. Here we call this function “meta”-packaging. The 12 LRBase.XXX.eg.db-type packages described above are also generated by this “meta”-packaging. In this case, the only user have to specify are 1. an L-R-list containing the columns “GENEID_L” (ligand NCBI Gene IDs) and “GENEID_R” (receptor NCBI Gene IDs) and 2. a meta information table describing the L-R-list. makeLRBasePackage function generates LRBase.XXX.eg.db like below. The gene identifier is limited as NCBI Gene ID for now.

example("makeLRBasePackage")
## 
## mkLRBP> if(interactive()){
## mkLRBP+     ## makeLRBasePackage enable users to construct
## mkLRBP+     ## user's own custom LRBase package
## mkLRBP+     data(FANTOM5)
## mkLRBP+     head(FANTOM5)
## mkLRBP+ 
## mkLRBP+     # We are also needed to prepare meta data as follows.
## mkLRBP+     data(metaFANTOM5)
## mkLRBP+     metaFANTOM5
## mkLRBP+ 
## mkLRBP+     ## sets up a temporary directory for this example
## mkLRBP+     ## (users won't need to do this step)
## mkLRBP+     tmp <- tempfile()
## mkLRBP+     dir.create(tmp)
## mkLRBP+ 
## mkLRBP+     ## makes an Organism package for human called Homo.sapiens
## mkLRBP+     makeLRBasePackage(pkgname = "FANTOM5.Hsa.eg.db",
## mkLRBP+         data = FANTOM5,
## mkLRBP+         metadata = metaFANTOM5,
## mkLRBP+         organism = "Homo sapiens",
## mkLRBP+         version = "0.99.0",
## mkLRBP+         maintainer = "Koki Tsuyuzaki <k.t.the-answer@hotmail.co.jp>",
## mkLRBP+         author = "Koki Tsuyuzaki",
## mkLRBP+         destDir = tmp,
## mkLRBP+         license="Artistic-2.0")
## mkLRBP+ }

Although any package name is acceptable, note that if the organism that user summarized L-R-list is also described above (Table 1), same XXX-character is recommended. This is because of the HTML report function described later identifies the XXX-character and if the XXX is corresponding to the 12 organisms, the gene annotation of the generated HTML report will become rich.

2.3 scTensor (CCI-tensor construction, decomposition, and HTML reporting)

Combined with LRBase.XXX.eg.db-type package and user’s gene expression matrix of scRNA-Seq, scTensor detects CCIs and generates HTML reports for exploratory data inspection. The algorithm of scTensor is as follows.

Firstly, scTensor calculates the celltype-level mean vectors, searches the corresponding pair of genes in the row names of the matrix, and extracted as tow vectors.

Next, the cell type-level mean vectors of ligand expression and that of receptor expression are multiplied as outer product and converted to cell type \(\times\) cell type matrix. Here, the multiple matrices can be represented as a three-order “tensor” (Ligand-Cell * Receptor-Cell * L-R-Pair). scTensor decomposes the tensor into a small tensor (core tensor) and two factor matrices. Tensor decomposition is very similar to the matrix decomposition like PCA (principal component analysis). The core tensor is similar to the eigenvalue of PCA; this means that how much the pattern is outstanding. Likewise, three matrices are similar to the PC scores/loadings of PCA; These represent which ligand-cell/receptor-cell/L-R-pair are informative. When the matrices have negative values, interpreting which direction (+/-) is important and which is not, is a difficult and laboring task. That’s why, scTensor performs non-negative Tucker2 decomposition (NTD2), which is non-negative version of tensor decomposition (cf. nnTensor).

Finally, the result of NTD2 is summarized as an HTML report. Because most of the plots are visualized by plotly package, the precise information of the plot can be interactively confirmed by user’s on-site web browser. The two factor matrices can be interactively viewed and which cell types and which L-R-pairs are likely to be interacted each other. The mode-3 (LR-pair direction) sum of the core tensor is calculated and visualized as Ligand-Receptor Patterns. Detail of (Ligand-Cell, Receptor-Cell, L-R-pair) Patterns are also visualized.

2.3.1 Creating a SingleCellExperiment object

Here, we use the scRNA-Seq dataset of male germline cells and somatic cells\(^{3}\) GSE86146 as demo data. For saving the package size, the number of genes is strictly reduced by the standard of highly variable genes with a threshold of the p-value are 1E-150 (cf. Identifying highly variable genes). That’s why we won’t argue about the scientific discussion of the data here.

We assume that user has a scRNA-Seq data matrix containing expression count data summarised at the level of the gene. First, we create a SingleCellExperiment object containing the data. The rows of the object correspond to features, and the columns correspond to cells. The gene identifier is limited as NCBI Gene ID for now.

To improve the interpretability of the following HTML report, we highly recommend that user specifies the two-dimensional data of input data (e.g. PCA, t-SNE, or UMAP). Such information is easily specified by reducedDims function of SingleCellExperiment package and is saved to reducedDims slot of SingleCellExperiment object (Figure 1).

data(GermMale)
data(labelGermMale)
data(tsneGermMale)

sce <- SingleCellExperiment(assays=list(counts = GermMale))
reducedDims(sce) <- SimpleList(TSNE=tsneGermMale$Y)
plot(reducedDims(sce)[[1]], col=labelGermMale, pch=16, cex=2,
  xlab="Dim1", ylab="Dim2", main="Germline, Male, GSE86146")
legend("topleft", legend=c(paste0("FGC_", 1:3), paste0("Soma_", 1:4)),
  col=c("#9E0142", "#D53E4F", "#F46D43", "#ABDDA4", "#66C2A5", "#3288BD", "#5E4FA2"),
  pch=16)