This vignette introduces the usage of the Bioconductor package ISLET (Individual-Specific ceLl typE referencing Tool). Complementary to classic deconvolution algorithms, ISLET can take cell type proportions as input, and infer the individual-specific and cell-type-specific reference panels. ISLET also offers functions to detect cell-type specific differential expression (csDE) genes. Additionally, it can test for csDE genes change rate difference between two groups, given an additional covariate of time points or age. ISLET is based on rigorous statistical framework of Expectation–Maximization(EM) algorithm, and has parallel computing embedded to provide superior computational performance.
ISLET 1.2.0
To install the package, start R (version 4.1.0 or higher) and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ISLET")
You may post your question on ISLET’s GitHub Issue section: https://github.com/haoharryfeng/ISLET/issues.
In clinical samples, the observed bulk sequencing/microarray data are often a mixture of different cell types. Because each unique cell type has its own gene expression profile, the real sequencing/microarray data are the weighted average of signals from multiple pure cell types. In high-throughput data analysis, the mixing proportions will confound with the primary phenotype-of-interest, if not properly accounted for. Over the past several years, researchers have gained substantial interests in using computational methods to deconvolute cell compositions. Under the assumption of a commonly shared feature-by-cell-type reference panel across all samples, deconvolution methods were developed. However, this assumption may not hold. For example, when repeated samples are measured for each subject, assuming a shared reference panel across different time points for each subject is a preferred choice over assuming a shared one across all the samples.
Here, we developed a method called ISLET
(Individual-Specific ceLl typE referencing Tool), to solve for the individual-specific and cell-type-specific reference panels, once the cell type proportions are given. ISLET
can leverage on multiple observations or temporal measurements of the same subject. ISLET
adopted a more reasonable assumption that repeated samples from the same subject would share the same reference panel. This unknown subject-specific panel, treated as missing values, are modeled by Gaussian distribution in the mixed-effect regression framework and estimated by an iterative Expectation–Maximization (EM) algorithm, when combining all samples from all subjects together. This is the first statistical framework to estimate the subject-level cell-type-specific reference panel, for repeated measures. Our modeling can effectively borrow information across samples within the same subject. ISLET
can deconvolve reference panels based on the raw counts without batch effect in library size or the normalized counts such as Transcript Per Million (TPM). In the current version, ISLET
performs cell-type-specific differential expression analysis for two groups of subjects. Other covariates and additional groups will be added in future versions.
ISLET
depends on the following packages:
ISLET needs one input file organized into SummarizedExperiment objects, combining cases and controls. The input file should contain a feature by sample matrix for observed values stored in the counts
slot. It should also use the first column in colData
slot to store the group status (i.e. case/control), the second column in the colData
slot to store the subject IDs mapping to each sample. The remaining columns in the colData
slot should store the cell type proportions. In other words, use the column 3 to K+2 to store the cell type proportions for all K cell types. An example dataset GE600
is included:
Step 1: Load in example data.
library(ISLET)
data(GE600)
ls()
## [1] "GE600_se"
GE600_se
## class: SummarizedExperiment
## dim: 10 520
## metadata(0):
## assays(1): counts
## rownames(10): gene1 gene2 ... gene9 gene10
## rowData names(0):
## colnames(520): 6454256 1716203 ... 3657905 2440389
## colData names(8): group subject_ID ... Mono Others
It contains a SummarizedExperiment
object containing the following elements:
counts
stores the gene expression value data frame of 10 genes by 520 sample, with 83 cases and 89 controls, and multiple repeated measurements (i.e. time points) per subject. Each row is a gene and each column is a sample.
assays(GE600_se)$counts[1:5, 1:6]
## 6454256 1716203 8125261 6264143 5640428 3764673
## gene1 51.796233 50.593053 29.870425 55.052521 193.789977 60.638259
## gene2 1.486545 1.995879 2.670510 1.915750 1.456993 1.786843
## gene3 34.213228 41.148280 50.366612 15.964298 45.800864 22.525563
## gene4 5.788845 4.489374 7.776873 1.094732 1.451257 1.388009
## gene5 67.010747 76.308153 106.936595 257.108819 86.269951 67.152534
colData
stores the sample meta-data and the input cell type proportions. The first column is the group status (i.e. case/ctrl), the second column is the subject ID, shows the relationship between the 520 samples IDs and their 172 subject IDs. The remaining 6 columns (i.e. column 3-8) are the cell type proportions of all samples by their 6 cell types. The 6 cell types are: B cells, Tcells_CD4, Tcells_CD8, NK cells, Mono cells, and others cells.
colData(GE600_se)
## DataFrame with 520 rows and 8 columns
## group subject_ID Bcells Tcells_CD4 Tcells_CD8 NKcells
## <character> <integer> <numeric> <numeric> <numeric> <numeric>
## 6454256 case 210298 0.294597 0.0459207 0.0960261 0.0245194
## 1716203 case 210298 0.229228 0.0307202 0.0874901 0.0237722
## 8125261 case 210298 0.229506 0.0429694 0.1207701 0.0212622
## 6264143 case 223361 0.262023 0.0127117 0.0520090 0.0194373
## 5640428 case 223361 0.124125 0.0645530 0.0586977 0.0615492
## ... ... ... ... ... ... ...
## 5220586 ctrl 954888 0.426594 0.04046180 0.0854448 0.0184139
## 4601267 ctrl 954888 0.332744 0.04181961 0.0995010 0.0267642
## 6500466 ctrl 999257 0.311047 0.01287898 0.1226221 0.0312183
## 3657905 ctrl 999257 0.242521 0.01412359 0.1105289 0.0241399
## 2440389 ctrl 999257 0.353854 0.00908941 0.1042287 0.0192127
## Mono Others
## <numeric> <numeric>
## 6454256 0.1003072 0.438630
## 1716203 0.1284324 0.500357
## 8125261 0.0736778 0.511814
## 6264143 0.0608441 0.592975
## 5640428 0.2664628 0.424613
## ... ... ...
## 5220586 0.1106113 0.318474
## 4601267 0.0876010 0.411570
## 6500466 0.1019383 0.420296
## 3657905 0.0509589 0.557728
## 2440389 0.0952407 0.418374
This is the first step required before using ISLET for individual-specific reference deconvolution or testing. This step will prepare your input data for the downstream reference panels deconvolution (function isletSolve
) and/or testing of differential expression gene (function isletTest
). During this step, the input data in SummarizedExperiment
format will be further processed for ISLET. The expression values, from cases and controls respectively, will be extracted. The cell type names, number of cell types, number of cases/controls subject, number of cases/controls samples, will be obtained.
Step 2: Data preparation for downstream ISLET analysis.
study123input <- dataPrep(dat_se=GE600_se)
The output, are the extracted information in a S4 object, and can be overviewed by the function below:
study123input
## First couple of elements from cases and controls:
## 6454256 1716203 8125261 6264143 5640428 3764673
## gene1 51.796233 50.593053 29.87042 55.05252 193.789977 60.638259
## gene2 1.486545 1.995879 2.67051 1.91575 1.456993 1.786843
## gene3 34.213228 41.148280 50.36661 15.96430 45.800864 22.525563
## 1622468 6724003 3390865 2961023 2297235 1104596
## gene1 69.360096 42.421512 72.869948 59.052826 53.256419 77.130120
## gene2 3.823171 2.414935 3.391997 3.035638 3.539411 3.906642
## gene3 48.030845 40.214322 37.962721 13.309217 11.876977 16.314866
## Design matrices hidded.
## Total cell type number:
## [1] 6
## Cell type categories:
## [1] "Bcells" "Tcells_CD4" "Tcells_CD8" "NKcells" "Mono"
## [6] "Others"
## Total sample number and subject number:
## [1] 520 172
## Total case number and ctrl number:
## [1] 83 89
## First several subject ID for the samples:
## [1] 210298 210298 210298 223361 223361 223361 228055 228055 228055 228055
## Data preparation type (intercept/slope):
## [1] "intercept"
[Attention] Here we have strict requirement for the input data. Each subject ID represents a unique participant across cases and controls. Subjects also need to be sorted.
Step 3: With the curated data study123input
from the previous step, now we can use ISLET
to conduct deconvolution and obtain the individual-specific and cell-type-specific reference panels. This process can be achieved by running:
#Use ISLET for deconvolution
res.sol <- isletSolve(input=study123input)
The res.sol
is the deconvolution result list. For both case and control group, the deconvolution result is a list of length K
, where K
is the number of cell types. For each of the K
elements, it is a matrix of dimension G
by N
. For each of the K
cell types, it stores the deconvoluted values in a feature (G
) by subject (N
) matrix,
#View the deconvolution results
caseVal <- caseEst(res.sol)
ctrlVal <- ctrlEst(res.sol)
length(caseVal) #For cases, a list of 6 cell types' matrices.
## [1] 6
length(ctrlVal) #For controls, a list of 6 cell types' matrices.
## [1] 6
caseVal$Bcells[1:5, 1:4] #view the reference panels for B cells, for the first 5 genes and first 4 subjects, in Case group.
## 210298 223361 228055 229203
## gene1 0.0000000 0.004321134 1.134649 0.00000
## gene2 0.7402177 0.858061741 1.481642 0.11822
## gene3 9.3032183 4.673933671 4.991038 0.00000
## gene4 15.0234343 2.922345346 5.235796 15.70844
## gene5 26.9082246 31.428902564 15.347462 30.43319
ctrlVal$Bcells[1:5, 1:4] #view the reference panels for B cells, for the first 5 genes and first 4 subjects, in Control group.
## 225490 230198 248848 253527
## gene1 0.00000 0.000000 0.000000 0.000000
## gene2 2.73228 2.678124 2.163444 2.285876
## gene3 28.81544 2.697245 37.830728 5.347082
## gene4 13.66613 29.445485 6.849386 8.915755
## gene5 62.50238 53.985388 74.033019 52.470559
case.ind.ref
A list of length K
, where K
is the number of cell types. For each of the K
elements in this list, it is a feature by subject matrix containing all the feature values (i.e. gene expression values), for case group. It is one of the main products the individual-specific and cell-type-specific solve algorithm. ctrl.ind.ref
A list of length K
, where K
is the number of cell types. For each of the K
elements in this list, it is a feature by subject matrix containing all the feature values (i.e., gene expression values), for control group. It is one of the main products the individual-specific and cell-type-specific solve algorithm. mLLK
A scalar, the optimized marginal log-likelihood for the current model. It will be used in Likelihood Ratio Test (LRT).
Also, with the curated data study123input
from the previous Step 2, now we can test the group effect on individual reference panels, i.e., identifying csDE genes in mean or intercept. In this ‘intercept test’, we assume that the individual-specific reference panel is unchanged across time points. Note that Step 3 can be skipped, if one only need to call csDE genes. This test is done by the following line of code:
#Test for csDE genes
res.test <- isletTest(input=study123input)
## csDE testing on cell type 1
## csDE testing on cell type 2
## csDE testing on cell type 3
## csDE testing on cell type 4
## csDE testing on cell type 5
## csDE testing on cell type 6
## csDE testing on 6 cell types finished
The result res.test
is a matrix of p-values, in the dimension of feature by cell type. Each element is the LRT p-value, by contrasting case group and control group, for one feature in one cell type.
#View the test p-values
head(res.test)
## Bcells Tcells_CD4 Tcells_CD8 NKcells Mono Others
## [1,] 0.06649261 0.77489318 0.6077971 0.15315973 0.69627533 0.02982063
## [2,] 0.41736594 0.05028712 0.6539022 0.39584728 0.88311825 0.79493488
## [3,] 0.58806617 0.78104069 0.4299641 0.17664435 0.05892608 0.37546502
## [4,] 0.25000763 0.09684269 0.6367174 0.90274779 0.67013276 1.00000000
## [5,] 0.40303484 0.40255103 0.7866814 0.04603692 0.62486315 0.58097702
## [6,] 0.02558441 0.55927759 0.1807293 0.85494477 0.88159601 0.94549772
Given an additional continuous variable such as time or age, ISLET is able to compare cases and controls in the change-rate of reference profile over time. This is the ‘slope test’. Here, the assumption is that for the participants or subjects in a group, the individual reference profile could change over time, with change-rate fixed by group. At a given time point, there may be no (significant) group effect in the reference panel, but the participants still have distinct underlying reference profiles. Under this setting, it is of interest to test for such difference. Below is an example to detect reference panel change-rate difference between two groups, from data preparation to test.
We provide an additional example dataset GE600age
from the initial step to illustrate this. Different from the dataset GE600
above, here GE600age
has an additional age
column in the colData
, besides subject ID and cell type proportions. This covariate age
is required for the test.
Step 1: Load example dataset.
#(1) Example dataset for 'slope' test
data(GE600age)
ls()
## [1] "GE600_se" "GE600age_se" "caseVal" "ctrlVal"
## [5] "res.sol" "res.test" "study123input"
Similar to previous sections, it contains one SummarizedExperiment
objects containing the following elements:
counts
has the gene expression value data frame of 10 genes by 520 sample, with 83 cases and 89 controls, and multiple repeated measurements (i.e. time points) per subject. Each row is a gene and each column is a sample.
assays(GE600age_se)$counts[1:5, 1:6]
## 6454256 1716203 8125261 6264143 5640428 3764673
## gene1 51.796233 50.593053 29.870425 55.052521 193.789977 60.638259
## gene2 1.486545 1.995879 2.670510 1.915750 1.456993 1.786843
## gene3 34.213228 41.148280 50.366612 15.964298 45.800864 22.525563
## gene4 5.788845 4.489374 7.776873 1.094732 1.451257 1.388009
## gene5 67.010747 76.308153 106.936595 257.108819 86.269951 67.152534
colData
contains the sample meta-data. The first column is the case/ctrl group status, the second column is the subject ID, shows the relationship between the samples IDs and the corresopnding subject IDs. The third column is the age variable for each sample, which is the main variable in downstream testing. The remaining 6 columns (i.e. column 4-9) are the cell type proportions of all samples by their 6 cell types. The 6 cell types are: B cells, Tcells_CD4, Tcells_CD8, NK cells, Mono cells, and others cells.
colData(GE600age_se)
## DataFrame with 520 rows and 9 columns
## group subject_ID age Bcells Tcells_CD4 Tcells_CD8
## <character> <integer> <numeric> <numeric> <numeric> <numeric>
## 6454256 case 210298 9.63333 0.294597 0.0459207 0.0960261
## 1716203 case 210298 12.26667 0.229228 0.0307202 0.0874901
## 8125261 case 210298 15.50000 0.229506 0.0429694 0.1207701
## 6264143 case 223361 8.43333 0.262023 0.0127117 0.0520090
## 5640428 case 223361 16.66667 0.124125 0.0645530 0.0586977
## ... ... ... ... ... ... ...
## 5220586 ctrl 954888 16.2333 0.426594 0.04046180 0.0854448
## 4601267 ctrl 954888 19.1000 0.332744 0.04181961 0.0995010
## 6500466 ctrl 999257 12.8667 0.311047 0.01287898 0.1226221
## 3657905 ctrl 999257 15.2000 0.242521 0.01412359 0.1105289
## 2440389 ctrl 999257 18.0333 0.353854 0.00908941 0.1042287
## NKcells Mono Others
## <numeric> <numeric> <numeric>
## 6454256 0.0245194 0.1003072 0.438630
## 1716203 0.0237722 0.1284324 0.500357
## 8125261 0.0212622 0.0736778 0.511814
## 6264143 0.0194373 0.0608441 0.592975
## 5640428 0.0615492 0.2664628 0.424613
## ... ... ... ...
## 5220586 0.0184139 0.1106113 0.318474
## 4601267 0.0267642 0.0876010 0.411570
## 6500466 0.0312183 0.1019383 0.420296
## 3657905 0.0241399 0.0509589 0.557728
## 2440389 0.0192127 0.0952407 0.418374
[Attention] This time/age covariate must be stored in the third column in colData
, to successfully execute this testing. The data must be sorted by subject ID, so that the multiple replicates per subject are close to each other.
Step 2: Data preparation.
#(2) Data preparation
study456input <- dataPrepSlope(dat_se=GE600age_se)
## Begin: working on data preparation as the input for ISLET algorithm.
## Complete: data preparation for ISLET.
Step 3: ‘Slope’ testing.
#(3) Test for slope effect(i.e. age) difference in csDE testing
age.test <- isletTest(input=study456input)
## csDE testing on cell type 1
## csDE testing on cell type 2
## csDE testing on cell type 3
## csDE testing on cell type 4
## csDE testing on cell type 5
## csDE testing on cell type 6
## csDE testing on 6 cell types finished
The result age.test
is a matrix of p-values, in the dimension of feature by cell type. Each element is the LRT p-value, by contrasting case group and control group, for one feature in one cell type. In contrast to the (intercept) test described before, here is a test for difference of the expression CHANGE IN REFERENCE over time, between cases and controls.
#View the test p-values
head(age.test)
## Bcells Tcells_CD4 Tcells_CD8 NKcells Mono Others
## [1,] 1.0000000 0.7031954 0.03177445 0.8321539 0.02039022 0.09609757
## [2,] 0.7085139 0.9329100 0.79584834 0.3971310 0.80106928 0.76964312
## [3,] 0.3392235 0.1902524 0.06803089 0.1597954 0.03443943 0.85405799
## [4,] 0.1280650 0.5198595 0.40505410 0.1427035 0.53713379 0.11159749
## [5,] 0.0851573 0.6204602 0.87195749 0.7323157 0.24655800 0.50021983
## [6,] 0.4048984 0.6680858 0.71375795 0.9476769 0.50651788 0.30298067
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ISLET_1.2.0 SummarizedExperiment_1.30.0
## [3] Biobase_2.60.0 GenomicRanges_1.52.0
## [5] GenomeInfoDb_1.36.0 IRanges_2.34.0
## [7] S4Vectors_0.38.0 BiocGenerics_0.46.0
## [9] MatrixGenerics_1.12.0 matrixStats_0.63.0
## [11] BiocParallel_1.34.0 Matrix_1.5-4
## [13] BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.4 compiler_4.3.0 BiocManager_1.30.20
## [4] bitops_1.0-7 jquerylib_0.1.4 yaml_2.3.7
## [7] fastmap_1.1.1 mime_0.12 lattice_0.21-8
## [10] R6_2.5.1 XVector_0.40.0 knitr_1.42
## [13] DelayedArray_0.26.0 bookdown_0.33 GenomeInfoDbData_1.2.10
## [16] bslib_0.4.2 rlang_1.1.0 cachem_1.0.7
## [19] xfun_0.39 sass_0.4.5 cli_3.6.1
## [22] zlibbioc_1.46.0 digest_0.6.31 grid_4.3.0
## [25] evaluate_0.20 codetools_0.2-19 RCurl_1.98-1.12
## [28] rmarkdown_2.21 tools_4.3.0 htmltools_0.5.5