LineagePulse 1.4.0

LineagePulse is a differential expression algorithm for single-cell RNA-seq (scRNA-seq) data. LineagePulse is based on zero-inflated negative binomial noise model and can capture both discrete and continuous population structures: Discrete population structures are groups of cells (e.g. condition of an experiment or tSNE clusters). Continous population structures can for example be pseudotemporal orderings of cells or temporal orderings of cells. The main use and novelty of LineagePulse lies in its ability to fit gene expression trajectories on pseudotemporal orderings of cells well. Note that LineagePulse does not infer a pseudotemporal ordering but is a downstream analytic tool to analyse gene expression trajectories on a given pseudotemporal ordering (such as from diffusion pseudotime or monocle2).

To run LineagPulse on scRNA-seq data, the user needs to use a minimal input parameter set for the wrapper function runLineagePulse, which then performs all normalisation, model fitting and differential expression analysis steps without any more user interaction required:

- counts the count matrix (genes x cells) which MUST NOT be normalised in any way. A valid input option is expected counts from an aligner. Note that TPM or depth normalised expected counts are NOT count data. The statistical framework of LineagePulse rests on the assumption that matCounts contain count data. The count matrix can also be supplied as the path to a .mtx file (sparse matrix format file) or as a SummarizedExperiment or SingleCellExperiment object.
- dfAnnotation data frame that contains cell-wise annotation. The rownames of dfAnnotation must be equal to the column names of matCounts as both correspond to cells. Note that if counts is a SummarizedExperiment or SingleCellExperiment object, the annotation data frame is taken to be colData(counts) if dfAnnotation is NULL. dfAnnotation must contain
- a column named “continuous” if a continuous model is fit (e.g. strMuModel is impulse or splines and expression is fit as a function of time or pseudotime coordinates)
- a column named “groups” if a discrete population model is fit (e.g. strMuModel is groups and expression is fit as a function of group assignment, e.g. clusters or experimental conditions) that contains the assignemnt of cells to these groups as strings
- columns that describe the batch structure (if any).

Additionally, one can provide:

- matPiConstPredictors a matrix of gene-specific predictors of the drop-out rate (genes x predictors). We suggest to use the log average expression (unless strDropModel is “logistic_ofMu”) and potentially parameters which may affect sequencing efficiency such as GC content of the gene.
- strMuModel the type of expression model to use as an alternative model for differential expression analysis: “impulse” for an impulse model and “splines” for a natural cubic spline model.
- vecConfoundersMu a vector of strings which corresond to column names in dfAnnotation which describe the batch structure to be corrected for.
- scaDFSplinesMu the degrees of freedom of the spline-based model for the mean parameter if strMuModel was set to “splines”.
- vecNormConstExternal cell-wise normalisation constants to be used (e.g. sequencing depth correction factors). the names of the elements have to be the column names of matCounts (cells).
- scaNProc to set the number of processes for parallelization.
- boolVerbose output basic progress reports while the wrapper functions runs.
- boolSuperVerbose output detailed progress reports for each step of the wrapper function.

Lastly, the experienced user who has a solid grasp of the mathematical and algorithmic basis of LineagePulse may change the defaults of these advanced input options:

- vecConfoundersDisp batch variables to be used to correct the dispersion (variance).
- strDispModelFull the dispersion model to be used for the full model.
- strDispModelRed the dispersion model to be used for the reduced model.
- strDropModel the drop-out model to be used.
- strDropFitGroup the groups of cells which receive one parameterisation of the drop-out model.
- scaDFSplinesDisp the degrees of freedom of the spline-based model for the dispersion parameter if strDispModel was set to “splines”.
- boolEstimateNoiseBasedOnH0 whether to estimate the drop-out model on the null or alternative expression model. Note that setting this to FALSE strongly increases the run time.
- scaMaxEstimationCycles maximum number of drop-out and expression model estimation iteration cycles.

Here, we present a differential expression analysis scenario on a longitudinal ordering. The differential expression results are in a data frame which can be accessed from the output object via list like properties ($). The core differential expression analysis result are p-value and false-discovery-rate corrected p-value of differential expression which are the result of a gene-wise hypothesis test of a non-constant expression model (impulse, splines or groups) versus a constant expression model.

`library(LineagePulse)`

```
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
```

```
lsSimulatedData <- simulateContinuousDataSet(
scaNCells = 100,
scaNConst = 10,
scaNLin = 10,
scaNImp = 10,
scaMumax = 100,
scaSDMuAmplitude = 3,
vecNormConstExternal=NULL,
vecDispExternal=rep(20, 30),
vecGeneWiseDropoutRates = rep(0.1, 30))
```

`## Draw mean trajectories`

`## Setting size factors uniformly =1`

`## Draw dispersion`

`## Simulate negative binomial noise`

`## Simulate drop-out`

```
objLP <- runLineagePulse(
counts = lsSimulatedData$counts,
dfAnnotation = lsSimulatedData$annot)
```

`## LineagePulse for count data: v1.4.0`

`## --- Data preprocessing`

`## # 0 out of 100 cells did not have a continuous covariate and were excluded.`

`## # 0 out of 30 genes did not contain non-zero observations and are excluded from analysis.`

`## # 0 out of 100 cells did not contain non-zero observations and are excluded from analysis.`

`## --- Compute normalisation constants:`

`## # All size factors are set to one.`

`## --- Fit ZINB model for both H1 and H0.`

`## ### a) Fit ZINB model A (H0: mu=constant disp=constant) with noise model.`

`## # . Initialisation: ll -24950.2826257684`

`## # 1. Iteration with ll -13057.3434879246 in 0.02 min.`

`## # 2. Iteration with ll -12725.6100800047 in 0.05 min.`

`## # 3. Iteration with ll -12725.6087943848 in 0.02 min.`

`## Finished fitting zero-inflated negative binomial model A with noise model in 0.12 min.`

`## ### b) Fit ZINB model B (H1: mu=splines disp=constant).`

`## # . Initialisation: ll -14388.5875174006`

`## # 1. Iteration with ll -12337.0338195748 in 0.02 min.`

`## Finished fitting zero-inflated negative binomial model B in 0.03 min.`

`## ### c) Fit NB model A (H0: mu=constant disp=constant).`

`## # . Initialisation: ll -14251.5454698876`

`## # 1. Iteration with ll -14072.0690962406 in 0.01 min.`

`## Finished fitting NB model B in 0.02 min.`

`## ### d) Fit NB model B (H1: mu=splines disp=constant).`

`## # . Initialisation: ll -14458.6945343055`

`## # 1. Iteration with ll -13957.998971559 in 0.02 min.`

`## Finished fitting NB model B in 0.03 min.`

`## Time elapsed during ZINB fitting: 0.22 min`

`## --- Run differential expression analysis.`

`## Finished runLineagePulse().`

`head(objLP$dfResults)`

```
## gene p padj mean_H0 p_nb padj_nb
## gene_1 gene_1 0.71078558 0.7615560 70.923131 0.71078558 0.9990978
## gene_2 gene_2 0.08328868 0.1469800 6.249991 0.08328868 0.2803342
## gene_3 gene_3 0.64299702 0.7419196 3.149716 0.64299702 0.9990978
## gene_4 gene_4 0.57069463 0.6848336 35.019657 0.57069463 0.9990978
## gene_5 gene_5 0.36380894 0.4547612 87.571101 0.36380894 0.9990978
## gene_6 gene_6 0.07616542 0.1428102 77.777865 0.07616542 0.9990978
## df_full_zinb df_red_zinb df_full_nb df_red_nb loglik_full_zinb
## gene_1 7 2 7 2 -441.3621
## gene_2 7 2 7 2 -279.0289
## gene_3 7 2 7 2 -204.6424
## gene_4 7 2 7 2 -408.0237
## gene_5 7 2 7 2 -454.5199
## gene_6 7 2 7 2 -442.8618
## loglik_red_zinb loglik_full_nb loglik_red_nb allZero
## gene_1 -442.8271 -519.4502 -520.7991 FALSE
## gene_2 -283.8934 -279.6291 -284.4807 FALSE
## gene_3 -206.3279 -206.5198 -208.9529 FALSE
## gene_4 -409.9504 -451.1589 -452.4751 FALSE
## gene_5 -457.2433 -542.6512 -543.0586 FALSE
## gene_6 -447.8455 -522.8982 -523.3171 FALSE
```

In addition to the raw p-values, one may be interested in further details of the expression models such as shape of the expression mean as a function of pseudotime, log fold changes (LFC) and global expression trends as function of pseudotime. We address each of these follow-up questions with separate sections in the following. Note that all of these follow-up questions are answered based on the model that were fit to compute the p-value of differential expression. Therefore, once runLineagePulse() was called once, no further model fitting is required.

# Further inspection of results ## Plot gene-wise trajectories

Multiple options are available for gene-wise expression trajectory plotting: Observations can be coloured by the posterior probability of drop-out (boolColourByDropout). Observations can be normalized based on the alternative expression model or taken as raw observerations for the scatter plot (boolH1NormCounts). Lineage contours can be added to aid visual interpretation of non-uniform population density in pseudotime related effects (boolLineageContour). Log counts can be displayed instead of counts if the fold changes are large (boolLogPlot). In any case, the output object of the gene-wise expression trajectors plotting function plotGene is a ggplot2 object which can then be printed or modified.

```
# plot the gene with the lowest p-value of differential expression
gplotExprProfile <- plotGene(
objLP = objLP, boolLogPlot = FALSE,
strGeneID = objLP$dfResults[which.min(objLP$dfResults$p),]$gene,
boolLineageContour = FALSE)
gplotExprProfile
```

The function plotGene also shows the H1 model fit under a negative binomial noise model (“H1(NB)”) as a reference to show what the model fit looks like if drop-out is not accounted for.

LineagePulse provides the user with parameter extraction functions that allow the user to interact directly with the raw model fits for analytic tasks or questions not addressed above.

```
# extract the mean parameter fits per cell of the gene with the lowest p-value.
matMeanParamFit <- getFitsMean(
lsMuModel = lsMuModelH1(objLP),
vecGeneIDs = objLP$dfResults[which.min(objLP$dfResults$p),]$gene)
cat("Minimum fitted mean parameter: ", round(min(matMeanParamFit),1) )
```

`## Minimum fitted mean parameter: 69.3`

`cat("Mean fitted mean parameter: ", round(mean(matMeanParamFit),1) )`

`## Mean fitted mean parameter: 168.2`

Given a discrete population structure, such as tSNE cluster or experimental conditions, a fold change is the ratio of the mean expression value of both groups. The definition of a fold change is less clear if a continous expression trajector is considered: Of interest may be for example the fold change from the first to the last cell on the expression trajectory or from the minimum to the maximum expression value. Note that in both cases, we compute fold changes on the model fit of the expression mean parameter which is corrected for noise and therefore more stable than the estimate based on the raw expression count observation.

```
# first, extract the model fits for a given gene again
vecMeanParamFit <- getFitsMean(
lsMuModel = lsMuModelH1(objLP),
vecGeneIDs = objLP$dfResults[which.min(objLP$dfResults$p),]$gene)
# compute log2-fold change from first to last cell on trajectory
idxFirstCell <- which.min(dfAnnotationProc(objLP)$pseudotime)
idxLastCell <- which.max(dfAnnotationProc(objLP)$pseudotime)
cat("LFC first to last cell on trajectory: ",
round( (log(vecMeanParamFit[idxLastCell]) -
log(vecMeanParamFit[idxFirstCell])) / log(2) ,1) )
```

`## LFC first to last cell on trajectory:`

```
# compute log2-fold change from minimum to maximum value of expression trajectory
cat("LFC minimum to maximum expression value of model fit: ",
round( (log(max(vecMeanParamFit)) -
log(min(vecMeanParamFit))) / log(2),1) )
```

`## LFC minimum to maximum expression value of model fit: 2.1`

Global expression profiles or expression profiles across large groups of genes can be visualised via heatmaps of expression z-scores. One could extract the expression mean parameter fits as described above and create such heatmaps from scratch. LineaegePulse also offers a wrapper for creating such a heatmap:

```
# create heatmap with all differentially expressed genes
lsHeatmaps <- sortGeneTrajectories(
vecIDs = objLP$dfResults[which(objLP$dfResults$padj < 0.01),]$gene,
lsMuModel = lsMuModelH1(objLP),
dirHeatmap=NULL)
print(lsHeatmaps$hmGeneSorted)
```

`sessionInfo()`

```
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] LineagePulse_1.4.0 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 lattice_0.20-38
## [3] circlize_0.4.6 png_0.1-7
## [5] gtools_3.8.1 assertthat_0.2.1
## [7] digest_0.6.18 SingleCellExperiment_1.6.0
## [9] R6_2.4.0 GenomeInfoDb_1.20.0
## [11] plyr_1.8.4 stats4_3.6.0
## [13] evaluate_0.13 ggplot2_3.1.1
## [15] pillar_1.3.1 gplots_3.0.1.1
## [17] GlobalOptions_0.1.0 zlibbioc_1.30.0
## [19] rlang_0.3.4 lazyeval_0.2.2
## [21] gdata_2.18.0 S4Vectors_0.22.0
## [23] GetoptLong_0.1.7 Matrix_1.2-17
## [25] rmarkdown_1.12 labeling_0.3
## [27] splines_3.6.0 BiocParallel_1.18.0
## [29] stringr_1.4.0 RCurl_1.95-4.12
## [31] munsell_0.5.0 DelayedArray_0.10.0
## [33] compiler_3.6.0 xfun_0.6
## [35] pkgconfig_2.0.2 BiocGenerics_0.30.0
## [37] shape_1.4.4 htmltools_0.3.6
## [39] tidyselect_0.2.5 SummarizedExperiment_1.14.0
## [41] tibble_2.1.1 GenomeInfoDbData_1.2.1
## [43] bookdown_0.9 IRanges_2.18.0
## [45] matrixStats_0.54.0 crayon_1.3.4
## [47] dplyr_0.8.0.1 bitops_1.0-6
## [49] grid_3.6.0 gtable_0.3.0
## [51] magrittr_1.5 scales_1.0.0
## [53] KernSmooth_2.23-15 stringi_1.4.3
## [55] XVector_0.24.0 rjson_0.2.20
## [57] RColorBrewer_1.1-2 tools_3.6.0
## [59] Biobase_2.44.0 glue_1.3.1
## [61] purrr_0.3.2 parallel_3.6.0
## [63] yaml_2.2.0 clue_0.3-57
## [65] colorspace_1.4-1 cluster_2.0.9
## [67] BiocManager_1.30.4 caTools_1.17.1.2
## [69] GenomicRanges_1.36.0 ComplexHeatmap_2.0.0
## [71] knitr_1.22
```