coseq package: Quick-start guide

Andrea Rau, Antoine Godichon-Baggioni, and Cathy Maugis-Rabusseau

2017-10-30

coseq is a package to perform clustering analysis of sequencing data (e.g., co-expression analysis of RNA-seq data), using transformed profiles rather than the raw counts directly. The package implements two distinct strategies in conjunction with transformations: 1) MacQueen’s K-means algorithm and 2) Gaussian mixture models. For both strategies model selection is provided using the slope heuristics approach and integrated completed likelihood (ICL) criterion, respectively. Note that for backwards compatibility, coseq also provides a wrapper for the Poisson mixture model originally proposed in Rau et al. (2015) and implemented in the HTSCluster package.

The methods implemented in this package are described in detail in the following three publications.

Below, we provide a quick-start guide using a subset of RNA-seq data to illustrate the functionalities of the coseq package. In this document, we focus on the methods described in Rau and Maugis-Rabusseau (2017) and Godichon-Baggioni et al. (2017). For more information about the method described in Rau et al. (2015), see the HTSCluster vignette.

Quick start (tl;dr)

A standard coseq analysis takes the following form, where counts represents a matrix or data.frame of gene-level counts arising from an RNA-seq experiment (of dimension n x d for n genes and d samples). Results, exported in the form of a coseqResults S4 object, can easily be examined using standard summary and plot functions (see below and the User’s Guide for examples of plots that may be obtained):

run <- coseq(counts, K=2:25)
summary(run)
plot(run)

The cluster labels obtained from the coseq clustering may easily be obtained using the clusters functions:

clusters(run)

Note that unless otherwise specified by the user, coseq uses the following default parameters:

Co-expression analysis with coseq

For the remainder of this vignette, we make use of the mouse neocortex RNA-seq data from Fietz et al. (2012) (available at https://perso.math.univ-toulouse.fr/maugis/mixstatseq/packages and as an ExpressionSet object called fietz in coseq). The aim in this study was to investigate the expansion of the neocortex in five embryonic (day 14.5) mice by analyzing the transcriptome of three regions: the ventricular zone (VZ), subventricular zone (SVZ) and cortical place (CP).

We begin by first loading the necessary R packages as well as the data.

library(coseq)
library(Biobase)
library(corrplot)

data("fietz")
counts <- exprs(fietz)
conds <- pData(fietz)$tissue

## Equivalent to the following:
## counts <- read.table("http://www.math.univ-toulouse.fr/~maugis/coseq/Fietz_mouse_counts.txt",
##                       header=TRUE)
## conds <- c(rep("CP",5),rep("SVZ",5),rep("VZ",5))

The coseq package fits either a Gaussian mixture model (Rau and Maugis-Rabusseau, 2017) or uses the K-means algorithm (Godichon-Baggioni et al., 2017) to cluster transformed normalized expression profiles of RNA-seq data. Normalized expression profiles correspond to the proportion of normalized reads observed for gene i with respect to the total observed for gene i across all samples: \[ p_{ij} = \frac{y_{ij}/s_{j} +1}{\sum_{j'} y_{ij'}/s_{j'} +1}, \] where \(s_j\) are normalization scaling factors (e.g., after applying TMM normalization to library sizes) and \(y_{ij}\) represents the raw count for gene \(i\) in sample \(j\).

Transformations for normalized profiles with the Gaussian mixture model

Since the coordinates of \(\mathbf{p}_i\) are linearly dependent (causing estimation problems for a Gaussian mixture distribution), weconsider either the arcsine or logit transformation of the normalized profiles \(p_{ij}\): \[g_{\text{arcsin}}(p_{ij}) = \text{arcsin}\left( \sqrt{ p_{ij} } \right) \in [0, \pi/2], \text{ and}\] \[g_{\text{logit}}(p_{ij}) \text{log}_2 \left( \frac{p_{ij}}{1-p_{ij}} \right) \in (-\infty, \infty).\]

Then the distribution of the transformed normalized expression profiles is modeled by a general multidimensional Gaussian mixture

\[f(.|\theta_K) = \sum_{k=1}^K \pi_k \phi(.|\mu_k,\Sigma_k)\]

where \(\theta_K=(\pi_1,\ldots,\pi_{K-1},\mu_1,\ldots,\mu_K,\Sigma_1,\ldots,\Sigma_K)\), \(\pi=(\pi_1,\ldots,\pi_K)\) are the mixing proportions and \(\phi(.|\mu_k,\Sigma_k)\) is the \(q\)-dimensional Gaussian density function with mean \(\mu_k\) and covariance matrix \(\Sigma_k\). To estimate mixture parameters \(\theta_K\) by computing the maximum likelihood estimate (MLE), an Expectation-Maximization (EM) algorithm is used via the Rmixmod package. Finally, let \(\hat t_{ik}\) be the conditional probability that observation \(i\) arises from the \(k\)th component of the mixture \(f(.|\hat \theta_{\hat K})\). Each observation \(i\) is assigned to the component maximizing the conditional probability \(\hat t_{ik}\) i.e., using the so-called maximum a posteriori (MAP) rule.

Transformations for normalized profiles with K-means

For the K-means algorithm, we consider three separate transformations of the profiles \(p_{i}\) that are well adapted to compositional data (see Godichon-Baggioni et al., 2017 for more details):

Then, the aim is to minimize the sum of squared errors (SSE), defined for each set of clustering \(\mathcal{C}^{(K)}=\left( C_{1},...,C_{k}\right)\) by \[\begin{equation*} \text{SSE} \left( \mathcal{C}^{(K)}\right) := \sum_{k=1}^{K}\sum_{i \in C_{k}} \left\| h \left( y_{i}\right) - \mu_{k,h} \right\|_{2}^{2} , \end{equation*}\]

with \(i \in C_{k}\) if \(\left\| h\left( y_{i}\right) - \mu_{k,h} \right\|_{2} = \min_{k'=1,\ldots,K} \left\| y_{i}- \mu_{k',h} \right\|_{2}\), and \[ \mu_{k,h}= \frac{1}{|C_{k} |}\sum_{i \in C_{k}}h \left( y_{i} \right), \] and \(h\) is the chosen transformation. In order to minimize the SSE, we use the well-known MacQueen’s K-means algorithm.

Model selection

Because the number of clusters \(K\) is not known a priori, we fit a collection of models (here \(K\) = 2, …, 25) and use either the Integrated Completed Likelihood (ICL) criterion (in the case of the Gaussian mixture model) or the slope heuristics (in the case of the K-means algorithm) to select the best model in terms of fit, complexity, and cluster separation.

Other options

If desired, we can set a filtering cutoff on the mean normalized counts via the meanFilterCutoff argument to remove very weakly expressed genes from the co-expression analysis; in the interest of faster computation for this vignette, in this example we filter all genes with mean normalized expression less than 200 (for the Gaussian mixture model) or less than 50 (for the K-means algorithm).

Note that if desired, parallel execution using BiocParallel can be specified via the parallel argument:

run <- coseq(..., parallel=TRUE)

Running coseq

The collection of co-expression models for the logCLR-transformed profiles using the K-means algorithm may be obtained as follows (note that we artificially set the number of maximum allowed iterations and number of random starts to be low for faster computational time in this vignette):

runLogCLR <- coseq(counts, K=2:25, transformation="logclr",norm="TMM", 
                      meanFilterCutoff=50, model="kmeans",
                   nstart=1, iter.max=10)
## ****************************************
## coseq analysis: kmeans approach & logclr transformation
## K = 2 to 25 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...
## Running K = 16 ...
## Running K = 17 ...
## Running K = 18 ...
## Running K = 19 ...
## Running K = 20 ...
## Running K = 21 ...
## Running K = 22 ...
## Running K = 23 ...
## Running K = 24 ...
## Running K = 25 ...

The collection of Gaussian mixture models for the arcsine-transformed and logit-transformed profiles may be obtained as follows (as before, we set the number of iterations to be quite low for computational speed in this vignette):

runArcsin <- coseq(counts, K=2:20, model="Normal", transformation="arcsin", 
                   meanFilterCutoff=200, iter=10)
## ****************************************
## coseq analysis: Normal approach & arcsin transformation
## K = 2 to 20 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...
## Running K = 16 ...
## Running K = 17 ...
## Running K = 18 ...
## Running K = 19 ...
## Running K = 20 ...
runLogit <- coseq(counts, K=2:20, model="Normal", transformation="logit", 
                  meanFilterCutoff=200, verbose=FALSE, iter=10)
## ****************************************
## coseq analysis: Normal approach & logit transformation
## K = 2 to 20 
## ****************************************

In all cases, the resulting output of a call to coseq is an S4 object of class coseqResults.

class(runArcsin)
## [1] "coseqResults"
## attr(,"package")
## [1] "coseq"
runArcsin
## An object of class coseqResults
##  4230 features by 15 samples. 
##  Models fit: K = 2 ... 20
##  Chosen clustering model: K = 10

To choose the most appropriate transformation to use (arcsine versus logit) in a Gaussian mixture model, we may use the corrected ICL, where the minimum value corresponds to the selected model. Note that this feature is not available for the K-means algorithm.

compareICL(list(runArcsin, runLogit))

This indicates that the preferred transformation is the arcsine, and the selected model has $K = $ {r} ncol(runArcsin) clusters. We can additionally explore how similar the models with $K = $ {r ncol(runArcsin)-1} to {r ncol(runArcsin)+1} are using the adjusted Rand index (ARI) via the compareARI function. Values close to 1 indicate perfect agreement, while values close to 0 indicate near random partitions.

compareARI(runArcsin, K=8:12)

##      K=8  K=9 K=10 K=11 K=12
## K=8    1 0.59 0.67 0.49 0.42
## K=9         1 0.56 0.72 0.49
## K=10             1 0.51  0.5
## K=11                  1  0.5
## K=12                       1

Note that because the results of coseq depend on the initialization point, results from one run to another may vary; as such, in practice, it is typically a good idea to re-estimate the same collection of models a few times (e.g., 5) to avoid problems linked to initialization.

Exploring coseq results

Results from a coseq analysis can be explored and summarized in a variety of ways. First, a call to the summary function provides the number of clusters selected for the ICL model selection approach, number of genes assigned to each cluster, and if desired the per-gene cluster means.

summary(runArcsin)
## *************************************************
## Model: Gaussian_pk_Lk_Ck
## Transformation: arcsin
## *************************************************
## Clusters fit: 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20
## Clusters with errors: 19
## Selected number of clusters via ICL: 10
## ICL of selected model: -366018.8
## *************************************************
## Number of clusters = 10
## ICL = -366018.8
## *************************************************
## Cluster sizes:
##  Cluster 1  Cluster 2  Cluster 3  Cluster 4  Cluster 5  Cluster 6 
##        832        379       1009        323         91        162 
##  Cluster 7  Cluster 8  Cluster 9 Cluster 10 
##        418        307        180        529 
## 
## Number of observations with MAP > 0.90 (% of total):
## 3404 (80.47%)
## 
## Number of observations with MAP > 0.90 per cluster (% of total per cluster):
##  Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
##  602       289       753       287       72        155       338      
##  (72.36%)  (76.25%)  (74.63%)  (88.85%)  (79.12%)  (95.68%)  (80.86%) 
##  Cluster 8 Cluster 9 Cluster 10
##  293       160       455       
##  (95.44%)  (88.89%)  (86.01%)

Next, a variety of plots may be explored using the plot function:

By default, all of these plots are produced simultaneously unless specific graphics are requested via the graphs argument. In addition, a variety of options are available to specify the number of graphs per row/column in a grid layout, whether profiles should be averaged over conditions, whether condition labels should be used, whether clusters should be ordered according to their similarity, etc. Note that the histogram of maximum conditional probabilities of cluster membership for all genes (probapost_histogram), and boxplots and barplots of maximum conditional probabilities of cluster membership for the genes assigned to each cluster (probapost_boxplots, probapost_barplots) help to evaluate the degree of certitude accorded by the model in assigning genes to clusters, as well as whether some clusters are attribued a greater degree of uncertainty than others.

## To obtain all plots
## plot(runArcsin)
plot(runArcsin, graphs="boxplots")
## $boxplots

plot(runArcsin, graphs="boxplots", conds=conds)
## $boxplots

plot(runArcsin, graphs="boxplots", conds=conds, average_over_conds=TRUE)
## $boxplots

plot(runArcsin, graphs="profiles")
## $profiles

plot(runArcsin, graphs="probapost_boxplots")
## $probapost_boxplots

plot(runArcsin, graphs="probapost_histogram")
## $probapost_histogram

If desired the per-cluster correlation matrices estimated in the Gaussian mixture model may be obtained by calling NormMixParam. These matrices may easily be visualized using the corrplot package.

rho <- NormMixParam(runArcsin)$rho
## Covariance matrix for cluster 1
rho1 <- rho[,,1]
colnames(rho1) <- rownames(rho1) <- paste0(colnames(rho1), "\n", conds)
corrplot(rho1)

Finally, cluster labels (as well as a variety of other information) are easily accessible via a set of accessor functions.

labels <- clusters(runArcsin)
table(labels)
## labels
##    1    2    3    4    5    6    7    8    9   10 
##  832  379 1009  323   91  162  418  307  180  529
metadata(runArcsin)
## $nbCluster
##  K=2  K=3  K=4  K=5  K=6  K=7  K=8  K=9 K=10 K=11 K=12 K=13 K=14 K=15 K=16 
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## K=17 K=18 K=20 
##   17   18   20 
## 
## $logLike
##      K=2      K=3      K=4      K=5      K=6      K=7      K=8      K=9 
## 180274.3 182148.5 184595.7 185639.2 186506.7 187189.1 188267.6 188650.3 
##     K=10     K=11     K=12     K=13     K=14     K=15     K=16     K=17 
## 189374.4 189610.7 190289.4 190375.3 190767.3 190994.8 191587.4 191817.5 
##     K=18     K=20 
## 192056.1 192505.0 
## 
## $ICL
##       K=2       K=3       K=4       K=5       K=6       K=7       K=8 
## -357872.4 -360426.6 -363971.3 -364750.1 -365182.3 -365286.6 -365950.2 
##       K=9      K=10      K=11      K=12      K=13      K=14      K=15 
## -365774.7 -366018.8 -365256.0 -365354.2 -364022.1 -363925.2 -363060.0 
##      K=16      K=17      K=18      K=20 
## -363124.0 -362529.5 -361760.4 -360318.7 
## 
## $nbClusterError
## [1] 19
## 
## $GaussianModel
## [1] "Gaussian_pk_Lk_Ck"
likelihood(runArcsin)
##      K=2      K=3      K=4      K=5      K=6      K=7      K=8      K=9 
## 180274.3 182148.5 184595.7 185639.2 186506.7 187189.1 188267.6 188650.3 
##     K=10     K=11     K=12     K=13     K=14     K=15     K=16     K=17 
## 189374.4 189610.7 190289.4 190375.3 190767.3 190994.8 191587.4 191817.5 
##     K=18     K=20 
## 192056.1 192505.0
nbCluster(runArcsin)
##  K=2  K=3  K=4  K=5  K=6  K=7  K=8  K=9 K=10 K=11 K=12 K=13 K=14 K=15 K=16 
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## K=17 K=18 K=20 
##   17   18   20
ICL(runArcsin)
##       K=2       K=3       K=4       K=5       K=6       K=7       K=8 
## -357872.4 -360426.6 -363971.3 -364750.1 -365182.3 -365286.6 -365950.2 
##       K=9      K=10      K=11      K=12      K=13      K=14      K=15 
## -365774.7 -366018.8 -365256.0 -365354.2 -364022.1 -363925.2 -363060.0 
##      K=16      K=17      K=18      K=20 
## -363124.0 -362529.5 -361760.4 -360318.7
model(runArcsin)
## [1] "Normal"
transformationType(runArcsin)
## [1] "arcsin"

The data used to fit the mixture model (transformed normalized profiles) as well as the normalized profiles themselves are stored as DataFrame objects that may be accessed with the corresponding functions:

## arcsine-transformed normalized profiles
tcounts(runArcsin)
## DataFrame with 4230 rows and 15 columns
##                 CP       CP.1       CP.2       CP.3       CP.4       SVZ
##          <numeric>  <numeric>  <numeric>  <numeric>  <numeric> <numeric>
## Gene1    0.2365982  0.2281048  0.2412887  0.2473080  0.2489826 0.2629261
## Gene4    0.2848261  0.2931767  0.3118907  0.2882754  0.2864151 0.2732570
## Gene10   0.2533920  0.2638799  0.2530599  0.2632555  0.2606190 0.2562390
## Gene14   0.2498710  0.2418410  0.2536331  0.2589391  0.2603361 0.2770880
## Gene16   0.2300029  0.2357965  0.2494971  0.2383596  0.2453366 0.2531277
## ...            ...        ...        ...        ...        ...       ...
## Gene8946 0.1074535 0.03192211 0.05026663 0.02474984 0.05811146 0.1259358
## Gene8948 0.2229960 0.21811146 0.24818159 0.24328239 0.25228416 0.2682103
## Gene8950 0.2304332 0.21964058 0.26689686 0.23939748 0.21978256 0.2558527
## Gene8951 0.1680933 0.13250889 0.15236214 0.16021283 0.22036502 0.2279859
## Gene8952 0.3231309 0.32351511 0.36509345 0.33938134 0.32256340 0.2557594
##              SVZ.1     SVZ.2     SVZ.3     SVZ.4        VZ      VZ.1
##          <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## Gene1    0.2551189 0.2642252 0.2541767 0.2828720 0.2695606 0.2685841
## Gene4    0.2702218 0.2645290 0.2775838 0.2706835 0.2147412 0.2015032
## Gene10   0.2601825 0.2552897 0.2560920 0.2629374 0.2663525 0.2620199
## Gene14   0.2684733 0.2876669 0.2663481 0.2842205 0.2422776 0.2577315
## Gene16   0.2228943 0.2582456 0.2529375 0.2598008 0.2728374 0.2904795
## ...            ...       ...       ...       ...       ...       ...
## Gene8946 0.1064392 0.1378225 0.1500380 0.1358239 0.4029652 0.4669373
## Gene8948 0.2532188 0.2871137 0.2860507 0.3019755 0.2456510 0.2521769
## Gene8950 0.2392094 0.2948982 0.2841849 0.2950181 0.2566831 0.2594619
## Gene8951 0.2041124 0.2384300 0.2520194 0.2572120 0.3320978 0.3345008
## Gene8952 0.2658962 0.2844873 0.2611099 0.2536597 0.1571320 0.1634018
##               VZ.2      VZ.3      VZ.4
##          <numeric> <numeric> <numeric>
## Gene1    0.2796562 0.2915050 0.2783121
## Gene4    0.2147695 0.2309719 0.2023530
## Gene10   0.2735773 0.2629053 0.2667900
## Gene14   0.2457918 0.2707092 0.2471942
## Gene16   0.2988195 0.3108463 0.2820386
## ...            ...       ...       ...
## Gene8946 0.4335344 0.4591869 0.4183262
## Gene8948 0.2641034 0.2979577 0.2610596
## Gene8950 0.2876879 0.2941666 0.2565660
## Gene8951 0.3588670 0.3726418 0.3435751
## Gene8952 0.1542864 0.1354519 0.1590296
## normalized profiles
profiles(runArcsin)
## DataFrame with 4230 rows and 15 columns
##                  CP        CP.1        CP.2         CP.3        CP.4
##           <numeric>   <numeric>   <numeric>    <numeric>   <numeric>
## Gene1    0.05494192  0.05113558  0.05709909   0.05992446  0.06072184
## Gene4    0.07895568  0.08351799  0.09416224   0.08082604  0.07981480
## Gene10   0.06284502  0.06803128  0.06268394   0.06771718  0.06639829
## Gene14   0.06114689  0.05735569  0.06296208   0.06556424  0.06625749
## Gene16   0.05197502  0.05457714  0.06096785   0.05574744  0.05899210
## ...             ...         ...         ...          ...         ...
## Gene8946 0.01150189 0.001018675 0.002524607 0.0006124297 0.003373143
## Gene8948 0.04890841 0.046822993 0.060339831 0.0580278232 0.062308380
## Gene8950 0.05216625 0.047471194 0.069558492 0.0562246310 0.047531594
## Gene8951 0.02799024 0.017456077 0.023035143 0.0254492834 0.047779767
## Gene8952 0.10082974 0.101061220 0.127475125 0.1108249301 0.100488237
##                 SVZ      SVZ.1      SVZ.2      SVZ.3      SVZ.4         VZ
##           <numeric>  <numeric>  <numeric>  <numeric>  <numeric>  <numeric>
## Gene1    0.06755175 0.06368580 0.06820528 0.06322640 0.07790500 0.07091992
## Gene4    0.07282927 0.07125971 0.06835855 0.07509393 0.07149749 0.04540929
## Gene10   0.06423393 0.06618112 0.06376926 0.06416188 0.06755744 0.06928177
## Gene14   0.07483284 0.07036272 0.08049461 0.06927953 0.07862937 0.05755888
## Gene16   0.06271678 0.04886452 0.06522134 0.06262462 0.06599147 0.07261135
## ...             ...        ...        ...        ...        ...        ...
## Gene8946 0.01577617 0.01128659 0.01887507 0.02234299 0.01833497 0.15377984
## Gene8948 0.07022826 0.06276099 0.08019391 0.07961743 0.08845085 0.05914033
## Gene8950 0.06404466 0.05613800 0.08447303 0.07861023 0.08453972 0.06445189
## Gene8951 0.05108323 0.04108650 0.05577975 0.06218046 0.06471184 0.10629357
## Gene8952 0.06399898 0.06905023 0.07877305 0.06664295 0.06297502 0.02448794
##                VZ.1       VZ.2       VZ.3       VZ.4
##           <numeric>  <numeric>  <numeric>  <numeric>
## Gene1    0.07041939 0.07618994 0.08259536 0.07547827
## Gene4    0.04005695 0.04542110 0.05240606 0.04039090
## Gene10   0.06709758 0.07299583 0.06754132 0.06950415
## Gene14   0.06496771 0.05920676 0.07151071 0.05987047
## Gene16   0.08203165 0.08666678 0.09355311 0.07745883
## ...             ...        ...        ...        ...
## Gene8946 0.20263823 0.17646788 0.19644345 0.16502406
## Gene8948 0.06225654 0.06814390 0.08618246 0.06661790
## Gene8950 0.06582331 0.08050605 0.08406656 0.06439438
## Gene8951 0.10777937 0.12335102 0.13255220 0.11347155
## Gene8952 0.02646335 0.02361602 0.01823529 0.02507793

Finally, if the results for a model in the collection other than that chosen by ICL are desired, they may be accessed in the form of a list using the coseqFullResults function.

fullres <- coseqFullResults(runArcsin)
class(fullres)
## [1] "list"
length(fullres)
## [1] 19
names(fullres)
##  [1] "K=2"  "K=3"  "K=4"  "K=5"  "K=6"  "K=7"  "K=8"  "K=9"  "K=10" "K=11"
## [11] "K=12" "K=13" "K=14" "K=15" "K=16" "K=17" "K=18" "K=19" "K=20"

Running coseq on a DESeq2 or edgeR results object

In many cases, it is of interest to run a co-expression analysis on the subset of genes identified as differentially expressed in a prior analysis. To facilitate this, coseq may be directly inserted into a DESeq2 analysis pipeline as follows:

library(DESeq2)
dds <- DESeqDataSetFromMatrix(counts, DataFrame(group=factor(conds)), ~group)
dds <- DESeq(dds, test="LRT", reduced = ~1)
res <- results(dds)
summary(res)
## 
## out of 8962 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 3910, 44% 
## LFC < 0 (down)   : 3902, 44% 
## outliers [1]     : 13, 0.15% 
## low counts [2]   : 0, 0% 
## (mean count < 45)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## By default, alpha = 0.10
run <- coseq(dds, K=2:15, verbose=FALSE)
## ****************************************
## Co-expression analysis on DESeq2 output:
## 7812 DE genes at p-adj < 0.1 
## ****************************************
## coseq analysis: kmeans approach & logclr transformation
## K = 2 to 15 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...
## The following two lines provide identical results
run <- coseq(dds, K=2:15, alpha=0.05)
## ****************************************
## Co-expression analysis on DESeq2 output:
## 7535 DE genes at p-adj < 0.05 
## ****************************************
## coseq analysis: kmeans approach & logclr transformation
## K = 2 to 15 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...
run <- coseq(dds, K=2:15, subset=results(dds, alpha=0.05))
## ****************************************
## Co-expression analysis on DESeq2 output:
## 7812 DE genes at p-adj < 0.1 
## ****************************************
## coseq analysis: kmeans approach & logclr transformation
## K = 2 to 15 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...

A co-expression analysis following the edgeR analysis pipeline may be done in a similar way, depending on whether the quasi-likelihood or likelihood ratio test is used:

library(edgeR)
y <- DGEList(counts=counts, group=factor(conds))
y <- calcNormFactors(y)
design <- model.matrix(~conds)
y <- estimateDisp(y, design)

## edgeR: QLF test
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef=2)

## edgeR: LRT test
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)

run <- coseq(counts, K=2:15, subset=lrt, alpha=0.1)
## ****************************************
## Co-expression analysis on edgeR output:
## 5775 DE genes at p-adj < 0.1 
## ****************************************
## coseq analysis: kmeans approach & logclr transformation
## K = 2 to 15 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...
run <- coseq(counts, K=2:15, subset=qlf, alpha=0.1)
## ****************************************
## Co-expression analysis on edgeR output:
## 5717 DE genes at p-adj < 0.1 
## ****************************************
## coseq analysis: kmeans approach & logclr transformation
## K = 2 to 15 
## ****************************************
## Running K = 2 ...
## Running K = 3 ...
## Running K = 4 ...
## Running K = 5 ...
## Running K = 6 ...
## Running K = 7 ...
## Running K = 8 ...
## Running K = 9 ...
## Running K = 10 ...
## Running K = 11 ...
## Running K = 12 ...
## Running K = 13 ...
## Running K = 14 ...
## Running K = 15 ...

coseq FAQs

  1. Should I use the Gaussian mixture model (model="Normal") or the K-means (model="kmeans") approach for my co-expression analysis? And what about the Poisson mixture model?

The Gaussian mixture model is more time-intensive but enables the estimation of per-cluster correlation structures among samples. The K-means algorithm has a much smaller computational cost, at the expense of assuming a diagonl per-cluster covariance structures. Generally speaking, we feel that the K-means approach is a good first approach to use, although this can be somewhat context-dependent. Finally, although we feel the Poisson mixture model was a good first approach for RNA-seq co-expression, we now recommend either the Gaussian mixture model or K-means approach instead.

  1. There are a lot of proposed transformations in the package. Which one should I use?

In our experience, when using the Gaussian mixture model approach the arcsine transformation performs well. When using the K-means algorithm, if highly specific profiles should be highlighted (e.g., genes expressed in a single tissue), the logCLR transformation performs well; if on the other hand the user wishes to distinguish the fine differences among genes with generally equal expression across conditions (e.g., non-differentially expressed genes), the CLR transformation performs well.

  1. Why do I get different results when I rerun the analysis?

Because the results of coseq depend on the initialization point, results from one run to another may vary; as such, in practice, it is typically a good idea to re-estimate the same collection of models a few times (e.g., 5) to avoid problems linked to initialization.

  1. How do I output the clustering results?

Use the clusters(...) command.

  1. How can I check whether all models converged?

Look at the nbClusterError slot of the coseqResults metadata using metadata(...)$nbClusterError to examine the models that did not converge.