The package moanin was developed to provide a simple and efficient workflow for long time-course gene expression data. It makes it simple to do differential expression between conditions based on either individual condition comparisons per time point, or by fitting spline functions per gene. There are also functions to help with clustering and visualization.

We do not detail all of the features of the package here, but some of the main functions. See our more in-depth workflow paper at https://github.com/NelleV/2019timecourse-rnaseq-pipeline.

Setup

We will work with time course data available in the accompanying timecoursedata package. This is mRNA-Seq timecourse data on a plant, sorghum. The plants were sampled every week for 17 weeks. The plants were of two different varieties (i.e. different genotypes), and under three different watering conditions:

Control: regular weekly watering
Preflowering drought: plants under drought under pre-flowering (weeks before week 9), and watered like control for week 9 and afterward.
Postflowering drought: plants were regularly watered pre-flowering (weeks through week 9), and then watering withheld weeks 10 onward.

This data is available on both the leaf and the root sampled from the same plant. We will concentrate on the leaf samples, available as a dataset varoquaux2019leaf.

library(moanin)
library(timecoursedata)
data(varoquaux2019leaf)
names(varoquaux2019leaf)

## [1] "data" "meta"

The data element contains the gene expression data, while the meta data consists of information regarding the samples.

For simplicity, we are going to focus on comparing the three different watering conditions within the “BT642” variety (the variety known for comparative tolerance of pre-flowering drought). We are also going to drop Week 2, which is only measured in the control samples.

whSamples<-with(varoquaux2019leaf$meta,which(Genotype=="BT642" & Week >2))
preData<-varoquaux2019leaf$data[,whSamples]
preMeta<-varoquaux2019leaf$meta[whSamples,]
dim(preData)

## [1] 34211    97

dim(preMeta)

## [1] 97 22

Set up `moanin` class

The first step is to create a moanin class, which uses our meta data to create the needed information for future analysis. Specifically, in addition to storing the meta information, the function create_moanin_model will define a splines basis for future calculations. By default the model is given as a separate spline basis for each level of a factor Group, plus a group intercept with the following R formula

~Group:ns(Timepoint, df=4) + Group + 0

moaninObject <- create_moanin_model(data=preData, meta=preMeta, 
                            group_variable="Condition", time_variable="Week")
moaninObject

## Moanin object on 97 samples containing the following information:
## Group variable given by 'Condition' with the following levels:
##       Control Postflowering  Preflowering 
##            37            21            39 
## Time variable given by 'Week'
## Basis matrix with 15 basis_matrix functions
## Basis matrix was constructed with the following spline_formula
## ~Condition + Condition:splines::ns(Week, df = 4) + 0 
## 
## Information about the data (a SummarizedExperiment object):
## class: SummarizedExperiment 
## dim: 34211 97 
## metadata(0):
## assays(1): ''
## rownames(34211): Sobic.001G000100.v3.1 Sobic.001G000200.v3.1 ...
##   Sobic.K044420.v3.1 Sobic.K044505.v3.1
## rowData names(0):
## colnames(97): 0622162L06 0622162L14 ... 0928169L11 0928169L19
## colData names(23): Barcode libraryName ... Group WeeklyGroup

Moanin and SummarizedExperiment

Moanin extends the SummarizedExperiment class, and as such the data as well as the meta data are saved in one object. This means you can index the Moanin object like a regular matrix, and safely be indexing the related meta data:

moaninObject[,1:2]

## Moanin object on 2 samples containing the following information:
## Group variable given by 'Condition' with the following levels:
##       Control Postflowering  Preflowering 
##             0             0             2 
## Time variable given by 'Week'
## Basis matrix with 15 basis_matrix functions
## Basis matrix was constructed with the following spline_formula
## ~Condition + Condition:splines::ns(Week, df = 4) + 0 
## 
## Information about the data (a SummarizedExperiment object):
## class: SummarizedExperiment 
## dim: 34211 2 
## metadata(0):
## assays(1): ''
## rownames(34211): Sobic.001G000100.v3.1 Sobic.001G000200.v3.1 ...
##   Sobic.K044420.v3.1 Sobic.K044505.v3.1
## rowData names(0):
## colnames(2): 0622162L06 0622162L14
## colData names(23): Barcode libraryName ... Group WeeklyGroup

Log-tranformation

Because we have count data from mRNA-Seq, we set the argument to log_transform=TRUE. This means that many of the functions will internally use the transformation log(x+1) on the assay(moaninObject) for computations and visualizations. However, for some steps of the DE process the counts will be used (see below). The user can also just transform the data herself,

logMoaninObject<-moaninObject
assay(logMoaninObject)<-log(assay(moaninObject)+1)

Weekly differential expression analysis

One type of DE analysis we can do is to compare our watering conditions to each other, for every time point. We do this via a call to limma for the DE analysis, but before we can do this, we need to set up the appropriate contrasts. For example, “Preflowering - Control” (see ?makeContrasts in the limma package). We provide a function to do this for every week, so as to avoid this step.

preContrasts <- create_timepoints_contrasts(moaninObject,"Preflowering", "Control")

Notice we also get a warning that timepoint 16 is missing in our Control Samples (in fact, some of these time points only have a single observations per time point, which is not particularly appropriate for DE analysis per week).

We can also create contrasts to compare Postflowering and Control, and Preflowering and Postflowering. We get many warnings here, because Post-flowering only has samples after week 9.

postContrasts <- create_timepoints_contrasts(
    moaninObject, "Postflowering", "Control" )
prepostContrasts <- create_timepoints_contrasts(
    moaninObject, "Postflowering", "Preflowering")

We can run the DE analysis, setting use_voom_weights=TRUE to make use of limma correction for low-counts. We will subset to just the first 500 genes for illustration purposes, though the full set of genes does not take very long

weeklyPre <- DE_timepoints(moaninObject[1:500,], 
    contrasts=c(preContrasts,postContrasts,prepostContrasts),
     use_voom_weights=TRUE)

By setting use_voom_weights=TRUE, the DE step is done after an upperquartile normalization via limma using voom weights (see ?DE_timepoints for details). Therefore this should only be done if the input data to the Moanin class are counts (or on a count scale, such as expected counts from TopHat); in this case the user should set log_transform=TRUE in the construction of the Moanin object, so that other functions will take the log appropriately.

The results give the raw p-value (_pval), the FDR adjusted p-values (_qval), and the estimate of log-fold-change (_lfc) for each week (3 columns for each of the 31 contrasts):

dim(weeklyPre)

## [1] 500  93

head(weeklyPre[,1:10])

##                       Preflowering.3-Control.3_pval
## Sobic.001G000100.v3.1                    0.76935240
## Sobic.001G000200.v3.1                    0.01705126
## Sobic.001G000300.v3.1                    0.75896765
## Sobic.001G000400.v3.1                    0.23770107
## Sobic.001G000501.v3.1                    0.75896765
## Sobic.001G000700.v3.1                    0.05734872
##                       Preflowering.3-Control.3_qval
## Sobic.001G000100.v3.1                     0.9806377
## Sobic.001G000200.v3.1                     0.1792723
## Sobic.001G000300.v3.1                     0.9795879
## Sobic.001G000400.v3.1                     0.6740071
## Sobic.001G000501.v3.1                     0.9795879
## Sobic.001G000700.v3.1                     0.3530203
##                       Preflowering.3-Control.3_lfc
## Sobic.001G000100.v3.1                    2.4312650
## Sobic.001G000200.v3.1                    1.5852637
## Sobic.001G000300.v3.1                   -1.9235186
## Sobic.001G000400.v3.1                    0.9811749
## Sobic.001G000501.v3.1                   -1.0133240
## Sobic.001G000700.v3.1                   -0.6885880
##                       Preflowering.4-Control.4_pval
## Sobic.001G000100.v3.1                     0.2290217
## Sobic.001G000200.v3.1                     0.9860256
## Sobic.001G000300.v3.1                     0.2083560
## Sobic.001G000400.v3.1                     0.9176221
## Sobic.001G000501.v3.1                     0.2083560
## Sobic.001G000700.v3.1                     0.1970270
##                       Preflowering.4-Control.4_qval
## Sobic.001G000100.v3.1                     0.6639340
## Sobic.001G000200.v3.1                     0.9996942
## Sobic.001G000300.v3.1                     0.6398885
## Sobic.001G000400.v3.1                     0.9921867
## Sobic.001G000501.v3.1                     0.6398885
## Sobic.001G000700.v3.1                     0.6282489
##                       Preflowering.4-Control.4_lfc
## Sobic.001G000100.v3.1                    1.0249975
## Sobic.001G000200.v3.1                    1.1352912
## Sobic.001G000300.v3.1                    0.6779068
## Sobic.001G000400.v3.1                   -0.4052482
## Sobic.001G000501.v3.1                    3.0543973
## Sobic.001G000700.v3.1                   -2.5260550
##                       Preflowering.5-Control.5_pval
## Sobic.001G000100.v3.1                    0.64964783
## Sobic.001G000200.v3.1                    0.03384720
## Sobic.001G000300.v3.1                    0.63455782
## Sobic.001G000400.v3.1                    0.62792598
## Sobic.001G000501.v3.1                    0.63455782
## Sobic.001G000700.v3.1                    0.08759985
##                       Preflowering.5-Control.5_qval
## Sobic.001G000100.v3.1                     0.9465706
## Sobic.001G000200.v3.1                     0.2640320
## Sobic.001G000300.v3.1                     0.9417509
## Sobic.001G000400.v3.1                     0.9405412
## Sobic.001G000501.v3.1                     0.9417509
## Sobic.001G000700.v3.1                     0.4388142
##                       Preflowering.5-Control.5_lfc
## Sobic.001G000100.v3.1                    0.9699917
## Sobic.001G000200.v3.1                   -5.4320308
## Sobic.001G000300.v3.1                    1.7300535
## Sobic.001G000400.v3.1                    1.2150398
## Sobic.001G000501.v3.1                   -0.7634983
## Sobic.001G000700.v3.1                   -0.7326084
##                       Preflowering.6-Control.6_pval
## Sobic.001G000100.v3.1                     0.9438199
## Sobic.001G000200.v3.1                     0.3823328
## Sobic.001G000300.v3.1                     0.9412143
## Sobic.001G000400.v3.1                     0.3198618
## Sobic.001G000501.v3.1                     0.9412143
## Sobic.001G000700.v3.1                     0.6082270

Time-course differential expression analysis between two groups

The results of such a weekly analysis can be quite difficult to interpret, the number of replicates per week can be quite low, and different numbers of replicates in different weeks can make it difficult to compare the analysis across weeks.

An alternative approach is to fit a smooth spline to each gene per group, and perform differential tests as to whether there are differences between the spline functions between two groups. We provide this function in DE_timecourse, where the user provides a string of comparisons that they wish to make:

timecourseContrasts <- c("Preflowering-Control",
             "Postflowering-Control",
             "Postflowering-Preflowering")
splinePre <- DE_timecourse(moaninObject[1:500,], contrasts=timecourseContrasts,
     use_voom_weights=TRUE)

Again, because our input data are counts, we set use_voom_weights=TRUE to weight our observations based on their variability. Because we set log_transform=TRUE in our creation of moaninObject, the function will internally take the log of the data for fitting the splines, but use the counts for creating the voom weights (see ?DE_timecourse).

Each contrast has a pval and a qval column:

head(splinePre)

##                       Preflowering-Control_pval Preflowering-Control_qval
## Sobic.001G000100.v3.1                 0.8672454                 0.8899313
## Sobic.001G000200.v3.1                 0.4591360                 0.5665431
## Sobic.001G000300.v3.1                       NaN                       NaN
## Sobic.001G000400.v3.1                 0.2247364                 0.3468600
## Sobic.001G000501.v3.1                       NaN                       NaN
## Sobic.001G000700.v3.1                 0.7780892                 0.8298662
##                       Postflowering-Control_pval Postflowering-Control_qval
## Sobic.001G000100.v3.1                          0                          0
## Sobic.001G000200.v3.1                          0                          0
## Sobic.001G000300.v3.1                        NaN                        NaN
## Sobic.001G000400.v3.1                          0                          0
## Sobic.001G000501.v3.1                        NaN                        NaN
## Sobic.001G000700.v3.1                          0                          0
##                       Postflowering-Preflowering_pval
## Sobic.001G000100.v3.1                               0
## Sobic.001G000200.v3.1                               0
## Sobic.001G000300.v3.1                             NaN
## Sobic.001G000400.v3.1                               0
## Sobic.001G000501.v3.1                             NaN
## Sobic.001G000700.v3.1                               0
##                       Postflowering-Preflowering_qval
## Sobic.001G000100.v3.1                               0
## Sobic.001G000200.v3.1                               0
## Sobic.001G000300.v3.1                             NaN
## Sobic.001G000400.v3.1                               0
## Sobic.001G000501.v3.1                             NaN
## Sobic.001G000700.v3.1                               0

There is not a log-fold change column, because it is more complicated to define the log-fold change over a series of timepoints, where potentially the means may even switch from being over-expressed to under-expressed. We provide a function estimate_log_fold_change which gives the option of estimating several kinds of log-fold-change. We demonstrate with the method abs_sum, which gives the sum of the absolute difference in the means across time points:

log_fold_change_timepoints <- estimate_log_fold_change(
    moaninObject[1:500,], contrasts=timecourseContrasts, method="sum")
head(log_fold_change_timepoints)

##                       Preflowering-Control Postflowering-Control
## Sobic.001G000100.v3.1           -0.3333333            -0.3333333
## Sobic.001G000200.v3.1          212.6666667             7.0000000
## Sobic.001G000300.v3.1            0.0000000             0.0000000
## Sobic.001G000400.v3.1          103.0000000            78.3333333
## Sobic.001G000501.v3.1            0.0000000             0.0000000
## Sobic.001G000700.v3.1          -68.3333333           875.0000000
##                       Postflowering-Preflowering
## Sobic.001G000100.v3.1                    0.00000
## Sobic.001G000200.v3.1                 -205.66667
## Sobic.001G000300.v3.1                    0.00000
## Sobic.001G000400.v3.1                  -24.66667
## Sobic.001G000501.v3.1                    0.00000
## Sobic.001G000700.v3.1                  943.33333

See ?estimate_log_fold_change to see the full set of methods.

Visualizing Genes of Interest

The package moanin also provides a simple utility function (plot_splines_data) to visualize gene time-course data from different conditions. We use this to plot the 10 genes with the largest log-fold-change in Preflowering-Control contrast (since we only looked at the first top 500, these aren't representative of the overall signal in the data set, which is much stronger).

whSig <- which(splinePre[,"Preflowering-Control_qval"]<0.05)
deGenes <- order(abs(
    log_fold_change_timepoints)[whSig,"Preflowering-Control"],
    decreasing=TRUE)[1:10]

plot_splines_data(moaninObject[whSig, ][deGenes,],
    smooth=TRUE, mar=c(1.5,2.5,2,0.1))

plot of chunk unnamed-chunk-14

Clustering of time-course data

We can also cluster the data based on their spline fits. Here we would like to work with the log transform of the counts which are stored in moaninObject, so we set log_transform=TRUE.

# First fit the kmeans clusters
kmeans_clusters <- splines_kmeans(moaninObject[1:500,], n_clusters=3,
    random_seed=42, 
    n_init=20)

We then use the plot_splines_data function, only now applied to the centroids of the kmeans clustering, to visualize the centroids of each cluster obtained with the splines k-means model.

plot_splines_data(
    data=kmeans_clusters$centroids, moaninObject,
    smooth=TRUE)

plot of chunk unnamed-chunk-15

Assigning genes to clusters

Because there is variability in how well the genes fit a cluster, we would like to be able to score how well a gene fits a cluster. Furthermore, we often chose a subset of genes based on a filtering process, and we would like to have a mechanism to assign all genes to a cluster.

The function splines_kmeans_score_and_label gives a goodness-of-fit score between each gene and each cluster.

scores_and_labels <- splines_kmeans_score_and_label(
    object=moaninObject,data=preData[1:2000,], kmeans_clusters=kmeans_clusters)

Notice that we used more genes here (first 2000), even though previously we only used first 500 to make the clusters. These choices were unrealistic, since in practice we would probably pick high variable genes or differentially expressed genes, rather than the first 500, but at least give a sense of how this works.

The result is a list with elements scores and labels. scores gives the goodness-of-fit score between each gene and each cluster

head(scores_and_labels$scores)

##                            [,1]      [,2]      [,3]
## Sobic.001G000100.v3.1 0.9979535 0.9867139 0.9999825
## Sobic.001G000200.v3.1 0.9999940 0.9030656 0.9999523
## Sobic.001G000300.v3.1 1.0000000 1.0000000 1.0000000
## Sobic.001G000400.v3.1 1.0000000 0.8524792 1.0000000
## Sobic.001G000501.v3.1 1.0000000 1.0000000 1.0000000
## Sobic.001G000700.v3.1 0.9689555 0.9146831 0.9965017

labels gives the best cluster assignment for each gene, but only if its score in its best cluster is above a certain threshold (see ?splines_kmeans_score_and_label)

head(scores_and_labels$labels)

## Sobic.001G000100.v3.1 Sobic.001G000200.v3.1 Sobic.001G000300.v3.1 
##                    NA                     2                    NA 
## Sobic.001G000400.v3.1 Sobic.001G000501.v3.1 Sobic.001G000700.v3.1 
##                     2                    NA                     2

# How many are not assigned a label?
sum(is.na(scores_and_labels$labels))

## [1] 1000