DiscoRhythm is a framework for analyzing periodicity of large scale temporal biological datasets (e.g. circadian transcriptomic experiments). The main goal of this package is to characterize the rhythmicity present in the provided dataset by performing the following steps:
The entire workflow can be run interactively in the web application or run directly in R.
Section 4 specifies the data input format expected by the DiscoRhythm app.
Next, section 5 will describe all the analysis steps and their purpose. Section 6 will then demonstrate how to generate the same results using the DiscoRhythm R package directly from the R command line.
The public server for DiscoRhythm is available here2 Users will be timed out of this server after a session is idle for 30 minutes after the last computation has completed..
The web application currently has no auto-save functionality, results will be lost upon session termination. It may be helpful to use the report feature (see 5.6.2.2) which automatically downloads results upon completion.
To start using DiscoRhythm directly from the web application proceed to 5.
To run the application locally or use DiscoRhythm with R, the DiscoRhythm R package and its dependencies3 Installation of pandoc is required in order to use the report generation features of DiscoRhythm. must be installed by executing the following commands in R:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("DiscoRhythm")
The following commands then can be used in order to launch the DiscoRhythm application:
library(DiscoRhythm)
discoApp()
For improved performance DiscoRhythm can be parallelized using multiple cores:
discoApp(ncores=number_of_available_cores)
Alternatively, if docker is installed, the DiscoRhythm container on Docker Hub can be pulled and used to run the web application locally4 See instructions on Docker Hub. avoiding the need to install DiscoRhythm and its dependencies
The same computations performed in the web application can be executed directly in R. This may help with:
Section 6 describes using DiscoRhythm from R in more detail.
Sinusoid - A mathematical curve that describes periodic oscillations following a sine function.
Period - Parameter describing the duration after which a temporal pattern repeats itself.
Acrophase - The time in a periodic cycle where a temporal pattern is at its maximum value, typically referring to the peak of a sinusoid curve.
Amplitude - The amount of change in signal seen during the course of an oscillation.
P-value - Described simply, it is the probability that a finding could have occurred under the null hypothesis. For rhythm detection, this would be the probability that the “goodness of fit” of a rhythmic model would occur at a value greater than or equal to the observed value when applied to a non-rhythmic signal.
Biological replicates - Independent random samples from the population of interest with the same time of collection.
Technical replicates - A single biological sample split into multiple samples which are processed independently are typically labelled as “technical replicates”. They are useful for determining the variance introduced simply by the process of sample preprocessing.
DiscoRhythm was designed for large scale “-omic” data matrices such as those found in the following studies:
Krishnaiah et al. (2017) - circadian metabolomics study, where data matrices from supplemental table 2 may be used as input to DiscoRhythm after removing the first row from each sheet and saving as CSV. The “Liver_data” sheet may be used after the following modifications:
Hurley et al. (2018) - A circadian proteomic study, where supplemental dataset 2 may be used as input to DiscoRhythm after saving sheet “2B” or “2H” as CSV.
Wu et al. (2016) - This software package contains truncated transcriptomic datasets that can be exported from R for use with DiscoRhythm, including a simulated dataset (object name cycSimu4h2d) containing various waveforms to test the performance of the rhythm detection algorithms.
Singer, Fu, and Hughey (2019) - The simphony R package can be used to generate complex simulated “-omic” datasets, which can easily be formatted for input to DiscoRhythm.
See section 4 for full details on input data format.
For a more detailed background on statistical methods, experiment design, and method selection see the following references:
Cornelissen (2014) - A detailed review of Cosinor based techniques for rhythm analysis. The cosinor test as presented in Cornelissen et. al. (2014) is a common hypothesis test, which may appear in the literature as “Cosinor”, “harmonic regression”, “sinusoid regression”, and other similar names.
Hughes et al. (2017) - Experiment design for circadian sequencing experiments. Provides additional comments on p-values obtained from rhythm detection methods.
Deckard et al. (2013) - Evaluates 4 methods (de Lichtenberg, Lomb-scargle, JTK Cycle, and persistent homology) and Fig. 6 provides a clear decision tree to aid with the selection of the most suitable method for practical data analysis. This publication references additional method evaluation articles.
Wu et al. (2014) provides an empirical evaluation of 5 methods (ARSER, JTK Cycle, COSOPT, Fisher’s G test, and HAYSTACK).
These references also provide some valuable insights about the experiment design stage of rhythm detection. (see section 5.6.1.1 for a few additional comments).
The four currently implemented methods in DiscoRhythm are well established and have sound statistical methodology. These methods were easily accessed or implemented in R and fit well into the workflow for rhythmicity analysis. Additional algorithms fitting this criteria may be added to DiscoRhythm in the future. Below are some quotes taken directly from publications of these methods, briefly describing the rationale for each:
Cornelissen (2014)
“Conceived as a regression problem, the method is applicable to non-equidistant data, a major advantage.”
and referring to least squares techniques more generally:
“useful in curve-fitting problems, where it is desirable to obtain a functional form that best fits a given set of measurements.”
Hughes, Hogenesch, and Kornacker (2010)
“JTK_CYCLE’s increased resistance to outliers results in considerably greater sensitivity and specificity”
Glynn, Chen, and Mushegian (2006)
“Our approach should be applicable for detection and quantification of periodic patterns in any unevenly spaced gene expression time-series data.”
Yang and Su (2010)
“ARSER was superior to two existing and widely-used methods, COSOPT and Fisher’s G-test, during identification of sinusoidal and non-sinusoidal periodic patterns in short, noisy and non-stationary time-series”
Description of the input format expected by DiscoRhythm.
Below is a small simulated circadian transcriptomic dataset generated using simphony, which follows the expected input format for DiscoRhythm. The dataset was generated to contain ~50% rhythmic transcripts with many different phases of oscillation.
IDs | CT0_1_A | CT0_2_B | CT0_3_C | CT4_4_D | CT4_5_E |
---|---|---|---|---|---|
nonRhythmGene_Id1 | 6.3717658 | 4.0909466 | 8.3010577 | 7.0240708 | -4.6510406 |
nonRhythmGene_Id2 | -2.4340798 | -5.6695362 | -3.0800470 | -0.1800393 | 3.1282445 |
nonRhythmGene_Id3 | -9.8373929 | -5.8738313 | -0.7793628 | -2.7236067 | 2.6574806 |
nonRhythmGene_Id4 | 4.5951368 | 0.2952068 | 3.5660223 | -1.2458999 | 6.1385621 |
nonRhythmGene_Id5 | 0.3980836 | 3.9928686 | 1.5626314 | 0.0613089 | 0.7013376 |
nonRhythmGene_Id6 | -0.2139567 | -2.4819291 | -6.5602675 | -1.9901769 | -7.5526251 |
The first column should contain unique feature IDs (e.g. gene names in this case).
IDs | CT0_1_A | CT0_2_B | CT0_3_C | CT4_4_D | CT4_5_E |
---|---|---|---|---|---|
nonRhythmGene_Id1 | 6.3717658 | 4.0909466 | 8.3010577 | 7.0240708 | -4.6510406 |
nonRhythmGene_Id2 | -2.4340798 | -5.6695362 | -3.0800470 | -0.1800393 | 3.1282445 |
nonRhythmGene_Id3 | -9.8373929 | -5.8738313 | -0.7793628 | -2.7236067 | 2.6574806 |
nonRhythmGene_Id4 | 4.5951368 | 0.2952068 | 3.5660223 | -1.2458999 | 6.1385621 |
nonRhythmGene_Id5 | 0.3980836 | 3.9928686 | 1.5626314 | 0.0613089 | 0.7013376 |
nonRhythmGene_Id6 | -0.2139567 | -2.4819291 | -6.5602675 | -1.9901769 | -7.5526251 |
All subsequent columns contain experimental sample data.
IDs | CT0_1_A | CT0_2_B | CT0_3_C | CT4_4_D | CT4_5_E |
---|---|---|---|---|---|
nonRhythmGene_Id1 | 6.3717658 | 4.0909466 | 8.3010577 | 7.0240708 | -4.6510406 |
nonRhythmGene_Id2 | -2.4340798 | -5.6695362 | -3.0800470 | -0.1800393 | 3.1282445 |
nonRhythmGene_Id3 | -9.8373929 | -5.8738313 | -0.7793628 | -2.7236067 | 2.6574806 |
nonRhythmGene_Id4 | 4.5951368 | 0.2952068 | 3.5660223 | -1.2458999 | 6.1385621 |
nonRhythmGene_Id5 | 0.3980836 | 3.9928686 | 1.5626314 | 0.0613089 | 0.7013376 |
nonRhythmGene_Id6 | -0.2139567 | -2.4819291 | -6.5602675 | -1.9901769 | -7.5526251 |
Sample metadata is extracted from the column names of the matrix.
IDs | CT0_1_A | CT0_2_B | CT0_3_C | CT4_4_D | CT4_5_E |
---|---|---|---|---|---|
nonRhythmGene_Id1 | 6.3717658 | 4.0909466 | 8.3010577 | 7.0240708 | -4.6510406 |
nonRhythmGene_Id2 | -2.4340798 | -5.6695362 | -3.0800470 | -0.1800393 | 3.1282445 |
nonRhythmGene_Id3 | -9.8373929 | -5.8738313 | -0.7793628 | -2.7236067 | 2.6574806 |
nonRhythmGene_Id4 | 4.5951368 | 0.2952068 | 3.5660223 | -1.2458999 | 6.1385621 |
nonRhythmGene_Id5 | 0.3980836 | 3.9928686 | 1.5626314 | 0.0613089 | 0.7013376 |
nonRhythmGene_Id6 | -0.2139567 | -2.4819291 | -6.5602675 | -1.9901769 | -7.5526251 |
Names5 All fields should contain only alphanumeric values (with the
exception
of ‘.’ in the Time
field which is allowed for decimal values. are expected to follow the pattern:
Prefix
Time
*_Unique Id
_Replicate Id
Field | Description | Examples |
---|---|---|
Prefix |
A unit of time that will be used by the web application. | hr , ZT , CT |
Time * |
Indicates the time of collection for the respective sample. Can only be positive. | 20 , 2.1 , 0.3 |
Unique Id |
Used to uniquely identify samples in visualizations and summaries. | GSM3186429 , sample1 , subjectA , AX |
Replicate Id |
Used to identify each biological sample uniquely when combined with Time . |
1 , A , rep1 |
Sample Time - Time
6 32
, CT32
, CT32_AS_1
, 32_AS_1
are all valid naming styles. is the only mandatory field and the other
fields
are used to
specify the additional information about samples when necessary.
Biological vs Technical Replicates - Time
+ Replicate Id
7 If no Replicate Id
is provided, all samples are assumed to be
independent biological replicates. are used
to identify independent samples collected at the same timepoint (biological
replicates). Samples with the same Time
and Replicate Id
are assumed to be technical replicates originating from a single biological
sample.
Unique Sample Identity - Each column name should be unique such that all
samples can be uniquely identified to the user. The Unique Id
field is
intended for this purpose. If column
names are not unique, the Unique Id
field will be generated to provide
unique sample names during usage of DiscoRhythm.
The Time
and Replicate Id
data extracted from the column names
will be seen throughout the DiscoRhythm application8 This table is the colData(se)
data stored in R. as:
ID | Time | ReplicateID | |
---|---|---|---|
CT0_1_A | CT0_1_A | 0 | A |
CT0_2_B | CT0_2_B | 0 | B |
CT0_3_C | CT0_3_C | 0 | C |
CT4_4_D | CT4_4_D | 4 | D |
CT4_5_E | CT4_5_E | 4 | E |
CT4_6_F | CT4_6_F | 4 | F |
Note: If there is more relevant metadata in the experiment that does not fit into the design specified in this section, it may be appropriate to split the dataset into several parts and use each subgroup as a separate input for DiscoRhythm. Example: If the experiment includes multiple conditions, each condition may be an independent input dataset for DiscoRhythm.
DiscoRhythm defines time within a dataset in one of two ways:
Linear time exists in systems where an experiment start time is meaningful (often setting t=0 to some specific event). Circular time exists in experiments where the start of experiment is not meaningful or left unobserved (e.g. time-of-day in a cross-sectional study). For a specific dataset, one of these two types has to be selected and it will have an influence on how the DiscoRhythm analysis will be performed.
Example of linear time in hours: 1,2,3 ... 24,25,26
Example of circular time, time-of-day in hours: 1,2,3 ... 24,1,2
For example, circadian studies with mice often entrain to a 24hr light/dark schedule and then release mice into total darkness in order to remove external cues. The presence of a specific event (release into total darkness) would make the dataset suitable for the “Linear time” setting of DiscoRhythm.
On the other hand, if samples were collected during the entrainment to the light/dark base-cycle, “Circular time” would be appropriate as mice sampled at the same point in the cycle on different days may be treated as biological replicates (Hughes et al. 2017).
This section will walk the user through each section of the DiscoRhythm web application. It is recommended to keep this documentation open during the first use of the application in order to track what the application is doing in each section.
The DiscoRhythm web interface is a dashboard where the sections of the analysis progress in a sequential fashion with data from the previous step being carried over to the next. Each section can be accessed using the sidebar on the left.
Interactive controls allow users to set parameters specific to each section’s analysis. When parameters relevant to a figure change, the corresponding figure will dynamically update to reflect the newly calculated results. The exception to this is the oscillation detection section which requires user input to re-compute results.
Various download buttons are available throughout the application for extracting plot outputs and numerical results.
Purpose: To upload, clean, and summarize the experimental design of the provided dataset.
An input dataset is expected to be in comma separated value (CSV) format as specified in section 4. Upload the dataset using the “upload CSV” input method.
The simulated dataset (from section 4.1) is available to test the features of DiscoRhythm.
Messages or warnings may be seen at this point as DiscoRhythm imports the dataset and performs a few clean-up tasks:
Unique Id
with unique sample keys.Some oscillation detection algorithms are capable of handling rows with
missing values, however, this currently may only be performed via R
(using the discoODAs
function).
Time Type - See 4.4.
Period of Interest - The main hypothesized period should be specified in order to set appropriate defaults throughout the application. If unknown, set it to the range of the sample collection times.
Time Unit/Observation Unit - Units to display in the axis labels throughout the application.
If the “Sampling Summary” table does not seem to accurately reflect the data, please refer back to section 4.3. It is also a good idea to expand the “Inspect Input Data Matrix” and “Inspect Parsed Metadata” boxes to ensure the data has been read correctly.
In “Inspect Parsed Metadata” it is possible to override the metadata extracted from the column names with an independent CSV table which matches the metadata table format outlined in 4.3.1, however, this is not recommended for regular use.
Experimental artifacts commonly result in data not accurately reflecting the true biological phenomenon. This can often be observed through systematic signals from a single sample that do not have biological plausibility. DiscoRhythm attempts to detect such systematic outliers with the use of two separate methods:
Each method is applied independently to the dataset in order to detect outliers. The filtering summary section is then used to decide which detected outliers to remove. A reasonable standard deviation threshold for both methods would be around 2 to 3.
By default, no outliers will be flagged for removal. The DiscoRhythm web application will set the default threshold such that no outliers are flagged.
Purpose: In order to detect outliers, samples are pairwise correlated using either the Pearson or Spearman method.
Heatmap: The values of these pairwise correlations can be visualized in this tab, where samples with similar correlation values are grouped together using clustering.
Outlier Detection: The average correlation value for each sample is used as a metric of its overall similarity to all other samples and is summarized in this figure.
Samples with average correlations significantly below the mean will be flagged as outliers and the user may specify a number of standard deviations below the mean to use as a threshold.
Purpose: Utilize principal component analysis (PCA) to detect outliers.
PCA is used to extract the strongest recurring patterns in the dataset. Outliers detected in these patterns (PC scores) are flagged by their deviation from the mean where again the user may specify a threshold in units of standard deviations.
Scale Before PCA: Whether to scale rows to a standard deviation of 1 prior to PCA, such that all rows are on an equal scale.
PCs to Use For Outlier Detection: Click to change the list of PCs to use for outlier removal in the case the scores of a PC seem inappropriate for use in outlier detection. You can remove unwanted PCs by pressing “delete” and add extra ones by typing their number.
Before CSV/After CSV: Downloadable summaries of the PCA before and after the detected outliers are removed.
Figures:
Distributions: The distributions of PC scores used to detect outliers. Only the PCs colored darkly are used for outlier detection. Outliers will be shown with an ‘x’.
Scree: PCs are numbered, where the amount of variance explained by each PC (their ‘importance’) decreases with increasing PC number. This can be seen in the “Scree” figure. Users should choose an appropriate number of PCs to use for outlier detection by the shape of this scree plot.
One Pair and All Pairs: Plotting the PC scores of the components versus one another may reveal grouping that cannot be determined from simple analysis of individual PCs.
Purpose: Determine how to proceed with outlier removal.
The user may, at this point, choose to remove the flagged outliers or may disregard these flags if it is suspected the dissimilarity of these samples may be biologically relevant. The user may also manually remove samples they deem unreliable for further analysis.
Raw Distributions: A boxplot for the values of each individual sample that can be used to further evaluate sample selection.
Input and Final: Shows summary tables for data before and after outlier removal.
Purpose: Utilize any technical replicates present in the dataset to quantify the signal-to-noise ratio for each row. Then combine technical replicates for downstream analysis.
Technical replicates are not useful for the statistical tests used by DiscoRhythm for oscillation detection, as they are not representative of the populational variance of the data (i.e. do not satisfy the independence assumptions). They will instead be used to identify rows of the dataset where the biological variation is greater than the technical variation. ANOVA procedures are used to determine whether a row has detectable biological signal.
ANOVA Method: 3 options are available for ANOVA:
F-statistic Cutoff: The user may choose to filter rows by the magnitude of the signal-to-noise ratio rather than by statistical significance.
Replicates should be combined for downstream rhythmicity analysis. DiscoRhythm provides three methods for combining technical replicates:
Note: Users may also choose not to combine technical replicates. This is only advisable if the technical replicates do in fact represent independent samples of the population/dataset (i.e. if they were erroneously labelled in section 4).
Purpose: Summarize the strength of multiple periodicities across the entire dataset. This section may be used for broad characterization of the dataset, or as an exploratory analysis tool for determining the main periodicities of the dataset to use for oscillation detection.
Available periods will be limited10 For circular time, only harmonics of the base-cycle will be available for testing (e.g. for 24hr base-cycle: 12hr, 8hr etc.). starting from a smallest period of 3 times the sampling-interval up to the sampling duration. 12 periods, evenly spaced across this range, will be used for period detection.
The aim of this section is to evaluate various periodicities across all features, such that a fixed period may be chosen for oscillation detection. While methods for determining period strength for individual time series have been established, methods for determining a common period across multiple parallel time series are less clear. The period detection section attempts to determine this optimal period by obtaining a “goodness of fit” across all features and allows for inspection for the period with the highest median fit. Cosinor’s R2 was chosen over other methods of estimating model fit quality due to: the maturity of the method, its execution speed, and its ability to be executed under most experiment design conditions (see 5.6.1.1).
To obtain an abstract representation of the rhythmic patterns seen across the dataset, PCA is performed to reduce dimensionality and test the summarized temporal signal for rhythmicity (using the Cosinor method). Using an a priori hypothesized period, a significant fit on a single PC indicates the presence of few phases11 e.g. an oscillating principal component with an 18hr acrophase indicates many features oscillate near 18hr and, due to negative loadings, there may also be many features oscillating near a 6hr acrophase. in the dataset, while strong fits to multiple PCs suggests that there exist multiple phases of oscillation. Oscillations detected among the first principal components would be more meaningful than oscillations in the last components. When a period is selected based on period detection results, more caution should be taken as the reliability of the findings might suffer due to selection bias and overfitting.
Inspection of the patterns seen in the principal component scores may hint at other effects in the data that one should be aware of (e.g. batch effects or unobserved confounders).
The two statistical summaries of the periodicity provided in the period detection section may be used to proceed with the analysis in “Oscillation Detection” in multiple ways. Three scenarios are provided below to aid in illustrating some approaches for utilizing the results. The aim often should be to proceed with a hypothesis driven analysis; however, data driven exploratory approaches may also be used.
An exact periodicity is hypothesized. The goal is to determine which features fit this hypothesis.
An approximate periodicity is hypothesized, such as when subjects are in free running conditions. The goal is to determine which features fit this approximate hypothesis.
The period is entirely unknown. The goal may be to find a dominant periodicity with a secondary goal of identifying which features follow this period.
Purpose: Individually quantify rhythmicity of remaining rows in the dataset where each row will be tested for rhythmicity using methods suitable for the sample collections present.
The user must choose a single period12 If it is unknown which periodicity to test start with the dominant period seen in section 5.5. of oscillation to test across all rows of the dataset. The application may show warnings/messages regarding the choice of period. By default, the period specified in the ‘Select Data’ section will be chosen.
JTK Cycle, Lomb-Scargle, and ARSER results are all obtained through the MetaCycle R package (meta2d function using minper=maxper). Cosinor is provided by DiscoRhythm. A brief summary of each method:
Cosinor - a.k.a “Harmonic Regression” Fits a sinusoid
with a free phase parameter.
JTK Cycle - non-parametric test of rhythmicity robust to outliers.
Lomb-Scargle - an approach using spectral power density.
ARSER - removes linear trends and performs the Cosinor test.
Exclusion Criteria Matrix: A table is presented describing the criteria that exclude a method from being used. All exclusion criteria which are true for the loaded dataset are shown.13 If no criteria are present, this table will be absent. The reasons for exclusion may be due to either computational (causes errors under given conditions) or statistical restrictions (requirements of study design) of the method.
Criteria | Description |
---|---|
missing_value | Rows contain missing values. |
with_bio_replicate | Biological replicates are present. |
non_integer_interval | The spacing between samples is not an integer value. |
uneven_interval | Time between collections is not uniform. |
circular_t | Time is circular (see 4.4). |
invalidPeriod | Chosen period to test is not valid. |
invalidJTKperiod | Chosen period to test is not valid for JTK Cycle. |
In the below matrix,14 This matrix is from the object: discoODAexclusionMatrix
FALSE
indicates that the method is unable to
handle this condition:
CS | JTK | LS | ARS | |
---|---|---|---|---|
missing_value | TRUE | TRUE | TRUE | FALSE |
with_bio_replicate | TRUE | TRUE | TRUE | FALSE |
non_integer_interval | TRUE | FALSE | TRUE | FALSE |
uneven_interval | TRUE | FALSE | TRUE | FALSE |
circular_t | TRUE | TRUE | TRUE | FALSE |
invalidPeriod | FALSE | FALSE | FALSE | FALSE |
invalidJTKperiod | TRUE | FALSE | TRUE | TRUE |
This criteria should be considered during the design phase of an experiment when possible to ensure the most appropriate method will be available for use. It may not always be ideal to utilize multiple methods for rhythm detection as the process of integrating the results is not well defined and may interfere with drawing clear conclusions. Discussions regarding methods for integrating rhythm detection results may be found in the literature:
Wu et al. (2016) - Proposes Fisher’s method for integrating p-values
Hutchison and Dinner (2017) - Proposes Brown’s method for integrating p-values
Users preferring to decide on a single method are encouraged to consult the literature on which methods are suitable under given conditions (see 3.3).
Two modes of execution are available for oscillation detection:
On the public server only, email features are available for receiving notifications and results.
When computations are finished, results can be visualized in the application as seen in section 5.6.3. If an email address is provided, an email will be sent to indicate that the computations are complete.
A report of the results and the rest of the DiscoRhythm session can be generated by selecting the report method. This mode executes the entire DiscoRhythm workflow from scratch with the current parameter settings of the session to generate a zip file containing:
discoBatch()
)discoODAs()
)If the input dataset is large, this may be a preffered mode of execution over the interactive mode which could time out before results are viewed and/or exported. Results from the report mode will be automatically downloaded/saved upon completion.
This is the recommended method for archiving results as the inputs, outputs, all software versions, and the precise code used to produce the results will be archived in one location.
If an email address is provided, an email will be sent with the results attached upon completion.
Section 6.6.2 describes how to use the Rdata archive file to recompute results and continue an analysis in R.
Once rhythmicity computation is completed, 3 sections become available for inspecting the results:
Individual Models: Allows inspection of individual rows from the raw dataset. User may click a row on the table in order to display the raw data values along with a fitted periodic curve. If the Cosinor method is being viewed, the line will be the Cosinor fit, all other methods utilize a loess fit. If the error bar option is selected, a 95% confidence interval on the mean will be displayed for each timepoint.
Summary: Summarizes calculated rhythm parameters across all tested rows by all executed methods.
Method Comparison: Offers pairwise comparison of rhythmic parameters calculated by each method to determine the degree of agreement between them.
Note on method comparison: This section should not be used as evidence of rhythmicity, and rather should be used for testing purposes with simulated datasets to determine the degree of agreement of the detection methods under various conditions. For real datasets, this section could be used to compare the results of each method to determine which features are sensitive to method choice.
The remainder of this vignette will focus on programmatic usage of DiscoRhythm functions using R.
This section will demonstrate usage of the R functions used to perform the
analysis in section 5. For each of
the sections below, refer to the DiscoRhythm R package manual for
specific technical details on usage, arguments and methods or use ?
to
access individual manual pages. For instance,
get more
help for the function discoBatch()
with command ?discoBatch
.
SummarizedExperiment objects such as those generated
by other Bioconductor packages
should be suitable inputs to DiscoRhythm functions
once modified to contain the required metadata.
For a “Summarized experiment” object named se
the following fields are assumed
to be present:
rownames(se)
- The feature IDs.
assay(se)
15 At present DiscoRhythm will use only the
first assay (assays(se)[[1]]
) of the SummarizedExperiment
and all others will be ignored. - A matrix containing experimental data.
colData(se)
- Stores sample metadata (See 4.3.1).
3 columns are required:
Objects with this structure will be used throughout the package.
The CSV inputs to the DiscoRhythm web interface described in section
4 can be read into R as a data.frame
.
To allow the users of the web application to use the same input for their
analysis in R, the function discoDFtoSE()
can convert the tabular
input into an apporpriate format described in
6.1 above.
The sample metadata will be
extracted from the column
names by matching the format16 See ?discoParseMeta
for regular expression specifications. described in section
4.3.
The checks for validity and uniqueness mentioned in section 5.2
will also be performed. Alternatively, this metadata may be provided directly
to discoDFtoSE
as a data.frame
.
Loading in the example dataset described in section 4 using
discoGetSimu()
to get the demo dataset as a data.frame
:
library(DiscoRhythm)
indata <- discoGetSimu()
knitr::kable(head(indata[,1:6]), format = "markdown") # Inspect the data
IDs | CT0_1_A | CT0_2_B | CT0_3_C | CT4_4_D | CT4_5_E |
---|---|---|---|---|---|
nonRhythmGene_Id1 | 6.3717658 | 4.0909466 | 8.3010577 | 7.0240708 | -4.6510406 |
nonRhythmGene_Id2 | -2.4340798 | -5.6695362 | -3.0800470 | -0.1800393 | 3.1282445 |
nonRhythmGene_Id3 | -9.8373929 | -5.8738313 | -0.7793628 | -2.7236067 | 2.6574806 |
nonRhythmGene_Id4 | 4.5951368 | 0.2952068 | 3.5660223 | -1.2458999 | 6.1385621 |
nonRhythmGene_Id5 | 0.3980836 | 3.9928686 | 1.5626314 | 0.0613089 | 0.7013376 |
nonRhythmGene_Id6 | -0.2139567 | -2.4819291 | -6.5602675 | -1.9901769 | -7.5526251 |
And importing as a SummarizedExperiment
.
se <- discoDFtoSE(indata)
The reverse operation, discoSEtoDF
, is also available and is mainly intended
for exporting the data to a “csv” format to be used as an input to the web
application.
write.csv(discoSEtoDF(se),file = "DiscoRhythmInputFile.csv",row.names = FALSE)
This function performs the row-wise checks for missing values and constant values as mentioned in section 5.2.
selectDataSE <- discoCheckInput(se)
The sample collection information present in colData(selectDataSE)
can be
summarized by the discoDesignSummary()
function to detail the number of
biological and technical replicates available at each collection time. Number of
technical replicates is shown in brackets.
library(SummarizedExperiment)
Metadata <- colData(selectDataSE)
knitr::kable(discoDesignSummary(Metadata),format = "markdown")
0 | 4 | 8 | 12 | 16 | 20 | |
---|---|---|---|---|---|---|
Total | 9 | 9 | 9 | 9 | 9 | 9 |
Biological Sample A (3) | Biological Sample D (3) | Biological Sample G (3) | Biological Sample J (3) | Biological Sample M (3) | Biological Sample P (3) | |
Biological Sample B (3) | Biological Sample E (3) | Biological Sample H (3) | Biological Sample K (3) | Biological Sample N (3) | Biological Sample Q (3) | |
Biological Sample C (3) | Biological Sample F (3) | Biological Sample I (3) | Biological Sample L (3) | Biological Sample O (3) | Biological Sample R (3) |
Performs the analysis described in 5.3.1 to return some intermediate results and a vector indicating which samples were flagged as outliers using the inter-sample correlation method.
CorRes <- discoInterCorOutliers(selectDataSE,
cor_method="pearson",
threshold=3,
thresh_type="sd")
Performs the analysis described in 5.3.2 to return some intermediate results and a vector indicating which samples were flagged as outliers using the PCA method.
PCAres <- discoPCAoutliers(selectDataSE,
threshold=3,
scale=TRUE,
pcToCut = c("PC1","PC2","PC3","PC4"))
A simple wrapper was written for the stats::prcomp
function for better use
with
the web application and it can be utilized as:
discoPCAres <- discoPCA(selectDataSE)
This returns the same output as prcomp with the addition of a reformatted
summary table (available as PCAresAfter$table
).
The results of the outlier detection analysis (CorRes
and PCAres
)
are used to remove outliers by subsetting the data.
FilteredSE <- selectDataSE[,!PCAres$outliers & !CorRes$outliers]
DT::datatable(as.data.frame(
colData(selectDataSE)[PCAres$outliers | CorRes$outliers,]
))
knitr::kable(discoDesignSummary(colData(FilteredSE)),format = "markdown")
0 | 4 | 8 | 12 | 16 | 20 | |
---|---|---|---|---|---|---|
Total | 9 | 8 | 9 | 8 | 9 | 9 |
Biological Sample A (3) | Biological Sample D (3) | Biological Sample G (3) | Biological Sample J (3) | Biological Sample M (3) | Biological Sample P (3) | |
Biological Sample B (3) | Biological Sample E (3) | Biological Sample H (3) | Biological Sample K (3) | Biological Sample N (3) | Biological Sample Q (3) | |
Biological Sample C (3) | Biological Sample F (2) | Biological Sample I (3) | Biological Sample L (2) | Biological Sample O (3) | Biological Sample R (3) |
Performs the analysis described in 5.4 returning the results
of the ANOVA test and the se
data object where technical replicates
are combined.
ANOVAres <- discoRepAnalysis(FilteredSE,
aov_method="Equal Variance",
aov_pcut=0.05,
aov_Fcut=1,
avg_method="Median")
FinalSE <- ANOVAres$se
Performs the analysis described in 5.5 to return a
data.frame
of Cosinor results across a range of periods.
PeriodRes <- discoPeriodDetection(FinalSE,
timeType="linear",
main_per=24)
The main period of interest is fit using a Cosinor model to principal component scores as described in 5.5.
OVpca <- discoPCA(FinalSE)
OVpcaSE <- discoDFtoSE(data.frame("PC"=1:ncol(OVpca$x),t(OVpca$x)),
colData(FinalSE))
knitr::kable(discoODAs(OVpcaSE,period = 24,method = "CS")$CS,
format = "markdown")
acrophase | amplitude | pvalue | qvalue | Rsq | mesor | sincoef | coscoef |
---|---|---|---|---|---|---|---|
17.712807 | 8.8193393 | 0.0000000 | 0.0000000 | 0.9855849 | 0 | -8.7944228 | -0.6624756 |
23.688091 | 5.4627944 | 0.0000000 | 0.0000000 | 0.9182224 | 0 | -0.4455825 | 5.4445918 |
14.950576 | 0.2226963 | 0.9816116 | 0.9973953 | 0.0024716 | 0 | -0.1554194 | -0.1594943 |
10.367958 | 0.1284881 | 0.9936085 | 0.9973953 | 0.0008546 | 0 | 0.0532436 | -0.1169372 |
12.875265 | 0.3915672 | 0.9364723 | 0.9973953 | 0.0087132 | 0 | -0.0889421 | -0.3813321 |
15.091166 | 0.2544040 | 0.9696152 | 0.9973953 | 0.0041057 | 0 | -0.1841327 | -0.1755465 |
11.569215 | 0.4873137 | 0.8860104 | 0.9973953 | 0.0160074 | 0 | 0.0548425 | -0.4842179 |
23.145543 | 0.2364532 | 0.9697559 | 0.9973953 | 0.0040864 | 0 | -0.0524537 | 0.2305618 |
2.219717 | 0.3157691 | 0.9426803 | 0.9973953 | 0.0078395 | 0 | 0.1733449 | 0.2639350 |
20.716605 | 0.0657149 | 0.9973953 | 0.9973953 | 0.0003477 | 0 | -0.0497840 | 0.0428952 |
Performs the analysis described in 5.6.1
using just the Cosinor method.
discoODAs()
will automatically run all appropraite methods
if none are provided.
discoODAres <- discoODAs(FinalSE,
period=24,
method="CS",
ncores=1,
circular_t=FALSE)
The entire analysis performed in section 6 may be
run through a single call to discoBatch()
to obtain the final discoODAres
results.
discoBatch(indata=indata,
report="discoBatch_example.html",
ncores=1,
main_per=24,
timeType="linear",
cor_threshold=3,
cor_method="pearson",
cor_threshType="sd",
pca_threshold=3,
pca_scale=TRUE,
pca_pcToCut=paste0("PC",seq_len(4)),
aov_method="None",
aov_pcut=0.05,
aov_Fcut=0,
avg_method="Median",
osc_method="CS",
osc_period=24)
This command will generate an html report called “discoBatch_example.html”
, which includes the
visualizations seen in the DiscoRhythm application. indata
may be in either
of the two input formats described in 6.1 (data.frame
or
SummarizedExperiment
).
Section 5.6.2.2 describes how to archive results in a DiscoRhythm web session. The RDS file produced by this feature contains the parameters used in the session which can be used by one of the methods below to continue an analysis in R.
First the CSV input dataset can be read into R. This may be done as:
indata <- read.csv(path_to_csv_file)
And the R data file can be read into R with:
discorhythm_inputs <- readRDS('discorhythm_inputs.RDS')
The batch analysis can then be computed to create a report and obtain a list of the oscillation detection results.
discoODAres <- do.call(discoBatch, c(list(indata=indata,
report='discorhythm_report.html'),
discorhythm_inputs))
The above code is simply a call to discoBatch
using a list of arguments
as input.
Alternatively, a customized DiscoRhythm workflow may be built on the code template below which computes the major results. First, to get all necessary objects into the workspace, the list will be converted into individual objects using the code below.
list2env(discorhythm_inputs,envir=.GlobalEnv)
The code from this template can then be used directly and modified for further analysis. This code and visualization code can be found in the discorhythm_report.html file.
######################################################################
# Intended for use by discoBatch or through the DiscoRhythm_report.Rmd
# Includes all R code for the DiscoRhythm data processing
# Expects all arguments to discoBatch in the environment
#####################################################################
library(DiscoRhythm)
# Preprocess inputs
selectDataSE <- discoCheckInput(discoDFtoSE(indata))
# Intersample correlations
CorRes <- discoInterCorOutliers(selectDataSE,cor_method,
cor_threshold,cor_threshType)
# PCA for outlier detection
PCAres <- discoPCAoutliers(selectDataSE,pca_threshold,pca_scale,pca_pcToCut)
PCAresAfter <- discoPCA(selectDataSE[,!PCAres$outliers])
# Removing the outliers from the main data.frame and metadata data.frame
FilteredSE <- selectDataSE[,!PCAres$outliers & !CorRes$outliers]
# Running ANOVA and merging replicates
ANOVAres <- discoRepAnalysis(FilteredSE, aov_method,
aov_pcut, aov_Fcut, avg_method)
# Data to be used for Period Detection and Oscillation Detection
FinalSE <- ANOVAres$se
# Perform PCA on the final dataset
OVpca <- discoPCA(FinalSE)
# Period Detection
PeriodRes <- discoPeriodDetection(FinalSE,
timeType,
main_per)
# Oscillation Detection
discoODAres <- discoODAs(FinalSE,
circular_t = timeType=="circular",
period=osc_period,
osc_method,ncores)
Usage in R provides additional columns from discoODAs
which are excluded from
web application outputs. Additionally, the web application provides row means
which may be easily calculated in R with rowMeans
.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.32.0 Biobase_2.62.0
## [3] GenomicRanges_1.54.0 GenomeInfoDb_1.38.0
## [5] IRanges_2.36.0 BiocGenerics_0.48.0
## [7] MatrixGenerics_1.14.0 matrixStats_1.0.0
## [9] S4Vectors_0.40.0 DiscoRhythm_1.18.0
## [11] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 gridExtra_2.3 formatR_1.14
## [4] rlang_1.1.1 magrittr_2.0.3 shinydashboard_0.7.2
## [7] compiler_4.3.1 reshape2_1.4.4 systemfonts_1.0.5
## [10] vctrs_0.6.4 stringr_1.5.0 rvest_1.0.3
## [13] pkgconfig_2.0.3 crayon_1.5.2 fastmap_1.1.1
## [16] backports_1.4.1 magick_2.8.1 XVector_0.42.0
## [19] ellipsis_0.3.2 ca_0.71.1 utf8_1.2.4
## [22] promises_1.2.1 rmarkdown_2.25 UpSetR_1.4.0
## [25] purrr_1.0.2 xfun_0.40 zlibbioc_1.48.0
## [28] cachem_1.0.8 jsonlite_1.8.7 matrixTests_0.2.3
## [31] highr_0.10 later_1.3.1 DelayedArray_0.28.0
## [34] broom_1.0.5 R6_2.5.1 stringi_1.7.12
## [37] bslib_0.5.1 RColorBrewer_1.1-3 jquerylib_0.1.4
## [40] Rcpp_1.0.11 bookdown_0.36 assertthat_0.2.1
## [43] iterators_1.0.14 knitr_1.44 VennDiagram_1.7.3
## [46] Matrix_1.6-1.1 httpuv_1.6.12 tidyselect_1.2.0
## [49] rstudioapi_0.15.0 abind_1.4-5 yaml_2.3.7
## [52] viridis_0.6.4 TSP_1.2-4 miniUI_0.1.1.1
## [55] codetools_0.2-19 lattice_0.22-5 tibble_3.2.1
## [58] plyr_1.8.9 shiny_1.7.5.1 evaluate_0.22
## [61] lambda.r_1.2.4 futile.logger_1.4.3 heatmaply_1.5.0
## [64] xml2_1.3.5 zip_2.3.0 pillar_1.9.0
## [67] shinycssloaders_1.0.0 BiocManager_1.30.22 DT_0.30
## [70] foreach_1.5.2 shinyjs_2.1.0 plotly_4.10.3
## [73] generics_0.1.3 RCurl_1.98-1.12 ggplot2_3.4.4
## [76] munsell_0.5.0 scales_1.2.1 xtable_1.8-4
## [79] glue_1.6.2 lazyeval_0.2.2 tools_4.3.1
## [82] dendextend_1.17.1 data.table_1.14.8 webshot_0.5.5
## [85] registry_0.5-1 grid_4.3.1 tidyr_1.3.0
## [88] crosstalk_1.2.0 seriation_1.5.1 shinyBS_0.61.1
## [91] colorspace_2.1-0 GenomeInfoDbData_1.2.11 cli_3.6.1
## [94] kableExtra_1.3.4 futile.options_1.0.1 fansi_1.0.5
## [97] S4Arrays_1.2.0 viridisLite_0.4.2 svglite_2.1.2
## [100] dplyr_1.1.3 gtable_0.3.4 sass_0.4.7
## [103] digest_0.6.33 SparseArray_1.2.0 htmlwidgets_1.6.2
## [106] htmltools_0.5.6.1 lifecycle_1.0.3 httr_1.4.7
## [109] mime_0.12 ggExtra_0.10.1
Cornelissen, Germaine. 2014. “Cosinor-Based Rhythmometry.” Theor. Biol. Med. Model. 11 (April): 16.
Deckard, Anastasia, Ron C Anafi, John B Hogenesch, Steven B Haase, and John Harer. 2013. “Design and Analysis of Large-Scale Biological Rhythm Studies: A Comparison of Algorithms for Detecting Periodic Signals in Biological Data.” Bioinformatics 29 (24): 3174–80.
Glynn, Earl F, Jie Chen, and Arcady R Mushegian. 2006. “Detecting Periodic Patterns in Unevenly Spaced Gene Expression Time Series Using Lomb-Scargle Periodograms.” Bioinformatics 22 (3): 310–16.
Hughes, Michael E, Katherine C Abruzzi, Ravi Allada, Ron Anafi, Alaaddin Bulak Arpat, Gad Asher, Pierre Baldi, et al. 2017. “Guidelines for Genome-Scale Analysis of Biological Rhythms.” J. Biol. Rhythms 32 (5): 380–93.
Hughes, Michael E, John B Hogenesch, and Karl Kornacker. 2010. “JTK_CYCLE: An Efficient Nonparametric Algorithm for Detecting Rhythmic Components in Genome-Scale Data Sets.” J. Biol. Rhythms 25 (5): 372–80.
Hurley, Jennifer M, Meaghan S Jankowski, Hannah De Los Santos, Alexander M Crowell, Samuel B Fordyce, Jeremy D Zucker, Neeraj Kumar, et al. 2018. “Circadian Proteomic Analysis Uncovers Mechanisms of Post-Transcriptional Regulation in Metabolic Pathways.” Cell Syst 7 (6): 613–626.e5.
Hutchison, Alan L., and Aaron R. Dinner. 2017. “Correcting for Dependent P-Values in Rhythm Detection.” bioRxiv. https://doi.org/10.1101/118547.
Krishnaiah, Saikumari Y, Gang Wu, Brian J Altman, Jacqueline Growe, Seth D Rhoades, Faith Coldren, Anand Venkataraman, et al. 2017. “Clock Regulation of Metabolites Reveals Coupling Between Transcription and Metabolism.” Cell Metab. 25 (4): 961–974.e4.
Singer, Jordan M, Darwin Y Fu, and Jacob J Hughey. 2019. “Simphony: Simulating Large-Scale, Rhythmic Data.” PeerJ 7 (May): e6985.
Wu, Gang, Ron C Anafi, Michael E Hughes, Karl Kornacker, and John B Hogenesch. 2016. “MetaCycle: An Integrated R Package to Evaluate Periodicity in Large Scale Data.” Bioinformatics 32 (21): 3351–3.
Wu, Gang, Jiang Zhu, Jun Yu, Lan Zhou, Jianhua Z Huang, and Zhang Zhang. 2014. “Evaluation of Five Methods for Genome-Wide Circadian Gene Identification.” J. Biol. Rhythms 29 (4): 231–42.
Yang, Rendong, and Zhen Su. 2010. “Analyzing Circadian Expression Data by Harmonic Regression Based on Autoregressive Spectral Estimation.” Bioinformatics 26 (12): i168–74.