Contents

1 Purpose

Coordinate Covariation Analysis (COCOA) is a method to understand variation among samples and can be used with data that includes genomic coordinates such as DNA methylation. To describe the method on a high level, COCOA uses a database of “region sets” and principal component analysis (PCA) of the data to identify sources of variation among samples. A region set is a set of genomic regions that share a biological annotation, for instance transcription factor (TF) binding regions, histone modification regions, or open chromatin regions. COCOA can identify region sets where there is intersample variation, giving you biologically meaningful insight into variation in your data. For a more in-depth description of the method, see the “Method details” section of this vignette. In contrast to some other common techniques, COCOA is unsupervised, meaning that samples do not have to be divided into groups such as case/control or healthy/disease, although COCOA works in those situations as well. Also, COCOA focuses on continuous variation between samples instead of having cutoffs. Because of this, COCOA can be used as a complementary method alongside “differential” methods that find discrete differences between groups of samples and it can also be used in situations where there are no groups.

2 Our test case

In this vignette, we will see how COCOA can find meaningful sources of intersample variation in breast cancer patients. We will use data from The Cancer Genome Atlas: 450k DNA methylation microarrays for 657 breast cancer patients (TCGA-BRCA, https://portal.gdc.cancer.gov/). This vignette begins after PCA has been performed on the DNA methylation data. As you can see below, PC1 and PC4 cluster patients based on estrogen receptor status which is an important prognostic marker for breast cancer:

Based on this, we might expect that the variation captured by these PCs would be associated with estrogen receptor (consistent with interpatient variation in estrogen receptor status). We are using an easy example for this vignette but COCOA will still work even if there are not distinct clusters in your PCA plot and if you do not have known groups of samples. Outside of this vignette, COCOA was performed with a region set database of over 2000 region sets. To keep things simple for this vignette, we will only use two of the highest scoring region sets and two of the lowest scoring region sets from that analysis to demonstrate how COCOA works.

3 Inputs

There are three main inputs for COCOA:

  1. PCA loadings (prcomp()$rotation).
  2. Genomic coordinates associated with the raw data used for the PCA and with the PCA loadings.
  3. Region sets (normally a database of region sets).

We did PCA with DNA methylation data for all autosomal chromosomes but only the DNA methylation data and loadings for cytosines in chromosome 1 are included in this vignette to minimize memory requirements. brcaLoadings1 has the chr1 loadings. brcaMCoord1 has the coordinates for the chr1 DNA methylation data, which are also the coordinates for the chr1 loadings. For more on the relationship between the loadings and the original data see the “Method details” section. In this vignette, we will use only a few region sets to demonstrate how to use the COCOA functions. For actual use, COCOA should be done with many region sets (ie hundreds or > 1000). Publicly available collections of region sets can be found online (eg http://databio.org/regiondb).

4 Running COCOA

First, we load the example data and required packages:

library(COCOA)
library(data.table)
library(ggplot2)
data("esr1_chr1")
data("gata3_chr1")
data("nrf1_chr1")
data("atf3_chr1")
data("brcaLoadings1")
data("brcaMCoord1")
data("brcaPCScores")
data("brcaMethylData1")

We have region sets for four transcription factors from ChIP-seq experiments (with the same reference genome as our breast cancer data): Esr1 in MCF-7 cells, Gata3 in MCF-7 cells, Nrf1 in HepG2 cells, and Atf1 in K562 cells.

# prepare data
GRList <- GRangesList(esr1_chr1, gata3_chr1, nrf1_chr1, atf3_chr1) 
regionSetName <- c("esr1_chr1", "gata3_chr1", "nrf1_chr1", "atf3_chr1")

Now let’s give each region set a score with runCOCOA() to quantify how much it is associated with each principal component:

PCsToAnnotate <- paste0("PC", 1:4)
regionSetScores <- runCOCOA(loadingMat=brcaLoadings1, 
                            signalCoord=brcaMCoord1, 
                            GRList=GRList, 
                            PCsToAnnotate=PCsToAnnotate, 
                            scoringMetric="regionMean")
regionSetScores$regionSetName <- regionSetName

As an easy way to visualize the results, we can see how the region sets are ranked for each PC:

rsScoreHeatmap(regionSetScores, 
               PCsToAnnotate=paste0("PC", 1:4), 
               rsNameCol = "regionSetName",
               orderByPC = "PC1", 
               column_title = "Region sets ordered by score for PC1")

We can see that Gata3 had the highest score for PCs 1, 3, and 4 but that Esr1 had a higher score for PC2. While you might expect that Esr1 would be ranked first for all PCs, Gata3 is known to be associated with breast cancer and Esr1 activity so these results still make sense. As mentioned before, this vignette is showing two of the top scoring and bottom scoring region sets from a previous COCOA analysis with over 2200 region sets. So despite Esr1 being ranked second out of four for some PCs in this vignette which may not seem very impressive, it had one of the best scores in the previous larger analysis. On a practical note, if you want to arrange the heatmap by region set scores for another PC, you can just change the orderByPC parameter like so:

rsScoreHeatmap(regionSetScores, 
               PCsToAnnotate=paste0("PC", 1:4),
               rsNameCol = "regionSetName",
               orderByPC = "PC2", 
               column_title = "Region sets ordered by score for PC2")

5 Understanding the results

We can further understand the variability in these region sets in several ways:

  1. Look at whether variability is specific to the regions of interest compared to the genome around these regions.
  2. Look at the genomic signal in these regions, in our case DNA methylation, and whether it follows the same trends as the PC scores.
  3. Look at the loadings in each region of a given region set to see whether all regions have high loadings or just a subset the regions. Also we can see if the same regions have high loadings for multiple PCs.

5.1 Specificity of variation to the regions of interest

We can see whether variability along the PC is specific to the region of interest by comparing the region of interest to the surrounding genome. To do this, we will calculate the average loadings of a wide area surrounding the regions of interest.

wideGRList <- lapply(GRList, resize, width=14000, fix="center")
loadProfile <- lapply(wideGRList, function(x) getLoadingProfile(loadingMat=brcaLoadings1,
                                                                signalCoord=brcaMCoord1,
                                                                regionSet=x, 
                                                                PCsToAnnotate=PCsToAnnotate,
                                                                binNum=21))

We will normalize the result for each PC so we can better compare them. Here we normalize by subtracting the mean absolute loading of each PC from the region set profiles for the corresponding PC. Then we get the plot scale so we can easily compare the different profiles. These normalization steps are helpful for comparing the loading profiles but not necessarily required so it’s not essential that you understand the below code.

# average loading value from each PC to normalize so PCs can be compared with each other
avLoad <- apply(X=brcaLoadings1[, PCsToAnnotate], 
                MARGIN=2, 
                FUN=function(x) mean(abs(x)))

# normalize
loadProfile <- lapply(loadProfile, 
                      FUN=function(x) as.data.frame(mapply(FUN = function(y, z) x[, y] - z, 
                                                           y=PCsToAnnotate, z=avLoad)))
binID = 1:nrow(loadProfile[[1]])
loadProfile <- lapply(loadProfile, FUN=function(x) cbind(binID, x))

# for the plot scale
maxVal <- max(sapply(loadProfile, FUN=function(x) max(x[, PCsToAnnotate])))
minVal <- min(sapply(loadProfile, FUN=function(x) min(x[, PCsToAnnotate])))

# convert to long format for plots
loadProfile <- lapply(X=loadProfile, FUN=function(x) tidyr::gather(data=x, key="PC", value="loading_value", PCsToAnnotate))
loadProfile <- lapply(loadProfile, 
                      function(x){x$PC <- factor(x$PC, levels=PCsToAnnotate); return(x)})

Let’s look at the plots!

wrapper <- function(x, ...) paste(strwrap(x, ...), collapse="\n") 
profilePList <- list()
for (i in seq_along(loadProfile)) {
    
    thisRS <- loadProfile[[i]]
    
    profilePList[[i]] <- ggplot(data=thisRS, 
                                mapping=aes(x=binID , y=loading_value)) + 
        geom_line() + ylim(c(minVal, maxVal)) + facet_wrap(facets="PC") + 
        ggtitle(label=wrapper(regionSetName[i], width=30)) + 
        xlab("Genome around region set, 14 kb") + 
        ylab("Normalized loading value") + 
        theme(panel.grid.major.x=element_blank(), 
              panel.grid.minor.x=element_blank(), 
              axis.text.x=element_blank(), 
              axis.ticks.x=element_blank())
    profilePList[[i]]

}
profilePList[[1]]

profilePList[[2]]

profilePList[[3]]

profilePList[[4]]

These plots show the average magnitude of the loadings in the genome around and including the regions of interest. The loading for an input variable (for us an input variable is DNA methylation at a single cytosine), indicates how much that input variable varies in the same direction as the PC. A peak in the middle of the profile indicates that there is increased covariation in the regions of interest compared to the surrounding genome. It indicates that those regions are changing in a coordinated way whereas the surrounding genome is not changing in a coordinated way to the same extent. A peak suggests that the variation in a PC may be somehow specifically related to the region set although it is not clear whether the region set is causally linked to the variation or just affected by other things that are causing the variation captured by the PC. Some region sets may have an increased loading but no peak, for example, some histone modification region sets like H3K27me3 or H3K9me3. This doesn’t necessarily mean these regions are not relevant. It could just mean that there is variability in larger blocks of the genome around these histone modifications. For details on how the loading profile was created, check out the “Method details” section of this vignette or see the docs for getLoadingProfile with ?COCOA::getLoadingProfile.

These plots show that Esr1 and Gata3 binding regions demonstrate higher covariation along the PC axes than the surrounding genome (because they have a peak in the middle) while Nrf1 and Atf3 do not. So for example, if you line up samples by their PC1 score, then as you go from low to high PC1 score, the DNA methylation of ESR1 binding regions will generally change in a coordinated way across samples but the DNA methylation in the surrounding genome would not change as much or would not change in a coordinated way. The results from our four region sets suggest that Esr1 and Gata3 regions specifically contribute to the variation along the PCs 1 and 3 (as well as perhaps 4), helping us understand the biological meaning of the variation captured by those PCs (at least in part related to estrogen receptor).

5.2 The raw data

If a region set has a high score for a PC, we would expect that DNA methylation in at least some of those regions would correlate with PC score. In other words, as you go from high PC score to low PC score along the PC axis, DNA methylation will either go up or down. Let’s look at DNA methylation at CpGs in Esr1 regions. In the following plot, each column is a CpG in an Esr1 region and each row is a patient. Patients are ordered by their PC score for PC1.

signalAlongPC(genomicSignal=brcaMethylData1,
                signalCoord=brcaMCoord1,
                regionSet=esr1_chr1,
                pcScores=brcaPCScores,
                orderByPC="PC1", cluster_columns=TRUE,
                column_title = "Individual cytosine/CpG",
                name = "DNA methylation level")

It appears that some but not all CpGs vary greatly along the PC axis. The CpGs that show high variation along the PC axis are the ones that contribute to the Esr1 region set being ranked highly in our analysis. Looking at the raw data confirms that DNA methylation is in fact varying along the PC axis, which gives us more confidence in our results.

Now let’s look at one of the region sets, Nrf1, that had a low score for PC1:

signalAlongPC(genomicSignal=brcaMethylData1,
              signalCoord=brcaMCoord1,
              regionSet=nrf1_chr1,
              pcScores=brcaPCScores,
              orderByPC="PC1", 
              cluster_columns=TRUE, 
              column_title = "Individual cytosine/CpG",
              name = "DNA methylation level")

When patients are ordered according to PC1 score, we can see that there is very little covariation in DNA methylation in these regions. Therefore, it is not surprising that the Nrf1 region set had a low score for this PC.

Since COCOA ranks region sets based on their relative scores in comparison to other region sets tested, there will always be a region set with a best score. Therefore, it is always a good idea to check the raw genomic signal in your top region sets to make sure that there really is variation along the PC.

5.3 Loadings of individual regions

This plot can help you learn more about the contribution of individual regions to the region set score for each PC. For example, if the estrogen receptor region set was associated with PCs 1, 3, and 4, we might wonder whether the same regions are causing the association with these PCs or whether different regions are associated with each PC. To do this, we will first calculate the average absolute loading value for each region in a region set (obtained by averaging a given PC’s loadings for CpGs within that region). Then we can use the distribution of loadings for each PC to convert each region’s loading to a percentile to see how extreme/high that region is for each PC. Let’s look at the plot for estrogen receptor:

regionQuantileByPC(loadingMat = brcaLoadings1,
                                signalCoord = brcaMCoord1,
                                regionSet = esr1_chr1,
                                rsName = "Estrogen receptor (chr1)",
                                PCsToAnnotate=paste0("PC", 1:4),
                                maxRegionsToPlot = 8000,
                                cluster_rows = TRUE,
                                cluster_columns = FALSE,
                                column_title = rsName,
                                name = "Percentile of loading scores in PC")

We can see that some of the same regions have high loadings for multiple PCs (ie these regions are important for these PCs). Also, there are some regions that do not have high loadings for any of the top 4 PCs, suggesting that these regions are not associated with the largest sources of covariation in the data. Overall, PC2 does not have as high loadings as PCs 1, 3, and 4, consistent with our loading profiles (no peak for PC2). While Esr1 was ranked highest for PC2 for the initial COCOA results, this was only the highest score out of 4 region sets. This illustrates two important points for using COCOA. First, you almost always will want to use many region sets so that you can compare the COCOA score for a region set to many others to see if it really is relatively high. Second, you should do further confirmation on top-ranked region sets such as looking at the raw DNA methylation values and looking at the loading profiles to make sure your top region sets are capturing real variation.

For contrast, we can look at the regions of Nrf1:

regionQuantileByPC(loadingMat = brcaLoadings1,
                                signalCoord = brcaMCoord1,
                                regionSet = nrf1_chr1,
                                rsName = "Nrf1 (chr1)",
                                PCsToAnnotate=paste0("PC", 1:4),
                                maxRegionsToPlot = 8000,
                                cluster_rows = TRUE,
                                cluster_columns = FALSE,
                                column_title = rsName,
                                name = "Percentile of loading scores in PC")

We can see that only a very small proportion of regions in the Nrf1 region set have high loadings for PCs 1-4. This is consistent with this region set being ranked low by COCOA for association with these PCs.

These visualization functions can help you explore your data and understand its variation. In conclusion, COCOA can help you identify biologically meaningful sources of variation that contribute to each PC, giving you insight into variability in your data.

6 Method details

There are two main conceptual steps:

  1. Quantify how each individual genomic feature/signal contributes to intersample variation using PCA.
  2. Aggregate information from individual genomic features to identify biologically meaningful sources of intersample variation using a region set database and COCOA functions.

6.1 PCA

PCA is a common dimensionality reduction method that produces new (orthogonal) dimensions in the directions of highest intersample variation. Each new dimension (principal component) is a linear combination of the original dimensions/features. For the linear combination for a given principal component (PC), each original feature has a coefficient called a “loading” which quantifies how much that feature contributes to that PC. To run PCA and get the loadings, first you need a matrix with features that all samples share (samples as rows, features as columns). Then run PCA on the data with the prcomp() function. The loadings can be found in the prcomp()$rotation output. Here is a small toy example with made up data, where the columns F1, F2, and F3 represent three original features and the five rows represent five samples:

toyData = data.frame(F1 = 1:5, F2 = c(2, 0, 6, 4, 3), F3 = c(1, 3, 2, 10, 5))

Let’s look at the loading matrix:

prcomp(toyData)$rotation
##          PC1        PC2        PC3
## F1 0.3193463 -0.1784098 -0.9306922
## F2 0.2005095 -0.9471601  0.2503670
## F3 0.9261824  0.2665664  0.2666992

The values in this loading matrix are the coefficients of the linear combination for each PC as we see below (each column in the loading matrix corresponds to an equation below):

PC1 = 0.319 * F1 + 0.201 * F2 + 0.926 * F3
PC2 = -0.178 * F1 + -0.947 * F2 + 0.267 * F3
PC3 = -0.931 * F1 + 0.250 * F2 + 0.267 * F3

COCOA will use these loadings as feature level information which will be aggregated to gain insight into the broader sources of variation in the data.

6.2 Region set database

COCOA uses a database of region sets to gain biological insight into sources of variability in your data. A region set is a set of genomic regions that share a biological annotation. This includes transcription factor (TF) binding regions (eg from ChIP-seq), regions with a certain histone modification (eg ChIP-seq) or chromatin accessibility regions (eg DNase/ATAC-seq). Most of these region sets are from experimental data but don’t necessarily have to be. For instance, you could use predicted TF binding regions based on the TF motif. The big picture goal of using a region set database is to connect variation between samples to an interpretable biological meaning: the known annotation of a region set. For each PC, COCOA will give a score to each region set that quantifies how much that region set is associated with that PC (and therefore with the intersample variation captured by that PC).

COCOA should be done with many region sets (ie hundreds or > 1000). A region set can be a simple “.bed”" file with three columns containing the genomic locations of the regions: chr (for chromosome), start, and end. In R, this data can be represented as a data.frame or as a GRanges object. Publicly available collections of region sets can be found online (eg http://databio.org/regiondb) and region sets can be accessed through Bioconductor packages (eg LOLA and AnnotationHub). The region sets must be from the same reference genome as your sample data (although you could use the liftOver tool to convert from one genome version to another). The region sets can come from anywhere so if you experimentally or computationally generate your own region sets, you can just include those with the others when running the COCOA analysis. For an example of working with a region set database, see the ‘COCOA_Workflow’ vignette.

6.3 Aggregating info from individual features

Since differences between samples in individual features (nucleotides/regions) may be hard to interpret, COCOA uses region sets to aggregate nucleotide/region level info into a more condensed, interpretable form. As mentioned above, each feature has a “loading” for a given PC and the magnitude of the loading represents how much that feature contributes to that PC. We will use the absolute value of the loading since both a very positive and very negative loading reflect an association between that feature and the PC. Also, each original feature is associated with a genomic coordinate (or a range but ranges are not supported by COCOA in this release version). COCOA will use this information to give each region set in the region set database a score for each PC. For a given PC-region set combination (for example PC1 and the region set esr1_chr1), we first identify all the original features that overlap with the region set. Then the scoring of the region set depends on the scoring metric chosen. The “regionMean” method is described here although the other methods are described in function docs. For the “regionMean” method, we take the PC loadings (eg PC1 loadings) for the features that overlap the region set and average the absolute loadings by region (average loadings in each region to get one value per region of the region set). Then we average the region values to get a single average for that region set which is its score. We repeat this calculation for all PC-region set combinations. Now for a given PC, we can rank the region sets by their score/loading average to see which region sets are most associated with that PC (higher loading average means a greater association with the PC). Since the “regionMean” scoring method only tells you which region sets are the most associated with a PC out of those tested, it is important to use a large database of region sets and then to validate by looking at the raw data in these regions which can be done with COCOA visualization functions. Also, you can screen out region sets that have low coverage in your input data (the region_coverage column of runCOCOA() output) since high scores for these regions are more likely to have happened by chance. The biological annotation of the top ranked region sets for the top PCs can help you understand variation among your samples.

6.4 Making a “meta-region” loading profile

A “meta-region” loading profile is a summary of the loadings in the genome in and around the regions of a region set. This is created with the getLoadingProfile function. The calculations are similar to those of runCOCOA with a few major differences. Instead of using the region set as is, we will expand each region in the region set on both sides so we can also look at the surrounding genome. We will then split each region into the same number of bins (of approximately equal size). Then we average the magnitude of all loadings that overlap a bin to get a single average loading for each bin. We combine information from all the regions by averaging the corresponding bins from the different regions (all bin1’s averaged together, all bin2’s averaged together, etc.). Finally, we average the profile symmetrically over the center (the first bin becomes the average of the first and last bin, the second bin becomes the average of the second and second to last bin, etc.). We do this because the orientation of the regions in the genome is arbitrary: we did not use any strand information or relate the regions to any directional information. The “meta-region” profile gives a summary in a single profile of all the regions in a region set and allows you to compare the regions to the surrounding genome.

7 Q and A

  1. What data types can COCOA be used with? So far, COCOA has been validated on single base pair resolution DNA methylation data. Theoretically, COCOA could work with any type of genomic coordinate-based data: data where you have a genomic coordinate or range and an associated value. This could include ATAC-seq data, single nucleotide polymorphism/mutation data, copy number variation etc. although COCOA would probably work better for data where smaller regions or single bases are measured. Future releases of COCOA will likely have support for more of these data types. If you are interested in using COCOA for these analyses before the next Bioconductor release, you can check out the Github page to see the latest version or download the “dev” version of COCOA from Bioconductor.

  2. Can COCOA be used with other dimensionality reduction techniques such as t-SNE? Short answer for t-SNE: no. In general though, it depends. COCOA must have a score for each original dimension that quantifies how much it contributes to the new dimension. Since t-SNE maps the original dimensions to new dimensions in a nonlinear way, the mappings of the original dimensions to the new dimensions are not comparable to each other and cannot be aggregated into a single score for a region set in a uniform way.

  3. Where did the name COCOA come from?

The method is called Coordinate Covariation Analysis because it uses PCA to capture covariation of individual signals/features at genomic coordinates. The top PCs capture both variation between samples and covariation between individual genomic features. COCOA annotates the covariation of individual genomic features with region sets in order to gain insight into variation between samples.