**Coordinate Covariation Analysis (COCOA) is a method to understand epigenetic variation among samples by annotating that variation with reference data. COCOA can be used on epigenetic data that includes genomic coordinates such as DNA methylation and chromatin accessibility data.**

Epigenetic differences between samples can be hard to interpret because epigenetic data is high dimensional. Grouping high dimensional data into functional categories helps to interpret data. An effective grouping method for epigenetic data is to group by related genomic regions, which collectively are called a region set. A region set is a set of genomic regions that share the same biological annotation. A few examples of region sets are binding regions for a certain transcription factor, cell-type specific open chromatin or enhancer regions, or regions with a certain histone modification. Reference region sets derived from public data are used by COCOA to annotate inter-sample epigenetic variation. Epigenetic variation can be annotated by COCOA in a supervised or unsupervised analysis. A supervised COCOA analysis annotates epigenetic variation that is related to a specific sample phenotype of interest (whether a molecular or higher-level phenotype). For example, we might want to see what epigenetic variation is associated with disease status, disease severity, survival, a certain mutation, or the expression of a certain gene transcript or protein. In contrast, an unsupervised COCOA analysis annotates inter-sample epigenetic variation without requiring a sample phenotype. Often, unsupervised analyses will look at the major sources of inter-sample variation by using a dimensionality reduction technique such as principal component analysis (PCA). COCOA can annotate those major sources of inter-sample variation (e.g. the principal components) for linear dimensionality reduction methods like PCA. To annotate the epigenetic variation of interest, whether supervised or unsupervised, COCOA finds reference region sets that are most associated with that variation, giving you biologically meaningful insight into variation in your data.

COCOA can be used in many different contexts so feel free to skip around to the section that is most relevant to you although you might also want to read the Basic workflow section for a high level overview of the COCOA workflow. We’ll show COCOA workflows with DNA methylation and chromatin accessibility data and for each of those we will demonstrate a supervised and unsupervised analysis. The vignettes are very similar since a similar COCOA workflow is used. For a more in-depth description of the method, see the “Method details” section.

We start off with a matrix of epigenetic data in which rows correspond to the epigenetic features (e.g. CpG or accessible chromatin region) and columns correspond to samples. Each epigenetic feature must have genomic coordinates, which are stored in a separate object.

The first step in COCOA is to quantify the association between your epigenetic data and the target variable/s (i.e. sample phenotype for supervised analysis or principal component for unsupervised analysis). The goal of this step is to get a score for each epigenetic feature that represents how much it contributes to the target variable. We’ll refer to these scores as feature contribution scores.

For a supervised analysis, we quantify the association between each epigenetic feature and the target variable using a metric such as correlation. For correlation, we take the correlation coefficient as the feature contribution score for each epigenetic feature. The metrics that are built in to COCOA are covariation, Pearson correlation, and Spearman correlation but you could also use another metric of your choice that results in a score for each epigenetic feature representing its association with the target variable.

For an unsupervised analysis, you must first do dimensionality reduction on your epigenetic data. Then, you take the principal component/latent factor sample scores for the principal components you are interested in and treat them as your target variables for COCOA. As described for a supervised analysis above, you will use a metric such as correlation to quantify the association between the epigenetic features and the target variables (principal component scores). For example, the correlation coefficient for each epigenetic feature represents how much it is associated with a given principal component. (For those familiar with PCA, you might wonder why we are using correlation instead of taking the PC loadings as the feature contribution scores. There is a reason, which will be explained in the section about permutation testing for statistical significance.)

The second step in COCOA is use the feature contribution scores from step 1 to find out which reference region sets are most associated with variation in the target variable. To do this, we use the COCOA algorithm to score each reference region set based on the epigenetic features from your data that it overlaps. Check out the `scoringMetric`

parameter in ?runCOCOA but essentially, the score for a region set is the average of the scores of the epigenetic features that it overlaps. The region sets can be ranked according to their score to see which region sets are most associated with the target variable. High scores for small region sets are more likely to be due to noise but this is addressed by the permutation test. Finding region sets where epigenetic variation is associated with variation in the target variable can give biological insight into your data.

To account for the different sizes of region sets and to assess statistical significance of the results, we can do a permutation test to estimate p-values. We run COCOA on shuffled data to get null distributions for each region set. More specifically, we shuffle the samples’ values for the target variable and recompute COCOA scores as described above. After doing this many times, the permutation COCOA scores for a given region set make up a region set-specific null distribution. This null distribution accounts for the size of the region set. If multiple target variables are used, each target variable, region set combination will have its own unique null distribution. To avoid the need to run a large number of permutations, we can run a small number of permutations (a few hundred) and then fit a gamma distribution to each null distribution. We use the gamma distribution to estimate a p-value for the real COCOA score for each region set. As mentioned before, the unsupervised COCOA test does not use PCA loadings as the feature contribution scores. This is done so that PCA, a time-consuming computation, does not have to be recomputed for every permutation. Instead, the PC sample scores are treated as the target variable and shuffled so that the feature contribution scores can be recalculated.

COCOA uses a database of region sets to annotate inter-sample epigenetic variation in your data. In this vignette, we will see how COCOA can find meaningful sources of DNA methylation variation in breast cancer patients.

First, we will show how to use COCOA to find region sets where DNA methylation variation is associated with variation in a phenotype of interest (the target variable) (Supervised COCOA). Then, we will show how to find region sets that are associated with the principal components of PCA of our DNA methylation data (Unsupervised COCOA).

We will use data from The Cancer Genome Atlas: 450k DNA methylation microarrays for breast cancer patients (TCGA-BRCA, https://portal.gdc.cancer.gov/). We (the authors) ran an unsupervised COCOA analysis outside of this vignette on the full methylation data for 657 patients and a large region set database of over 2000 region sets. We used those results to select region sets and a subset of the DNA methylation data for this vignette. We are using two of the highest and two of the lowest scoring region sets from that analysis. Those region sets are transcription factor binding regions from ChIP-seq experiments (with the same reference genome as our breast cancer data): ESR1 in MCF-7 cells, GATA3 in MCF-7 cells, NRF1 in HepG2 cells, and Atf1 in K562 cells. Only chr1 regions are included in the vignette. For a real analysis, we recommend using hundreds or thousands of region sets. For sources of region sets, see the “Region set database” section. We also have subsetted the DNA methylation data to a small number of CpGs that we determined were necessary for the vignette in order to keep the vignette data small. Since we’re using only a small subset of the data, the vignette results may not necessarily be generalizable to the full data. In your real analysis, we recommend using as many CpGs as possible.

Goal: Understand epigenetic variation related to a specific phenotype

In a supervised COCOA analysis, we want to annotate epigenetic variation that is related to variation in one or more sample phenotypes. These sample phenotypes could be molecular (e.g. expression of a protein marker, presence/variant allele frequency of a mutation, or expression of a certain gene). The sample phenotypes can also be higher-level organism phenotypes (e.g. patient survival, cancer stage, disease severity). COCOA will identify region sets where the epigenetic signal is most related to the phenotype. These region sets then can help to understand the relationship between the phenotype and epigenetic state.

Vignette analysis goal: As mentioned previously, we are looking at DNA methylation in breast cancer patients. Our goal is to understand epigenetic variation related to our phenotype of interest, estrogen receptor (ER) status.

First we will load the necessary data: BRCA DNA methylation data with genomic coordinates, region sets to test, and our phenotype of interest, ER status (part of `brcaMetadata`

).

```
library(COCOA)
data("esr1_chr1")
data("gata3_chr1")
data("nrf1_chr1")
data("atf3_chr1")
data("brcaMCoord1")
data("brcaMethylData1")
data("brcaMetadata")
```

As mentioned, ER status is our target phenotype so we’ll pull that info out of the patient metadata. Since we only have one target variable, we are keeping the data as a data.frame object using drop=FALSE (instead of targetVarDF becoming a vector).

```
myPhen <- "ER_status"
targetVarDF <- brcaMetadata[colnames(brcaMethylData1), myPhen, drop=FALSE]
```

Let’s convert estrogen receptor status to a number so we can do calculations with it. Then we will quantify the association between ER status and the DNA methylation level at each cytosine using Pearson correlation. If NA’s are present for any samples for a given CpG, the default correlation parameters will return NA. The parameter “pairwise.complete.obs” can be used to get correlation values even if some samples have NAs/missing data for a CpG (we use this in our code as an example even though our data does not have NAs). Whether to use this parameter is up to the judgment of the user, which could be decided based on what percent of samples have no data for a given CpG (filter out CpGs missing data for a certain percent of samples, use the rest).

```
targetVarDF$ER_status <- scale(as.numeric(targetVarDF$ER_status),
center=TRUE, scale=FALSE)
methylCor <- cor(t(brcaMethylData1), targetVarDF$ER_status,
use = "pairwise.complete.obs")
# if the standard deviation of the methylation level
# for a CpG across samples is 0,
# cor() will return NA, so manually set the correlation to 0 for these CpGs
methylCor[is.na(methylCor)] <- 0
colnames(methylCor) <- myPhen
```

Now we have a feature contribution score (the correlation coefficient in this case) for each methylation site that represents how much the methylation at that site is associated with ER status. We can use those individual cytosine scores to score region sets. We’ll score our region sets with the `aggregateSignalGRList`

function.

```
GRList <- GRangesList(esr1_chr1, gata3_chr1, atf3_chr1, nrf1_chr1)
regionSetNames <- c("ESR1", "GATA3", "ATF3", "NRF1")
rsScores <- aggregateSignalGRList(signal=methylCor,
signalCoord=brcaMCoord1,
GRList=GRList,
signalCol=myPhen,
scoringMetric="default",
absVal=TRUE)
rsScores$regionSetName <- regionSetNames
rsScores
```

```
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.24468219 440 216 765
## 2 0.32829812 45 39 559
## 3 0.07115301 296 88 116
## 4 0.06487258 484 145 163
## meanRegionSize regionSetName
## 1 1196.2 ESR1
## 2 494.6 GATA3
## 3 283.4 ATF3
## 4 273.4 NRF1
```

Now we have a score for each region set that represents how much epigenetic variation in that region set is associated with the phenotype of interest, ER status. Since we are using the absolute value of the correlation coefficient for each CpG to score our region sets, the possible region set scores range from 0 to 1, with 0 representing no correlation and 1 representing complete correlation. From our initial results, it appears that the DNA methylation in ER and GATA3 region sets is much more highly associated with ER status than in the NRF1 and ATF3 region sets.

Both the correlation and scoring steps can be combined into one step with the `runCOCOA`

function as will be shown in the next section.

Now that we have a score for each region set, we can to get an idea of the statistical significance of these results. We will do this using a permutation test. The permutation test is computationally expensive and, in some cases, may not be needed since the COCOA scores by themselves might be enough to gain insight your data. However, if a p-value estimate is desired, the permutation test can return that.

By shuffling the sample phenotypes and calculating fake COCOA scores, we can create a null distribution for each region set. The `runCOCOA`

function does the two main COCOA steps: quantifying the variation and scoring the region sets. If the samples were in the correct order, `runCOCOA`

would return the real COCOA scores but since we are giving a shuffled sample order (`sampleOrder`

parameter), we will be correlating the epigenetic data with the shuffled sample phenotypes to create null distribution COCOA scores. We will be running five permutations for the vignette but many more should be done for a real analysis (probably at least a few hundred). We need to set the seed for reproducible results since we are doing random shuffling of the sample phenotypes.

```
set.seed(100)
nPerm <- 5
permRSScores <- list()
for (i in 1:nPerm) {
# shuffling sample labels
sampleOrder <- sample(1:nrow(targetVarDF), nrow(targetVarDF))
permRSScores[[i]] <- runCOCOA(sampleOrder=sampleOrder,
genomicSignal=brcaMethylData1,
signalCoord=brcaMCoord1,
GRList=GRList,
signalCol=myPhen,
targetVar=targetVarDF,
variationMetric="cor")
permRSScores[[i]]$regionSetName <- regionSetNames
}
permRSScores[1:3]
```

```
## [[1]]
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.03785689 440 216 765
## 2 0.04190073 45 39 559
## 3 0.04719021 296 88 116
## 4 0.04658327 484 145 163
## meanRegionSize regionSetName
## 1 1196.2 ESR1
## 2 494.6 GATA3
## 3 283.4 ATF3
## 4 273.4 NRF1
##
## [[2]]
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.03978633 440 216 765
## 2 0.05742872 45 39 559
## 3 0.04219523 296 88 116
## 4 0.03830465 484 145 163
## meanRegionSize regionSetName
## 1 1196.2 ESR1
## 2 494.6 GATA3
## 3 283.4 ATF3
## 4 273.4 NRF1
##
## [[3]]
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.03266484 440 216 765
## 2 0.03681228 45 39 559
## 3 0.04749759 296 88 116
## 4 0.04767661 484 145 163
## meanRegionSize regionSetName
## 1 1196.2 ESR1
## 2 494.6 GATA3
## 3 283.4 ATF3
## 4 273.4 NRF1
```

We now have a list where each item is a result of `aggregateSignalGRList`

for a permutation of the sample phenotypes. We have `nPerm`

list items in permRSScores (only showing 3 above). We’ll reformat it so that we have a list where each item is a null distribution for a single region set:

```
nullDistList <- convertToFromNullDist(permRSScores)
names(nullDistList) <- regionSetNames
nullDistList
```

```
## $ESR1
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.03785689 440 216 765
## 2 0.03978633 440 216 765
## 3 0.03266484 440 216 765
## 4 0.05175149 440 216 765
## 5 0.06422102 440 216 765
## meanRegionSize regionSetName
## 1 1196.2 ESR1
## 2 1196.2 ESR1
## 3 1196.2 ESR1
## 4 1196.2 ESR1
## 5 1196.2 ESR1
##
## $GATA3
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.04190073 45 39 559
## 2 0.05742872 45 39 559
## 3 0.03681228 45 39 559
## 4 0.05885779 45 39 559
## 5 0.06096974 45 39 559
## meanRegionSize regionSetName
## 1 494.6 GATA3
## 2 494.6 GATA3
## 3 494.6 GATA3
## 4 494.6 GATA3
## 5 494.6 GATA3
##
## $ATF3
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.04719021 296 88 116
## 2 0.04219523 296 88 116
## 3 0.04749759 296 88 116
## 4 0.03752185 296 88 116
## 5 0.04578851 296 88 116
## meanRegionSize regionSetName
## 1 283.4 ATF3
## 2 283.4 ATF3
## 3 283.4 ATF3
## 4 283.4 ATF3
## 5 283.4 ATF3
##
## $NRF1
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.04658327 484 145 163
## 2 0.03830465 484 145 163
## 3 0.04767661 484 145 163
## 4 0.03933045 484 145 163
## 5 0.04451599 484 145 163
## meanRegionSize regionSetName
## 1 273.4 NRF1
## 2 273.4 NRF1
## 3 273.4 NRF1
## 4 273.4 NRF1
## 5 273.4 NRF1
```

We have four null distributions, one for each region set in `GRList`

.

To reduce the number of permutations (and the run time for COCOA), we will approximate each null distribution with a gamma distribution (Winkler et al., 2016). After fitting the gamma distribution to each null distribution, we can use the gamma distributions to get p-values. We fit the gamma distribution and get the p-value with `getGammaPVal`

.

```
# p-values based on fitted gamma distributions
gPValDF <- getGammaPVal(rsScores = rsScores,
nullDistList = nullDistList,
signalCol = myPhen,
method = "mme", realScoreInDist = TRUE)
gPValDF <- cbind(gPValDF,
rsScores[, colnames(rsScores)[!(colnames(rsScores)
%in% myPhen)]])
gPValDF <- cbind(gPValDF, regionSetNames)
gPValDF
```

```
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1 0.03973931 440 216 765
## 2 0.04020263 45 39 559
## 3 0.02697259 296 88 116
## 4 0.02884609 484 145 163
## meanRegionSize regionSetName regionSetNames
## 1 1196.2 ESR1 ESR1
## 2 494.6 GATA3 GATA3
## 3 283.4 ATF3 ATF3
## 4 273.4 NRF1 NRF1
```

Fitting a gamma distribution with only 5 permutations does not really make sense so that is one reason the p-values for ER and GATA3 (which we would expect to be lower than NRF1/ATF3) are higher than those of NRF1 and ATF3. If you increase the number of permutations to 50, you will see that ER and GATA3 have lower p-values than ATF3 and NRF1 in that more realistic scenario. Also, keep in mind that a low p-value does not necessarily indicate a large effect size. The actual COCOA scores for ER and GATA3 were much larger than those of NRF1 and ATF3.

After getting p-values, we also want to do multiple testing correction to account for the number of region sets in our region set database (not shown here but see `?p.adjust`

).

We can also use the null distributions to directly calculate empirical p-values:

```
getPermStat(rsScores = rsScores,
nullDistList = nullDistList,
signalCol = myPhen)
```

```
## ER_status signalCoverage regionSetCoverage totalRegionNumber
## 1: 0 440 216 765
## 2: 0 45 39 559
## 3: 0 296 88 116
## 4: 0 484 145 163
## meanRegionSize regionSetName
## 1: 1196.2 ESR1
## 2: 494.6 GATA3
## 3: 283.4 ATF3
## 4: 273.4 NRF1
```

An empirical p-value of zero is returned when the real region set score was more extreme than any scores in its null distribution. What this really suggests is that the true p-value is less than ( 1/nPerm ) although it is unclear how much less.

The empirical p-values are limited by the number of permutations done. When multiple testing correction is done to correct for testing many region sets in a large region set database, the empirical test normally cannot achieve significant p-values with a reasonable number of permutations. For that reason, the gamma distribution approximation is recommended to allow the possibility of lower p-values.

Goal: Understand major sources of epigenetic variation present in data without knowing beforehand what they are.

In an unsupervised COCOA analysis, we normally begin with a dimensionality reduction technique such as PCA that identifies major sources/axes of inter-sample variation. Then the PC/latent factors are treated as the target variables for COCOA and a workflow similar to the supervised COCOA analysis is used. We identify region sets where the epigenetic signal varies along the PC axis. These region sets can help understand the biological meaning of the PCs/latent factors.

Vignette analysis goal: As mentioned previously, we are looking at DNA methylation in breast cancer patients. Our goal is to understand sources of inter-sample epigenetic variation in an unsupervised way (without using sample phenotypes or groups).

First we will load the required packages and necessary data: BRCA DNA methylation data with genomic coordinates and region sets to test:

```
library(COCOA)
library(data.table)
library(ggplot2)
data("esr1_chr1")
data("gata3_chr1")
data("nrf1_chr1")
data("atf3_chr1")
data("brcaMCoord1")
data("brcaMethylData1")
```

Next, we will do PCA on our breast cancer DNA methylation data. PCA could take a while for large datasets (e.g. longer than 30 minutes).

```
pca <- prcomp(t(brcaMethylData1))
pcScores <- pca$x
plot(pcScores[, c("PC1", "PC2")])
```

Each point is a sample. It is not immediately clear what the meaning of PC1 (or PC2) is and we see a wide spectrum of variation among samples. COCOA will help us understand the biological meaning of the PCs.

After the PCA, you might want to look at plots of the first few PCs and consider removing extreme outliers and rerunning the PCA since PCA can be affected by outliers, although this depends on your analysis. We note that COCOA will still work even if there are not distinct clusters of samples in your PCA plot and if you do not have known groups of samples. In this vignette, we will look at principal components 1-4 but this choice also depends on the context of your analysis.

We treat the PC scores as our “target variables” for this analysis and calculate the correlation between the DNA methylation level of each CpG and each PC.

```
PCsToAnnotate <- paste0("PC", 1:4)
targetVar <- pcScores[, PCsToAnnotate]
targetVar <- as.matrix(scale(targetVar,
center=TRUE, scale=FALSE))
methylCor <- cor(t(brcaMethylData1), targetVar,
use = "pairwise.complete.obs")
# if the standard deviation of the methylation level
# for a CpG across samples is 0,
# cor() will return NA, so manually set the correlation to 0 for these CpGs
methylCor[is.na(methylCor)] <- 0
```

Let’s put our region sets into one object to simplify the next steps of the analysis.

```
# prepare data
GRList <- GRangesList(esr1_chr1, gata3_chr1, nrf1_chr1, atf3_chr1)
regionSetNames <- c("esr1_chr1", "gata3_chr1", "nrf1_chr1", "atf3_chr1")
```

Now let’s give each region set a score with `aggregateSignalGRList()`

to quantify how much it is associated with each principal component:

```
regionSetScores <- aggregateSignalGRList(signal=methylCor,
signalCoord=brcaMCoord1,
GRList=GRList,
signalCol=PCsToAnnotate,
scoringMetric="default")
regionSetScores$regionSetName <- regionSetNames
regionSetScores
```

```
## PC1 PC2 PC3 PC4 signalCoverage
## 1 0.40626294 0.2790623 0.19914102 0.11770782 440
## 2 0.52578555 0.2718046 0.18056909 0.10463801 45
## 3 0.06974181 0.1061399 0.08018811 0.05219728 484
## 4 0.08536608 0.1053894 0.07422950 0.05996384 296
## regionSetCoverage totalRegionNumber meanRegionSize regionSetName
## 1 216 765 1196.2 esr1_chr1
## 2 39 559 494.6 gata3_chr1
## 3 145 163 273.4 nrf1_chr1
## 4 88 116 283.4 atf3_chr1
```

Now we have a score for each region set that represents how much epigenetic variation in that region set is associated with each PC. Since we are using the absolute value of the correlation coefficient for each CpG to score our region sets, the possible region set scores range from 0 to 1, with 0 representing no correlation and 1 representing complete correlation. From our initial results, it appears that the DNA methylation in ER and GATA3 region sets is more highly associated with the PC scores than the DNA methylation in the NRF1 and ATF3 region sets is. Out of PCs 1-4, ER and GATA3 have the highest scores for PC1. This result makes sense when we visualize ER status on the PCA plot. As you can see in the figure below, PC1 separates samples pretty well based on ER status (even though ER status was not used when doing the PCA).

```
annoPCScores <- data.frame(pcScores, ER_status=as.factor(brcaMetadata[row.names(pcScores), "ER_status"]))
ggplot(data = annoPCScores, mapping = aes(x=PC1, y=PC2, col=ER_status)) + geom_point() + ggtitle("PCA of a subset of DNA methylation data from breast cancer patients") + theme_classic()
```