1 Introduction

The spqn package contains an implementation of spatial quantile normalization (SpQN) as described in (Wang, Hicks, and Hansen 2020). In this work we describe how correlation matrices estimated from gene expression data exhibits a mean-correlation relationship, in which highly expressed genes appear more correlated than lowly expressed genes. This relationship is not observed in protein-protein interaction networks, which suggests it is technical rather than biological. To correct for this bias in estimation, we developed SpQN, a method for normalizing the correlation matrix. In our work, we show how correcting for this relationship increases the correlations involving transcription factors, an important class of gene regulators.

We believe spatial quantile normalization has applications outside of co-expression analysis including beyond correlation matrices. However, the implementation in this package is currently focused on correlation matrices.

The package contains various plotting functions for assessing the mean-correlation relationship, mirroring plots in (Wang, Hicks, and Hansen 2020).

2 Preliminaries

Let’s load the package and some example data.

library(spqn)
library(spqnData)
library(SummarizedExperiment)

This dataset contains 4,000 genes from the GTEx tissue “Adipose Subcutaneous” which we use as exemplar tissue in our preprint (Wang, Hicks, and Hansen 2020). The data is from GTEx v6p. The genes have been restricted to a random sample of 4,000 genes from genes which are

  • either protein-coding or lincRNAs
  • expressed (median expression greater than 0 on the \(\log_2(\text{RPKM})\) scale

The data is counts (from the GTEx website) of bulk RNA-seq data which has been transformed to the expression matrix in the \(\log_2(\text{RPKM})\) scale. Each gene has been scaled to have mean 0 and variance 1, and 4 principal components have been removed from the data matrix. For SpQN we need access to the gene expression level, which we take as the mean or median expression level across samples. So this quantity needs to be computed (and stored) prior to removing principal components. For this example data, we have stored the average expression value on the log-RPKM scale in rowData(gtex.4k)$ave_logrpkm.

To compare the distribution of correlation across different expression levels, we get the correlation matrix and average gene expression level (\(\log_2(\text{RPKM})\)) for each gene.

data(gtex.4k)
cor_m <- cor(t(assay(gtex.4k)))
ave_logrpkm <- rowData(gtex.4k)$ave_logrpkm

3 Preparing your own data

Note: code in this section is not being evaluated.

We assume you have counts (or something similar) stored in a SummarizedExperiment. In our case, we have downloaded the GTEx v6p counts from the GTEx portal website (code is available in the spqnData package inside scripts).

It makes sense to do some correction for library size. In this data we have simply divided by the number of reads in millions. There are multiple normalization strategies for estimating a library size correction factor. The literature shows that using such methods are essential when comparing expression across different tissues / cell types / conditions. However, here we are looking at correlation patterns within tissue.

We also need to remove genes which are not expressed. The threshold we have been using for GTEx is fairly liberal as it removes less than 50% of protein-coding codes. We apply this cutoff on the log2RPKM scale as it is necessary to consider gene length when applying a cutoff across genes. However, the exact way of doing this is probably not too important.

Finally, following (Parsana et al. 2019) we strongly recommend removing principal components from the data matrix prior to analysis. A fast convenience function for this is removePrincipalComponents() from the WGCNA package. How many principal components to remove is an open question. In (Freytag et al. 2015) the authors recommend looking at control genes to assess this, under the hypothesis that random genes should be uncorrelated and we have existing groups of genes which are believe to be highly correlated in all tissues, such as the PRC2 complex (Freytag et al. 2015) or the ribosomal RNA genes (Boukas et al. 2019). Another approach is to use the num.sv() function from the SVA package. In our experience, this yields a substantially higher number of PCs than the graphical methods. Finally, as we shown (Wang, Hicks, and Hansen 2020), the QC plots we present below might also help making this decision. At the end of the day, it is still an open problem.

Example code is as follows, the starting point is a SummarizedExperiment called gtex with the relevant data. It has two relevant columns in rowData(gtex). One column is gene_length and another is gene_type.

# only keep coding genes and lincRNA in the counts matrix
gtex <- gtex[rowData(gtex)$gene_type %in%
                          c("lincRNA", "protein_coding"), ]

# normalize the counts matrix by log2rpkm
cSums <- colSums(assay(gtex))
logrpkm <- sweep(log2(assay(gtex) + 0.5), 2, FUN = "-",
                 STATS = log2(cSums / 10^6))
logrpkm <- logrpkm - log2(rowData(gtex)$gene_length / 1000)

# only keep those genes with median log2rpkm above 0
wh.expressed  <- which(rowMedians(logrpkm) > 0)

gtex.0pcs <- gtex[wh.expressed,]
logrpkm.0pcs <- logrpkm[wh.expressed,]
ave_logrpkm <- rowMeans(logrpkm.0pcs)
logrpkm <- logrpkm - ave_logrpkm # mean centering
logrpkm  <- logrpkm / matrixStats::rowSds(logrpkm) # variance scaling
assays(gtex.0pcs) <- SimpleList(logrpkm = logrpkm.0pcs)
rowData(gtex.0pcs)$ave_logrpkm <- ave_logrpkm

# remove PCs from the gene expression matrix after scaling each gene to have mean=0 and variance=1
gtex.4pcs <- gtex.0pcs
assay(gtex.4pcs) <- removePrincipalComponents(t(scale(t(logrpkm.0pcs))), n = 4)

4 Examine the mean-correlation relationship

We now recreate a number of diagnostic plots which illustrate the mean-correlation relationship. These plots are similar to that is being presented in (Wang, Hicks, and Hansen 2020).

To compare the distribution of correlation across different expression levels, we re-arrrange the correlation matrix by sortng the row and column by the expression level, and partitioned the correlation matrix into same-size bins, then used each bin to estimate the distribution of correlation for genes with that expression level. Here we use 10 by 10 bins. Each bin corresponds to two groups of genes that correspodnd to two expression levels.

We first looked into the bins on the diagnal, where the two groups of genes have the same expression level. We looked at the distribution of the correlations within each of these bins. We find that the correlations of highly expressed genes are more variant than those of lowly expressed genes. We call such relationship between gene expression level and the distribution of correlation as mean-correlation relationship.

plot_signal_condition_exp(cor_m, ave_logrpkm, signal=0)

Next, we examine the impact of this mean-correlation relationship on the downstream analysis. A number of studies predict the co-expressed genes by setting a threshold. We therefore examine how the mean-correlation relationship would bias such prediction. We set 0.1% highest correlations as predicted signals in each bin. We compare the distibution of signals for bins on the diagonal. We find that the the predicted signals are stronger for high expressed genes and weaker for low expressed genes. Because of such bias, the true signals among low expressed genes could be hidden and may not be detected.

plot_signal_condition_exp(cor_m, ave_logrpkm, signal=0.001)
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?