1 Introduction

Multi-sample normalization techniques such as quantile normalization Bolstad et al. (2003), Irizarry et al. (2003) have become a standard and essential part of analysis pipelines for high-throughput data. Although it was originally developed for gene expression microarrays, it is now used across many different high-throughput applications including genotyping arrays, DNA Methylation, RNA Sequencing (RNA-Seq) and Chromatin Immunoprecipitation Sequencing (ChIP-Seq). These techniques transform the original raw data to remove unwanted technical variation. However, quantile normalization and other global normalization methods rely on assumptions about the data generation process that are not appropriate in some context. Until now, it has been left to the researcher to check for the appropriateness of these assumptions.

Quantile normalization assumes that the statistical distribution of each sample is the same. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference. This method has worked very well in practice but note that when the assumptions are not met, global changes in distribution, that may be of biological interest, will be wiped out and features that are not different across samples can be artificially induced. These types of assumptions are justified in many biomedical applications, for example in gene expression studies in which only a minority of genes are expected to be differentially expressed. However, if, for example, a substantially higher percentage of genes are expected to be expressed in only one group of samples, it may not be appropriate to use global adjustment methods.

The quantro R-package can be used to test for global differences between groups of distributions which asses whether global normalization methods such as quantile normalization should be applied. Our method uses the raw unprocessed high-throughput data to test for global differences in the distributions across a set of groups. The main function quantro() will perform two tests:

  1. An ANOVA to test if the medians of the distributions are different across groups. Differences across groups could be attributed to unwanted technical variation (such as batch effects) or real global biological variation. This is a helpful step for the user to verify if there is any technical variation unaccounted for.

  2. A test for global differences between the distributions across groups which returns a test statistic called quantroStat. This test statistic is a ratio of two variances and is similar to the idea of ANOVA. The main idea is to compare the variability of distributions within groups relative to between groups. If the variability between groups is sufficiently larger than the variability within groups, then this suggests global adjustment methods may not be appropriate. As a default, we perform this test on the median normalized data, but the user may change this option.

2 Getting Started

2.1 Installation

The R-package quantro can be installed from the Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("quantro")

2.2 Load the package in R

After installation, the package can be loaded into R.

library(quantro)

3 Data

3.1 flowSorted Data Example

To explore how to use quantro(), we use the FlowSorted.DLPFC.450k data package in Bioconductor Jaffe and Kaminsky (2021). The data in this package originally came from Guintivano, Aryee, and Kaminsky (2013). This data set in FlowSorted.DLPFC.450k contains raw data objects of 58 Illumina 450K DNA methylation microarrays, formatted as RGset objects. The samples represent two different cellular populations of brain tissues on the same 29 individuals extracted using flow sorting. For more information on this data set, please see the FlowSorted.DLPFC.450k User’s Guide.For the purposes of this vignette, a MethylSet object from the minfi Bioconductor package Aryee et al. (2014) was created which is a subset of the rows from the original FlowSorted.DLPFC.450k data set. This MethylSet object is found in the /data folder and the script to create the object is found in /inst.

Here we will explore the distributions of these two cellular populations of brain tissue (NeuN_pos and NeuN_neg) and then test if there are global differences in the distributions across groups. First, load the MethylSet object (flowSorted) and compute the Beta values using the function getBeta() in the minfi Bioconductor package. We use an offset of 100 as this is the default used by Illumina.

library(minfi)
data(flowSorted)
p <- getBeta(flowSorted, offset = 100)
pd <- pData(flowSorted)
dim(p)
## [1] 10000    58
head(pd)
## DataFrame with 6 rows and 11 columns
##        Sample_Name    SampleID    CellType Sentrix.Barcode Sample.Section
##        <character> <character> <character>       <numeric>    <character>
## 813_N        813_N         813    NeuN_pos      7766130090         R02C01
## 1740_N      1740_N        1740    NeuN_pos      7766130090         R02C02
## 1740_G      1740_G        1740    NeuN_neg      7766130090         R04C01
## 1228_G      1228_G        1228    NeuN_neg      7766130090         R04C02
## 813_G        813_G         813    NeuN_neg      7766130090         R06C01
## 1228_N      1228_N        1228    NeuN_pos      7766130090         R06C02
##               diag         sex   ethnicity       age       PMI
##        <character> <character> <character> <integer> <integer>
## 813_N      Control      Female   Caucasian        30        14
## 1740_N     Control      Female     African        13        17
## 1740_G     Control      Female     African        13        17
## 1228_G     Control        Male   Caucasian        47        13
## 813_G      Control      Female   Caucasian        30        14
## 1228_N     Control        Male   Caucasian        47        13
##                      BasePath
##                   <character>
## 813_N  /dcs01/lieber/ajaffe..
## 1740_N /dcs01/lieber/ajaffe..
## 1740_G /dcs01/lieber/ajaffe..
## 1228_G /dcs01/lieber/ajaffe..
## 813_G  /dcs01/lieber/ajaffe..
## 1228_N /dcs01/lieber/ajaffe..

3.2 Plot distributions

quantro contains two functions to view the distributions of the samples of interest: matdensity() and matboxplot(). The function matdensity() computes the density for each sample (columns) and uses the matplot() function to plot all the densities. matboxplot() orders and colors the samples by a group level variable. These two functions use the RColorBrewer package and the brewer palettes can be changed using the arguments brewer.n and brewer.name.

The distributions of the two groups of cellular populations are shown here. The NeuN_neg samples are colored in green and the NeuN_pos are colored in red.

matdensity(p, groupFactor = pd$CellType, xlab = " ", ylab = "density",
           main = "Beta Values", brewer.n = 8, brewer.name = "Dark2")
legend('top', c("NeuN_neg", "NeuN_pos"), col = c(1, 2), lty = 1, lwd = 3)

matboxplot(p, groupFactor = pd$CellType, xaxt = "n", main = "Beta Values")

4 Using the quantro() function

4.1 Input for quantro()

The quantro() function must have two objects as input:

  1. An object which is a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns.

  2. A groupFactor which represents the group level information about each sample. For example if the samples represent tumor and normal samples, provide quantro() with a factor representing which columns in the object are normal and tumor samples.

4.2 Running quantro()

In this example, the groups we are interested in comparing are contained in the CellType column in the pd dataset. To run the quantro() function, input the data object and the object containing the phenotypic data. Here we use the flowSorted data set as an example.

qtest <- quantro(object = p, groupFactor = pd$CellType)
qtest
## quantro: Test for global differences in distributions
##    nGroups:  2 
##    nTotSamples:  58 
##    nSamplesinGroups:  29 29 
##    anovaPval:  0.01206 
##    quantroStat:  8.80735 
##    quantroPvalPerm:  Use B > 0 for permutation testing.

The details related to the experiment can be extracted using the summary accessor function:

summary(qtest)
## $nGroups
## [1] 2
## 
## $nTotSamples
## [1] 58
## 
## $nSamplesinGroups
## NeuN_neg NeuN_pos 
##       29       29

To asssess if the medians of the distributions different across groups, we perform an ANOVA on the medians from the samples. Those results can be found using anova:

anova(qtest)
## Analysis of Variance Table
## 
## Response: objectMedians
##             Df   Sum Sq   Mean Sq F value  Pr(>F)  
## groupFactor  1 0.006919 0.0069194  6.7327 0.01206 *
## Residuals   56 0.057553 0.0010277                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The full output can be seen The test statistic produced from quantro() testing for global differences between distributions is given by quantroStat:

quantroStat(qtest)
## [1] 8.807348

4.3 eSets

quantro() also can accept objects that inherit eSets such as an ExpressionSet or MethylSet. The groupFactor must still be provided.

is(flowSorted, "MethylSet")
## [1] TRUE
qtest <- quantro(flowSorted, groupFactor = pData(flowSorted)$CellType)
qtest 
## quantro: Test for global differences in distributions
##    nGroups:  2 
##    nTotSamples:  58 
##    nSamplesinGroups:  29 29 
##    anovaPval:  0.01206 
##    quantroStat:  8.80735 
##    quantroPvalPerm:  Use B > 0 for permutation testing.

4.4 Output from quantro()

Elements in the S4 object from quantro() include:

Element Description
summary A list that contains (1) number of groups (nGroups), (2) total number of samples (nTotSamples) (3) number of samples in each group (nSamplesinGroups)
anova ANOVA to test if the average medians of the distributions are different across groups
MSbetween mean squared error between groups
MSwithin mean squared error within groups
quantroStat test statistic which is a ratio of the mean squared error between groups of distributions to the mean squared error within groups of distributions
quantroStatPerm If B is not equal to 0, then a permutation test was performed to assess the statistical significance of quantroStat. These are the test statistics resulting from the permuted samples
quantroPvalPerm If B is not equal to 0, then this is the \(p\)-value associated with the proportion of times the test statistics (quantroStatPerm) resulting from the permuted samples were larger than quantroStat

5 Assessing the statistical significance

To assess statistical significance of the test statistic, we use permutation testing. We use the foreach package which distribute the computations across multiple cross in a single machine or across multiple machines in a cluster. The user must pick how many permutations to perform where B is the number of permutations. At each permutation of the samples, a test statistic is calculated. The proportion of test statistics (quantroStatPerm) that are larger than the quantroStat is reported as the quantroPvalPerm. To use the foreach package, we first register a backend, in this case a machine with 1 cores.

library(doParallel)
registerDoParallel(cores=1)
qtestPerm <- quantro(p, groupFactor = pd$CellType, B = 1000)
qtestPerm
## quantro: Test for global differences in distributions
##    nGroups:  2 
##    nTotSamples:  58 
##    nSamplesinGroups:  29 29 
##    anovaPval:  0.01206 
##    quantroStat:  8.80735 
##    quantroPvalPerm:  0.001

6 Visualizing the statistical significance from permutation tests

If permutation testing was used (i.e. specifying B \(>\) 0), then there is a second function in the package called quantroPlot() which will plot the test statistics of the permuted samples. The plot is a histogram of the null test statistics quantroStatPerm from quantro() and the red line is the observed test statistic quantroStat from quantro().

quantroPlot(qtestPerm)

Additional options in the quantroPlot() function include:

Element Description
xLab the x-axis label
yLab the y-axis label
mainLab title of the histogram
binWidth change the binwidth

7 SessionInfo

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] doParallel_1.0.17           minfi_1.44.0               
##  [3] bumphunter_1.40.0           locfit_1.5-9.6             
##  [5] iterators_1.0.14            foreach_1.5.2              
##  [7] Biostrings_2.66.0           XVector_0.38.0             
##  [9] SummarizedExperiment_1.28.0 Biobase_2.58.0             
## [11] MatrixGenerics_1.10.0       matrixStats_0.62.0         
## [13] GenomicRanges_1.50.0        GenomeInfoDb_1.34.0        
## [15] IRanges_2.32.0              S4Vectors_0.36.0           
## [17] BiocGenerics_0.44.0         quantro_1.32.0             
## [19] knitr_1.40                  BiocStyle_2.26.0           
## 
## loaded via a namespace (and not attached):
##   [1] BiocFileCache_2.6.0       plyr_1.8.7               
##   [3] splines_4.2.1             BiocParallel_1.32.0      
##   [5] ggplot2_3.3.6             digest_0.6.30            
##   [7] htmltools_0.5.3           magick_2.7.3             
##   [9] fansi_1.0.3               magrittr_2.0.3           
##  [11] memoise_2.0.1             tzdb_0.3.0               
##  [13] limma_3.54.0              readr_2.1.3              
##  [15] annotate_1.76.0           askpass_1.1              
##  [17] siggenes_1.72.0           prettyunits_1.1.1        
##  [19] colorspace_2.0-3          blob_1.2.3               
##  [21] rappdirs_0.3.3            xfun_0.34                
##  [23] dplyr_1.0.10              crayon_1.5.2             
##  [25] RCurl_1.98-1.9            jsonlite_1.8.3           
##  [27] genefilter_1.80.0         GEOquery_2.66.0          
##  [29] survival_3.4-0            glue_1.6.2               
##  [31] gtable_0.3.1              zlibbioc_1.44.0          
##  [33] DelayedArray_0.24.0       Rhdf5lib_1.20.0          
##  [35] HDF5Array_1.26.0          scales_1.2.1             
##  [37] DBI_1.1.3                 rngtools_1.5.2           
##  [39] Rcpp_1.0.9                xtable_1.8-4             
##  [41] progress_1.2.2            bit_4.0.4                
##  [43] mclust_6.0.0              preprocessCore_1.60.0    
##  [45] httr_1.4.4                RColorBrewer_1.1-3       
##  [47] ellipsis_0.3.2            farver_2.1.1             
##  [49] pkgconfig_2.0.3           reshape_0.8.9            
##  [51] XML_3.99-0.12             sass_0.4.2               
##  [53] dbplyr_2.2.1              utf8_1.2.2               
##  [55] labeling_0.4.2            tidyselect_1.2.0         
##  [57] rlang_1.0.6               AnnotationDbi_1.60.0     
##  [59] munsell_0.5.0             tools_4.2.1              
##  [61] cachem_1.0.6              cli_3.4.1                
##  [63] generics_0.1.3            RSQLite_2.2.18           
##  [65] evaluate_0.17             stringr_1.4.1            
##  [67] fastmap_1.1.0             yaml_2.3.6               
##  [69] bit64_4.0.5               beanplot_1.3.1           
##  [71] scrime_1.3.5              purrr_0.3.5              
##  [73] KEGGREST_1.38.0           nlme_3.1-160             
##  [75] doRNG_1.8.2               sparseMatrixStats_1.10.0 
##  [77] nor1mix_1.3-0             xml2_1.3.3               
##  [79] biomaRt_2.54.0            compiler_4.2.1           
##  [81] filelock_1.0.2            curl_4.3.3               
##  [83] png_0.1-7                 tibble_3.1.8             
##  [85] bslib_0.4.0               stringi_1.7.8            
##  [87] highr_0.9                 GenomicFeatures_1.50.0   
##  [89] lattice_0.20-45           Matrix_1.5-1             
##  [91] multtest_2.54.0           vctrs_0.5.0              
##  [93] pillar_1.8.1              lifecycle_1.0.3          
##  [95] rhdf5filters_1.10.0       BiocManager_1.30.19      
##  [97] jquerylib_0.1.4           data.table_1.14.4        
##  [99] bitops_1.0-7              rtracklayer_1.58.0       
## [101] R6_2.5.1                  BiocIO_1.8.0             
## [103] bookdown_0.29             codetools_0.2-18         
## [105] MASS_7.3-58.1             assertthat_0.2.1         
## [107] rhdf5_2.42.0              openssl_2.0.4            
## [109] rjson_0.2.21              GenomicAlignments_1.34.0 
## [111] Rsamtools_2.14.0          GenomeInfoDbData_1.2.9   
## [113] hms_1.1.2                 quadprog_1.5-8           
## [115] grid_4.2.1                tidyr_1.2.1              
## [117] base64_2.0.1              rmarkdown_2.17           
## [119] DelayedMatrixStats_1.20.0 illuminaio_0.40.0        
## [121] restfulr_0.0.15

Aryee, Martin J, Andrew E Jaffe, Hector Corrada-Bravo, Christine Ladd-Acosta, Andrew P Feinberg, Kasper D Hansen, and Rafael A Irizarry. 2014. “Minfi: A Flexible and Comprehensive Bioconductor Package for the Analysis of Infinium Dna Methylation Microarrays.” Bioinformatics 30 (10): 1363–9. https://doi.org/10.1093/bioinformatics/btu049.

Bolstad, B M, R A Irizarry, M Astrand, and T P Speed. 2003. “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias.” Bioinformatics 19 (2): 185–93.

Guintivano, Jerry, Martin J Aryee, and Zachary A Kaminsky. 2013. “A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression.” Epigenetics 8 (3): 290–302. https://doi.org/10.4161/epi.23924.

Irizarry, Rafael A, Bridget Hobbs, Francois Collin, Yasmin D Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P Speed. 2003. “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data.” Biostatistics 4 (2): 249–64. https://doi.org/10.1093/biostatistics/4.2.249.

Jaffe, A J, and Z A Kaminsky. 2021. FlowSorted.DLPFC.450k: Illumina Humanmethylation Data on Sorted Frontal Cortex Cell Populations. R package version 1.29.0.