1 Introduction: What is ShinyÉPICo for?

shinyÉPICo is a web application based on shiny and created to accelerate and to streamline the study of Illumina Infinium DNA methylation arrays, including the 450k and EPIC. For this purpose, shinyÉPICo uses, among others, the widely used minfi (Aryee et al. 2014) and limma (Ritchie et al. 2015) packages for the array normalization and the differentially methylated positions (DMPs) calculation, respectively. Moreover, shinyÉPICo also allows calculation of differentially methylated regions calculation (DMRs), based on the mCSEA package (Martorell-Marugán, González-Rumayor, and Carmona-Sáez 2019).

It is conceived as a ‘graphical pipeline’ since, throughout the analysis, you can observe the result of the different options with multiple charts, and interactively modify decisions in current or prior steps. In addition, the application automates calculations that would otherwise take much more time and effort.

shinyÉPICo is a fully graphical application and the calculations performed in R are done entirely on the server-side. Therefore, the application can be used both on the local computer and through a web server, to which devices, such as computers, smartphones or tablets could be connected, without high RAM or CPU requirements and only a web browser as requirement.

The steps followed by the application includes the data importing, starting with iDAT files (Raw Illumina DNA methylation datasets), array normalization, quality control, and data exploring, and DMP and DMR calculation. Each step has multiple options and charts that we will describe in this manual.

1.1 What do I need in order to use ShinyÉPICo?

ShinyÉPICo can run in GNU/Linux, Windows, or macOS. The package dependencies are automatically installed with the package.

Since the application allows to follow interactively all the analysis process, many objects have to be stored in RAM memory. Therefore, the application can be memory demanding, especially when trying to analyze a large number of samples at the same time. We recommend at least 12GB of RAM for smooth use of the application, but depending on the number of samples analyzed and whether they are EPIC or 450k, the needs may be lower or higher.

1.2 How can I install ShinyÉPICo?

The package will be available soon from Bioconductor, where it can be installed following standard instructions:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("shinyepico")

Currently, you can install the application through GitHub using the ‘remotes’ R package:

install.packages("remotes")
library("remotes")
install_github("omorante/shinyepico", upgrade="always", dependencies = TRUE)

1.3 Dependencies and implementation

ShinyÉPICo implements internal functions to follow the array normalization, and the DMP/DMR calculations, including, for example, the differential of beta values calculation between the groups. For the main features of the application, the package dependencies are the following:

  • Application interface: shiny, shinyWidgets, and DT.
  • Array reading and normalization: minfi
  • Linear model generation and DMP calculation: limma
  • DMR calculation and genomic graph: mCSEA
  • Heatmaps: gplots::heatmap.2 and heatmaply
  • Other charts: ggplot2 and plotly

The complete list of the application dependencies are:

DT (>= 0.15.0),
data.table (>= 1.13.0),
doParallel (>= 1.0.0),
dplyr (>= 1.0.9),
foreach (>= 1.5.0),
GenomicRanges (>= 1.38.0),
ggplot2 (>= 3.3.0),
gplots (>= 3.0.0),
heatmaply (>= 1.1.0),
limma (>= 3.42.0),
minfi (>= 1.32.0),
plotly (>= 4.9.2),
reshape2 (>= 1.4.0),
rlang (>= 1.0.2),
rmarkdown (>= 2.3.0),
rtracklayer (>= 1.46.0),
shiny (>= 1.5.0),
shinyWidgets (>= 0.5.0),
shinycssloaders (>= 0.3.0),
shinyjs (>= 1.1.0),
shinythemes (>= 1.1.0),
statmod (>= 1.4.0),
tidyr (>= 1.2.0),
zip (>= 2.1.0)

For DMR calculation, additional suggested packages are required. You can use the application without installing this package, but you will not able to calculate DMRs. Moreover, for the array normalization, you will need to install annotation and manifest packages, for 450k or EPIC.

knitr (>= 1.30.0),
mCSEA (>= 1.10.0),
IlluminaHumanMethylation450kanno.ilmn12.hg19,
IlluminaHumanMethylation450kmanifest,
IlluminaHumanMethylationEPICanno.ilm10b4.hg19,
IlluminaHumanMethylationEPICmanifest,
testthat,
minfiData

And, to run the application:

library("shinyepico")
set.seed(123)
run_shinyepico()

The function run_shinyepico has 4 parameters that can be modified:

  • n_cores: The application is partially compatible with parallel computing. This numeric parameter controls the number of cores and, by default, it is half of the detected cores. If you have limited RAM (8GB or less) we recommend setting this to 1, in order to avoid the RAM overhead of multicore calculations.
  • max_upload_size: By default, shiny applications have an upload limit (in MB), useful when the application is running on a web server. By default, this parameter is 2000MB.
  • host: IP used to deploy the application. By default, this parameter is your local IP (127.0.0.1) which means that only you, from your computer, will have access to the application. However, it is possible to make the app reachable to other computers in the same LAN by changing the IP to 0.0.0.0.
  • port: Port used to deploy the application. By default, a random free port.

As can be seen in the example, we used the set.seed function before starting ShinyÉPICo. We recommend always setting up the same seed, which can be any number, to avoid uncertainties and ensure that the results obtained are reproducible. Particularly, DMRs calculation, that rely on the mCSEA package, utilizes permutations to estimate the p.value. For that reason, some uncertainty is expected, and results can be a bit different every time you run the application. Using a seed avoids this problem, as you will obtain always the same result with the same seed.

2 ShinyÉPICo workflow

shinyÉPICO workflow is divided into tabs that summarize the different steps to follow:

  • Data import
  • Quality control and array normalization
  • Calculation of Differentially Methylated Positions
  • Calculation of Differentially Methylated Regions
  • Data export

While it is possible to go back to previous steps and change options at any point in the process, some buttons are disabled until the steps required to perform them are completed. For example, you will not be able to generate a heatmap if you do not complete the normalization of the arrays and the calculation of the DMPs. This design is intended to avoid errors and guide the user through the analysis steps.

In the next sections, we will discuss the different steps, the options, and chart interpretation. Moreover, we will use a public DNA methylation array dataset to show an example of application use.

3 Using ShinyÉPICO: an explanation of the options and an example with real data

In the following example, iDATs from a study using Illumina EPIC DNA methylation arrays will be used to illustrate the steps (Li et al. 2020). Although shinyÉPICo can be used with multiple groups and large cohorts, and in that case, iteration and automation are more useful, for this manual we have decided to use a minimal example that can be easily reproduced on almost any computer. Specifically, monocyte and monocyte-derived macrophages (produced with M-CSF) has been selected, only 6 samples (3 monocyte samples and 3 macrophage samples from 3 different healthy donors).

The .zip file of this dataset can be found in the Github repository

In this example, the seed 123 has been set up using the function set.seed() before run_shinyepico().

3.1 Data Import and Sample Selection

The first step in the shinyÉPICo workflow is to prepare the data in the proper format. iDAT files should be compressed in a .zip file. The name of the files should follow the standard convention: XXXXXXXXXXXX_YYYYYY_ZZZ.idat being XXXXXXXXXXXX the Sentrix_ID, YYYYYY the Sentrix_Position, and ZZZ Grn or Red (corresponding, respectively, to the Red and Green signal file).

Moreover, a CSV (comma-separated) file with the annotation of the experiment should be included. It is mandatory to include the Sentrix_ID and Sentrix_Position columns that allow the software to find their respective iDAT files. Moreover, other columns should be added to reflect the different variables (e.g. sample name, health/disease, treatment/control, age, sex, hybridization day, etc.). Usually, iDATs and sample sheet have these features by default, and no additional work is required.

In this regard, 3 parameters are required in the Input tab to continue the analysis:

  • A column with the sample names, that should be unique, without duplicates.

  • A column with the variable of interest, in which sample groups will be found to calculate DMPs and DMRs.

  • Optionally, a donor variable is also included in the Input tab. If you have an experiment where several samples come from the same donor, it is recommended to add a column with this information and select it in the form. This information is used in the SNPs heatmap of the Normalization section (as we will discuss below), and also for the DMPs/DMR calculation. If you select a valid donor variable, this information is automatically added as covariable of the linear model in the DMPs section, and, therefore a “paired analysis” will be performed. If you do not have or you would not like to use this information, you can select the same column as sample names.

An aspect to take into account is that shinyÉPICo autodetects the variable types (numerical or categorical) depending on whether or not they can be coerced to numbers. For this purpose, variables are coerced to a numeric vector, and when the generated NAs are less than 75% of the total, the variable is set as numeric, and, otherwise, as categorical. Moreover, not informative categorical variables are excluded (e.g., a variable with the same value in all the samples, or with not equal values). Therefore, numbers should not be used for categorical variables in the sample sheet. Numerical variables with some NAs but less than 75% can be used in the exploratory analysis but not as covariables of the linear model, since Limma needs a design matrix without missing values.

In the next table, you can see the sample sheet of the example dataset:

Sample_Name Sample_Well Sample_Plate Sample_Group Donor Pool_ID Sentrix_ID Sentrix_Position Scan_Date barcode
MAC A B05 SMET0256 MAC A Epic 202163550095 R02C01 27-Apr-18 202163550095_R02C01
MO B D05 SMET0256 MO B Epic 202163550095 R04C01 27-Apr-18 202163550095_R04C01
MAC C F05 SMET0256 MAC C Epic 202163550095 R06C01 27-Apr-18 202163550095_R06C01
MO A G05 SMET0256 MO A Epic 202163550095 R07C01 27-Apr-18 202163550095_R07C01
MO C E06 SMET0256 MO C Epic 202163550097 R05C01 27-Apr-18 202163550097_R05C01
MAC B G06 SMET0256 MAC B Epic 202163550097 R07C01 27-Apr-18 202163550097_R07C01

It includes the Sentrix_ID and Sentrix_Position mandatory columns, and also other columns relevant for the experiment such as Sample_name, Sample_group, and donor.

After selecting the proper column in each form field, and the samples to be included (in this example, all), it is possible to proceed with the next step, pressing the “Continue” button. When the application is performing an operation expected to take time, a progress bar is displayed at the bottom right of the window to inform the user that the application is busy.

Internally, shinyÉPICO uses the “read.metharray.sheet” and “read.metharray.exp” functions to read the sample sheet and load the DNA methylation data, respectively.

After loading the methylation data of the selected samples, the Normalization tab is automatically shown, where you can see an overview of the data quality and perform the array normalization.

3.2 Quality control charts

First, the quality control tab shows two useful charts to identify bad samples.

On the one hand, the QC Signal plot shows the median methylated (mMed) and unmethylated (uMed) of each sample array. When mMed or uMed of a sample is less than 10, it is considered “Suboptimal”. Although this cutoff is arbitrary and it is the user who decides whether a sample is valid or not, a signal much lower than this threshold or very different from most samples indicates that there has been a problem with it. Depending on whether the signal is very far from this threshold and depending on the rest of the results, these types of samples should be excluded from the analysis. To exclude samples, you can go back to the Input tab to change the samples selected and load them again.

On the other hand, the Bisulfite conversion plot is calculated using the information of the bisulfite conversion II control probes of the 450k/EPIC arrays. When the bisulfite conversion reaction is successful, the probe of the control position will have intensity in the Red channel, whereas if the sample has unconverted DNA, the probe will have a high signal in the Green channel.

For each sample, we calculate all the Red/Green ratios for each control position, and the minimum ratio is shown in the chart. When the ratio of a sample is lower than 1.5, we flag it as “Failed”.

3.3 Array Normalization

A correct array normalization is essential for the results of the subsequent steps in the workflow. shinyÉPICo can use all the normalization methods available in the minfi package:

  • Raw: Without normalization.

  • Illumina: A reverse-engineered implementation of the Genome Studio normalization.

  • Funnorm: A between-array normalization method that relies in data from control probes of the arrays. It performs also Noob normalization before the Functional Normalization.

  • Noob: A within-array normalization method with dye-bias normalization.

  • SWAN: A within-array normalization method that allows Infinium I and Infinium II probes to be normalized together.

  • Quantile: A between-array normalization method that assumes no global differences in methylation between the samples. When global changes are expected, such as in cancer samples, other methods, such as Funnorm, are recommended.

  • Additionally, it offers the option of performing Noob within array normalization followed by Quantile normalization (Noob + Quantile), analogously to the Funnorm method. This non-standard approach has empirically shown good results in our experience.

For details about the methods, the minfi documentation and the respective publications are good references. Although depending on the experimental design and the type of changes expected a normalization method can be expected to be better than another, usually the best method should be determined empirically among the valid options.

Additionally, shinyÉPICo offers three more options in the Normalization tab:

  • Drop CpHs: When this option is selected, CH probes (non-CpG methylation) are removed. It uses the minfi::dropMethylationLoci function with standard parameters.

  • Drop SNPs: When this option is selected, positions with SNPs annotated by Illumina are removed, with a MAF (minimum allele frequency) higher than the cutoff selected are removed. It uses the minfi::dropLociwithSnps function with standard parameters.

  • Drop X/Y Chr: When this option is selected, all the positions of the sex chromosomes are removed. Since the methylation values of these chromosomes can be very different depending on the sex of the donors, removing them can help to homogenize the data. Alternatively, or additionally, sex information can be added to the linear model.

Moreover, Using the minfi recommended value, the genomic positions of the final obtained RGChannelSet object (Raw Data) are filtered by the detection p.value (using the minfi::detectionP function) in the normalized data, removing all the positions with an average p.value higher of 0.01.

shinyÉPICo simplify the testing of different methods and options. When a normalization method is selected, clicking the “Select” button, the different charts are automatically generated.

In this example, we are going to select the Quantile normalization. We show different examples of the type of charts generated by the software.

3.3.1 Density plot

This chart shows the distribution of the beta values of every sample. A bimodal distribution is expected, with two peaks centered around 0 (unmethylated) and 1 (methylated) beta values. When more peaks are shown in the middle, or when the pattern of different samples is very different, it indicates problems in some step of the sample preparation or hybridization. After normalization, we expect an improved alignment of the sample patterns, with little discrepancies between them.

Raw: