Dimensionality reduction is used to represent high dimensional data in a more tractable form. It is commonly used in RNA-seq analysis, where each sample is characterised by tens of thousands of gene expression values, to visualise samples in a 2D plane with distances between points representing similarity and dissimilarity. For RNA-seq the data used is generally gene counts, for methylation there are generally two relevant count matrices, the count of methylated bases, and the count of unmethylated bases. The information from these two matrices can be combined by taking log-methylation ratios as done in Chen et al. 2018.

Preparing data for dimensionality reduction

It is assumed that users of this package have imported the data into the gzipped tabix format as described in the “Importing Data” vignette. From there, further processing is required to create the log-methylation-ratio matrix used in dimensionality reduction. Namely we go through the BSseq format as it is easily coerced into the desired matrix and is itself useful for various other analyses.

library(NanoMethViz)
# import example NanoMethResult object
nmr <- load_example_nanomethresult()
nmr
## An object of class "NanoMethResult"
## Slot "methy":
## [1] "/tmp/Rtmp4hAX3j/Rinst1e5f87ced46af/NanoMethViz/methy_subset.tsv.bgz"
## 
## Slot "samples":
## # A tibble: 6 × 2
##   sample             group
##   <chr>              <fct>
## 1 B6Cast_Prom_1_bl6  bl6  
## 2 B6Cast_Prom_1_cast cast 
## 3 B6Cast_Prom_2_bl6  bl6  
## 4 B6Cast_Prom_2_cast cast 
## 5 B6Cast_Prom_3_bl6  bl6  
## 6 B6Cast_Prom_3_cast cast 
## 
## Slot "exons":
## # A tibble: 303 × 7
##    gene_id chr   strand     start       end transcript_id symbol
##    <chr>   <chr> <chr>      <int>     <int>         <int> <chr> 
##  1 12189   chr11 -      101551769 101551879         92234 Brca1 
##  2 12189   chr11 -      101549015 101549113         92234 Brca1 
##  3 12189   chr11 -      101539981 101540034         92234 Brca1 
##  4 12189   chr11 -      101535528 101535605         92234 Brca1 
##  5 12189   chr11 -      101533939 101534027         92234 Brca1 
##  6 12189   chr11 -      101532020 101532159         92234 Brca1 
##  7 12189   chr11 -      101530992 101531094         92234 Brca1 
##  8 12189   chr11 -      101529791 101529836         92234 Brca1 
##  9 12189   chr11 -      101528001 101528077         92234 Brca1 
## 10 12189   chr11 -      101523325 101526639         92234 Brca1 
## # … with 293 more rows
# convert to bsseq
bss <- methy_to_bsseq(nmr)
bss
## An object of type 'BSseq' with
##   4778 methylation loci
##   6 samples
## has not been smoothed
## All assays are in-memory

We can generate the log-methylation-ratio based on individual methylation sites or computed over genes, or other feature types. Aggregating over features will generally provide more stable and robust results, here we wil use genes.

# create gene annotation from exon annotation
gene_anno <- exons_to_genes(NanoMethViz::exons(nmr))

# create log-methylation-ratio matrix
lmr <- bsseq_to_log_methy_ratio(bss, regions = gene_anno)

NanoMethViz currently provides two options, a MDS plot based on the limma implementation of MDS, and a PCA plot using BiocSingular.

plot_mds(lmr) +
    ggtitle("MDS Plot")

plot_pca(lmr) +
    ggtitle("PCA Plot")

Additional coloring and labeling options can be provided via arguments to either function. Further customisations can be done using typical ggplot2 commands.

new_labels <- gsub("B6Cast_Prom_", "", colnames(lmr))
new_labels <- gsub("(\\d)_(.*)", "\\2 \\1", new_labels)
groups <- gsub(" \\d", "", new_labels)

plot_mds(lmr, labels = new_labels, groups = groups) +
    ggtitle("MDS Plot") +
    scale_colour_brewer(palette = "Set1")