Usage of this package

Load the data

library(ExperimentHub)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colMeans, colSums, colnames,
##     dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
##     intersect, is.unsorted, lapply, lengths, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
##     rowMeans, rowSums, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## Loading required package: AnnotationHub

eh <- ExperimentHub()

## snapshotDate(): 2018-04-27

myfiles <- query(eh, "HarmonizedTCGAData")
Wall <- myfiles[[1]]

## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation

## downloading 0 resources

## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1014'

project_ids <- myfiles[[2]]

## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation

## downloading 0 resources

## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1015'

surv.plot <- myfiles[[3]]

## see ?HarmonizedTCGAData and browseVignettes('HarmonizedTCGAData') for documentation

## downloading 0 resources

## loading from cache 
##     '/home/biocbuild//.ExperimentHub/1016'

Brief backgroud about Wall

Comprehensive molecular profiling data of dozens of cancer types have been made available by TCGA. We selected cancers from five primary sites (in the following we refer to the names of primary sites as cancer types). Each of these five cancer types has at least two known disease types. For other cancer types, we do not know the groundtruth disease types. That’s why they were not included in the package.

names(Wall)

## [1] "adrenal_gland" "lung"          "uterus"        "kidney"       
## [5] "colorectal"

We are trying to cluster patients into groups and identify cancer subtypes (i.e., disease types) using multi-omic data. Here we include three types of data: gene expression, miRNA expression and DNA methylation beta values. We selected 2582 patients who has all these data available and included them in this package.

names(Wall[[1]][[1]]) #Note: "fpkm" refers to gene expression measurement, which can be HTSeq-Counts, transformed HTSeq-Counts (log2 transformation or variance-stabilizing transformation), and FPKM values. Sorry for the confusing name.

## [1] "fpkm"     "mirnas"   "methy450"

For raw counts data (measuring gene expression and miRNA expression), we can perform feature selection (e.g., differential expression analysis) and feature transformation (e.g., log2 transformation and variance-stabilizing transformation). We have included six feature types in this package:

raw.all: Raw counts of all genes or miRNAs

raw.sel: Raw counts of selected (differentially expressed) genes or miRNAs (Differential expression analysis was performed using DESeq2)

log.all: Log transformation of raw counts of all genes or miRNAs

log.sel: Log transformation of raw counts of selected (differentially expressed) genes or miRNAs

vst.sel: Variance stabilizing transformation of raw counts of selected genes or miRNAs

normalized: FPKM values of all genes or normalized counts for all miRNAs

names(Wall[[1]])

## [1] "raw.all"    "raw.sel"    "log.all"    "log.sel"    "vst.sel"   
## [6] "normalized"

So for each of the five cancer types, we have the above six feature types (raw.all, raw.sel, log.all, log.sel, vst.sel and normalized). For each feature type, we have three “views”: gene expression (named fpkm, i.e., names(Wall[[1]][[1]])[1]), miRNA expression (mirnas), and DNA methylation beta values (methy450). In total, there are 90 matrices contained in Wall. (For DNA methylation, we directly used beta values without feature transformation. So the six methy450 matrices are the same. Thus we actually only have 65 unique matrices in Wall.)

Spectral clustering using affinity matrices

We can perform spectral clustering on a patient affinity matrix. Take adrend_gland cancer for example. We can cluseter patients using affinity matrix derived from log2 transformation of raw counts of differentially expressed genes.

library(ANF)
affinity.mat <- Wall[["adrenal_gland"]][["log.sel"]][["fpkm"]]
labels <- spectral_clustering(affinity.mat, k = 2)

Since we know true disease types, which correspond to project ids in project_ids, we can calculate NMI and ARI.

true.disease.types <- as.factor(project_ids[rownames(affinity.mat)])
print(table(labels, true.disease.types))

##       true.disease.types
## labels TCGA-ACC TCGA-PCPG
##      1        0       176
##      2       76         1

nmi <- igraph::compare(true.disease.types, labels, method = "nmi")

adjusted.rand = igraph::compare(true.disease.types, labels, method = "adjusted.rand")

# we can also calculate p-value using `surv.plot` data
surv.plot <- surv.plot[rownames(affinity.mat), ]
f <- survival::Surv(surv.plot$time, !surv.plot$censored)
fit <- survival::survdiff(f ~ labels)
pval <- stats::pchisq(fit$chisq, df = length(fit$n) - 1, lower.tail = FALSE)

print(paste("NMI =", nmi, ", ARI =", adjusted.rand, ", p-val =", pval))

## [1] "NMI = 0.962881312440823 , ARI = 0.983811442595234 , p-val = 1.46669868477277e-07"

In ANF package, We have provided a function eval_clu that streamlines the above process from spectral clustering to calculating NMI, ARI and p-value. Here is an example of how to use eval_clu:

res <- eval_clu(project_ids, w = affinity.mat, surv = surv.plot)

## nmi = 0.962881312440823

## adjusted.rand = 0.983811442595234

## survdiff p value = 1.46669868477277e-07

##            labels
## true_class    1   2
##   TCGA-ACC    0  76
##   TCGA-PCPG 176   1

res$labels

##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1

For adrenal_gland cancer, we only misclassify one out of 253 patients using this affinity matrix. That is a pretty good result (In fact, this the best result we can achieve. Users can try using other matrices and compare the results). However, for many cases, using a single affinity matrix does a “terrible” job in clustering patients into correct disease types. Take uterus cancer for example (the NMI is near 0).

res <- eval_clu(project_ids, w = Wall$uterus$raw.all$fpkm)

## nmi = 0.0563331568619408

## adjusted.rand = -0.0550627074255129

## survdiff p value = NA

##            labels
## true_class    1   2
##   TCGA-UCEC 153 268
##   TCGA-UCS    3  51

Use Affinity Network Fusion (ANF package) to fuse multiple affinity matrices for patient clustering

Instead of using one affinity matrix, we can “fuse” multiple affinity matrices using ANF package, and then perform spectral clustering on the fused affinity matrix.

Let’s take uterus cancer for example.

# fuse three matrices: "fpkm" (gene expression), "mirnas" (miRNA expression) and "methy450" (DNA methylation)
fused.mat <- ANF(Wall = Wall$uterus$raw.all)
# Spectral clustering on fused patient affinity matrix
labels <- spectral_clustering(A = fused.mat, k = 2)
# Or we can directly evaluate clustering results using function `eval_clu`, which calls `spectral_clustering` and calculate NMI and ARI (and p-value if patient survival data is available. `surv.plot` does not contain information for uterus cancer patients)
res <- eval_clu(true_class = project_ids[rownames(fused.mat)], w = fused.mat)

## nmi = 0.48526760996706

## adjusted.rand = 0.684204025562809

## survdiff p value = NA

##            labels
## true_class    1   2
##   TCGA-UCEC 410  11
##   TCGA-UCS   14  40

As we can see, spectral clustering on the fused affinity matrix significantly improves the results for uterus cancer. This demonstrate the power of ANF. The paper https://arxiv.org/abs/1708.07136 have provided more results.

Affinity Matrices of 2582 Patients Derived from Harmonized TCGA Dataset

Tianle Ma

2018-05-01

About the data packages

Usage of this package

Load the data

Brief backgroud about Wall

Spectral clustering using affinity matrices

Use Affinity Network Fusion (ANF package) to fuse multiple affinity matrices for patient clustering