1 Introduction

The netboost package, implements a three-step dimension reduction technique. First, a boosting-based filter is combined with the topological overlap measure to identify the essential edges of the network. Second, sparse hierarchical clustering is applied on the selected edges to identify modules and finally module information is aggregated by the first principal components. The primary analysis is then carried out on these summary measures instead of the original data.

2 Loading an example dataset

The package comes with an example dataset included. We import the acute myeloid leukemia patient data from The Cancer Genome Atlas public domain database. The dataset consists of one thousand DNA methylation sites and gene expression levels on chromosome 18 for 80 patients.

## Loading required package: netboost
## netboost 1.6.0 loadedDefault CPU cores: 1
data("tcga_aml_meth_rna_chr18", package = "netboost")
## [1]   80 1000

The netboost() function integrates all major analysis steps and generates multiple plots. In this step we also set analysis parameters:

stepno defines the number of boosting steps taken

soft_power (if null, automatically chosen) the exponent in the transformation of the correlation

min_cluster_size the minimal size of clusters, n_pc the number of maximally computed principal components

scale if data should be scaled and centered prior to analysis

ME_diss_thres defines the merging threshold for identified clusters.

For details on the options please see ?netboost and the corresponding paper Schlosser et al. 2019 (under review).

results <- netboost(datan = tcga_aml_meth_rna_chr18, stepno = 20L, 
soft_power = 3L, min_cluster_size = 10L, n_pc = 2, scale = TRUE, ME_diss_thres = 0.25) 
## idx: 1 (0.1%) - Tue Oct 27 21:56:30 2020

## Netboost extracted 16 modules (including background) with an average size of 28.3333333333333 (excluding background) from Tree 1.
## Netboost detected 15 modules and 1 background modules in 1 trees resulting in 25 aggreagate measures.
## Average size of the modules was 28.3333333333333.
## 575 of 1000 features (57.5%) were not assigned to modules.

For each detected independent tree in the dataset (here one) the first graph shows a dendrogram of initial modules and at which level they are merged, the second graph a module dendrogram after merging and the third the dendrogram of features including the module-color-code.

results contains the dendrograms (dendros), feature identifier (names) matched to module assignment (colors), the aggregated dataset (MEs), the rotation matrix to compute the aggregated dataset (rotation) and the proportion of variance explained by the aggregate measures (var_explained). Dependent on the minimum proportion of variance explained set in the netboost() call (default 0.5) up to n_pc principal components are exported.

## [1] "dendros"       "names"         "colors"        "MEs"          
## [5] "rotation"      "var_explained" "filter"
##  [1] "ME0_1_pc1" "ME0_1_pc2" "ME4_pc1"   "ME14_pc1"  "ME14_pc2"  "ME6_pc1"  
##  [7] "ME6_pc2"   "ME7_pc1"   "ME11_pc1"  "ME11_pc2"  "ME9_pc1"   "ME9_pc2"  
## [13] "ME8_pc1"   "ME13_pc1"  "ME13_pc2"  "ME2_pc1"   "ME12_pc1"  "ME15_pc1" 
## [19] "ME15_pc2"  "ME1_pc1"   "ME1_pc2"   "ME5_pc1"   "ME10_pc1"  "ME3_pc1"  
## [25] "ME3_pc2"

As you see for most modules the first principal component already explained more than 50% of the variance in the original features of this module. ME0_X_pcY denotes the background module (unclustered features) of the independent tree X.

Explained variance is reported by a matrix for the first n_pc principal components. Here we list the first 5 modules:

##          ME0_1        ME4      ME14       ME6        ME7
## PC1 0.07681420 0.50125421 0.4419347 0.4777013 0.52079949
## PC2 0.05416773 0.06023611 0.1443083 0.1266931 0.07778952

results$colors use a numeric coding for the modules which matches their module name. To list features of module ME10 we can extract them by:

##  [1] "cg00018044" "cg00081465" "cg00142378" "cg00335802" "cg00602115"
##  [6] "cg00732383" "cg01074921" "cg01875596" "cg02351804" "cg02490253"
## [11] "cg02711020" "cg02976900" "cg03116030" "cg03686067" "cg03778258"

The final dendrogram including all trees can be plotted including labels (results$names) for individual features. colorsrandom controls if module-color matching should be randomized to get a clearly differentiable pattern of the potentially many modules. Labels are only suitable in applications with few features or with a appropriately large pdf device.

nb_plot_dendro(nb_summary = results, labels = FALSE, colorsrandom = TRUE)