1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("curatedTCGAData")

Load packages:

library(curatedTCGAData)
library(MultiAssayExperiment)
library(TCGAutils)

2 Downloading datasets

Checking available cancer codes and assays in TCGA data:

curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
## Please see the list below for available cohorts and assays
## Available Cancer codes:
##  ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
##  KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
##  PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM 
## Available Data Types:
##  CNACGH CNACGH_CGH_hg_244a
##  CNACGH_CGH_hg_415k_g4124a CNASNP CNASeq
##  CNVSNP GISTIC_AllByGene GISTIC_Peaks
##  GISTIC_ThresholdedByGene Methylation
##  Methylation_methyl27 Methylation_methyl450
##  Mutation RNASeq2GeneNorm RNASeqGene RPPAArray
##  mRNAArray mRNAArray_TX_g4502a
##  mRNAArray_TX_g4502a_1
##  mRNAArray_TX_ht_hg_u133a mRNAArray_huex
##  miRNAArray miRNASeqGene

Check potential files to be downloaded:

curatedTCGAData(diseaseCode = "COAD", assays = "RPPA*", dry.run = TRUE)
##                      Title DispatchClass
## 96 COAD_RPPAArray-20160128           Rda

3 Caveats for working with TCGA data

Not all TCGA samples are cancer, there are a mix of samples in each of the 33 cancer types. Use sampleTables on the MultiAssayExperiment object along with data(sampleTypes, package = "TCGAutils") to see what samples are present in the data. There may be tumors that were used to create multiple contributions leading to technical replicates. These should be resolved using the appropriate helper functions such as mergeReplicates. Primary tumors should be selected using TCGAutils::TCGAsampleSelect and used as input to the subsetting mechanisms. See the “Samples in Assays” section of this vignette.

3.1 ACC dataset example

(accmae <- curatedTCGAData("ACC", c("CN*", "Mutation"), FALSE))
## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 3:
##  [1] ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 180 columns
##  [2] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
##  [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## Features:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample availability DFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices

Note. For more on how to use a MultiAssayExperiment please see the MultiAssayExperiment vignette.

3.1.1 Subtype information

Some cancer datasets contain associated subtype information within the clinical datasets provided. This subtype information is included in the metadata of colData of the MultiAssayExperiment object. To obtain these variable names, use the getSubtypeMap function from TCGA utils:

head(getSubtypeMap(accmae))
##         ACC_annotations   ACC_subtype
## 1            Patient_ID     patientID
## 2 histological_subtypes     Histology
## 3         mrna_subtypes       C1A/C1B
## 4         mrna_subtypes       mRNA_K4
## 5                  cimp    MethyLevel
## 6     microrna_subtypes miRNA cluster

3.1.2 Typical clinical variables

Another helper function provided by TCGAutils allows users to obtain a set of consistent clinical variable names across several cancer types. Use the getClinicalNames function to obtain a character vector of common clinical variables such as vital status, years to birth, days to death, etc.

head(getClinicalNames("ACC"))
## [1] "years_to_birth"        "vital_status"          "days_to_death"        
## [4] "days_to_last_followup" "tumor_tissue_site"     "pathologic_stage"
colData(accmae)[, getClinicalNames("ACC")][1:5, 1:5]
## DataFrame with 5 rows and 5 columns
##              years_to_birth vital_status days_to_death days_to_last_followup
##                   <integer>    <integer>     <integer>             <integer>
## TCGA-OR-A5J1             58            1          1355                    NA
## TCGA-OR-A5J2             44            1          1677                    NA
## TCGA-OR-A5J3             23            0            NA                  2091
## TCGA-OR-A5J4             23            1           423                    NA
## TCGA-OR-A5J5             30            1           365                    NA
##              tumor_tissue_site
##                    <character>
## TCGA-OR-A5J1           adrenal
## TCGA-OR-A5J2           adrenal
## TCGA-OR-A5J3           adrenal
## TCGA-OR-A5J4           adrenal
## TCGA-OR-A5J5           adrenal

3.1.3 Samples in Assays

The sampleTables function gives an overview of sample types / codes present in the data:

sampleTables(accmae)
## $`ACC_CNASNP-20160128`
## 
## 01 10 11 
## 90 85  5 
## 
## $`ACC_CNVSNP-20160128`
## 
## 01 10 11 
## 90 85  5 
## 
## $`ACC_Mutation-20160128`
## 
## 01 
## 90

Often, an analysis is performed comparing two groups of samples to each other. To facilitate the separation of samples, the splitAssays TCGAutils function identifies all sample types in the assays and moves each into its own assay. By default, all discoverable sample types are separated into a separate experiment. In this case we requested only solid tumors and blood derived normal samples as seen in the sampleTypes reference dataset:

sampleTypes[sampleTypes[["Code"]] %in% c("01", "10"), ]
##    Code           Definition Short.Letter.Code
## 1    01  Primary Solid Tumor                TP
## 10   10 Blood Derived Normal                NB
splitAssays(accmae, c("01", "10"))
## Warning: Some 'sampleCodes' not found in assays
## A MultiAssayExperiment object of 5 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 5:
##  [1] 01_ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 90 columns
##  [2] 10_ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 85 columns
##  [3] 01_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 90 columns
##  [4] 10_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 85 columns
##  [5] 01_ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## Features:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample availability DFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices

To obtain a logical vector that could be used for subsetting a MultiAsssayExperiment, refer to TCGAsampleSelect. To select only primary tumors, use the function on the colnames of the MultiAssayExperiment:

tums <- TCGAsampleSelect(colnames(accmae), "01")

You can subsequently provide this input to the subsetting function to select only primary tumors:

accmae[, tums, ]
## harmonizing input:
##   removing 180 sampleMap rows with 'colname' not in colnames of experiments
## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 3:
##  [1] ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 90 columns
##  [2] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 90 columns
##  [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## Features:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample availability DFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices

3.2 Exporting Data

MultiAssayExperiment provides users with an integrative representation of multi-omic TCGA data at the convenience of the user. For those users who wish to use alternative environments, we have provided an export function to extract all the data from a MultiAssayExperiment instance and write them to a series of files:

td <- tempdir()
tempd <- file.path(td, "ACCMAE")
if (!dir.exists(tempd))
    dir.create(tempd)

exportClass(accmae, dir = tempd, fmt = "csv", ext = ".csv")
## Writing about 7 files to disk...
## [1] "/tmp/RtmpkCjmWF/ACCMAE/accmae_META_0.csv"               
## [2] "/tmp/RtmpkCjmWF/ACCMAE/accmae_META_1.csv"               
## [3] "/tmp/RtmpkCjmWF/ACCMAE/accmae_ACC_CNASNP-20160128.csv"  
## [4] "/tmp/RtmpkCjmWF/ACCMAE/accmae_ACC_CNVSNP-20160128.csv"  
## [5] "/tmp/RtmpkCjmWF/ACCMAE/accmae_ACC_Mutation-20160128.csv"
## [6] "/tmp/RtmpkCjmWF/ACCMAE/accmae_colData.csv"              
## [7] "/tmp/RtmpkCjmWF/ACCMAE/accmae_sampleMap.csv"

This works for all data classes stored (e.g., RaggedExperiment, HDF5Matrix, SummarizedExperiment) in the MultiAssayExperiment via the assays method which converts classes to matrix format (using individual assay methods).

4 Session Information

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RaggedExperiment_1.12.0     TCGAutils_1.8.1            
##  [3] curatedTCGAData_1.10.1      MultiAssayExperiment_1.14.0
##  [5] SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
##  [7] matrixStats_0.56.0          Biobase_2.48.0             
##  [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
## [11] IRanges_2.22.2              S4Vectors_0.26.1           
## [13] BiocGenerics_0.34.0         BiocStyle_2.16.0           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2                    bit64_4.0.5                  
##  [3] jsonlite_1.7.1                AnnotationHub_2.20.2         
##  [5] shiny_1.5.0                   assertthat_0.2.1             
##  [7] askpass_1.1                   interactiveDisplayBase_1.26.3
##  [9] BiocManager_1.30.10           BiocFileCache_1.12.1         
## [11] blob_1.2.1                    Rsamtools_2.4.0              
## [13] GenomeInfoDbData_1.2.3        progress_1.2.2               
## [15] yaml_2.2.1                    BiocVersion_3.11.1           
## [17] pillar_1.4.6                  RSQLite_2.2.0                
## [19] lattice_0.20-41               glue_1.4.2                   
## [21] digest_0.6.25                 promises_1.1.1               
## [23] XVector_0.28.0                rvest_0.3.6                  
## [25] htmltools_0.5.0               httpuv_1.5.4                 
## [27] Matrix_1.2-18                 XML_3.99-0.5                 
## [29] pkgconfig_2.0.3               biomaRt_2.44.1               
## [31] bookdown_0.20                 zlibbioc_1.34.0              
## [33] purrr_0.3.4                   xtable_1.8-4                 
## [35] later_1.1.0.1                 BiocParallel_1.22.0          
## [37] openssl_1.4.2                 tibble_3.0.3                 
## [39] generics_0.0.2                ellipsis_0.3.1               
## [41] GenomicFeatures_1.40.1        magrittr_1.5                 
## [43] crayon_1.3.4                  mime_0.9                     
## [45] memoise_1.1.0                 evaluate_0.14                
## [47] xml2_1.3.2                    prettyunits_1.1.1            
## [49] tools_4.0.2                   hms_0.5.3                    
## [51] lifecycle_0.2.0               stringr_1.4.0                
## [53] AnnotationDbi_1.50.3          Biostrings_2.56.0            
## [55] compiler_4.0.2                rlang_0.4.7                  
## [57] grid_4.0.2                    GenomicDataCommons_1.12.0    
## [59] RCurl_1.98-1.2                rappdirs_0.3.1               
## [61] bitops_1.0-6                  rmarkdown_2.3                
## [63] ExperimentHub_1.14.2          DBI_1.1.0                    
## [65] curl_4.3                      R6_2.4.1                     
## [67] GenomicAlignments_1.24.0      rtracklayer_1.48.0           
## [69] knitr_1.29                    dplyr_1.0.2                  
## [71] fastmap_1.0.1                 bit_4.0.4                    
## [73] readr_1.3.1                   stringi_1.5.3                
## [75] Rcpp_1.0.5                    vctrs_0.3.4                  
## [77] dbplyr_1.4.4                  tidyselect_1.1.0             
## [79] xfun_0.17