if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("curatedTCGAData")
Load packages:
library(curatedTCGAData)
library(MultiAssayExperiment)
library(TCGAutils)
Checking available cancer codes and assays in TCGA data:
curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
## Please see the list below for available cohorts and assays
## Available Cancer codes:
## ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
## KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
## PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM
## Available Data Types:
## CNACGH CNACGH_CGH_hg_244a
## CNACGH_CGH_hg_415k_g4124a CNASNP CNASeq
## CNVSNP GISTIC_AllByGene GISTIC_Peaks
## GISTIC_ThresholdedByGene Methylation
## Methylation_methyl27 Methylation_methyl450
## Mutation RNASeq2GeneNorm RNASeqGene RPPAArray
## mRNAArray mRNAArray_TX_g4502a
## mRNAArray_TX_g4502a_1
## mRNAArray_TX_ht_hg_u133a mRNAArray_huex
## miRNAArray miRNASeqGene
Check potential files to be downloaded:
curatedTCGAData(diseaseCode = "COAD", assays = "RPPA*", dry.run = TRUE)
## Title DispatchClass
## 96 COAD_RPPAArray-20160128 Rda
Not all TCGA samples are cancer, there are a mix of samples in each of the
33 cancer types. Use sampleTables
on the MultiAssayExperiment
object
along with data(sampleTypes, package = "TCGAutils")
to see what samples are
present in the data. There may be tumors that were used to create multiple
contributions leading to technical replicates. These should be resolved using
the appropriate helper functions such as mergeReplicates
. Primary tumors
should be selected using TCGAutils::TCGAsampleSelect
and used as input
to the subsetting mechanisms. See the “Samples in Assays” section of this
vignette.
(accmae <- curatedTCGAData("ACC", c("CN*", "Mutation"), FALSE))
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 180 columns
## [2] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns
## [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
Note. For more on how to use a MultiAssayExperiment
please see the
MultiAssayExperiment
vignette.
Some cancer datasets contain associated subtype information within the
clinical datasets provided. This subtype information is included in the
metadata of colData
of the MultiAssayExperiment
object. To obtain these
variable names, use the getSubtypeMap
function from TCGA utils:
head(getSubtypeMap(accmae))
## ACC_annotations ACC_subtype
## 1 Patient_ID patientID
## 2 histological_subtypes Histology
## 3 mrna_subtypes C1A/C1B
## 4 mrna_subtypes mRNA_K4
## 5 cimp MethyLevel
## 6 microrna_subtypes miRNA cluster
Another helper function provided by TCGAutils allows users to obtain a set
of consistent clinical variable names across several cancer types.
Use the getClinicalNames
function to obtain a character vector of common
clinical variables such as vital status, years to birth, days to death, etc.
head(getClinicalNames("ACC"))
## [1] "years_to_birth" "vital_status" "days_to_death"
## [4] "days_to_last_followup" "tumor_tissue_site" "pathologic_stage"
colData(accmae)[, getClinicalNames("ACC")][1:5, 1:5]
## DataFrame with 5 rows and 5 columns
## years_to_birth vital_status days_to_death days_to_last_followup
## <integer> <integer> <integer> <integer>
## TCGA-OR-A5J1 58 1 1355 NA
## TCGA-OR-A5J2 44 1 1677 NA
## TCGA-OR-A5J3 23 0 NA 2091
## TCGA-OR-A5J4 23 1 423 NA
## TCGA-OR-A5J5 30 1 365 NA
## tumor_tissue_site
## <character>
## TCGA-OR-A5J1 adrenal
## TCGA-OR-A5J2 adrenal
## TCGA-OR-A5J3 adrenal
## TCGA-OR-A5J4 adrenal
## TCGA-OR-A5J5 adrenal
The sampleTables
function gives an overview of sample types / codes
present in the data:
sampleTables(accmae)
## $`ACC_CNASNP-20160128`
##
## 01 10 11
## 90 85 5
##
## $`ACC_CNVSNP-20160128`
##
## 01 10 11
## 90 85 5
##
## $`ACC_Mutation-20160128`
##
## 01
## 90
Often, an analysis is performed comparing two groups of samples to each other.
To facilitate the separation of samples, the splitAssays
TCGAutils function
identifies all sample types in the assays and moves each into its own
assay. By default, all discoverable sample types are separated into
a separate experiment. In this case we requested only solid tumors and blood
derived normal samples as seen in the sampleTypes
reference dataset:
sampleTypes[sampleTypes[["Code"]] %in% c("01", "10"), ]
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 10 10 Blood Derived Normal NB
splitAssays(accmae, c("01", "10"))
## Warning: Some 'sampleCodes' not found in assays
## A MultiAssayExperiment object of 5 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 5:
## [1] 01_ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 90 columns
## [2] 10_ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 85 columns
## [3] 01_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 90 columns
## [4] 10_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 85 columns
## [5] 01_ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
To obtain a logical vector that could be used for subsetting a
MultiAsssayExperiment
, refer to TCGAsampleSelect
. To select only primary
tumors, use the function on the colnames of the MultiAssayExperiment
:
tums <- TCGAsampleSelect(colnames(accmae), "01")
You can subsequently provide this input to the subsetting function to select only primary tumors:
accmae[, tums, ]
## harmonizing input:
## removing 180 sampleMap rows with 'colname' not in colnames of experiments
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] ACC_CNASNP-20160128: RaggedExperiment with 79861 rows and 90 columns
## [2] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 90 columns
## [3] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
MultiAssayExperiment provides users with an integrative representation of multi-omic TCGA data at the convenience of the user. For those users who wish to use alternative environments, we have provided an export function to extract all the data from a MultiAssayExperiment instance and write them to a series of files:
td <- tempdir()
tempd <- file.path(td, "ACCMAE")
if (!dir.exists(tempd))
dir.create(tempd)
exportClass(accmae, dir = tempd, fmt = "csv", ext = ".csv")
## Writing about 7 files to disk...
## [1] "/tmp/RtmpkCjmWF/ACCMAE/accmae_META_0.csv"
## [2] "/tmp/RtmpkCjmWF/ACCMAE/accmae_META_1.csv"
## [3] "/tmp/RtmpkCjmWF/ACCMAE/accmae_ACC_CNASNP-20160128.csv"
## [4] "/tmp/RtmpkCjmWF/ACCMAE/accmae_ACC_CNVSNP-20160128.csv"
## [5] "/tmp/RtmpkCjmWF/ACCMAE/accmae_ACC_Mutation-20160128.csv"
## [6] "/tmp/RtmpkCjmWF/ACCMAE/accmae_colData.csv"
## [7] "/tmp/RtmpkCjmWF/ACCMAE/accmae_sampleMap.csv"
This works for all data classes stored (e.g., RaggedExperiment
, HDF5Matrix
,
SummarizedExperiment
) in the MultiAssayExperiment
via the assays
method
which converts classes to matrix
format (using individual assay
methods).
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RaggedExperiment_1.12.0 TCGAutils_1.8.1
## [3] curatedTCGAData_1.10.1 MultiAssayExperiment_1.14.0
## [5] SummarizedExperiment_1.18.2 DelayedArray_0.14.1
## [7] matrixStats_0.56.0 Biobase_2.48.0
## [9] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
## [11] IRanges_2.22.2 S4Vectors_0.26.1
## [13] BiocGenerics_0.34.0 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 bit64_4.0.5
## [3] jsonlite_1.7.1 AnnotationHub_2.20.2
## [5] shiny_1.5.0 assertthat_0.2.1
## [7] askpass_1.1 interactiveDisplayBase_1.26.3
## [9] BiocManager_1.30.10 BiocFileCache_1.12.1
## [11] blob_1.2.1 Rsamtools_2.4.0
## [13] GenomeInfoDbData_1.2.3 progress_1.2.2
## [15] yaml_2.2.1 BiocVersion_3.11.1
## [17] pillar_1.4.6 RSQLite_2.2.0
## [19] lattice_0.20-41 glue_1.4.2
## [21] digest_0.6.25 promises_1.1.1
## [23] XVector_0.28.0 rvest_0.3.6
## [25] htmltools_0.5.0 httpuv_1.5.4
## [27] Matrix_1.2-18 XML_3.99-0.5
## [29] pkgconfig_2.0.3 biomaRt_2.44.1
## [31] bookdown_0.20 zlibbioc_1.34.0
## [33] purrr_0.3.4 xtable_1.8-4
## [35] later_1.1.0.1 BiocParallel_1.22.0
## [37] openssl_1.4.2 tibble_3.0.3
## [39] generics_0.0.2 ellipsis_0.3.1
## [41] GenomicFeatures_1.40.1 magrittr_1.5
## [43] crayon_1.3.4 mime_0.9
## [45] memoise_1.1.0 evaluate_0.14
## [47] xml2_1.3.2 prettyunits_1.1.1
## [49] tools_4.0.2 hms_0.5.3
## [51] lifecycle_0.2.0 stringr_1.4.0
## [53] AnnotationDbi_1.50.3 Biostrings_2.56.0
## [55] compiler_4.0.2 rlang_0.4.7
## [57] grid_4.0.2 GenomicDataCommons_1.12.0
## [59] RCurl_1.98-1.2 rappdirs_0.3.1
## [61] bitops_1.0-6 rmarkdown_2.3
## [63] ExperimentHub_1.14.2 DBI_1.1.0
## [65] curl_4.3 R6_2.4.1
## [67] GenomicAlignments_1.24.0 rtracklayer_1.48.0
## [69] knitr_1.29 dplyr_1.0.2
## [71] fastmap_1.0.1 bit_4.0.4
## [73] readr_1.3.1 stringi_1.5.3
## [75] Rcpp_1.0.5 vctrs_0.3.4
## [77] dbplyr_1.4.4 tidyselect_1.1.0
## [79] xfun_0.17