This vignette generates a MultiAssayExperiment object from TCGA (The Cancer Genome Atlas) data using Specifically, the getFirehoseData()
method of the RTCGAToolbox package (the fork available from https://www.github.com/LiNk-NY/RTCGAToolbox) is used to access and read in data; output is then further coerced to fit the MultiAssayExperiment object specifications.
Note: This vignette requires TCGAutils
which currently only available on Github.
library(RTCGAToolbox)
## Download and install GitHub resource
if (!require(TCGAutils, quietly = TRUE)) {
BiocInstaller::biocLite("waldronlab/TCGAutils", suppressAutoUpdate = TRUE,
suppressUpdates = TRUE)
}
library(TCGAutils)
These and other packages available in Bioconductor or CRAN are loaded as follows.
# BIOCONDUCTOR
library(MultiAssayExperiment)
library(RaggedExperiment)
# CRAN
library(readr)
The RTCGAToolbox package provides the getFirehoseDatasets()
method for obtaining the names of all 33 cohorts contained within the TCGA data. Beyond the 33 cohorts, there are 5 additional “pan” cohorts where data of multiple cohorts was merged - information about the cohorts is available via the TCGA website. Additionally, the getFirehoseRunningDates()
and getFirehoseAnalyzeDates()
methods are used to obtain the most recent running and analysis dates.
dataset <- getFirehoseDatasets()[27] # PRAD
stopifnot(identical(dataset, "PRAD"))
runDate <- getFirehoseRunningDates(last = 1)
analyzeDate <- getFirehoseAnalyzeDates(last = 1)
A function, buildMultiAssayExperiment()
, is defined as shown below for the purpose of creating a new MultiAssayExperiment object with a single line of code. It accepts the arguments defined in the above chunk. It works for a single dataset at a time.
In the first part of the function, the existence of the data directory is checked and it is created if necessary. Then a cohort object is either loaded or serialized from the getFirehoseData()
method and saved to the data directory. Once serialized, colData
is extracted from the clinical slot and the rownames are cleaned by gsub()
and type.convert()
functions.
A named list of extraction targets is then created from the slot names of the cohort object and the biocExtract()
function is used within a try
call. The outputs are then passed to generateMap()
which will generate a sampleMap
specific to TCGA data.
Finally, the named ExperimentList
of extracted datasets, the colData
, and the generated sample map can be passed to the MultiAssayExperiment()
constructor function. The constructor function will ensure that orphaned samples, samples that don’t match a record in colData
, are removed. A MultiAssayExperiment
will be created, serialized and saved to the data directory, making it easier to return to in the future.
buildMultiAssayExperiment <- function(TCGAcode, runDate, analyzeDate, dataType=
c("RNASeqGene", "RNASeq2GeneNorm", "miRNASeqGene", "CNASNP", "CNVSNP",
"CNASeq", "CNACGH", "Methylation", "mRNAArray", "miRNAArray", "RPPAArray",
"Mutation", "GISTIC"), dataDirectory = "/tmp", force = FALSE) {
if (!dir.exists(dataDirectory))
dir.create(dataDirectory)
message("\n######\n", "\nProcessing ", TCGAcode, " : )\n", "\n######\n")
serialPath <- file.path(dataDirectory, paste0(TCGAcode, ".rds"))
targets <- c("RNASeqGene", "RNASeq2GeneNorm", "miRNASeqGene",
"CNASNP", "CNVSNP", "CNASeq", "CNACGH", "Methylation",
"mRNAArray", "miRNAArray", "RPPAArray", "Mutation",
"GISTIC")
names(targets) <- targets
dataType <- match.arg(dataType, targets, several.ok = TRUE)
names(dataType) <- dataType
dataType <- rep(TRUE, length(dataType))
if (file.exists(serialPath)) {
cancer.object <- readRDS(serialPath)
} else {
cancer.object <- do.call(getFirehoseData,
args = c(list(dataset = TCGAcode,
runDate = runDate,
gistic2Date = analyzeDate,
RNAseqNorm = "raw_counts",
RNAseq2Norm =
"normalized_count",
forceDownload = FALSE,
destdir = dataDirectory,
fileSizeLimit = 500000,
getUUIDs = FALSE), as.list(dataType))
)
# Uncomment to save to your directory
# saveRDS(cancer.object, file = serialPath, compress = "bzip2")
}
## Add clinical data from RTCGAToolbox
clinical.data <- selectType(cancer.object, "clinical")
rownames(clinical.data) <- toupper(gsub("\\.", "-",
rownames(clinical.data)))
clinical.data[] <- apply(clinical.data, 2, type.convert)
dataList <- lapply(targets, function(datType) {
tryCatch({biocExtract(cancer.object, datType)},
error = function(e) {
message(datType, " does not contain any data!")
})
})
data.full <- Filter(length, dataList)
isList <- vapply(data.full, is.list, logical(1L))
unlisted <- unlist(data.full[isList])
data.full <- c(data.full[!isList], unlisted)
newmap <- TCGAutils::generateMap(data.full, clinical.data,
TCGAutils::TCGAbarcode)
mae <- MultiAssayExperiment(data.full, clinical.data, newmap)
return( mae )
}
Lastly, it is necessary to call the buildMultiAssayExperiment()
function defined above and pass it the arguments defined using the RTCGAToolbox package. Using this function, a MultiAssayExperiment
object for the prostate adenocarcinoma cohort (PRAD
) is created with a single call.
buildMultiAssayExperiment(TCGAcode=dataset, runDate=runDate,
analyzeDate=analyzeDate, dataType = "miRNASeqGene", dataDirectory = "tmp")
##
## ######
##
## Processing PRAD : )
##
## ######
## gdac.broadinstitute.org_PRAD.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
## gdac.broadinstitute.org_PRAD.Clinical_Pick_Tier1.Level_4.2016012800.0.0
##
## gdac.broadinstitute.org_PRAD-TP.CopyNumber_Gistic2.Level_4.2016012800.0.0.tar.gz
##
Read 80.7% of 24776 rows
Read 24776 rows and 495 (of 495) columns from 0.074 GB file in 00:00:03
## working on: RNASeqGene
## working on: RNASeq2GeneNorm
## working on: miRNASeqGene
## working on: CNASNP
## working on: CNVSNP
## working on: CNASeq
## working on: CNACGH
## working on: Methylation
## working on: mRNAArray
## working on: miRNAArray
## working on: RPPAArray
## working on: Mutation
## working on: GISTIC
## harmonizing input:
## removing 7 colData rownames not in sampleMap 'primary'
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] GISTIC.AllByGene: SummarizedExperiment with 24776 rows and 492 columns
## [2] GISTIC.ThresholdedByGene: SummarizedExperiment with 24776 rows and 492 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices