Contents

1 Prerequisites

Methods from two packages hosted on GitHub are used in this vignette, the packages are installed as follows.

BiocInstaller::biocLite("LiNk-NY/RTCGAToolbox")
BiocInstaller::biocLite("waldronlab/TCGAutils")

These and other packages available in Bioconductor or CRAN are loaded as follows.

library(MultiAssayExperiment)
library(RTCGAToolbox)
library(TCGAutils)
library(readr)

2 Argument Definitions

The RTCGAToolbox package provides the getFirehoseDatasets() method for obtaining the names of all 33 cohorts contained within the TCGA data. Beyond the 33 cohorts, there are 5 additional “pan” cohorts where data of multiple cohorts was merged - information about the cohorts is available via the TCGA website. Additionally, the getFirehoseRunningDates() and getFirehoseAnalyzeDates() methods are used to obtain the most recent running and analysis dates. Finally, a character vector dd is created to specify the location of the data directory where output should be saved.

TCGAcode <- getFirehoseDatasets()[27] # PRAD
stopifnot(identical(ds, "PRAD"))
runDate <- getFirehoseRunningDates()[1]
analyzeDate <- getFirehoseAnalyzeDates()[1]
dataDirectory <- "data"

3 Function Definition

A function, buildMultiAssayExperiments(), is defined as shown below for the purpose of creating a new MultiAssayExperiment object with a single line of code. It accepts the arguments defined in the previous chunk and is capable of accepting multiple cohort names (e.g. dataDirectory <- getFirehoseDatasets()[1:5]). Even though the implementation is not parallel, low-level operations remain vectorized regardless of the for loop.

In the first part of the function, the existence of the data directory is checked and it is created if necessary. Then a cohort object is either loaded or serialized from the getFirehoseData() method and saved to the data directory. Once serialized, colData is extracted from the clinical slot and the rownames are cleaned by gsub() and type_convert() functions.

A named list of extraction targets is then created from the slot names of the cohort object and the TCGAextract() function is used within a try statement. The outputs are then passed to generateMap() which will generate a sampleMap specific to TCGA data.

Finally, the named ExperimentList of extracted datasets, the colData, and the generated sample map can be passed to the MultiAssayExperiment() constructor function. The constructor function will ensure that orphaned samples, samples that don’t match a record in colData, are removed. A MultiAssayExperiment will be created, serialized and saved to the data directory, making it easier to return to in the future.

buildMultiAssayExperiments <-
    function(TCGAcode, runDate, analyzeDate, dataDirectory) {
        if (!dir.exists(dataDirectory))
            dir.create(dataDirectory)
        for (cancer in TCGAcodes) {
            message("\n######\n",
                    "\nProcessing ", cancer, " : )\n",
                    "\n######\n")
            serialPath <- file.path("data", paste0(cancer, ".rds"))
            if (file.exists(serialPath)) {
                cancerObject <- readRDS(serialPath)
            } else {
                cancerObject <- getFirehoseData(cancer, runDate = runDate,
                                                gistic2_Date = analyzeDate,
                                                RNAseq_Gene = TRUE,
                                                Clinic = TRUE,
                                                miRNASeq_Gene = TRUE,
                                                RNAseq2_Gene_Norm = TRUE,
                                                CNA_SNP = TRUE,
                                                CNV_SNP = TRUE,
                                                CNA_Seq = TRUE,
                                                CNA_CGH = TRUE,
                                                Methylation = TRUE,
                                                Mutation = TRUE,
                                                mRNA_Array = TRUE,
                                                miRNA_Array = TRUE,
                                                RPPA_Array = TRUE,
                                                RNAseqNorm = "raw_counts",
                                                RNAseq2Norm =
                                                    "normalized_count",
                                                forceDownload = FALSE,
                                                destdir = "./tmp",
                                                fileSizeLimit = 500000,
                                                getUUIDs = FALSE)
                saveRDS(cancerObject, file = serialPath, compress = "bzip2")
            }
            ## Add clinical data from RTCGAToolbox
            pd <- Clinical(co)
            rownames(pd) <- toupper(gsub("\\.", "-", rownames(pd)))
            clinicalData <- type_convert(pd)
            ## slotNames in FirehoseData RTCGAToolbox class
            targets <- c("RNASeqGene", "RNASeq2GeneNorm", "miRNASeqGene",
                         "CNASNP", "CNVSNP", "CNAseq", "CNACGH", "Methylation",
                         "mRNAArray", "miRNAArray", "RPPAArray", "Mutations",
                         "gistica", "gistict")
            names(targets) <- targets
            dataList <- lapply(targets, function(datType) {
                tryCatch({TCGAutils::TCGAextract(cancerObject, datType)},
                         error = function(e) {
                             message(datType, " does not contain any data!")
                         })
            })
            dataFull <- Filter(function(x) {!is.null(x)}, dataList)
            NewMap <- generateMap(dataFull, clinicalData, TCGAbarcode)
            MAEO <- MultiAssayExperiment(dataFull, clinicalData, NewMap)
            saveRDS(MAEO, file = file.path(dataDirectory,
                                           paste0(cancer, "_MAEO.rds")),
                    compress = "bzip2")
        }
    }

4 Function Call

Lastly, it is necessary to call the buildMultiAssayExperiments() function defined above and pass it the arguments defined using the RTCGAToolbox package. Using this function, a MultiAssayExperiment object for the prostate adenocarcinoma cohort (PRAD) is created with a single call.

buildMultiAssayExperiments(TCGAcode, runDate, analyzeDate, dataDirectory)