Using RTCGA package to download RNAseq data that are included in RTCGA.rnaseq package

Date of datasets release: 2015-11-01

Marcin Kosinski

2016-10-19

RTCGA package

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.

RTCGA package offers download and integration of the variety and volume of TCGA data using patient barcode key, what enables easier data possession. This may have a benefcial infuence on development of science and improvement of patients’ treatment. RTCGA is an open-source R package, available to download from Bioconductor

source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")

or from github

if (!require(devtools)) {
    install.packages("devtools")
    require(devtools)
}
biocLite("RTCGA/RTCGA")

Furthermore, RTCGA package transforms TCGA data into form which is convenient to use in R statistical package. Those data transformations can be a part of statistical analysis pipeline which can be more reproducible with RTCGA.

Use cases and examples are shown in RTCGA packages vignettes:

browseVignettes("RTCGA")

How to download RNAseq data to gain the same datasets as in RTCGA.rnaseq package?

There are many available date times of TCGA data releases. To see them all just type:

library(RTCGA)
checkTCGA('Dates')

Version 20151101.*.* of RTCGA.rnaseq package contains RNAseq datasets which were released 2015-11-01. They were downloaded in the following way (which is mainly copied from http://rtcga.github.io/RTCGA/:

Available cohorts

All cohort names can be checked using:

(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))

For all cohorts the following code downloads the RNAseq data.

Downloading RNAseq files

# dir.create( "data2" ) # name of a directory in which data will be stored
releaseDate <- "2015-11-01"
sapply( cohorts, function(element){
tryCatch({
downloadTCGA( cancerTypes = element, 
              dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level",
              destDir = "data2", 
              date = releaseDate )},
error = function(cond){
   cat("Error: Maybe there weren't rnaseq data for ", element, " cancer.\n")
}
)
})

Reading downloaded RNAseq dataset

Shortening paths and directories

list.files( "data2") %>% 
   file.path( "data2", .) %>%
   file.rename( to = substr(.,start=1,stop=50))

Removing NA files from data2 folder

If there were not RNAseq data for some cohorts we should remove corresponding NA files.

list.files( "data2") %>%
   file.path( "data2", .) %>%
   sapply(function(x){
      if (x == "data2/NA")
         file.remove(x)      
   })

Paths to RNAseq data

Below is the code that removes unneeded “MANIFEST.txt” file from each RNAseq cohort folder.

list.files( "data2") %>% 
   file.path( "data2", .) %>%
   sapply(function(x){
      file.path(x, list.files(x)) %>%
         grep(pattern = "MANIFEST.txt", x = ., value=TRUE) %>%
         file.remove()
      })

Below is the code that automatically assigns paths to files for all RNAseq files for all available cohorts types downloaded to data2 folder.

list.files("data2") %>%
   file.path("data2", .) %>%
   sapply(function(y){
      file.path(y, list.files(y)) %>%
      assign( value = .,
              x = paste0(list.files(y) %>%
                                       gsub(x = .,
                                            pattern = "\\..*",
                                            replacement = "") %>%
                            gsub(x=., pattern="-", replacement = "_"),
                         ".rnaseq.path"),
              envir = .GlobalEnv)
   })

Reading RNAseq data using readTCGA

Because of the fact that RNAseq data are transposed in downloaded files, there has been prepared special function readTCGA to read and transpose data automatically. Code is below

ls() %>%
   grep("rnaseq\\.path", x = ., value = TRUE) %>% 
   sapply(function(element){
      tryCatch({
         readTCGA(get(element, envir = .GlobalEnv),
               dataType = "rnaseq") %>%
         assign(value = .,
                x = sub("\\.path", "", x = element),
                envir = .GlobalEnv )
      }, error = function(cond){
         cat(element)
      }) 
     invisible(NULL)
    }    
)

Saving RNAseq data to RTCGA.rnaseq package

grep( "rnaseq", ls(), value = TRUE) %>%
   grep("path", x=., value = TRUE, invert = TRUE) %>% 
   cat( sep="," ) #can one to it better? as from use_data documentation:
   # ...    Unquoted names of existing objects to save
   devtools::use_data(ACC.rnaseq,BLCA.rnaseq,BRCA.rnaseq,
                                     CESC.rnaseq,CHOL.rnaseq,COADREAD.rnaseq,
                                     COAD.rnaseq,DLBC.rnaseq,ESCA.rnaseq,
                                     GBMLGG.rnaseq,GBM.rnaseq,HNSC.rnaseq,
                                     KICH.rnaseq,KIPAN.rnaseq,KIRC.rnaseq,
                                     KIRP.rnaseq,LAML.rnaseq,LGG.rnaseq,
                                     LIHC.rnaseq,LUAD.rnaseq,LUSC.rnaseq,
                                     OV.rnaseq,PAAD.rnaseq,PCPG.rnaseq,
                                     PRAD.rnaseq,READ.rnaseq,SARC.rnaseq,
                                     SKCM.rnaseq,STAD.rnaseq,STES.rnaseq,
                                     TGCT.rnaseq,THCA.rnaseq,THYM.rnaseq,
                                     UCEC.rnaseq,UCS.rnaseq,UVM.rnaseq,
                     # overwrite = TRUE,
                      compress="xz")