1 Introduction

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.

The RTCGA package offers an easy interface for downloading and integrating variety of the TCGA data using patient barcode key. This allows for easier data acquisition facilitating development of science and improvement of patients’ treatment. Furthermore, the RTCGA package transforms the TCGA data to a tidy form which is convenient to use with R statistical package.

2 RTCGA package

More detailed information about this package can be found here https://github.com/MarcinKosinski/RTCGA.

2.1 Installation of the RTCGA package

To get started, install the latest version of RTCGA from Bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")

or use for development version:

if (!require(devtools)) {
    install.packages("devtools")
    library(devtools)
}
biocLite("MarcinKosinski/RTCGA")

Make sure you have rtools installed on your computer, if you are trying devtools on Windows.

3 Light data management and manipulations

Below is an example of how to use RTCGA package to download ACC cohort data that contains: clinical data, mutations data, and rnaseq v2 data. Furthermore, it is shown how to easily unzip those data and how to read them into tidy format.

3.1 Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading

We will download data from the one of the newest release date.

library(RTCGA)
releaseDate <- tail( checkTCGA('Dates'), 2 )[1]
# if server doesn't respond, just try
# date <- "2015-06-01"

We will need a folder into which we will download data.

3.1.1 Clinical data

Let us download clinical data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate )

3.1.2 Rnaseq v2 data

Let us download rnaseq v2 data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level" )
# one can check all available dataSets' names with
# checkTCGA('DataSets')

3.1.3 Mutations data

Let us download genes’ mutations data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "Mutation_Packager_Calls.Level" )

3.2 untarFile and removeTar parameters

By default untarFile and removeTar parameters are set to TRUE which means that after a desired file is downloaded it is untarred and then the no longer needed *.tar.gz file is removed. When one used downloadTCGA() function with those parameters set to FALSE the that’s the way how those files can be automatically untarred and then removed. ### Untarring data

Let us use the untar() function to untar all downloaded sets.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   sapply( untar, exdir = "data/" )

3.2.1 Removing no longer needed tar.gz files

After datasets are untarred, the tar.gz files ar no longer needed and can be deleted.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep( pattern = "tar.gz", x = ., value = TRUE) %>%
   sapply( file.remove )

3.3 Shortening directories of downloaded files

Because the path to rnaseq data has more thatn 256 digits we need to shorten that directory so that R can notice the existance of this file.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep("rnaseq", x = ., value = TRUE) %>%    
   file.rename( to = substr(.,start=1,stop=50))
[1] TRUE

4 Reading TCGA data to the tidy format

4.1 Clinical data

All downloaded clinical datasets for all cohorts are available in RTCGA.clinical package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Clinical data format is explained here. Below is just a single code on how to read clinical data for BRCA.

list.files("data/") %>%
    grep("Clinical", x = ., value = TRUE) %>%
    file.path("data", .)  -> folder

folder %>%
    list.files() %>%
    grep("clin.merged", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "clinical") -> BRCA.clinical

dim(BRCA.clinical)
[1]   92 1115

4.2 Rnaseq v2 data

All downloaded rnaseq datasets for all cohorts are available in RTCGA.rnaseq package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. rnaseq data format is explained here. Below is just a single code on how to read rnaseq data for BRCA.

list.files("data/") %>%
    grep("rnaseq", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>%
    list.files() %>%
    grep("illumina", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "rnaseq") -> BRCA.rnaseq

dim(BRCA.rnaseq)
[1]    79 20532

4.3 Mutations data

All downloaded mutations datasets for all cohorts are available in RTCGA.mutations package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Mutations data format is explained here. Below is just a single code on how to read mutations data for BRCA.

list.files("data/") %>%
    grep("Mutation", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>% 
    readTCGA(path = ., "mutations") -> BRCA.mutations
dim(BRCA.mutations)
[1] 20255    53

5 Information about TCGA project datasets

5.1 Codes and counts for each cohort

# library(devtools)
# install_github('Rapporter/pander')
if( require(pander) ){
infoTCGA() %>%
    pandoc.table()
}
Table continues below
  Cohort BCR Clinical CN LowP
ACC-counts ACC 92 92 90 0
BLCA-counts BLCA 412 412 410 112
BRCA-counts BRCA 1098 1098 1089 19
CESC-counts CESC 307 307 295 50
CHOL-counts CHOL 51 36 36 0
COAD-counts COAD 460 457 450 69
COADREAD-counts COADREAD 631 628 615 104
DLBC-counts DLBC 58 47 48 0
ESCA-counts ESCA 185 185 184 51
FPPP-counts FPPP 38 38 0 0
GBM-counts GBM 613 593 577 0
GBMLGG-counts GBMLGG 1129 1108 1090 52
HNSC-counts HNSC 528 528 522 108
KICH-counts KICH 113 111 66 0
KIPAN-counts KIPAN 973 938 883 0
KIRC-counts KIRC 537 537 528 0
KIRP-counts KIRP 323 290 289 0
LAML-counts LAML 200 200 197 0
LGG-counts LGG 516 515 513 52
LIHC-counts LIHC 377 368 370 0
LUAD-counts LUAD 585 521 516 120
LUSC-counts LUSC 504 502 501 0
MESO-counts MESO 87 87 87 0
OV-counts OV 602 591 586 0
PAAD-counts PAAD 185 185 184 0
PCPG-counts PCPG 179 179 175 0
PRAD-counts PRAD 499 498 492 115
READ-counts READ 171 171 165 35
SARC-counts SARC 260 260 256 0
SKCM-counts SKCM 470 470 469 118
STAD-counts STAD 443 443 442 107
STES-counts STES 628 628 626 158
TGCT-counts TGCT 150 134 150 0
THCA-counts THCA 503 503 499 98
THYM-counts THYM 124 124 123 0
UCEC-counts UCEC 560 545 540 106
UCS-counts UCS 57 57 56 0
UVM-counts UVM 80 80 80 51
Table continues below
  Methylation mRNA mRNASeq miR
ACC-counts 80 0 79 0
BLCA-counts 412 0 408 0
BRCA-counts 1097 526 1093 0
CESC-counts 307 0 304 0
CHOL-counts 36 0 36 0
COAD-counts 457 153 457 0
COADREAD-counts 622 222 623 0
DLBC-counts 48 0 28 0
ESCA-counts 185 0 184 0
FPPP-counts 0 0 0 0
GBM-counts 420 540 160 565
GBMLGG-counts 936 567 676 565
HNSC-counts 528 0 520 0
KICH-counts 66 0 66 0
KIPAN-counts 892 88 889 0
KIRC-counts 535 72 533 0
KIRP-counts 291 16 290 0
LAML-counts 194 0 179 0
LGG-counts 516 27 516 0
LIHC-counts 377 0 371 0
LUAD-counts 578 32 515 0
LUSC-counts 503 154 501 0
MESO-counts 87 0 86 0
OV-counts 594 574 304 570
PAAD-counts 184 0 178 0
PCPG-counts 179 0 179 0
PRAD-counts 498 0 497 0
READ-counts 165 69 166 0
SARC-counts 260 0 258 0
SKCM-counts 470 0 468 0
STAD-counts 443 0 274 0
STES-counts 628 0 458 0
TGCT-counts 150 0 150 0
THCA-counts 503 0 501 0
THYM-counts 124 0 120 0
UCEC-counts 547 54 545 0
UCS-counts 57 0 57 0
UVM-counts 80 0 80 0
  miRSeq RPPA MAF rawMAF
ACC-counts 80 46 90 0
BLCA-counts 409 344 130 395
BRCA-counts 1078 410 977 0
CESC-counts 307 173 194 0
CHOL-counts 36 30 35 0
COAD-counts 406 360 154 367
COADREAD-counts 549 491 223 489
DLBC-counts 47 33 48 0
ESCA-counts 184 126 0 0
FPPP-counts 22 0 0 0
GBM-counts 0 238 290 290
GBMLGG-counts 512 668 576 803
HNSC-counts 523 212 279 508
KICH-counts 66 63 66 66
KIPAN-counts 873 756 644 678
KIRC-counts 516 478 417 451
KIRP-counts 291 215 161 161
LAML-counts 188 0 197 0
LGG-counts 512 430 286 513
LIHC-counts 372 63 198 0
LUAD-counts 513 365 230 542
LUSC-counts 478 328 178 0
MESO-counts 87 63 0 0
OV-counts 453 426 316 469
PAAD-counts 178 123 146 0
PCPG-counts 179 80 178 0
PRAD-counts 494 352 332 0
READ-counts 143 131 69 122
SARC-counts 258 222 247 0
SKCM-counts 448 204 343 366
STAD-counts 436 357 289 0
STES-counts 620 483 289 0
TGCT-counts 150 118 149 0
THCA-counts 502 222 402 0
THYM-counts 124 90 0 0
UCEC-counts 538 440 248 0
UCS-counts 56 48 57 0
UVM-counts 80 12 80 0

5.2 Available cohorts names

(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))
 [1] "ACC"      "BLCA"     "BRCA"     "CESC"     "CHOL"     "COAD"     "COADREAD" "DLBC"     "ESCA"     "FPPP"     "GBM"      "GBMLGG"   "HNSC"    
[14] "KICH"     "KIPAN"    "KIRC"     "KIRP"     "LAML"     "LGG"      "LIHC"     "LUAD"     "LUSC"     "MESO"     "OV"       "PAAD"     "PCPG"    
[27] "PRAD"     "READ"     "SARC"     "SKCM"     "STAD"     "STES"     "TGCT"     "THCA"     "THYM"     "UCEC"     "UCS"      "UVM"     

5.3 Dates of release

checkTCGA('Dates')
 [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21" "2012-04-12"
[12] "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25" "2012-09-13" "2012-10-04"
[23] "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16" "2013-02-03" "2013-02-22" "2013-03-09"
[34] "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23" "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10"
[45] "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16" "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17"
[56] "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21"

5.4 Names of avaialable DataSets

checkTCGA('DataSets', 'ACC', releaseDate) %>%
    length()
[1] 138