Contents

Updates

Recently the TCGA data has been moved from the DCC server to The National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal In this version of the package, we rewrote all the functions that were acessing the old TCGA server to GDC.

The GDC, which receives, processes, harmonizes, and distributes clinical, biospecimen, and genomic data from multiple cancer research programs, has data from the following programs:

The big change is that the GDC data is harmonized against GRCh38. However, not all data has been harmonized yet. The old TCGA data can be acessed through GDC legacy Archive, in which the majority of data can be found.

More information about the project can be found in GCD FAQS

The functions TCGAquery, TCGAdownload, TCGAPrepare, TCGAquery_maf, TCGAquery_clinical, were replaced by GDCquery, GDCdownload, GDCprepare, GDCquery_maf, GDCquery_clinical.

And it can acess both the GDC and GDC Legacy Archive.

Note: Not all the examples in this vignette were updated.

Introduction

Motivation: The Cancer Genome Atlas (TCGA) provides us with an enormous collection of data sets, not only spanning a large number of cancers but also a large number of experimental platforms. Even though the data can be accessed and downloaded from the database, the possibility to analyse these downloaded data directly in one single R package has not yet been available.

TCGAbiolinks consists of three parts or levels. Firstly, we provide different options to query and download from TCGA relevant data from all currently platforms and their subsequent pre-processing for commonly used bio-informatics (tools) packages in Bioconductor or CRAN. Secondly, the package allows to integrate different data types and it can be used for different types of analyses dealing with all platforms such as diff.expression, network inference or survival analysis, etc, and then it allows to visualize the obtained results. Thirdly we added a social level where a researcher can found a similar intereset in a bioinformatic community, and allows both to find a validation of results in literature in pubmed and also to retrieve questions and answers from site such as support.bioconductor.org, biostars.org, stackoverflow,etc.

This document describes how to search, download and analyze TCGA data using the TCGAbiolinks package.

Installation

To install use the code below.

source("https://bioconductor.org/biocLite.R")
biocLite("TCGAbiolinks")

Citation

Please cite TCGAbiolinks package:

Related publications to this package:

Also, if you have used ELMER analysis please cite:

GDCquery: Searching TCGA open-access data

GDCquery: Searching GDC data for download

You can easily search GDC data using the GDCquery function.

Using a summary of filters as used in the TCGA portal, the function works with the following arguments:

The next subsections will detail each of the search arguments. Below, we show some search examples:

query <- GDCquery(project = "TARGET-AML",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

# Using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation", 
                  platform = "Illumina Human Methylation 27", 
                  legacy = TRUE,
                  barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))

The list of projects is below:

List of projects
released state dbgap_accession_number primary_site disease_type project_id name tumor
True legacy NA Soft Tissue Sarcoma TCGA-SARC Sarcoma SARC
True legacy NA Pleura Mesothelioma TCGA-MESO Mesothelioma MESO
True legacy NA Colorectal Rectum Adenocarcinoma TCGA-READ Rectum Adenocarcinoma READ
True legacy NA Kidney Kidney Renal Papillary Cell Carcinoma TCGA-KIRP Kidney Renal Papillary Cell Carcinoma KIRP
True legacy phs000467 Nervous System Neuroblastoma TARGET-NBL Neuroblastoma NBL
True legacy NA Pancreas Pancreatic Adenocarcinoma TCGA-PAAD Pancreatic Adenocarcinoma PAAD
True legacy NA Brain Glioblastoma Multiforme TCGA-GBM Glioblastoma Multiforme GBM
True legacy NA Adrenal Gland Adrenocortical Carcinoma TCGA-ACC Adrenocortical Carcinoma ACC
True legacy phs000468 Bone Osteosarcoma TARGET-OS Osteosarcoma OS
True legacy NA Cervix Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma TCGA-CESC Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma CESC
True legacy phs000470 Kidney Rhabdoid Tumor TARGET-RT Rhabdoid Tumor RT
True legacy NA Breast Breast Invasive Carcinoma TCGA-BRCA Breast Invasive Carcinoma BRCA
True legacy NA Esophagus Esophageal Carcinoma TCGA-ESCA Esophageal Carcinoma ESCA
True legacy NA Lymph Nodes Lymphoid Neoplasm Diffuse Large B-cell Lymphoma TCGA-DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma DLBC
True legacy NA Kidney Kidney Chromophobe TCGA-KICH Kidney Chromophobe KICH
True legacy NA Kidney Kidney Renal Clear Cell Carcinoma TCGA-KIRC Kidney Renal Clear Cell Carcinoma KIRC
True legacy NA Eye Uveal Melanoma TCGA-UVM Uveal Melanoma UVM
True legacy phs000465 Blood Acute Myeloid Leukemia TARGET-AML Acute Myeloid Leukemia AML
True legacy NA Bone Marrow Acute Myeloid Leukemia TCGA-LAML Acute Myeloid Leukemia LAML
True legacy NA Skin Skin Cutaneous Melanoma TCGA-SKCM Skin Cutaneous Melanoma SKCM
True legacy NA Adrenal Gland Pheochromocytoma and Paraganglioma TCGA-PCPG Pheochromocytoma and Paraganglioma PCPG
True legacy NA Colorectal Colon Adenocarcinoma TCGA-COAD Colon Adenocarcinoma COAD
True legacy NA Uterus Uterine Carcinosarcoma TCGA-UCS Uterine Carcinosarcoma UCS
True legacy NA Lung Lung Squamous Cell Carcinoma TCGA-LUSC Lung Squamous Cell Carcinoma LUSC
True legacy NA Brain Brain Lower Grade Glioma TCGA-LGG Brain Lower Grade Glioma LGG
True legacy NA Head and Neck Head and Neck Squamous Cell Carcinoma TCGA-HNSC Head and Neck Squamous Cell Carcinoma HNSC
True legacy NA Testis Testicular Germ Cell Tumors TCGA-TGCT Testicular Germ Cell Tumors TGCT
True legacy phs000466 Kidney Clear Cell Sarcoma of the Kidney TARGET-CCSK Clear Cell Sarcoma of the Kidney CCSK
True legacy NA Thyroid Thyroid Carcinoma TCGA-THCA Thyroid Carcinoma THCA
True legacy NA Liver Liver Hepatocellular Carcinoma TCGA-LIHC Liver Hepatocellular Carcinoma LIHC
True legacy NA Bladder Bladder Urothelial Carcinoma TCGA-BLCA Bladder Urothelial Carcinoma BLCA
True legacy NA Uterus Uterine Corpus Endometrial Carcinoma TCGA-UCEC Uterine Corpus Endometrial Carcinoma UCEC
True legacy phs000471 Kidney High-Risk Wilms Tumor TARGET-WT High-Risk Wilms Tumor WT
True legacy NA Prostate Prostate Adenocarcinoma TCGA-PRAD Prostate Adenocarcinoma PRAD
True legacy NA Ovary Ovarian Serous Cystadenocarcinoma TCGA-OV Ovarian Serous Cystadenocarcinoma OV
True legacy NA Thymus Thymoma TCGA-THYM Thymoma THYM
True legacy NA Bile Duct Cholangiocarcinoma TCGA-CHOL Cholangiocarcinoma CHOL
True legacy NA Stomach Stomach Adenocarcinoma TCGA-STAD Stomach Adenocarcinoma STAD
True legacy NA Lung Lung Adenocarcinoma TCGA-LUAD Lung Adenocarcinoma LUAD

The list of sample.type is below:

List sample types
tissue.code shortLetterCode tissue.definition
01 TP Primary solid Tumor
02 TR Recurrent Solid Tumor
03 TB Primary Blood Derived Cancer - Peripheral Blood
04 TRBM Recurrent Blood Derived Cancer - Bone Marrow
05 TAP Additional - New Primary
06 TM Metastatic
07 TAM Additional Metastatic
08 THOC Human Tumor Original Cells
09 TBM Primary Blood Derived Cancer - Bone Marrow
10 NB Blood Derived Normal
11 NT Solid Tissue Normal
12 NBC Buccal Cell Normal
13 NEBV EBV Immortalized Normal
14 NBM Bone Marrow Normal
20 CELLC Control Analyte
40 TRB Recurrent Blood Derived Cancer - Peripheral Blood
50 CELL Cell Lines
60 XP Primary Xenograft Tissue
61 XCL Cell Line Derived Xenograft Tissue

The other fields (data.category, data.type, workflow.type, platform, file.type) can be found below. Please, not that these tables are still incomplete.

Harmonized data

Data.category Data.type Workflow Type
Transcriptome Profiling Gene Expression Quantification HTSeq - Counts
HTSeq - FPKM-UQ
HTSeq - FPKM
Isoform Expression Quantification -
miRNA Expression Quantification -
Copy number variation Copy Number Segment
Masked Copy Number Segment
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical

Legacy data

Data.category Data.type Platform file.type
Transcriptome Profiling
Copy number variation - Affymetrix SNP Array 6.0 nocnv_hg18.seg
- Affymetrix SNP Array 6.0 hg18.seg
- Affymetrix SNP Array 6.0 nocnv_hg19.seg
- Affymetrix SNP Array 6.0 hg19.seg
- Illumina HiSeq -
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical
Protein expression MDA RPPA Core -
Gene expression Gene expression quantification Illumina HiSeq normalized_results
Illumina HiSeq results
HT_HG-U133A -
AgilentG4502A_07_2 -
AgilentG4502A_07_1 -
HuEx-1_0-st-v2 FIRMA.txt
gene.txt
Isoform expression quantification
miRNA gene quantification
Exon junction quantification
Exon quantification
miRNA isoform quantification
DNA methylation Illumina Human Methylation 450 -
Illumina Human Methylation 27 -
Illumina DNA Methylation OMA003 CPI -
Illumina DNA Methylation OMA002 CPI -
Illumina Hi Seq
Raw Microarray Data
Structural Rearrangement
Other

GDCquery_maf: Working with mutation files.

In order to download the Mutation Annotation Format (MAF), we provide the user with the GDCquery_maf function. Briefly, it will get the open acess maf files from https://gdc-docs.nci.nih.gov/Data/Release_Notes/Data_Release_Notes/

The arguments that will be used to filter the data are:

mutation <- GDCquery_Maf(tumor = "LGG")
mutation <- GDCquery_Maf(tumor = "LGG", save.csv = TRUE)

This mutation data can be visualized through a oncoprint from the ComplexHeatmap package (Gu, Zuguang and Eils, Roland and Schlesner, Matthias 2016).

mut <- GDCquery_Maf(tumor = "ACC")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "oncoprint.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        width = 5,
                        heatmap.legend.side = "right",
                        dist.col = 0,
                        label.font.size = 10)

GDCquery_clinic and GDCprepare_clinic: Working with clinical data.

In GDC database the clinical data can be retrieved in two sources:

there are two main differences:

clin.query <- GDCquery(project = "TCGA-BLCA", data.category = "Clinical", barcode = "TCGA-FD-A5C0")
 json  <- tryCatch(GDCdownload(clin.query), 
                   error = function(e) GDCdownload(clin.query, method = "client"))
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")
clinical.index <- GDCquery_clinic("TCGA-BLCA")
# Example of the second difference:
clinical.patient[,c("vital_status","days_to_death","days_to_last_followup")]
##   vital_status days_to_death days_to_last_followup
## 1        Alive                                 209
clinical.patient.followup[,c("vital_status","days_to_death","days_to_last_followup")]
##   vital_status days_to_death days_to_last_followup
## 1        Alive                                 407
## 2         Dead           550
# indexed data is equivalent to follow ups information
clinical.index[clinical.index$submitter_id=="TCGA-FD-A5C0",
               c("vital_status","days_to_death","days_to_last_follow_up")]
##     vital_status days_to_death days_to_last_follow_up
## 197         dead           550                    407

You can retrieve clinical data using the GDCquery_clinic function. This will get only the indexed GDC clinical data. This is the same as clicking on the download clinical buttun in the data portal. This means this function is not parsing the XML files, see GDCprepare_clinic function.

The arguments of this function are:

Examples of use:

clin <- GDCquery_clinic("TCGA-ACC", type = "clinical", save.csv = TRUE)
clin <- GDCquery_clinic("TCGA-ACC", type = "biospecimen", save.csv = TRUE)

To parse the TCGA clinical XML files, please use GDCprepare_clinic function. It receives as argument the query object and which clinical information we should retrieve. The possible values for the argument clinical.info are:

query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Clinical", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")

Also, some functions to work with clinical data are provided.

For example the function TCGAquery_SampleTypes will filter barcodes based on a type the argument typesample. Some of the typesamples possibilities are: TP (PRIMARY SOLID TUMOR), TR (RECURRENT SOLID TUMOR), NT (Solid Tissue Normal) etc. Please, see ?TCGAquery_SampleTypes for all the possible values.

The function TCGAquery_MatchedCoupledSampleTypes will filter the samples that have all the typesample provided as argument. For example, if TP and TR are set as typesample, the function will return the barcodes of a patient if it has both types. So, if it has a TP, but not a TR, no barcode will be returned. If it has a TP and a TR both barcodes are returned.

The function TCGAquery_clinicFilt will filter your data, returning the list of barcodes that matches all the filter.

The arguments of TCGAquery_clinicFilt are:

An example of the function is below:

bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")

S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")

# Retrieve multiple tissue types  NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))

# Retrieve multiple tissue types  FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))

# Get clinical data
clinical_brca_data <- TCGAquery_clinic("brca","clinical_patient")
female_erpos_herpos <- TCGAquery_clinicFilt(bar,
                                            clinical_brca_data, 
                                            HER="Positive", 
                                            gender="FEMALE", 
                                            ER="Positive")

The result is shown below:

## ER Positive Samples:
##   TCGA-BH-A1ES 
##   TCGA-BH-A0BZ 
##   TCGA-D8-A1JS 
##   TCGA-AN-A0FN 
##   TCGA-AR-A2LQ 
##   TCGA-BH-A1F8 
##   TCGA-AR-A24T 
##   TCGA-AO-A0J5 
##   TCGA-BH-A0B4
## HER Positive Samples:
##   TCGA-AN-A0FN 
##   TCGA-BH-A1F8
## GENDER FEMALE Samples:
##   TCGA-BH-A1ES 
##   TCGA-BH-A1F0 
##   TCGA-BH-A0BZ 
##   TCGA-D8-A1JS 
##   TCGA-AN-A0FN 
##   TCGA-AR-A2LQ 
##   TCGA-AR-A2LH 
##   TCGA-BH-A1F8 
##   TCGA-AR-A24T 
##   TCGA-AO-A0J5 
##   TCGA-B6-A1KN
## [1] "TCGA-AN-A0FN" "TCGA-BH-A1F8"

TCGAquery_subtype: Working with molecular subtypes data.

The Cancer Genome Atlas (TCGA) Research Network has reported integrated genome-wide studies of various diseases. We have added some of the subtypes defined by these report in our package. The ACC(Cancer Genome Atlas Research Network and others 2016), BRCA (Cancer Genome Atlas Research Network and others 2012c), COAD (Cancer Genome Atlas Research Network and others 2012b), GBM (Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others 2016), HNSC (Cancer Genome Atlas Research Network and others 2015a), KICH (Davis, Caleb F and Ricketts, Christopher J and Wang, Min and Yang, Lixing and Cherniack, Andrew D and Shen, Hui and Buhay, Christian and Kang, Hyojin and Kim, Sang Cheol and Fahey, Catherine C and others 2014), KIRC(Cancer Genome Atlas Research Network and others 2013a), KIRP (Linehan, W Marston and Spellman, Paul T and Ricketts, Christopher J and Creighton, Chad J and Fei, Suzanne S and Davis, Caleb and Wheeler, David A and Murray, Bradley A and Schmidt, Laura and Vocke, Cathy D and others 2016), LGG (Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others 2016), LUAD (Cancer Genome Atlas Research Network and others 2014b), LUSC(Cancer Genome Atlas Research Network and others 2012a), PRAD(Cancer Genome Atlas Research Network and others 2015c), READ (Cancer Genome Atlas Research Network and others 2012b), SKCM (Cancer Genome Atlas Research Network and others 2015b), STAD (Cancer Genome Atlas Research Network and others 2014a), THCA (Cancer Genome Atlas Research Network and others 2014c), UCEC (Cancer Genome Atlas Research Network and others 2013b) tumors have data added. These subtypes will be automatically added in the summarizedExperiment object through TCGAprepare. But you can also use the TCGAquery_subtype function to retrive this information.

# Check with subtypes from TCGAprepare and update examples
GBM_path_subtypes <- TCGAquery_subtype(tumor = "gbm")

LGG_path_subtypes <- TCGAquery_subtype(tumor = "lgg")

A subset of the lgg subytpe is shown below:

Table with LGG molecular subtypes from TCGAquery_subtype
patient Tissue.source.site Study BCR
1 TCGA-CS-4938 Thomas Jefferson University Brain Lower Grade Glioma IGC
2 TCGA-CS-4941 Thomas Jefferson University Brain Lower Grade Glioma IGC
3 TCGA-CS-4942 Thomas Jefferson University Brain Lower Grade Glioma IGC
4 TCGA-CS-4943 Thomas Jefferson University Brain Lower Grade Glioma IGC
5 TCGA-CS-4944 Thomas Jefferson University Brain Lower Grade Glioma IGC
6 TCGA-CS-5390 Thomas Jefferson University Brain Lower Grade Glioma IGC
7 TCGA-CS-5393 Thomas Jefferson University Brain Lower Grade Glioma IGC
8 TCGA-CS-5394 Thomas Jefferson University Brain Lower Grade Glioma IGC
9 TCGA-CS-5395 Thomas Jefferson University Brain Lower Grade Glioma IGC
10 TCGA-CS-5396 Thomas Jefferson University Brain Lower Grade Glioma IGC

GDCdownload: Downloading GDC data

You can easily download data using the GDCdownload function. It uses GDC transfer tool to download gdc data doing a system call. For this reason some times the update will stop to show, which does not means that the download process has stopped. Once the process has finished it will give a signal to R.

The data from query will be save in a folder: project/source/data.category (where source is harmonized or legacy)

The arguments are:

GDCdownload: Example of use

query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)

query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", file.type  = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                  legacy = TRUE)
GDCdownload(query)
z <- GDCprepare(query)

# A function to download only 20 samples
legacyPipeline <- function(project, data.category, platform, n = 20){
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE)
    cases <- query$results[[1]]$cases[1:n] # Get two samples from the search
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE,
                      chunks.per.download = 5,
                      barcode = cases)
    GDCdownload(query)
    data <- GDCprepare(query)
    return(data)
}
# DNA methylation
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 27")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 450")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA003 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")

GDCprepare: Preparing the data

This function is still under development, it is not working for all cases. See the tables below with the status. Examples of query, download, prepare can be found in this gist.

Harmonized data

Data.category Data.type Workflow Type Status
Transcriptome Profiling Gene Expression Quantification HTSeq - Counts Data frame or SE (losing 5% of information when mapping to genomic regions)
HTSeq - FPKM-UQ Returning only a (losing 5% of information when mapping to genomic regions)
HTSeq - FPKM Returning only a (losing 5% of information when mapping to genomic regions)
Isoform Expression Quantification Not needed
miRNA Expression Quantification Not needed Returning only a dataframe for the moment
Copy number variation Copy Number Segment Returning only a dataframe for the moment
Masked Copy Number Segment Returning only a dataframe for the moment
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical

Legacy data

Data.category Data.type Platform file.type Status
Transcriptome Profiling
Copy number variation - Affymetrix SNP Array 6.0 nocnv_hg18.seg Working
- Affymetrix SNP Array 6.0 hg18.seg Working
- Affymetrix SNP Array 6.0 nocnv_hg19.seg Working
- Affymetrix SNP Array 6.0 hg19.seg Working
- Illumina HiSeq - Working
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical
Protein expression MDA RPPA Core - Working
Gene expression Gene expression quantification Illumina HiSeq normalized_results Working
Illumina HiSeq results Working
HT_HG-U133A - Working
AgilentG4502A_07_2 - Data frame only
AgilentG4502A_07_1 - Data frame only
HuEx-1_0-st-v2 FIRMA.txt Not Preparing
gene.txt Not Preparing
Isoform expression quantification
miRNA gene quantification
Exon junction quantification
Exon quantification
miRNA isoform quantification
DNA methylation Illumina Human Methylation 450 Not used Working
Illumina Human Methylation 27 Not used Working
Illumina DNA Methylation OMA003 CPI Not used Working
Illumina DNA Methylation OMA002 CPI Not used Working
Illumina Hi Seq Not working
Raw Microarray Data
Structural Rearrangement
Other

You can easily read the downloaded data using the GDCprepare function. This function will prepare the data into a SummarizedExperiment (Huber, Wolfgang and Carey, Vincent J and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and others 2015) object for downstream analysis. For the moment this function works only with data level 3.

The arguments are:

library(TCGAbiolinks)
# Downloading and prepare 
query <- GDCquery(project = "TARGET-AML", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query)
data <- GDCprepare(query)

# Downloading and prepare using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Protein expression",
                  legacy = TRUE, 
                  barcode = c("TCGA-OX-A56R-01A-21-A44T-20","TCGA-08-0357-01A-21-1898-20"))
GDCdownload(query)
data <- GDCprepare(query, save = TRUE, 
                   save.filename = "gbmProteinExpression.rda",
                   remove.files.prepared = TRUE)

# Downloading and prepare using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation", 
                  platform = "Illumina Human Methylation 27",legacy = TRUE,
                  barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))
GDCdownload(query)
data <- GDCprepare(query)

TCGAanalyze: Analyze data from TCGA.

You can easily analyze data using following functions:

TCGAanalyze_Preprocessing: Preprocessing of Gene Expression data (IlluminaHiSeq_RNASeqV2)

You can easily search TCGA samples, download and prepare a matrix of gene expression.

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCAdownload(query)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseq_assay <- GDCprepare(query)

BRCAMatrix <- assay(BRCARnaseq_assay,"raw_counts")

# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
BRCARnaseq_CorOutliers <- TCGAanalyze_Preprocessing(BRCARnaseq_assay)

The result is shown below:

Example of a matrix of gene expression (10 genes in rows and 2 samples in columns)
TCGA-BH-A0DK-11A-13R-A089-07 TCGA-E9-A1NG-11A-52R-A14M-07
C6orf191|253582 0 0
SLC2A6|11182 46 106
SCRN2|90507 2491 3545
ALOXE3|59344 45 5
THSD7A|221981 184 429
TGFBR3|7049 10545 19966
MAP6D1|79929 166 69
FGF22|27006 2 2
C10orf11|83938 102 229
KLRK1|22914 72 209

The result from TCGAanalyze_Preprocessing is shown below:

TCGAanalyze_DEA & TCGAanalyze_LevelTab: Differential expression analysis (DEA)

Perform DEA (Differential expression analysis) to identify differentially expressed genes (DEGs) using the TCGAanalyze_DEA function.

TCGAanalyze_DEA performs DEA using following functions from R :

  1. edgeR::DGEList converts the count matrix into an edgeR object.
  2. edgeR::estimateCommonDisp each gene gets assigned the same dispersion estimate.
  3. edgeR::exactTest performs pair-wise tests for differential expression between two groups.
  4. edgeR::topTags takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes.

This function receives as arguments:

Next, we filter the output of dataDEGs by abs(LogFC) >=1, and uses the TCGAanalyze_LevelTab function to create a table with DEGs (differentially expressed genes), log Fold Change (FC), false discovery rate (FDR), the gene expression level for samples in Cond1type, and Cond2type, and Delta value (the difference of gene expression between the two conditions multiplied logFC).

# Downstream analysis using gene expression data  
# TCGA samples from IlluminaHiSeq_RNASeqV2 with type rsem.genes.results
# save(dataBRCA, geneInfo , file = "dataGeneExpression.rda")
library(TCGAbiolinks)

# normalization of genes
dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo =  geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
                                   typesample = c("NT"))

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), 
                                   typesample = c("TP"))

# Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
                            mat2 = dataFilt[,samplesTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")

# DEGs table with expression values in normal and tumor samples
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
                                          dataFilt[,samplesTP],dataFilt[,samplesNT])

The result is shown below:

Table of DEGs after DEA
mRNA logFC FDR Tumor Normal Delta
FN1 2.88 1.296151e-19 347787.48 41234.12 1001017.3
COL1A1 1.77 1.680844e-08 358010.32 89293.72 633086.3
C4orf7 5.20 2.826474e-50 87821.36 2132.76 456425.4
COL1A2 1.40 9.480478e-06 273385.44 91241.32 383242.9
GAPDH 1.32 3.290678e-05 179057.44 63663.00 236255.5
CLEC3A 6.79 7.971002e-74 27257.16 259.60 185158.6
IGFBP5 1.24 1.060717e-04 128186.88 53323.12 158674.6
CPB1 4.27 3.044021e-37 37001.76 2637.72 157968.8
CARTPT 6.72 1.023371e-72 21700.96 215.16 145872.8
DCD 7.26 1.047988e-80 19941.20 84.80 144806.3

TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot: Enrichment Analysis

Researchers, in order to better understand the underlying biological processes, often want to retrieve a functional profile of a set of genes that might have an important role. This can be done by performing an enrichment analysis.

We will perform an enrichment analysis on gene sets using the TCGAanalyze_EAcomplete function. Given a set of genes that are up-regulated under certain conditions, an enrichment analysis will identify classes of genes or proteins that are over-represented using annotations for that gene set.

To view the results you can use the TCGAvisualize_EAbarplot function as shown below.

library(TCGAbiolinks)
# Enrichment Analysis EA
# Gene Ontology (GO) and Pathway enrichment by DEGs list
Genelist <- rownames(dataDEGsFiltLevel)

system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist))

# Enrichment Analysis EA (TCGAVisualize)
# Gene Ontology (GO) and Pathway enrichment barPlot

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), 
                        GOBPTab = ansEA$ResBP,
                        GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF,
                        PathTab = ansEA$ResPat,
                        nRGTab = Genelist, 
                        nBar = 10)

The result is shown below: