Updates

Recently the TCGA data has been moved from the DCC server to The National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal In this version of the package, we rewrote all the functions that were acessing the old TCGA server to GDC.

The GDC, which receives, processes, harmonizes, and distributes clinical, biospecimen, and genomic data from multiple cancer research programs, has data from the following programs:

The Cancer Genome Atlas (TCGA)
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
the Cancer Genome Characterization Initiative (CGCI)

The big change is that the GDC data is harmonized against GRCh38. However, not all data has been harmonized yet. The old TCGA data can be acessed through GDC legacy Archive, in which the majority of data can be found.

More information about the project can be found in GCD FAQS

The functions TCGAquery, TCGAdownload, TCGAPrepare, TCGAquery_maf, TCGAquery_clinical, were replaced by GDCquery, GDCdownload, GDCprepare, GDCquery_maf, GDCquery_clinical.

And it can acess both the GDC and GDC Legacy Archive.

Note: Not all the examples in this vignette were updated.

Introduction

Motivation: The Cancer Genome Atlas (TCGA) provides us with an enormous collection of data sets, not only spanning a large number of cancers but also a large number of experimental platforms. Even though the data can be accessed and downloaded from the database, the possibility to analyse these downloaded data directly in one single R package has not yet been available.

TCGAbiolinks consists of three parts or levels. Firstly, we provide different options to query and download from TCGA relevant data from all currently platforms and their subsequent pre-processing for commonly used bio-informatics (tools) packages in Bioconductor or CRAN. Secondly, the package allows to integrate different data types and it can be used for different types of analyses dealing with all platforms such as diff.expression, network inference or survival analysis, etc, and then it allows to visualize the obtained results. Thirdly we added a social level where a researcher can found a similar intereset in a bioinformatic community, and allows both to find a validation of results in literature in pubmed and also to retrieve questions and answers from site such as support.bioconductor.org, biostars.org, stackoverflow,etc.

This document describes how to search, download and analyze TCGA data using the TCGAbiolinks package.

Installation

To install use the code below.

source("https://bioconductor.org/biocLite.R")
biocLite("TCGAbiolinks")

Citation

Please cite TCGAbiolinks package:

“TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.” Nucleic acids research (2015): gkv1507. (Colaprico, Antonio and Silva, Tiago C. and Olsen, Catharina and Garofano, Luciano and Cava, Claudia and Garolini, Davide and Sabedot, Thais S. and Malta, Tathiane M. and Pagnotta, Stefano M. and Castiglioni, Isabella and Ceccarelli, Michele and Bontempi, Gianluca and Noushmehr, Houtan 2016)

Related publications to this package:

“TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages”. F1000Research 10.12688/f1000research.8923.1 (Silva, TC and Colaprico, A and Olsen, C and D’Angelo, F and Bontempi, G and Ceccarelli, M and Noushmehr, H 2016)

Also, if you have used ELMER analysis please cite:

Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P. “Inferring regulatory element landscapes and transcription factor networks from cancer methylomes.” Genome Biol 16 (2015): 105.
Yao, Lijing, Benjamin P. Berman, and Peggy J. Farnham. “Demystifying the secret mission of enhancers: linking distal regulatory elements to target genes.” Critical reviews in biochemistry and molecular biology 50.6 (2015): 550-573.

`GDCquery`: Searching TCGA open-access data

`GDCquery`: Searching GDC data for download

You can easily search GDC data using the GDCquery function.

Using a summary of filters as used in the TCGA portal, the function works with the following arguments:

project A valid project (see table below)
data.category A valid project (see list with getProjectSummary(project))
data.type A data type to filter the files to download
sample.type A sample type to filter the files to download (See table below)
workflow.type GDC workflow type
barcode A list of barcodes to filter the files to download
legacy Search in the legacy repository? Default: FALSE
platform Experimental data platform (HumanMethylation450, AgilentG4502A_07 etc). Used only for legacy repository
file.type A string to filter files, based on its names. Used only for legacy repository

The next subsections will detail each of the search arguments. Below, we show some search examples:

query <- GDCquery(project = "TARGET-AML",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

# Using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation", 
                  platform = "Illumina Human Methylation 27", 
                  legacy = TRUE,
                  barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))

The list of projects is below:

List of projects
released	state	dbgap_accession_number	primary_site	disease_type	project_id	name	tumor
True	legacy	NA	Soft Tissue	Sarcoma	TCGA-SARC	Sarcoma	SARC
True	legacy	NA	Pleura	Mesothelioma	TCGA-MESO	Mesothelioma	MESO
True	legacy	NA	Colorectal	Rectum Adenocarcinoma	TCGA-READ	Rectum Adenocarcinoma	READ
True	legacy	NA	Kidney	Kidney Renal Papillary Cell Carcinoma	TCGA-KIRP	Kidney Renal Papillary Cell Carcinoma	KIRP
True	legacy	phs000467	Nervous System	Neuroblastoma	TARGET-NBL	Neuroblastoma	NBL
True	legacy	NA	Pancreas	Pancreatic Adenocarcinoma	TCGA-PAAD	Pancreatic Adenocarcinoma	PAAD
True	legacy	NA	Brain	Glioblastoma Multiforme	TCGA-GBM	Glioblastoma Multiforme	GBM
True	legacy	NA	Adrenal Gland	Adrenocortical Carcinoma	TCGA-ACC	Adrenocortical Carcinoma	ACC
True	legacy	phs000468	Bone	Osteosarcoma	TARGET-OS	Osteosarcoma	OS
True	legacy	NA	Cervix	Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma	TCGA-CESC	Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma	CESC
True	legacy	phs000470	Kidney	Rhabdoid Tumor	TARGET-RT	Rhabdoid Tumor	RT
True	legacy	NA	Breast	Breast Invasive Carcinoma	TCGA-BRCA	Breast Invasive Carcinoma	BRCA
True	legacy	NA	Esophagus	Esophageal Carcinoma	TCGA-ESCA	Esophageal Carcinoma	ESCA
True	legacy	NA	Lymph Nodes	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	TCGA-DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	DLBC
True	legacy	NA	Kidney	Kidney Chromophobe	TCGA-KICH	Kidney Chromophobe	KICH
True	legacy	NA	Kidney	Kidney Renal Clear Cell Carcinoma	TCGA-KIRC	Kidney Renal Clear Cell Carcinoma	KIRC
True	legacy	NA	Eye	Uveal Melanoma	TCGA-UVM	Uveal Melanoma	UVM
True	legacy	phs000465	Blood	Acute Myeloid Leukemia	TARGET-AML	Acute Myeloid Leukemia	AML
True	legacy	NA	Bone Marrow	Acute Myeloid Leukemia	TCGA-LAML	Acute Myeloid Leukemia	LAML
True	legacy	NA	Skin	Skin Cutaneous Melanoma	TCGA-SKCM	Skin Cutaneous Melanoma	SKCM
True	legacy	NA	Adrenal Gland	Pheochromocytoma and Paraganglioma	TCGA-PCPG	Pheochromocytoma and Paraganglioma	PCPG
True	legacy	NA	Colorectal	Colon Adenocarcinoma	TCGA-COAD	Colon Adenocarcinoma	COAD
True	legacy	NA	Uterus	Uterine Carcinosarcoma	TCGA-UCS	Uterine Carcinosarcoma	UCS
True	legacy	NA	Lung	Lung Squamous Cell Carcinoma	TCGA-LUSC	Lung Squamous Cell Carcinoma	LUSC
True	legacy	NA	Brain	Brain Lower Grade Glioma	TCGA-LGG	Brain Lower Grade Glioma	LGG
True	legacy	NA	Head and Neck	Head and Neck Squamous Cell Carcinoma	TCGA-HNSC	Head and Neck Squamous Cell Carcinoma	HNSC
True	legacy	NA	Testis	Testicular Germ Cell Tumors	TCGA-TGCT	Testicular Germ Cell Tumors	TGCT
True	legacy	phs000466	Kidney	Clear Cell Sarcoma of the Kidney	TARGET-CCSK	Clear Cell Sarcoma of the Kidney	CCSK
True	legacy	NA	Thyroid	Thyroid Carcinoma	TCGA-THCA	Thyroid Carcinoma	THCA
True	legacy	NA	Liver	Liver Hepatocellular Carcinoma	TCGA-LIHC	Liver Hepatocellular Carcinoma	LIHC
True	legacy	NA	Bladder	Bladder Urothelial Carcinoma	TCGA-BLCA	Bladder Urothelial Carcinoma	BLCA
True	legacy	NA	Uterus	Uterine Corpus Endometrial Carcinoma	TCGA-UCEC	Uterine Corpus Endometrial Carcinoma	UCEC
True	legacy	phs000471	Kidney	High-Risk Wilms Tumor	TARGET-WT	High-Risk Wilms Tumor	WT
True	legacy	NA	Prostate	Prostate Adenocarcinoma	TCGA-PRAD	Prostate Adenocarcinoma	PRAD
True	legacy	NA	Ovary	Ovarian Serous Cystadenocarcinoma	TCGA-OV	Ovarian Serous Cystadenocarcinoma	OV
True	legacy	NA	Thymus	Thymoma	TCGA-THYM	Thymoma	THYM
True	legacy	NA	Bile Duct	Cholangiocarcinoma	TCGA-CHOL	Cholangiocarcinoma	CHOL
True	legacy	NA	Stomach	Stomach Adenocarcinoma	TCGA-STAD	Stomach Adenocarcinoma	STAD
True	legacy	NA	Lung	Lung Adenocarcinoma	TCGA-LUAD	Lung Adenocarcinoma	LUAD

The list of sample.type is below:

List sample types
tissue.code	shortLetterCode	tissue.definition
01	TP	Primary solid Tumor
02	TR	Recurrent Solid Tumor
03	TB	Primary Blood Derived Cancer - Peripheral Blood
04	TRBM	Recurrent Blood Derived Cancer - Bone Marrow
05	TAP	Additional - New Primary
06	TM	Metastatic
07	TAM	Additional Metastatic
08	THOC	Human Tumor Original Cells
09	TBM	Primary Blood Derived Cancer - Bone Marrow
10	NB	Blood Derived Normal
11	NT	Solid Tissue Normal
12	NBC	Buccal Cell Normal
13	NEBV	EBV Immortalized Normal
14	NBM	Bone Marrow Normal
20	CELLC	Control Analyte
40	TRB	Recurrent Blood Derived Cancer - Peripheral Blood
50	CELL	Cell Lines
60	XP	Primary Xenograft Tissue
61	XCL	Cell Line Derived Xenograft Tissue

The other fields (data.category, data.type, workflow.type, platform, file.type) can be found below. Please, not that these tables are still incomplete.

Harmonized data

Data.category	Data.type	Workflow Type
Transcriptome Profiling	Gene Expression Quantification	HTSeq - Counts
		HTSeq - FPKM-UQ
		HTSeq - FPKM
	Isoform Expression Quantification	-
	miRNA Expression Quantification	-
Copy number variation	Copy Number Segment
	Masked Copy Number Segment
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical

Legacy data

Data.category	Data.type	Platform	file.type
Transcriptome Profiling
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg18.seg
	-	Affymetrix SNP Array 6.0	hg18.seg
	-	Affymetrix SNP Array 6.0	nocnv_hg19.seg
	-	Affymetrix SNP Array 6.0	hg19.seg
	-	Illumina HiSeq	-
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical
Protein expression		MDA RPPA Core	-
Gene expression	Gene expression quantification	Illumina HiSeq	normalized_results
		Illumina HiSeq	results
		HT_HG-U133A	-
		AgilentG4502A_07_2	-
		AgilentG4502A_07_1	-
		HuEx-1_0-st-v2	FIRMA.txt
			gene.txt
	Isoform expression quantification
	miRNA gene quantification
	Exon junction quantification
	Exon quantification
	miRNA isoform quantification

DNA methylation		Illumina Human Methylation 450	-
		Illumina Human Methylation 27	-
		Illumina DNA Methylation OMA003 CPI	-
		Illumina DNA Methylation OMA002 CPI	-
		Illumina Hi Seq
Raw Microarray Data
Structural Rearrangement
Other

`GDCquery_maf`: Working with mutation files.

In order to download the Mutation Annotation Format (MAF), we provide the user with the GDCquery_maf function. Briefly, it will get the open acess maf files from https://gdc-docs.nci.nih.gov/Data/Release_Notes/Data_Release_Notes/

The arguments that will be used to filter the data are:

tumor
save.csv: Save the maf as csv file

mutation <- GDCquery_Maf(tumor = "LGG")
mutation <- GDCquery_Maf(tumor = "LGG", save.csv = TRUE)

This mutation data can be visualized through a oncoprint from the ComplexHeatmap package (Gu, Zuguang and Eils, Roland and Schlesner, Matthias 2016).

mut <- GDCquery_Maf(tumor = "ACC")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "oncoprint.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        width = 5,
                        heatmap.legend.side = "right",
                        dist.col = 0,
                        label.font.size = 10)

`GDCquery_clinic` and `GDCprepare_clinic`: Working with clinical data.

In GDC database the clinical data can be retrieved in two sources:

indexed clinical (from json)
xml files

there are two main differences:

XML has more information: radiation, drugs information, follow-ups, biospecimen, etc. So the indexed one is only a subset of the XML files
The indexed data contains the updated data with the follow up informaiton. For example: if the patient is alive in the first time clinical data was collect and the in the next follow-up he is dead, the indexed data will show dead. The XML will have two fields, one for the first time saying he is alive (in the clinical part) and the follow-up saying he is dead. You can see this case here: https://gist.github.com/tiagochst/cfa384297ff0950a6489e3f4b296ecd7

clin.query <- GDCquery(project = "TCGA-BLCA", data.category = "Clinical", barcode = "TCGA-FD-A5C0")
 json  <- tryCatch(GDCdownload(clin.query), 
                   error = function(e) GDCdownload(clin.query, method = "client"))
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")
clinical.index <- GDCquery_clinic("TCGA-BLCA")

# Example of the second difference:
clinical.patient[,c("vital_status","days_to_death","days_to_last_followup")]

##   vital_status days_to_death days_to_last_followup
## 1        Alive                                 209

clinical.patient.followup[,c("vital_status","days_to_death","days_to_last_followup")]

##   vital_status days_to_death days_to_last_followup
## 1        Alive                                 407
## 2         Dead           550

# indexed data is equivalent to follow ups information
clinical.index[clinical.index$submitter_id=="TCGA-FD-A5C0",
               c("vital_status","days_to_death","days_to_last_follow_up")]

##     vital_status days_to_death days_to_last_follow_up
## 197         dead           550                    407

You can retrieve clinical data using the GDCquery_clinic function. This will get only the indexed GDC clinical data. This is the same as clicking on the download clinical buttun in the data portal. This means this function is not parsing the XML files, see GDCprepare_clinic function.

The arguments of this function are:

project: project_id (“TCGA-OV”,“TCGA-BRCA”,“TARGET-WT”, etc)
type: (“clinical”,“biospecimen”)
save.csv: save the clinical/biospeciment into a csv file. Pattern: project_type.csv

Examples of use:

clin <- GDCquery_clinic("TCGA-ACC", type = "clinical", save.csv = TRUE)
clin <- GDCquery_clinic("TCGA-ACC", type = "biospecimen", save.csv = TRUE)

To parse the TCGA clinical XML files, please use GDCprepare_clinic function. It receives as argument the query object and which clinical information we should retrieve. The possible values for the argument clinical.info are:

drug
admin
radiation
patient
stage_event
new_tumor_event
follow_up

query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Clinical", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")

Also, some functions to work with clinical data are provided.

For example the function TCGAquery_SampleTypes will filter barcodes based on a type the argument typesample. Some of the typesamples possibilities are: TP (PRIMARY SOLID TUMOR), TR (RECURRENT SOLID TUMOR), NT (Solid Tissue Normal) etc. Please, see ?TCGAquery_SampleTypes for all the possible values.

The function TCGAquery_MatchedCoupledSampleTypes will filter the samples that have all the typesample provided as argument. For example, if TP and TR are set as typesample, the function will return the barcodes of a patient if it has both types. So, if it has a TP, but not a TR, no barcode will be returned. If it has a TP and a TR both barcodes are returned.

The function TCGAquery_clinicFilt will filter your data, returning the list of barcodes that matches all the filter.

The arguments of TCGAquery_clinicFilt are:

barcode List of barcodes
clinical_patient_data clinical patient data obtained with clinic function Ex: clinical_patient_data <- TCGAquery_clinic(“LGG”,“clinical_patient”)
HER her2 immunohistochemistry receptor status: “Positive” or “Negative”
gender “MALE” or “FEMALE”
PR Progesterone receptor status: “Positive” or “Negative”
stage Pathologic Stage: “stage_IX”, “stage_I”, “stage_IA”, “stage_IB”, “stage_IIX”, “stage_IIA”, “stage_IIB”, “stage_IIIX”,“stage_IIIA”, “stage_IIIB”, “stage_IIIC”, “stage_IV” -
ER Estrogen receptor status: “Positive” or “Negative”

An example of the function is below:

bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")

S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")

# Retrieve multiple tissue types  NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))

# Retrieve multiple tissue types  FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))

# Get clinical data
clinical_brca_data <- TCGAquery_clinic("brca","clinical_patient")
female_erpos_herpos <- TCGAquery_clinicFilt(bar,
                                            clinical_brca_data, 
                                            HER="Positive", 
                                            gender="FEMALE", 
                                            ER="Positive")

The result is shown below:

## ER Positive Samples:

##   TCGA-BH-A1ES 
##   TCGA-BH-A0BZ 
##   TCGA-D8-A1JS 
##   TCGA-AN-A0FN 
##   TCGA-AR-A2LQ 
##   TCGA-BH-A1F8 
##   TCGA-AR-A24T 
##   TCGA-AO-A0J5 
##   TCGA-BH-A0B4

## HER Positive Samples:

##   TCGA-AN-A0FN 
##   TCGA-BH-A1F8

## GENDER FEMALE Samples:

##   TCGA-BH-A1ES 
##   TCGA-BH-A1F0 
##   TCGA-BH-A0BZ 
##   TCGA-D8-A1JS 
##   TCGA-AN-A0FN 
##   TCGA-AR-A2LQ 
##   TCGA-AR-A2LH 
##   TCGA-BH-A1F8 
##   TCGA-AR-A24T 
##   TCGA-AO-A0J5 
##   TCGA-B6-A1KN

## [1] "TCGA-AN-A0FN" "TCGA-BH-A1F8"

`TCGAquery_subtype`: Working with molecular subtypes data.

The Cancer Genome Atlas (TCGA) Research Network has reported integrated genome-wide studies of various diseases. We have added some of the subtypes defined by these report in our package. The ACC(Cancer Genome Atlas Research Network and others 2016), BRCA (Cancer Genome Atlas Research Network and others 2012c), COAD (Cancer Genome Atlas Research Network and others 2012b), GBM (Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others 2016), HNSC (Cancer Genome Atlas Research Network and others 2015a), KICH (Davis, Caleb F and Ricketts, Christopher J and Wang, Min and Yang, Lixing and Cherniack, Andrew D and Shen, Hui and Buhay, Christian and Kang, Hyojin and Kim, Sang Cheol and Fahey, Catherine C and others 2014), KIRC(Cancer Genome Atlas Research Network and others 2013a), KIRP (Linehan, W Marston and Spellman, Paul T and Ricketts, Christopher J and Creighton, Chad J and Fei, Suzanne S and Davis, Caleb and Wheeler, David A and Murray, Bradley A and Schmidt, Laura and Vocke, Cathy D and others 2016), LGG (Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others 2016), LUAD (Cancer Genome Atlas Research Network and others 2014b), LUSC(Cancer Genome Atlas Research Network and others 2012a), PRAD(Cancer Genome Atlas Research Network and others 2015c), READ (Cancer Genome Atlas Research Network and others 2012b), SKCM (Cancer Genome Atlas Research Network and others 2015b), STAD (Cancer Genome Atlas Research Network and others 2014a), THCA (Cancer Genome Atlas Research Network and others 2014c), UCEC (Cancer Genome Atlas Research Network and others 2013b) tumors have data added. These subtypes will be automatically added in the summarizedExperiment object through TCGAprepare. But you can also use the TCGAquery_subtype function to retrive this information.

# Check with subtypes from TCGAprepare and update examples
GBM_path_subtypes <- TCGAquery_subtype(tumor = "gbm")

LGG_path_subtypes <- TCGAquery_subtype(tumor = "lgg")

A subset of the lgg subytpe is shown below:

Table with LGG molecular subtypes from TCGAquery_subtype
	patient	Tissue.source.site	Study	BCR
1	TCGA-CS-4938	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
2	TCGA-CS-4941	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
3	TCGA-CS-4942	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
4	TCGA-CS-4943	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
5	TCGA-CS-4944	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
6	TCGA-CS-5390	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
7	TCGA-CS-5393	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
8	TCGA-CS-5394	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
9	TCGA-CS-5395	Thomas Jefferson University	Brain Lower Grade Glioma	IGC
10	TCGA-CS-5396	Thomas Jefferson University	Brain Lower Grade Glioma	IGC

`GDCdownload`: Downloading GDC data

You can easily download data using the GDCdownload function. It uses GDC transfer tool to download gdc data doing a system call. For this reason some times the update will stop to show, which does not means that the download process has stopped. Once the process has finished it will give a signal to R.

The data from query will be save in a folder: project/source/data.category (where source is harmonized or legacy)

The arguments are:

query A query for GDCquery function
token.file A token file to download protect data (not test yet)
method Use API (POST method) or gdc client tool (API is faster, but the data might get corrupted in the download, and it might need to be executed again). Options: “api”, “client”
directory Directory/Folder where the data was downloaded. Default: GDCdata
chunks.per.download This will make the API method only download n (chunks.per.download) files at a time. This may reduce the download problems when the data size is too large.

`GDCdownload`: Example of use

query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)

query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", file.type  = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                  legacy = TRUE)
GDCdownload(query)
z <- GDCprepare(query)

# A function to download only 20 samples
legacyPipeline <- function(project, data.category, platform, n = 20){
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE)
    cases <- query$results[[1]]$cases[1:n] # Get two samples from the search
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE,
                      chunks.per.download = 5,
                      barcode = cases)
    GDCdownload(query)
    data <- GDCprepare(query)
    return(data)
}
# DNA methylation
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 27")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 450")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA003 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")

`GDCprepare`: Preparing the data

This function is still under development, it is not working for all cases. See the tables below with the status. Examples of query, download, prepare can be found in this gist.

Harmonized data

Data.category	Data.type	Workflow Type	Status
Transcriptome Profiling	Gene Expression Quantification	HTSeq - Counts	Data frame or SE (losing 5% of information when mapping to genomic regions)
		HTSeq - FPKM-UQ	Returning only a (losing 5% of information when mapping to genomic regions)
		HTSeq - FPKM	Returning only a (losing 5% of information when mapping to genomic regions)
	Isoform Expression Quantification	Not needed
	miRNA Expression Quantification	Not needed	Returning only a dataframe for the moment
Copy number variation	Copy Number Segment		Returning only a dataframe for the moment
	Masked Copy Number Segment		Returning only a dataframe for the moment
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical

Legacy data

Data.category	Data.type	Platform	file.type	Status
Transcriptome Profiling
Copy number variation	-	Affymetrix SNP Array 6.0	nocnv_hg18.seg	Working
	-	Affymetrix SNP Array 6.0	hg18.seg	Working
	-	Affymetrix SNP Array 6.0	nocnv_hg19.seg	Working
	-	Affymetrix SNP Array 6.0	hg19.seg	Working
	-	Illumina HiSeq	-	Working
Simple Nucleotide Variation
Raw Sequencing Data
Biospecimen
Clinical
Protein expression		MDA RPPA Core	-	Working
Gene expression	Gene expression quantification	Illumina HiSeq	normalized_results	Working
		Illumina HiSeq	results	Working
		HT_HG-U133A	-	Working
		AgilentG4502A_07_2	-	Data frame only
		AgilentG4502A_07_1	-	Data frame only
		HuEx-1_0-st-v2	FIRMA.txt	Not Preparing
			gene.txt	Not Preparing
	Isoform expression quantification
	miRNA gene quantification
	Exon junction quantification
	Exon quantification
	miRNA isoform quantification

DNA methylation		Illumina Human Methylation 450	Not used	Working
		Illumina Human Methylation 27	Not used	Working
		Illumina DNA Methylation OMA003 CPI	Not used	Working
		Illumina DNA Methylation OMA002 CPI	Not used	Working
		Illumina Hi Seq		Not working
Raw Microarray Data
Structural Rearrangement
Other

You can easily read the downloaded data using the GDCprepare function. This function will prepare the data into a SummarizedExperiment (Huber, Wolfgang and Carey, Vincent J and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and others 2015) object for downstream analysis. For the moment this function works only with data level 3.

The arguments are:

query Data frame as the one returned from TCGAquery
save Save a rda object with the prepared object? Default: FALSE
save.filename Name of the rda object that will be saved if save is TRUE
summarizedExperiment Should the output be a SummarizedExperiment object? Default: TRUE
directory Directory/Folder where the data was downloaded. Default: GDCdata
remove.files.prepared Remove the files read? Default: FALSE. This argument will be considered only if save argument is set to true

library(TCGAbiolinks)
# Downloading and prepare 
query <- GDCquery(project = "TARGET-AML", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query)
data <- GDCprepare(query)

# Downloading and prepare using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "Protein expression",
                  legacy = TRUE, 
                  barcode = c("TCGA-OX-A56R-01A-21-A44T-20","TCGA-08-0357-01A-21-1898-20"))
GDCdownload(query)
data <- GDCprepare(query, save = TRUE, 
                   save.filename = "gbmProteinExpression.rda",
                   remove.files.prepared = TRUE)

# Downloading and prepare using legacy
query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation", 
                  platform = "Illumina Human Methylation 27",legacy = TRUE,
                  barcode = c("TCGA-02-0047-01A-01D-0186-05","TCGA-06-2559-01A-01D-0788-05"))
GDCdownload(query)
data <- GDCprepare(query)

`TCGAanalyze`: Analyze data from TCGA.

You can easily analyze data using following functions:

`TCGAanalyze_Preprocessing`: Preprocessing of Gene Expression data (IlluminaHiSeq_RNASeqV2)

You can easily search TCGA samples, download and prepare a matrix of gene expression.

# You can define a list of samples to query and download providing relative TCGA barcodes.
listSamples <- c("TCGA-E9-A1NG-11A-52R-A14M-07","TCGA-BH-A1FC-11A-32R-A13Q-07",
                 "TCGA-A7-A13G-11A-51R-A13Q-07","TCGA-BH-A0DK-11A-13R-A089-07",
                 "TCGA-E9-A1RH-11A-34R-A169-07","TCGA-BH-A0AU-01A-11R-A12P-07",
                 "TCGA-C8-A1HJ-01A-11R-A13Q-07","TCGA-A7-A13D-01A-13R-A12P-07",
                 "TCGA-A2-A0CV-01A-31R-A115-07","TCGA-AQ-A0Y5-01A-11R-A14M-07")

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Gene expression",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  barcode = listSamples, 
                  legacy = TRUE)

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCAdownload(query)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseq_assay <- GDCprepare(query)

BRCAMatrix <- assay(BRCARnaseq_assay,"raw_counts")

# For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
BRCARnaseq_CorOutliers <- TCGAanalyze_Preprocessing(BRCARnaseq_assay)

The result is shown below:

Example of a matrix of gene expression (10 genes in rows and 2 samples in columns)
	TCGA-BH-A0DK-11A-13R-A089-07	TCGA-E9-A1NG-11A-52R-A14M-07
C6orf191\|253582	0	0
SLC2A6\|11182	46	106
SCRN2\|90507	2491	3545
ALOXE3\|59344	45	5
THSD7A\|221981	184	429
TGFBR3\|7049	10545	19966
MAP6D1\|79929	166	69
FGF22\|27006	2	2
C10orf11\|83938	102	229
KLRK1\|22914	72	209

The result from TCGAanalyze_Preprocessing is shown below:

`TCGAanalyze_DEA` `&` `TCGAanalyze_LevelTab`: Differential expression analysis (DEA)

Perform DEA (Differential expression analysis) to identify differentially expressed genes (DEGs) using the TCGAanalyze_DEA function.

TCGAanalyze_DEA performs DEA using following functions from R :

edgeR::DGEList converts the count matrix into an edgeR object.
edgeR::estimateCommonDisp each gene gets assigned the same dispersion estimate.
edgeR::exactTest performs pair-wise tests for differential expression between two groups.
edgeR::topTags takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes.

This function receives as arguments:

mat1 The matrix of the first group (in the example, group 1 is the normal samples),
mat2 The matrix of the second group (in the example, group 2 is tumor samples)
Cond1type Label for group 1
Cond1type Label for group 2

Next, we filter the output of dataDEGs by abs(LogFC) >=1, and uses the TCGAanalyze_LevelTab function to create a table with DEGs (differentially expressed genes), log Fold Change (FC), false discovery rate (FDR), the gene expression level for samples in Cond1type, and Cond2type, and Delta value (the difference of gene expression between the two conditions multiplied logFC).

# Downstream analysis using gene expression data  
# TCGA samples from IlluminaHiSeq_RNASeqV2 with type rsem.genes.results
# save(dataBRCA, geneInfo , file = "dataGeneExpression.rda")
library(TCGAbiolinks)

# normalization of genes
dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo =  geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
                                   typesample = c("NT"))

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), 
                                   typesample = c("TP"))

# Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
                            mat2 = dataFilt[,samplesTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")

# DEGs table with expression values in normal and tumor samples
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
                                          dataFilt[,samplesTP],dataFilt[,samplesNT])

The result is shown below:

Table of DEGs after DEA
mRNA	logFC	FDR	Tumor	Normal	Delta
FN1	2.88	1.296151e-19	347787.48	41234.12	1001017.3
COL1A1	1.77	1.680844e-08	358010.32	89293.72	633086.3
C4orf7	5.20	2.826474e-50	87821.36	2132.76	456425.4
COL1A2	1.40	9.480478e-06	273385.44	91241.32	383242.9
GAPDH	1.32	3.290678e-05	179057.44	63663.00	236255.5
CLEC3A	6.79	7.971002e-74	27257.16	259.60	185158.6
IGFBP5	1.24	1.060717e-04	128186.88	53323.12	158674.6
CPB1	4.27	3.044021e-37	37001.76	2637.72	157968.8
CARTPT	6.72	1.023371e-72	21700.96	215.16	145872.8
DCD	7.26	1.047988e-80	19941.20	84.80	144806.3

`TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot`: Enrichment Analysis

Researchers, in order to better understand the underlying biological processes, often want to retrieve a functional profile of a set of genes that might have an important role. This can be done by performing an enrichment analysis.

We will perform an enrichment analysis on gene sets using the TCGAanalyze_EAcomplete function. Given a set of genes that are up-regulated under certain conditions, an enrichment analysis will identify classes of genes or proteins that are over-represented using annotations for that gene set.

To view the results you can use the TCGAvisualize_EAbarplot function as shown below.

library(TCGAbiolinks)
# Enrichment Analysis EA
# Gene Ontology (GO) and Pathway enrichment by DEGs list
Genelist <- rownames(dataDEGsFiltLevel)

system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist))

# Enrichment Analysis EA (TCGAVisualize)
# Gene Ontology (GO) and Pathway enrichment barPlot

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), 
                        GOBPTab = ansEA$ResBP,
                        GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF,
                        PathTab = ansEA$ResPat,
                        nRGTab = Genelist, 
                        nBar = 10)

The result is shown below:

The figure shows canonical pathways significantly overrepresented (enriched) by the DEGs (differentially expressed genes). The most statistically significant canonical pathways identified in DEGs list are listed according to their p value corrected FDR (-Log) (colored bars) and the ratio of list genes found in each pathway over the total number of genes in that pathway (Ratio, red line).

`TCGAanalyze_survival`: Survival Analysis: Cox Regression and dnet package

When analyzing survival, different problems come up than the ones discussed so far. One question is how do we deal with subjects dropping out of a study. For example, assuming that we test a new cancer drug. While some subjects die, others may believe that the new drug is not effective, and decide to drop out of the study before the study is finished. A similar problem would be faced when we investigate how long a machine lasts before it breaks down.

Using the clinical data, it is possible to create a survival plot with the function TCGAanalyze_survival as follows:

clin.gbm <- GDCquery_clinic("TCGA-GBM", "clinical")
TCGAanalyze_survival(clin.gbm,
                     "gender",
                     main = "TCGA Set\n GBM",height = 10, width=10)

The arguments of TCGAanalyze_survival are:

clinical_patient TCGA Clinical patient with the information days_to_death
clusterCol Column with groups to plot. This is a mandatory field, the caption will be based in this column
legend Legend title of the figure
cutoff xlim This parameter will be a limit in the x-axis. That means, that patients with days_to_deth > cutoff will be set to Alive.
main main title of the plot
ylab y-axis text of the plot
xlab x-axis text of the plot
filename The name of the pdf file
color Define the colors of the lines.

The result is shown below:

library(TCGAbiolinks)
# Survival Analysis SA

clinical_patient_Cancer <- TCGAquery_clinic("TCGA-BRCA","clinical")
dataBRCAcomplete <- log2(BRCA_rnaseqv2)

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dataBRCAcomplete)/100)){
    message( paste( i, "of ", round(nrow(dataBRCAcomplete)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM<-TCGAanalyze_SurvivalKM(clinical_patient_Cancer,
                                      dataBRCAcomplete,
                                      Genelist = rownames(dataBRCAcomplete)[tokenStart:tokenStop],
                                      Survresult = F,
                                      ThreshTop=0.67,
                                      ThreshDown=0.33)
    
    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

tabSurvKMcomplete <- tabSurvKMcomplete[tabSurvKMcomplete$pvalue < 0.01,]
tabSurvKMcomplete <- tabSurvKMcomplete[!duplicated(tabSurvKMcomplete$mRNA),]
rownames(tabSurvKMcomplete) <-tabSurvKMcomplete$mRNA
tabSurvKMcomplete <- tabSurvKMcomplete[,-1]
tabSurvKMcomplete <- tabSurvKMcomplete[order(tabSurvKMcomplete$pvalue, decreasing=F),]

tabSurvKMcompleteDEGs <- tabSurvKMcomplete[
    rownames(tabSurvKMcomplete) %in% dataDEGsFiltLevel$mRNA,
    ]

The result is shown below:

Table KM-survival genes after SA
	pvalue	Cancer Deaths	Cancer Deaths with Top	Cancer Deaths with Down
DCTPP1	6.204170e-08	66	46	20
APOO	9.390193e-06	65	49	16
LOC387646	1.039097e-05	69	48	21
PGK1	1.198577e-05	71	49	22
CCNE2	2.100348e-05	65	48	17
CCDC75	2.920614e-05	74	46	28
FGD3	3.039998e-05	69	23	46
FAM166B	3.575856e-05	68	25	43
MMP28	3.762361e-05	70	17	53
ADHFE1	3.907103e-05	67	22	45

	Mean Tumor Top	Mean Tumor Down	Mean Normal
DCTPP1	13.31	12.01	11.74
APOO	11.40	10.17	10.01
LOC387646	7.92	4.64	5.90
PGK1	15.66	14.18	14.28
CCNE2	11.07	8.23	7.03
CCDC75	9.47	-Inf	9.74
FGD3	12.30	8.57	8.90
FAM166B	6.82	-Inf	7.52
MMP28	8.55	-Inf	9.06
ADHFE1	9.04	6.13	10.10

`TCGAanalyze_DMR`: Differentially methylated regions Analysis

We will search for differentially methylated CpG sites using the TCGAanalyze_DMR function. In order to find these regions we use the beta-values (methylation values ranging from 0.0 to 1.0) to compare two groups.

First, it calculates the difference between the mean DNA methylation of each group for each probe.

Second, it test for differential expression between two groups using the wilcoxon test adjusting by the Benjamini-Hochberg method. The default arguments was set to require a minimum absolute beta-values difference of 0.2 and am adjusted p-value of < 0.01.

After these tests, we save a volcano plot (x-axis:diff mean methylation, y-axis: statistical significance) that will help the user identify the differentially methylated CpG sites, then the results are saved in a csv file (DMR_results.groupCol.group1.group2.csv) and finally the object is returned with the calculus in the rowRanges.

The arguments of TCGAanalyze_DMR are:

data SummarizedExperiment obtained from the GDCp/repare
groupCol Columns with the groups inside the SummarizedExperiment object. (This will be obtained by the function colData(data)
group1 In case our object has more than 2 groups, you should set the name of the group
group2 In case our object has more than 2 groups, you should set the name of the group
plot.filename pdf filename. Default: volcano.pdf
legend Legend title
color vector of colors to be used in graph
title main title. If not specified it will be “Volcano plot (group1 vs group2)
ylab y axis text
xlab x axis text
xlim x limits to cut image
ylim y limits to cut image
label vector of labels to be used in the figure. Example: c(“Not Significant”, “Hypermethylated in group1”, “Hypomethylated in group1”))
p.cut p values threshold. Default: 0.01
diffmean.cut diffmean threshold. Default: 0.2
adj.method Adjusted method for the p-value calculation
paired Wilcoxon paired parameter. Default: FALSE
overwrite Overwrite the pvalues and diffmean values if already in the object for both groups? Default: FALSE
save save the object with the results?
cores use multiple cores for non-parametric test
filename Name of the rda file to save the object

data <- TCGAanalyze_DMR(data, groupCol = "methylation_subtype",
                        group1 = "CIMP.H",
                        group2="CIMP.L",
                        p.cut = 10^-5,
                        diffmean.cut = 0.25,
                        legend = "State",
                        plot.filename = "coad_CIMPHvsCIMPL_metvolcano.png")

The output will be a plot such as the figure below. The green dots are the probes that are hypomethylated in group 2 compared to group 1, while the red dots are the hypermethylated probes in group 2 compared to group 1

Also, the TCGAanalyze_DMR function will save the plot as pdf and return the same SummarizedExperiment that was given as input with the values of p-value, p-value adjusted, diffmean and the group it belongs in the graph (non significant, hypomethylated, hypermethylated) in the rowRanges. The collumns will be (where group1 and group2 are the names of the groups):

diffmean.group1.group2 (mean.group2 - mean.group1)
diffmean.group2.group1 (mean.group1 - mean.group2)
p.value.group1.group2
p.value.adj.group1.group2
status.group1.group2 (Status of probes in group2 in relation to group1)
status.group2.group1 (Status of probes in group1 in relation to group2)

This values can be view/acessed using the rowRanges acessesor (rowRanges(data)).

Observation: Calling the same function again, with the same arguments will only plot the results, as it was already calculated. If you want to have them recalculated, please set overwrite to TRUE or remove the calculated collumns.

`TCGAvisualize`: Visualize results from analysis functions with TCGA’s data.

You can easily visualize results from some following functions:

`TCGAvisualize_Heatmap`: Create heatmaps with cluster bars

In order to have a better view of clusters, we normally use heatmaps. TCGAvisualize_Heatmap will plot a heatmap and add to each sample bars representing different features. This function is a wrapper to the package ComplexHeatmap package,

The arguments of this function are:

data The object with the heatmap data (expression, methylation)
col.metadata Metadata for the columns (patients). It should have the column bcr_patient_barcode or patient or ID with the patients barcodes.
row.metadata Metadata for the rows genes (expression) or probes (methylation)
col.colors A list of names colors
row.colors A list of named colors
show_column_names Show column names names? Dafault: FALSE
show_row_names Show row names? Dafault: FALSE
cluster_rows Cluster rows ? Dafault: FALSE
cluster_columns Cluster columns ? Dafault: FALSE
sortCol Name of the column to be used to sort the columns
title Title of the plot
type Select the colors of the heatmap values. Possible values are “expression” (default), “methylation”
scale Use z-score to make the heamat? If we want to show differences between genes, it is good to make Z-score by samples (force each sample to have zero mean and standard deviation=1). If we want to show differences between samples, it is good to make Z-score by genes (force each gene to have zero mean and standard deviation=1). Possibilities: “row”, “col. Default”none"

For more information please take a look on case study #2.

The result is shown below:

`TCGAvisualize_Volcano`: Create volcano plot

Creates a volcano plot for DNA methylation or expression

The arguments of this function are:

x x-axis data
y y-axis data
y.cut p-values threshold. Default: 0.01
x.cut x-axis threshold. Default: 0.0
filename Filename. Default: volcano.pdf, volcano.svg, volcano.png
legend Legend title
color vector of colors to be used in graph
title main title. If not specified it will be Volcano plot (group1 vs group2)
ylab y axis text
xlab x axis text
xlim x limits to cut image
ylim y limits to cut image
height Figure height
width Figure width
names Names to be ploted if significant. Should be the same size of x and y
names.fill Names should be filled in a color box? Default: TRUE
names.size Size of the names text
dpi Figure dpi
label vector of labels to be used in the figure. Example: c(“Not Significant”,“Hypermethylated in group1”, “Hypomethylated in group1”))
highlight List of genes/probes to be highlighted. It should be in the names argument.
highlight.color Color of the points highlighted
show.names What names will be showd? Possibilities: “both”, “significant”, “highlighted”

For more information please take a look on case study #3.

`TCGAvisualize_oncoprint`

The visualize the MAF files we created a function called TCGAvisualize_oncoprint that uses the ComplexHeatmap to create an oncoprint.

mut Mutation file (see TCGAquery_maf from TCGAbiolinks)
genes Gene list
filename name of the pdf
color named vector for the plot
height pdf height
rm.empty.columns If there is no alteration in that sample, whether remove it on the oncoprint
show.row.barplot Show barplot annotation on rows?
show.column.names Show column names? Default: FALSE
rows.font.size Size of the fonts
labels.font.size Size of the fonts
annotation Matrix or data frame with the annotation. Should have a column bcr_patient_barcode with the same ID of the mutation object
annotation.position Position of the annotation “bottom” or “top”
label.title Title of the label

mut <- GDCquery_Maf(tumor = "ACC")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "oncoprint.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        heatmap.legend.side = "right",
                        dist.col = 0, 
                        width = 5,
                        label.font.size = 10)

The columns are the samples and the rows are genes. The upper histograms shows the number of mutation of each samples, the right histogram shows the number of patients that has that mutation (which is also represented by the percentage numbers in the left side of the plot).

`TCGAvisualize_PCA`: Principal Component Analysis plot for differentially expressed genes

In order to better understand our genes, we can perform a PCA to reduce the number of dimensions of our gene set. The function TCGAvisualize_PCA will plot the PCA for different groups.

The arguments of this function are:

dataFilt The expression matrix after normalization and quantile filter
dataDEGsFiltLevel The TCGAanalyze_LevelTab output
ntopgenes number of DEGs genes to plot in PCA

# normalization of genes
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)

# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# Principal Component Analysis plot for ntop selected DEGs
TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200)

The result is shown below:

`TCGAvisualize_SurvivalCoxNET` Survival Analysis: Cox Regression and dnet package

TCGAvisualize_SurvivalCoxNET can help an user to identify a group of survival genes that are significant from univariate Kaplan Meier Analysis and also for Cox Regression.

It shows in the end a network build with community of genes with similar range of pvalues from Cox regression (same color) and that interaction among those genes is already validated in literatures using the STRING database (version 9.1).

library(TCGAbiolinks)
# Survival Analysis SA

clinical_patient_Cancer <- TCGAquery_clinic("brca","clinical_patient")
dataBRCAcomplete <- log2(BRCA_rnaseqv2)

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dataBRCAcomplete)/100)){
    message( paste( i, "of ", round(nrow(dataBRCAcomplete)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM <- TCGAanalyze_SurvivalKM(clinical_patient_Cancer,
                                        dataBRCAcomplete,
                                        Genelist = rownames(dataBRCAcomplete)[tokenStart:tokenStop],
                                        Survresult = F,ThreshTop=0.67,ThreshDown=0.33)
    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

tabSurvKMcomplete <- tabSurvKMcomplete[tabSurvKMcomplete$pvalue < 0.01,]
tabSurvKMcomplete <- tabSurvKMcomplete[!duplicated(tabSurvKMcomplete$mRNA),]
rownames(tabSurvKMcomplete) <-tabSurvKMcomplete$mRNA
tabSurvKMcomplete <- tabSurvKMcomplete[,-1]
tabSurvKMcomplete <- tabSurvKMcomplete[order(tabSurvKMcomplete$pvalue, decreasing=F),]

tabSurvKMcompleteDEGs <- tabSurvKMcomplete[rownames(tabSurvKMcomplete) %in% dataDEGsFiltLevel$mRNA,]

tflist <- EAGenes[EAGenes$Family == "transcription regulator","Gene"]
tabSurvKMcomplete_onlyTF <- tabSurvKMcomplete[rownames(tabSurvKMcomplete) %in% tflist,]

TabCoxNet <- TCGAvisualize_SurvivalCoxNET(clinical_patient_Cancer,dataBRCAcomplete,
                                          Genelist = rownames(tabSurvKMcompleteDEGs),
                                          scoreConfidence = 700,
                                          titlePlot = "TCGAvisualize_SurvivalCoxNET Example")

In particular the survival analysis with kaplan meier and cox regression allow user to reduce the feature / number of genes significant for survival. And using ‘dnet’ pipeline with ‘TCGAvisualize_SurvivalCoxNET’ function the user can further filter those genes according some already validated interaction according STRING database. This is important because the user can have an idea about the biology inside the survival discrimination and further investigate in a sub-group of genes that are working in as synergistic effect influencing the risk of survival. In the following picture the user can see some community of genes with same color and survival pvalues.

The result is shown below:

`TCGAvisualize_meanMethylation`: Mean DNA Methylation Analysis

Using the data and calculating the mean DNA methylation per group, it is possible to create a mean DNA methylation boxplot with the function TCGAvisualize_meanMethylation as follows:

query <- GDCquery(project = "TCGA-GBM",
                  data.category = "DNA methylation",
                  platform = "Illumina Human Methylation 27",
                  legacy = TRUE, 
                  barcode = c("TCGA-02-0058-01A-01D-0186-05", "TCGA-12-1597-01B-01D-0915-05",
                              "TCGA-12-0829-01A-01D-0392-05", "TCGA-06-0155-01B-01D-0521-05",
                              "TCGA-02-0099-01A-01D-0199-05", "TCGA-19-4068-01A-01D-1228-05",
                              "TCGA-19-1788-01A-01D-0595-05", "TCGA-16-0848-01A-01D-0392-05"))
GDCdownload(query, method = "api")
data <- GDCprepare(query)

# "shortLetterCode" is a column in the SummarizedExperiment::colData(data) matrix
TCGAvisualize_meanMethylation(data, groupCol = "shortLetterCode",filename = NULL)

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## TP        0.2785957   0.2733131   0.3270671   0.2341192

The arguments of TCGAvisualize_meanMethylation are:

data SummarizedExperiment object obtained from GDCprepare
groupCol Columns in colData(data) that defines the groups. If no columns defined a columns called “Patients” will be used
subgroupCol Columns in colData(data) that defines the subgroups.
shapes Shape vector of the subgroups. It must have the size of the levels of the subgroups. Example: shapes = c(21,23) if for two levels
filename The name of the pdf that will be saved
subgroup.legend Name of the subgroup legend. DEFAULT: subgroupCol
group.legend Name of the group legend. DEFAULT: groupCol
color vector of colors to be used in graph
title main title in the plot
ylab y axis text in the plot
print.pvalue Print p-value for two groups in the plot
xlab x axis text in the plot
labels Labels of the groups
y.limits Change lower/upper y-axis limit

Other examples using the parameters:

# setting y limits: lower 0, upper 1
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode",filename = NULL, y.limits = c(0,1))

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## TP        0.2785957   0.2733131   0.3270671   0.2341192

# setting y limits: lower 0
TCGAvisualize_meanMethylation(data,groupCol = "shortLetterCode",filename = NULL, y.limits = 0)

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## TP        0.2785957   0.2733131   0.3270671   0.2341192

# Changing shapes of jitters to show subgroups
TCGAvisualize_meanMethylation(data,
                              groupCol  = "subtype_Pan.Glioma.DNA.Methylation.Cluster", 
                              subgroupCol="vital_status", filename = NULL)

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## LGm1      0.3270671   0.3270671   0.3270671   0.3270671
## LGm2      0.3232846   0.3232846   0.3232846   0.3232846
## LGm5      0.2577566   0.2487907   0.2961445   0.2373007
## NA        0.2736938   0.2736938   0.3132684   0.2341192
## 
## 
##        LGm1   LGm2   LGm5 
## -----  -----  -----  -----
## LGm1   NA     NA     NA   
## LGm2   NA     NA     NA   
## LGm5   NA     NA     NA

# Sorting bars by descending mean methylation
TCGAvisualize_meanMethylation(data,
                              groupCol  = "subtype_Pan.Glioma.DNA.Methylation.Cluster",sort="mean.desc",filename=NULL)

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## LGm1      0.3270671   0.3270671   0.3270671   0.3270671
## LGm2      0.3232846   0.3232846   0.3232846   0.3232846
## LGm5      0.2577566   0.2487907   0.2961445   0.2373007
## NA        0.2736938   0.2736938   0.3132684   0.2341192
## 
## 
##        LGm1   LGm2   LGm5 
## -----  -----  -----  -----
## LGm1   NA     NA     NA   
## LGm2   NA     NA     NA   
## LGm5   NA     NA     NA

# Sorting bars by asc mean methylation
TCGAvisualize_meanMethylation(data,groupCol  = "subtype_Pan.Glioma.DNA.Methylation.Cluster",
                              sort="mean.asc",filename=NULL)

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## LGm1      0.3270671   0.3270671   0.3270671   0.3270671
## LGm2      0.3232846   0.3232846   0.3232846   0.3232846
## LGm5      0.2577566   0.2487907   0.2961445   0.2373007
## NA        0.2736938   0.2736938   0.3132684   0.2341192
## 
## 
##        LGm1   LGm2   LGm5 
## -----  -----  -----  -----
## LGm1   NA     NA     NA   
## LGm2   NA     NA     NA   
## LGm5   NA     NA     NA

TCGAvisualize_meanMethylation(data,groupCol  = "vital_status",sort="mean.asc",filename=NULL)

## 
## 
## groups         Mean      Median         Max         Min
## -------  ----------  ----------  ----------  ----------
## alive     0.2961445   0.2961445   0.2961445   0.2961445
## dead      0.2760888   0.2504818   0.3270671   0.2341192
## 
## 
##         alive   dead 
## ------  ------  -----
## alive   NA      NA   
## dead    NA      NA

`TCGAvisualize_starburst`: Integration of gene expression and DNA methylation data

The starburst plot is proposed to combine information from two volcano plots, and is applied for a study of DNA methylation and gene expression. It first introduced in 2010 (Noushmehr, H., Weisenberger, D.J., Diefes, K., Phillips, H.S., Pujara, K., Berman, B.P., Pan, F., Pelloski, C.E., Sulman, E.P., Bhat, K.P. et al. 2010). In order to reproduce this plot, we will use the TCGAvisualize_starburst function.

The function creates Starburst plot for comparison of DNA methylation and gene expression. The log10 (FDR-corrected P value) for DNA methylation is plotted in the x axis, and for gene expression in the y axis, for each gene. The black dashed line shows the FDR-adjusted P value of 0.01.

The arguments of this function are:

met SummarizedExperiment with methylation data obtained from the GDCprepare or Data frame from DMR_results file. Expected colData columns: diffmean, p.value.adj and p.value Execute volcanoPlot function in order to obtain these values for the object.
exp Matrix with expression data obtained from the TCGAanalyze_DEA function. Expected colData columns: logFC, FDR
names Add the names of the significant genes? Default: FALSE
names.fill Names should be filled in a color box? Default: TRUE
circle Circle pair gene/probe that respects diffmean.cut and logFC.cut
filename pdf filename
legend legend title
color vector of colors to be used in graph
label vector of labels to be used in graph
title main title
ylab y axis text
xlab x axis text
xlim x limits to cut image
ylim y limits to cut image
p.cut p value cut-off
group1 The name of the group 1 Obs: Column p.value.adj.group1.group2 should exist
group2 The name of the group 2. Obs: Column p.value.adj.group1.group2 should exist
exp.p.cut expression p value cut-off
met.p.cut methylation p value cut-off
diffmean.cut If set, the probes with diffmean higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted.
logFC.cut If set, the probes with expression fold change higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted.
height Figure height
width Figure width
dpi Figure dpi

starburst <- TCGAvisualize_starburst(coad.SummarizeExperiment, 
                                     different.experssion.analysis.data,
                                     group1 = "CIMP.H",
                                     group2 = "CIMP.L",
                                     met.p.cut = 10^-5, 
                                     exp.p.cut=10^-5,
                                     names = TRUE)

As a result, the function will a plot the figure below and return a matrix with the Gene_symbol and it status in relation to expression (up regulated/down regulated) and to methylation (Hyper/Hypo methylated). The case study 3, shows the complete pipeline for creating this figure.

TCGA Downstream Analysis: Case Studies

Introduction

This vignette shows a complete workflow of the TCGAbiolinks package. The code is divided in 4 case study:

1. Expression pipeline (BRCA)
1. Expression pipeline (GBM)
1. Integration of DNA methylation and RNA expression pipeline (COAD)
1. Elmer pipeline (KIRC)

Arguments definition

library(SummarizedExperiment)
library(TCGAbiolinks)
query.exp <- GDCquery(project = "TCGA-BRCA", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Primary solid Tumor","Solid Tissue Normal"))
GDCdownload(query.exp)
brca.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "brcaExp.rda")

# get subtype information
dataSubt <- TCGAquery_subtype(tumor = "BRCA")

# get clinical data
dataClin <- GDCquery_clinic(project = "TCGA-BRCA","clinical") 

# Which samples are primary solid tumor
dataSmTP <- TCGAquery_SampleTypes(query.exp$results[[1]]$cases,"TP") 
# which samples are solid tissue normal
dataSmNT <- TCGAquery_SampleTypes(query.exp$results[[1]]$cases,"NT")

Using TCGAnalyze_DEA, we identified 3,390 differentially expression genes (DEG) (log fold change >=1 and FDR < 1%) between 114 normal and 1097 BRCA samples. In order to understand the underlying biological process from DEGs we performed an enrichment analysis using TCGAnalyze_EA_complete function.

dataPrep <- TCGAanalyze_Preprocessing(object = brca.exp, cor.cut = 0.6)                      

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")                

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT],
                            mat2 = dataFilt[,dataSmTP],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")

TCGAbiolinks outputs bar chart with the number of genes for the main categories of three ontologies (GO:biological process, GO:cellular component, and GO:molecular function, respectively).

ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",
                                RegulonList = rownames(dataDEGs))  

TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                        GOBPTab = ansEA$ResBP,
                        GOCCTab = ansEA$ResCC,
                        GOMFTab = ansEA$ResMF,
                        PathTab = ansEA$ResPat,
                        nRGTab = rownames(dataDEGs),
                        nBar = 20)

The figure resulted from the code above is shown below.

The Kaplan-Meier analysis was used to compute survival univariate curves, and
log-Ratio test was computed to assess the statistical significance by using TCGAanalyze_SurvivalKM function; starting with 3,390 DEGs genes we found 555 significantly genes with p.value <0.05.

dataSurv <- TCGAanalyze_SurvivalKM(clinical_patient = dataClin,
                                   dataGE = dataFilt,
                                   Genelist = rownames(dataDEGs),
                                   Survresult = FALSE,
                                   ThreshTop = 0.67,
                                   ThreshDown = 0.33,
                                   p.cut = 0.05)

Cox-regression analysis was used to compute survival multivariate curves, and cox p-value was computed to assess the statistical significance by using TCGAnalyze_SurvivalCoxNET function. Survival multivariate analysis found 160 significantly genes according to the cox p-value FDR 5.00e-02. From DEGs that we found to correlate significantly with survival by both univariate and multivariate analyses we analyzed the following network.

The interactome network graph was generated using STRING.,org.Hs.string version 10 (Human functional protein association network). The network graph was resized by dnet package considering only multivariate survival genes, with strong interaction (threshold = 700) we obtained a subgraphsub graph of 24 nodes and 31 edges.

require(dnet)  # to change
org.Hs.string <- dRDataLoader(RData = "org.Hs.string")

TabCoxNet <- TCGAvisualize_SurvivalCoxNET(dataClin,
                                          dataFilt, 
                                          Genelist = rownames(dataSurv),
                                          scoreConfidence = 700,
                                          org.Hs.string = org.Hs.string,
                                          titlePlot = "Case Study n.1 dnet")

The figure resulted from the code above is shown below.

Case study n. 2: Pan Cancer downstream analysis LGG

We focused on the analysis of LGG samples. In particular, we used TCGAbiolinks to download 293 samples with molecular subtypes. Link the complete complete code. .

library(TCGAbiolinks)
library(SummarizedExperiment)

query.exp <- GDCquery(project = "TCGA-LGG", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = "Primary solid Tumor")
GDCdownload(query.exp)
lgg.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "lggExp.rda")

# get subtype information
dataSubt <- TCGAquery_subtype(tumor = "LGG")

# get indexed clinical data
dataClin <- GDCquery_clinic(project = "TCGA-LGG", "Clinical")

First, we searched for possible outliers using the TCGAanalyze_Preprocessing function, which performs an Array Array Intensity correlation AAIC. We used all samples in expression data which contain molecular subtypes, filtering out samples without molecular information, and using only IDHmut-codel (n=85), IDHmut-non-codel (n=141) and IDHwt (n=56), NA (11), to define a square symetric matrix of pearson correlation among all samples (n=293). According to this matrix we found no samples with low correlation (cor.cut = 0.6) that can be identified as possible outliers, so we continued our analysis with 70 samples.

Second, using the TCGAanalyze_Normalization function we normalized mRNA transcripts and miRNA, using EDASeq package. This function does use Within-lane normalization procedures to adjust for GC-content effect (or other gene-level effects) on read counts: loess robust local regression, global-scaling, and full-quantile normalization (Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. 2011) and between-lane normalization procedures to adjust for distributional differences between lanes (e.g., sequencing depth): global-scaling and full-quantile normalization (James H Bullard, Elizabeth Purdom, Kasper D Hansen and Sandrine Dudoit 2010).

Third, using the TCGAanalyze_Filtering function we applied 3 filters removing features / mRNAs with low signal across samples obtaining 4578, 4284, 1187 mRNAs respectively.

Then we applied two Hierarchical cluster analysis on 1187 mRNAs after the three filters described above, the first cluster using as method ward.D2, and the second with ConsensusClusterPlus.

After the two clustering analysis, with cut.tree =4 we obtained n=4 espression clusters (EC).

# expression data with molecular subtypes
lgg.exp <- subset(lgg.exp, select = colData(lgg.exp)$patient %in% dataSubt$patient)

dataPrep <- TCGAanalyze_Preprocessing(object = lgg.exp,cor.cut = 0.6)
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")

datFilt1 <- TCGAanalyze_Filtering(tabDF = dataNorm,method = "varFilter")
datFilt2 <- TCGAanalyze_Filtering(tabDF = datFilt1,method = "filter1")
datFilt <- TCGAanalyze_Filtering(tabDF = datFilt2,method = "filter2")

data_Hc1 <- TCGAanalyze_Clustering(tabDF = datFilt,
                                   method = "hclust",
                                   methodHC = "ward.D2")
data_Hc2 <- TCGAanalyze_Clustering(tabDF = datFilt,
                                   method = "consensus",
                                   methodHC = "ward.D2")

For the next steps, we will merge the clinical data with with the the cluster information and we will add the subtype information.

#------  Add cluster information
cluster <- data.frame("groupsHC" = data_Hc2[[4]]$consensusClass)
cluster$groupsHC <- paste0("EC",cluster$groupsHC)
cluster$patient <-  substr(colData(lgg.exp)$patient,1,12)

# Add information about gropus from consensus Cluster in clinical data
dataClin <- merge(dataClin,cluster, by.x="bcr_patient_barcode", by.y="patient")

# Merge subtype and clinical data
clin_subt <- merge(dataClin,dataSubt, by.x="bcr_patient_barcode", by.y="patient")
clin_subt_all <- merge(dataClin,dataSubt, 
                       by.x="bcr_patient_barcode", by.y="patient", all.x = TRUE)

The next steps will be to visualize the data. First, we created the survival plot.

#----------- VISUALIZE --------------------
# plotting survival for groups EC1, EC2, EC3, EC4
TCGAanalyze_survival(data = clin_subt_all,
                     clusterCol = "groupsHC",
                     main = "TCGA kaplan meier survival plot from consensus cluster",
                     legend = "RNA Group",
                     color = c("black","red","blue","green3"),
                     filename = "case2_surv.png")

The result is showed below:

We will also, create a heatmap of the expression.

TCGAvisualize_Heatmap(t(datFilt),
                      col.metadata =  clin_subt[,c("bcr_patient_barcode",
                                                   "groupsHC",
                                                   "Histology",
                                                   "IDH.codel.subtype")],
                      col.colors =  list(
                          groupsHC = c("EC1"="black",
                                       "EC2"="red",
                                       "EC3"="blue",
                                       "EC4"="green3"),
                          Histology=c("astrocytoma"="navy",
                                      "oligoastrocytoma"="green3",
                                      "oligodendroglioma"="red"),
                          IDH.codel.subtype = c("IDHmut-codel"="tomato",
                                                "IDHmut-non-codel"="navy",
                                                "IDHwt"="gold","NA"="white")),
                      sortCol = "groupsHC",
                      type = "expression", # sets default color
                      scale = "row", # use z-scores for better visualization
                      title = "Heatmap from concensus cluster", 
                      filename = "case2_Heatmap.pdf",
                      cluster_rows = TRUE)

The result is shown below:

Finally, we will take a look in the mutation genes. We will first download the maf file with TCGAquery_maf. In this example we will investigate the gene “ATRX”.

LGGmut <- GDCquery_Maf(tumor = "LGG") 
# Selecting gene
mRNAsel <- "ATRX"
LGGselected <- LGGmut[LGGmut$Hugo_Symbol == mRNAsel,]

dataMut <- LGGselected[!duplicated(LGGselected$Tumor_Sample_Barcode),]
dataMut$Tumor_Sample_Barcode <- substr(dataMut$Tumor_Sample_Barcode,1,12)

# Adding the Expression Cluster classification found before
dataMut <- merge(dataMut, cluster, by.y="patient", by.x="Tumor_Sample_Barcode")
dataMut <- dataMut[dataMut$Variant_Classification!=0,]

Case study n. 3: Integration of methylation and expression for ACC

In recent years, it has been described the relationship between DNA methylation and gene expression and the study of this relationship is often difficult to accomplish.

This case study will show the steps to investigate the relationship between the two types of data.

First, we downloaded ACC DNA methylation data for HumanMethylation450k platforms, and ACC RNA expression data for Illumina HiSeq platform.

TCGAbiolinks adds by default the subtypes classification already published by researchers.

We will use this classification to do our examples. So, selected the groups CIMP-low and CIMP-high to do RNA expression and DNA methylation comparison.

library(TCGAbiolinks)
library(SummarizedExperiment)
dir.create("case3")
setwd("case3")
#-----------------------------------
# STEP 1: Search, download, prepare |
#-----------------------------------
# 1.1 - DNA methylation
# ----------------------------------
query.met <- GDCquery(project = "TCGA-ACC", 
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = "Illumina Human Methylation 450")
GDCdownload(query.met)

acc.met <- GDCprepare(query = query.met,
                      save = TRUE, 
                      save.filename = "accDNAmet.rda",
                      summarizedExperiment = TRUE)

#-----------------------------------
# 1.2 - RNA expression
# ----------------------------------
query.exp <- GDCquery(project = "TCGA-ACC", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results")
GDCdownload(query.exp)
acc.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "accExp.rda")

We will first investigate the mean methylation of the methylation groups and highlight the hypermutated subgroup.

#--------------------------------------------
# STEP 2: Analysis
#--------------------------------------------
# 2.1 - Mean methylation of samples
# -------------------------------------------
TCGAvisualize_meanMethylation(acc.met,
                              groupCol = "subtype_MethyLevel",
                              subgroupCol = "subtype_Histology",
                              group.legend  = "Groups",
                              subgroup.legend = "Histology",
                              filename = "acc_mean.png")

The figure resulted from the code above is shown below:

For DNA methylation, we do a DMR (different methylated region) analysis, which will give the difference of DNA methylation for the probes of the groups and their significance value. The output can be seen in a volcano plot. Note: Depending on the number of samples this function can be very slow due to the wilcoxon test, taking from hours to days.

# na.omit
acc.met <- subset(acc.met,subset = (rowSums(is.na(assay(acc.met))) == 0))

# Volcano plot
acc.met <- TCGAanalyze_DMR(acc.met, groupCol = "subtype_MethyLevel",
                           group1 = "CIMP-high",
                           group2="CIMP-low",
                           p.cut = 10^-5,
                           diffmean.cut = 0.25,
                           legend = "State",
                           plot.filename = "CIMP-highvsCIMP-low_metvolcano.png")

The figure resulted from the code above is shown below:

For the expression analysis, we do a DEA (differential expression analysis) which will give the fold change of gene expression and their significance value.

#-------------------------------------------------
# 2.3 - DEA - Expression analysis - volcano plot
# ------------------------------------------------
acc.exp.aux <- subset(acc.exp, 
                      select = colData(acc.exp)$subtype_MethyLevel %in% c("CIMP-high","CIMP-low"))

idx <- colData(acc.exp.aux)$subtype_MethyLevel %in% c("CIMP-high")
idx2 <- colData(acc.exp.aux)$subtype_MethyLevel %in% c("CIMP-low")

dataPrep <- TCGAanalyze_Preprocessing(object = acc.exp.aux, cor.cut = 0.6)

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfo,
                                      method = "gcContent")

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  qnt.cut = 0.25,
                                  method='quantile')

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,idx],
                            mat2 = dataFilt[,idx2],
                            Cond1type = "CIMP-high",
                            Cond2type = "CIMP-low",
                            method = "glmLRT")

TCGAVisualize_volcano(dataDEGs$logFC,dataDEGs$FDR,
                      filename = "Case3_volcanoexp.png",
                      x.cut = 3,
                      y.cut = 10^-5,
                      names = rownames(dataDEGs),
                      color = c("black","red","darkgreen"),
                      names.size = 2,
                      xlab = " Gene expression fold change (Log2)",
                      legend = "State",
                      title = "Volcano plot (CIMP-high vs CIMP-low)",
                      width = 10)

The figure resulted from the code above is shown below:

Finally, using both previous analysis we do a starburst plot to select the genes that are Candidate Biologically Significant.

Observation: over the time, the number of samples has increased and the clinical data updated. We used only the samples that had a classification in the examples.

#------------------------------------------
# 2.4 - Starburst plot
# -----------------------------------------
# If true the argument names of the geenes in circle 
# (biologically significant genes, has a change in gene
# expression and DNA methylation and respects all the thresholds)
# will be shown
# these genes are returned by the function see starburst object after the function is executed
starburst <- TCGAvisualize_starburst(acc.met, dataDEGs,"CIMP-high","CIMP-low",
                                     filename = "starburst.png",
                                     met.p.cut = 10^-5,
                                     exp.p.cut = 10^-5,
                                     diffmean.cut = 0.25,
                                     logFC.cut = 3,
                                     names = FALSE, 
                                     height=10,
                                     width=15,
                                     dpi=300)

The figure resulted from the code above is shown below:

Case study n. 4: Elmer pipeline - KIRC

One of the biggest problems related to anlysis of data is the preparation phase, which often consists of successive steps in order to prepare it to a format acceptable by certain algorithms and software.

With the objective of assisting users in this arduous step, TCGAbiolinks will offer to prepare the data in order to obtain the correct format for different packages.

An example of package to perform DNA methylation and RNA expression analysis is ELMER (Huber, Wolfgang and Carey, Vincent J and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and others 2015). ELMER, which is designed to combine DNA methylation and gene expression data from human tissues to infer multi-level cis-regulatory networks. ELMER uses DNA methylation to identify enhancers, and correlates enhancer state with expression of nearby genes to identify one or more transcriptional targets. Transcription factor (TF) binding site analysis of enhancers is coupled with expression analysis of all TFs to infer upstream regulators. This package can be easily applied to TCGA public available cancer data sets and custom DNA methylation and gene expression data sets.

ELMER analyses have 5 main steps:

Identify distal enhancer probes on HM450K.
Identify distal enhancer probes with significantly different DNA methyaltion level in control group and experiment group.
Identify putative target genes for differentially methylated distal enhancer probes.
Identify enriched motifs for the distal enhancer probes which are significantly differentially methylated and linked to putative target gene.
Identify regulatory TFs whose expression associate with DNA methylation at motifs.

We will present this the study KIRC by TCGAbiolinks and ELMER integration. For more information, please consult the ELMER package.

For the DNA methylation data we will search the platform HumanMethylation450 and level 3 data with TCGAquery. After, we will download the data with TCGAdownload and the data will be prepared into a data.frame object with TCGAprepare. TCGAprepare_elmer will be used to prepare the data frame to ELMER.

library(TCGAbiolinks)
library(SummarizedExperiment)
library(ELMER)
library(parallel)
dir.create("case4")
setwd("case4")
#-----------------------------------
# STEP 1: Search, download, prepare |
#-----------------------------------
# 1.1 - DNA methylation
# ----------------------------------
query.met <- GDCquery(project = "TCGA-KIRC", 
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = "Illumina Human Methylation 450")
GDCdownload(query.met)
kirc.met <- GDCprepare(query = query.met,
                       save = TRUE, 
                       save.filename = "kircDNAmet.rda",
                       summarizedExperiment = TRUE)


kirc.met <- TCGAprepare_elmer(kirc.met, 
                              platform = "HumanMethylation450",
                              save = TRUE,
                              met.na.cut = 0.2)

For the gene expression data we will search the platform IlluminaHiSeq_RNASeqV2 and level 3 data with TCGAquery. Next, we will download the normalized_results data with TCGAdownload and prepare it into a data.frame object with TCGAprepare. TCGAprepare_elmer will be used to prepare the data frame to ELMER.

# Step 1.2 download expression data
#-----------------------------------
# 1.2 - RNA expression
# ----------------------------------
query.exp <- GDCquery(project = "TCGA-KIRC", 
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "normalized_results")
GDCdownload(query.exp)
kirc.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "kirkExp.rda")

kirc.exp <- TCGAprepare_elmer(kirc.exp,
                              save = TRUE,
                              platform = "IlluminaHiSeq_RNASeqV2")

The ELMER input is a mee object that contains the methylation matrix, the expression matrix, the probe information, the gene information and a table with a summary of the data.

#-----------------------------------
# STEP 2: ELMER integration         |
#-----------------------------------
# Step 2.1 prepare mee object       |
# -----------------------------------
library(ELMER)
library(parallel)

geneAnnot <- txs()
geneAnnot$GENEID <- paste0("ID",geneAnnot$GENEID)
geneInfo <- promoters(geneAnnot,upstream = 0, downstream = 0)
probe <- get.feature.probe()
mee <- fetch.mee(meth = kirc.met, exp = kirc.exp, TCGA = TRUE,
                 probeInfo = probe, geneInfo = geneInfo)

The code below excute steps 2-5 showed above:

direction <- c("hyper","hypo")

for (j in direction){
    print(j)
    dir.out <- paste0("kirc/",j)
    dir.create(dir.out, recursive = TRUE)
    #--------------------------------------
    # STEP 3: Analysis                     |
    #--------------------------------------
    # Step 3.1: Get diff methylated probes |
    #--------------------------------------
    Sig.probes <- get.diff.meth(mee, cores=detectCores(),
                                dir.out =dir.out,
                                diff.dir=j,
                                pvalue = 0.01)
    
    #-------------------------------------------------------------
    # Step 3.2: Identify significant probe-gene pairs            |
    #-------------------------------------------------------------
    # Collect nearby 20 genes for Sig.probes
    nearGenes <- GetNearGenes(TRange=getProbeInfo(mee, probe=Sig.probes$probe),
                              cores=detectCores(),
                              geneAnnot=getGeneInfo(mee))
    
    pair <- get.pair(mee=mee,
                     probes=Sig.probes$probe,
                     nearGenes=nearGenes,
                     permu.dir=paste0(dir.out,"/permu"),
                     dir.out=dir.out,
                     cores=detectCores(),
                     label= j,
                     permu.size=10000,
                     Pe = 0.001)
    
    Sig.probes.paired <- fetch.pair(pair=pair,
                                    probeInfo = getProbeInfo(mee),
                                    geneInfo = getGeneInfo(mee))
    Sig.probes.paired <-read.csv(paste0(dir.out,"/getPair.",j,".pairs.significant.csv"),
                                 stringsAsFactors=FALSE)[,1]
    
    
    #-------------------------------------------------------------
    # Step 3.3: Motif enrichment analysis on the selected probes |
    #-------------------------------------------------------------
    if(length(Sig.probes.paired) > 0 ){
        #-------------------------------------------------------------
        # Step 3.3: Motif enrichment analysis on the selected probes |
        #-------------------------------------------------------------
        enriched.motif <- get.enriched.motif(probes=Sig.probes.paired,
                                             dir.out=dir.out, label=j,
                                             background.probes = probe$name)
        if(length(enriched.motif) > 0){
            #-------------------------------------------------------------
            # Step 3.4: Identifying regulatory TFs                        |
            #-------------------------------------------------------------
            print("get.TFs")
            
            TF <- get.TFs(mee = mee,
                          enriched.motif = enriched.motif,
                          dir.out = dir.out,
                          cores = detectCores(), label = j)
            save(TF, enriched.motif, Sig.probes.paired,
                 pair, nearGenes, Sig.probes,
                 file=paste0(dir.out,"/ELMER_results_",j,".rda"))
        }
    }
}

From this analysis it is possible to verify the relationship between nearby 20 gene expression vs DNA methylation at this probe. The result of this is show by the ELMER scatter plot.

scatter.plot(mee,byProbe=list(probe=c("cg00328720"),geneNum=20), category="TN", lm_line = TRUE)

The result is shown below:

Each scatter plot showing the average DNA methylation level of sites with the UA6 motif in all KIRC samples plotted against the expression of the transcription factor ZNF677 and PEG3 respectively.

scatter.plot(mee,byTF = list(TF = c("ZNF677","PEG3"),
                          probe = enriched.motif[["UA6"]]), category = "TN",
                          save = TRUE, lm_line = TRUE, dir.out = "kirc")

The result is shown below:

The schematic plot shows probe colored in blue and the location of nearby 20 genes. The genes significantly linked to the probe were shown in red.

pair <- fetch.pair(pair="./kirc/getPair.hypo.pairs.significant.csv",
                   probeInfo = mee@probeInfo, geneInfo = mee@geneInfo)
schematic.plot(pair=pair, byProbe="cg15862394",save=FALSE)

The result is shown below:

The plot shows the odds ratio (x axis) for the selected motifs with OR above 1.3 and lower boundary of OR above 1.3. The range shows the 95% confidence interval for each Odds Ratio.

Session Information

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] png_0.1-7           TCGAbiolinks_2.0.13 BiocStyle_2.0.3    
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.3.9                         
##   [2] aroma.light_3.2.0                      
##   [3] plyr_1.8.4                             
##   [4] igraph_1.0.1                           
##   [5] ConsensusClusterPlus_1.36.0            
##   [6] splines_3.3.1                          
##   [7] BiocParallel_1.6.6                     
##   [8] GenomeInfoDb_1.8.7                     
##   [9] ggplot2_2.1.0                          
##  [10] TH.data_1.0-7                          
##  [11] digest_0.6.10                          
##  [12] foreach_1.4.3                          
##  [13] BiocInstaller_1.22.3                   
##  [14] htmltools_0.3.5                        
##  [15] gdata_2.17.0                           
##  [16] magrittr_1.5                           
##  [17] memoise_1.0.0                          
##  [18] cluster_2.0.4                          
##  [19] doParallel_1.0.10                      
##  [20] limma_3.28.21                          
##  [21] ComplexHeatmap_1.10.2                  
##  [22] Biostrings_2.40.2                      
##  [23] readr_1.0.0                            
##  [24] annotate_1.50.0                        
##  [25] matrixStats_0.50.2                     
##  [26] R.utils_2.4.0                          
##  [27] sandwich_2.3-4                         
##  [28] colorspace_1.2-6                       
##  [29] ggrepel_0.5                            
##  [30] rvest_0.3.2                            
##  [31] dplyr_0.5.0                            
##  [32] crayon_1.3.2                           
##  [33] RCurl_1.95-4.8                         
##  [34] jsonlite_1.1                           
##  [35] hexbin_1.27.1                          
##  [36] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [37] graph_1.50.0                           
##  [38] roxygen2_5.0.1                         
##  [39] genefilter_1.54.2                      
##  [40] supraHex_1.10.0                        
##  [41] survival_2.39-5                        
##  [42] zoo_1.7-13                             
##  [43] iterators_1.0.8                        
##  [44] ape_3.5                                
##  [45] gtable_0.2.0                           
##  [46] zlibbioc_1.18.0                        
##  [47] XVector_0.12.1                         
##  [48] GetoptLong_0.1.5                       
##  [49] kernlab_0.9-25                         
##  [50] Rgraphviz_2.16.0                       
##  [51] shape_1.4.2                            
##  [52] prabclus_2.2-6                         
##  [53] BiocGenerics_0.18.0                    
##  [54] DEoptimR_1.0-6                         
##  [55] scales_0.4.0                           
##  [56] DESeq_1.24.0                           
##  [57] mvtnorm_1.0-5                          
##  [58] DBI_0.5-1                              
##  [59] GGally_1.2.0                           
##  [60] edgeR_3.14.0                           
##  [61] ggthemes_3.2.0                         
##  [62] Rcpp_0.12.7                            
##  [63] xtable_1.8-2                           
##  [64] matlab_1.0.2                           
##  [65] mclust_5.2                             
##  [66] preprocessCore_1.34.0                  
##  [67] stats4_3.3.1                           
##  [68] httr_1.2.1                             
##  [69] gplots_3.0.1                           
##  [70] RColorBrewer_1.1-2                     
##  [71] fpc_2.1-10                             
##  [72] modeltools_0.2-21                      
##  [73] reshape_0.8.5                          
##  [74] XML_3.98-1.4                           
##  [75] R.methodsS3_1.7.1                      
##  [76] flexmix_2.3-13                         
##  [77] nnet_7.3-12                            
##  [78] labeling_0.3                           
##  [79] reshape2_1.4.1                         
##  [80] AnnotationDbi_1.34.4                   
##  [81] munsell_0.4.3                          
##  [82] tools_3.3.1                            
##  [83] downloader_0.4                         
##  [84] RSQLite_1.0.0                          
##  [85] devtools_1.12.0                        
##  [86] evaluate_0.9                           
##  [87] stringr_1.1.0                          
##  [88] yaml_2.1.13                            
##  [89] knitr_1.14                             
##  [90] robustbase_0.92-6                      
##  [91] caTools_1.17.1                         
##  [92] dendextend_1.3.0                       
##  [93] coin_1.1-2                             
##  [94] EDASeq_2.6.2                           
##  [95] nlme_3.1-128                           
##  [96] whisker_0.3-2                          
##  [97] formatR_1.4                            
##  [98] R.oo_1.20.0                            
##  [99] xml2_1.0.0                             
## [100] biomaRt_2.28.0                         
## [101] curl_2.1                               
## [102] testthat_1.0.2                         
## [103] affyio_1.42.0                          
## [104] tibble_1.2                             
## [105] geneplotter_1.50.0                     
## [106] stringi_1.1.2                          
## [107] highr_0.6                              
## [108] GenomicFeatures_1.24.5                 
## [109] lattice_0.20-34                        
## [110] trimcluster_0.1-2                      
## [111] Matrix_1.2-7.1                         
## [112] GlobalOptions_0.0.10                   
## [113] data.table_1.9.6                       
## [114] bitops_1.0-6                           
## [115] parmigene_1.0.2                        
## [116] dnet_1.0.9                             
## [117] rtracklayer_1.32.2                     
## [118] GenomicRanges_1.24.3                   
## [119] R6_2.2.0                               
## [120] latticeExtra_0.6-28                    
## [121] affy_1.50.0                            
## [122] hwriter_1.3.2                          
## [123] ShortRead_1.30.0                       
## [124] KernSmooth_2.23-15                     
## [125] IRanges_2.6.1                          
## [126] codetools_0.2-14                       
## [127] MASS_7.3-45                            
## [128] gtools_3.5.0                           
## [129] assertthat_0.1                         
## [130] chron_2.3-47                           
## [131] SummarizedExperiment_1.2.3             
## [132] rjson_0.2.15                           
## [133] withr_1.0.2                            
## [134] GenomicAlignments_1.8.4                
## [135] Rsamtools_1.24.0                       
## [136] multcomp_1.4-6                         
## [137] S4Vectors_0.10.3                       
## [138] diptest_0.75-7                         
## [139] parallel_3.3.1                         
## [140] class_7.3-14                           
## [141] rmarkdown_1.0                          
## [142] Biobase_2.32.0

References

Cancer Genome Atlas Research Network and others. 2012a. “Comprehensive Genomic Characterization of Squamous Cell Lung Cancers.” doi:10.1038/nature11404.

———. 2012b. “Comprehensive Molecular Characterization of Human Colon and Rectal Cancer.” doi:10.1038/nature11252.

———. 2012c. “Comprehensive Molecular Portraits of Human Breast Tumours.” doi:10.1038/nature11412.

———. 2013a. “Comprehensive Molecular Characterization of Clear Cell Renal Cell Carcinoma.” doi:10.1038/nature12222.

———. 2013b. “Integrated Genomic Characterization of Endometrial Carcinoma.” doi:10.1038/nature12113.

———. 2014a. “Comprehensive Molecular Characterization of Gastric Adenocarcinoma.” doi:10.1038/nature13480.

———. 2014b. “Comprehensive Molecular Profiling of Lung Adenocarcinoma.” doi:10.1038/nature13385.

———. 2014c. “Integrated Genomic Characterization of Papillary Thyroid Carcinoma.” doi:10.1016/j.cell.2014.09.050.

———. 2015a. “Comprehensive Genomic Characterization of Head and Neck Squamous Cell Carcinomas.” doi:10.1038/nature14129.

———. 2015b. “Genomic Classification of Cutaneous Melanoma.” doi:10.1016/j.cell.2015.05.044.

———. 2015c. “The Molecular Taxonomy of Primary Prostate Cancer.” doi:10.1016/j.cell.2015.10.025.

———. 2016. “Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma.” doi:10.1016/j.ccell.2016.04.002.

Ceccarelli, Michele and Barthel, Floris P and Malta, Tathiane M and Sabedot, Thais S and Salama, Sofie R and Murray, Bradley A and Morozova, Olena and Newton, Yulia and Radenbaugh, Amie and Pagnotta, Stefano M and others. 2016. “Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma.” doi:10.1016/j.cell.2015.12.028.

Colaprico, Antonio and Silva, Tiago C. and Olsen, Catharina and Garofano, Luciano and Cava, Claudia and Garolini, Davide and Sabedot, Thais S. and Malta, Tathiane M. and Pagnotta, Stefano M. and Castiglioni, Isabella and Ceccarelli, Michele and Bontempi, Gianluca and Noushmehr, Houtan. 2016. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of TCGA Data.” doi:10.1093/nar/gkv1507.

Davis, Caleb F and Ricketts, Christopher J and Wang, Min and Yang, Lixing and Cherniack, Andrew D and Shen, Hui and Buhay, Christian and Kang, Hyojin and Kim, Sang Cheol and Fahey, Catherine C and others. 2014. “The Somatic Genomic Landscape of Chromophobe Renal Cell Carcinoma.” doi:10.1016/j.ccr.2014.07.014.

Gu, Zuguang and Eils, Roland and Schlesner, Matthias. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” doi:10.1016/j.ccell.2016.04.002.

Huber, Wolfgang and Carey, Vincent J and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and others. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.”

James H Bullard, Elizabeth Purdom, Kasper D Hansen and Sandrine Dudoit. 2010. “Evaluation of Statistical Methods for Normalization and Differential Expression in MRNA-Seq Experiments.”

Linehan, W Marston and Spellman, Paul T and Ricketts, Christopher J and Creighton, Chad J and Fei, Suzanne S and Davis, Caleb and Wheeler, David A and Murray, Bradley A and Schmidt, Laura and Vocke, Cathy D and others. 2016. “Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma.” doi:10.1056/NEJMoa1505917.

Noushmehr, H., Weisenberger, D.J., Diefes, K., Phillips, H.S., Pujara, K., Berman, B.P., Pan, F., Pelloski, C.E., Sulman, E.P., Bhat, K.P. et al. 2010. “Identification of a CpG Island Methylator Phenotype That Defines a Distinct Subgroup of Glioma.”

Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. 2011. “GC-Content Normalization for RNA-Seq Data.”

Silva, TC and Colaprico, A and Olsen, C and D’Angelo, F and Bontempi, G and Ceccarelli, M and Noushmehr, H. 2016. “TCGA Workflow: Analyze Cancer Genomics and Epigenomics Data Using Bioconductor Packages.” doi:10.12688/f1000research.8923.1.

Working with TCGAbiolinks package

Antonio Colaprico, Tiago Chedraoui Silva, Luciano Garofano, Catharina Olsen, Davide Garolini, Claudia Cava, Isabella Castiglioni, Thais Sarraf Sabedot, Tathiane Maistro Malta, Stefano Pagnotta, Michele Ceccarelli, Gianluca Bontempi, Houtan Noushmehr

2016-10-05

Contents

Updates

Introduction

Installation

Citation

GDCquery: Searching TCGA open-access data

GDCquery: Searching GDC data for download

Harmonized data

Legacy data

GDCquery_maf: Working with mutation files.

GDCquery_clinic and GDCprepare_clinic: Working with clinical data.

TCGAquery_subtype: Working with molecular subtypes data.

GDCdownload: Downloading GDC data

GDCdownload: Example of use

GDCprepare: Preparing the data

Harmonized data

Legacy data

TCGAanalyze: Analyze data from TCGA.

TCGAanalyze_Preprocessing: Preprocessing of Gene Expression data (IlluminaHiSeq_RNASeqV2)

TCGAanalyze_DEA & TCGAanalyze_LevelTab: Differential expression analysis (DEA)

TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot: Enrichment Analysis

TCGAanalyze_survival: Survival Analysis: Cox Regression and dnet package

TCGAanalyze_DMR: Differentially methylated regions Analysis

TCGAvisualize: Visualize results from analysis functions with TCGA’s data.

TCGAvisualize_Heatmap: Create heatmaps with cluster bars

TCGAvisualize_Volcano: Create volcano plot

TCGAvisualize_oncoprint

TCGAvisualize_PCA: Principal Component Analysis plot for differentially expressed genes

TCGAvisualize_SurvivalCoxNET Survival Analysis: Cox Regression and dnet package

TCGAvisualize_meanMethylation: Mean DNA Methylation Analysis

TCGAvisualize_starburst: Integration of gene expression and DNA methylation data

TCGA Downstream Analysis: Case Studies

Introduction

Arguments definition

Case study n. 2: Pan Cancer downstream analysis LGG

Case study n. 3: Integration of methylation and expression for ACC

Case study n. 4: Elmer pipeline - KIRC

Session Information

References

`GDCquery`: Searching TCGA open-access data

`GDCquery`: Searching GDC data for download

`GDCquery_maf`: Working with mutation files.

`GDCquery_clinic` and `GDCprepare_clinic`: Working with clinical data.

`TCGAquery_subtype`: Working with molecular subtypes data.

`GDCdownload`: Downloading GDC data

`GDCdownload`: Example of use

`GDCprepare`: Preparing the data

`TCGAanalyze`: Analyze data from TCGA.

`TCGAanalyze_Preprocessing`: Preprocessing of Gene Expression data (IlluminaHiSeq_RNASeqV2)

`TCGAanalyze_DEA` `&` `TCGAanalyze_LevelTab`: Differential expression analysis (DEA)

`TCGAanalyze_EAcomplete & TCGAvisualize_EAbarplot`: Enrichment Analysis

`TCGAanalyze_survival`: Survival Analysis: Cox Regression and dnet package

`TCGAanalyze_DMR`: Differentially methylated regions Analysis

`TCGAvisualize`: Visualize results from analysis functions with TCGA’s data.

`TCGAvisualize_Heatmap`: Create heatmaps with cluster bars

`TCGAvisualize_Volcano`: Create volcano plot

`TCGAvisualize_oncoprint`

`TCGAvisualize_PCA`: Principal Component Analysis plot for differentially expressed genes

`TCGAvisualize_SurvivalCoxNET` Survival Analysis: Cox Regression and dnet package

`TCGAvisualize_meanMethylation`: Mean DNA Methylation Analysis

`TCGAvisualize_starburst`: Integration of gene expression and DNA methylation data