RTCGAToolbox

Mehmet Kemal Samur

2017-10-30

1 Introduction

Managing data from large scale projects such as The Cancer Genome Atlas (TCGA)(Cancer Genome Atlas Research Network 2008) for further analysis is an important and time consuming step for research projects. Several efforts, such as Firehose project, make TCGA pre-processed data publicly available via web services and data portals but it requires managing, downloading and preparing the data for following steps. We developed an open source and extensible R based data client for Firehose Level 3 and Level 4 data and demonstrated its use with sample case studies. RTCGAToolbox could improve data management for researchers who are interested with TCGA data. In addition, it can be integrated with other analysis pipelines for further data analysis.

RTCGAToolbox is open-source and licensed under the GNU General Public License Version 2.0. All documentation and source code for RTCGAToolbox is freely available.

Currently, following functions are provided to access datasets and process datasets.

2 Installation

To install RTCGAToolbox, you can use Bioconductor. Source code is also available on GitHub. First time users use the following code snippet to install the package

source("http://bioconductor.org/biocLite.R")
biocLite("RTCGAToolbox")

3 Data Client

Before getting the data from Firehose pipelines, users have to check valid dataset aliases, stddata run dates and analyze run dates. To provide valid information RTCGAToolbox comes with three control functions. Users can list datasets with “getFirehoseDatasets” function. In addition, users have to provide stddata run date or/and analyze run date for client function. Valid dates are accessible via “getFirehoseRunningDates” and “getFirehoseAnalyzeDates” functions. Below code chunk shows how to list datasets and dates.

library(RTCGAToolbox)

# Valid aliases
getFirehoseDatasets()
##  [1] "ACC"      "BLCA"     "BRCA"     "CESC"     "CHOL"     "COADREAD"
##  [7] "COAD"     "DLBC"     "ESCA"     "FPPP"     "GBMLGG"   "GBM"     
## [13] "HNSC"     "KICH"     "KIPAN"    "KIRC"     "KIRP"     "LAML"    
## [19] "LGG"      "LIHC"     "LUAD"     "LUSC"     "MESO"     "OV"      
## [25] "PAAD"     "PCPG"     "PRAD"     "READ"     "SARC"     "SKCM"    
## [31] "STAD"     "STES"     "TGCT"     "THCA"     "THYM"     "UCEC"    
## [37] "UCS"      "UVM"
# Valid stddata runs
stddata <- getFirehoseRunningDates()
stddata
##  [1] "20160128" "20151101" "20150821" "20150601" "20150402" "20150204"
##  [7] "20141206" "20141017" "20140902" "20140715" "20140614" "20140518"
## [13] "20140416" "20140316" "20140215" "20140115" "20131210" "20131114"
## [19] "20131010" "20130923" "20130809" "20130715" "20130623" "20130606"
## [25] "20130523" "20130508" "20130421" "20130406" "20130326" "20130309"
## [31] "20130222" "20130203" "20130116" "20121221" "20121206" "20121114"
## [37] "20121102" "20121024" "20121020" "20121018" "20121004" "20120913"
## [43] "20120825" "20120804" "20120725" "20120707" "20120623" "20120606"
## [49] "20120525" "20120515" "20120425" "20120412" "20120321" "20120306"
## [55] "20120217" "20120124" "20120110" "20111230" "20111206" "20111128"
## [61] "20111115" "20111026"
# Valid analysis running dates (will return 3 recent date)
gisticDate <- getFirehoseAnalyzeDates(last=3)
gisticDate
## [1] "20160128" "20150821" "20150402"

When the dates and datasets are determined users can call data client function (“getFirehoseData”) to access data. Current version can download multiple data types except ISOFORM and exon level data due to their huge data size. Below code chunk will download READ dataset with clinical and mutation data.

# READ mutation data and clinical data
brcaData <- getFirehoseData(dataset="READ", runDate="20150402",
    forceDownload=TRUE, clinical=TRUE, Mutation=TRUE)

Users have to set several parameters to get data they need. Below “getFirehoseData” options has been explained:

Following logic keys are provided for different data types. By default client only download clinical data.

Users can also set following parameters to set client behavior.

4 Post analysis functions

RTCGAToolbox has analyze functions to provide basic information about datasets. Analyze function includes differential gene expression analyze, correlation analysis between CN and GE data, univariate survival analysis, mutation frequency table and report figure.

4.1 Toy dataset

Since downloading new dataset takes some time and requires enough space on hard drive, I will use nonsense data set for following steps.

You can use following code snippet to load toy dataset.

data(RTCGASample)
RTCGASample
## TEST FirehoseData objectStandard run date: NA 
## Analysis running date: NA 
## Available data types: 
##   clinical: A data frame of phenotype data, dim:  100 x 7 
##   RNASeqGene: A matrix of count or normalized data, dim:  800 x 80 
##   GISTIC: A FirehoseGISTIC for copy number data 
##   Mutation: A data.frame, dim:  2685 x 30 
## To export data, use the 'getData' function.

4.2 Differential gene expression

RTCGAToolbox hires voom(Law, C. W. and Chen, Y. and Shi, W. and Smyth, G. K 2014) and limma(Smyth, G. K 2004) package functions to preform differential gene expression analysis between “Normal” and “Cancer” tissue samples. Every sample which is processed by TCGA project(Cancer Genome Atlas Research Network 2008) has a structured barcode number which includes the source of the tissue. RTCGAToolbox uses the barcode information to divide samples into “Normal” or “Tumor” groups and perform DGE analysis. Since “voom”(Law, C. W. and Chen, Y. and Shi, W. and Smyth, G. K 2014) requires raw count for RNASeq data, normalized RNASeq data cannot be used for the analysis. This function uses all gene expression datasets and returns a list which each member is “DGEResult” object. Each result object keeps top table from the genes that have 2 log fold change expression difference and significant adjusted p value. This function filters the results as a deafult behaviour using raw p value, adjusted p value and log fold change. Users can change “adj.pval”, “raw.pval” and “logFC” parameters to refine their results. Also function uses Benjamini & Hochberg adjustment for p values. For more details about adjment users can check base adjustment methods by calling “?p.adjust”. In addition to filter as a default behaviour function only draws heatmap for top 100 up and down regulated genes. Users can also adjust these values by using “hmTopUpN” and “hmTopDownN” parameters.

# Differential gene expression analysis for gene level RNA data.
diffGeneExprs <- getDiffExpressedGenes(dataObject=RTCGASample, DrawPlots=TRUE,
    adj.method="BH", adj.pval=0.05, raw.pval=0.05, logFC=2, hmTopUpN=10,
    hmTopDownN=10)

# Show head for expression outputs
diffGeneExprs
## [[1]]
## Dataset:RNASeq
## DGEResult object, dim: 4 6
showResults(diffGeneExprs[[1]])
## Dataset: RNASeq
##          logFC  AveExpr        t      P.Value    adj.P.Val        B
## TAP2  5.288573 1.743410 76.38279 7.496531e-76 4.760297e-73 150.8231
## GRTP1 6.187648 2.193494 48.25410 2.030875e-60 6.448029e-58 123.4407
## ENPP5 7.215676 2.707012 45.79069 1.110692e-58 2.350964e-56 120.1123
## APH1B 8.118533 3.158063 37.31744 5.796089e-52 6.134194e-50 106.2072
toptableOut <- showResults(diffGeneExprs[[1]])
## Dataset: RNASeq
##          logFC  AveExpr        t      P.Value    adj.P.Val        B
## TAP2  5.288573 1.743410 76.38279 7.496531e-76 4.760297e-73 150.8231
## GRTP1 6.187648 2.193494 48.25410 2.030875e-60 6.448029e-58 123.4407
## ENPP5 7.215676 2.707012 45.79069 1.110692e-58 2.350964e-56 120.1123
## APH1B 8.118533 3.158063 37.31744 5.796089e-52 6.134194e-50 106.2072

If “DrawPlots” set as FALSE, running code above won’t provide any output figure.

Voom + limma: To voom (variance modeling at the observational level) is to estimate the mean-variance relationship robustly and non-parametrically from the data. Voom works with log-counts that are normalized for sequence depth, in particular with log-counts per million (log-cpm). The mean-variance is fitted to the gene-wise standard deviations of the log-cpm, as a function of the average log-count. This method incorporates the mean-variance trend into a precision weight for each individual normalized observation. The normalized log-counts and associated precision weights can then be entered into the limma analysis pipeline, or indeed into any statistical pipeline for microarray data that is precision weight aware(Smyth, G. K 2004; Law, C. W. and Chen, Y. and Shi, W. and Smyth, G. K 2014). Users can check the following publications for more information about methods:

limma : Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3.

Voom: Law, CW, Chen, Y, Shi, W, Smyth, GK (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology15, R29.

4.3 Correlation between gene expression and copy number data

“getCNGECorrelation” function provides correlation coefficient and adjusted p value between copy number and gene expression data for each dataset. This function takes main dataobject as an input (uses gene copy number estimates from GISTIC2 (Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G 2011) algorithm and gen expression values from every platform (RNAseq and arrays) to prepare return lists. List object stores “CorResult” object that contains results for each comparison.)

#Correlation between gene expression values and copy number
corrGECN <- getCNGECorrelation(dataObject=RTCGASample, adj.method="BH",
    adj.pval=0.9, raw.pval=0.05)
corrGECN
## [[1]]
## Dataset:RNASeq
## CorResult object, dim: 43    4
showResults(corrGECN[[1]])
## Dataset: RNASeq
##     GeneSymbol        Cor adj.p.value    p.value
## 2       SEPHS1 -0.3769382   0.6287624 0.01650501
## 9         MSMB -0.3404472   0.8472802 0.03158898
## 39       PRMT3 -0.3806078   0.6287624 0.01540088
## 85       HTR3A  0.3378158   0.8472802 0.03301479
## 91      PHLDB1 -0.3260183   0.8688328 0.04007191
## 101       EMP1  0.3814696   0.6287624 0.01515084
corRes <- showResults(corrGECN[[1]])
## Dataset: RNASeq
##     GeneSymbol        Cor adj.p.value    p.value
## 2       SEPHS1 -0.3769382   0.6287624 0.01650501
## 9         MSMB -0.3404472   0.8472802 0.03158898
## 39       PRMT3 -0.3806078   0.6287624 0.01540088
## 85       HTR3A  0.3378158   0.8472802 0.03301479
## 91      PHLDB1 -0.3260183   0.8688328 0.04007191
## 101       EMP1  0.3814696   0.6287624 0.01515084

If the dataset has RNASeq data, data will be normalized for correlation analysis. Correlation function uses Benjamini & Hochberg adjustment for p values. For more details about adjment users can check base adjustment methods by calling “?p.adjust”. In addition, to filter results adjusted and raw p values are used. Users can change “adj.pval” and “raw.pval” parameters to refine results.

The RTCGAToolbox uses one of Pearson’s product moment correlation coefficient to test for associations between paired samples. Measures of association, all in the range [-1, 1] with 0 indicating no association, shows how copy number alterations affect gene expression changes. The test statistic follows a t-distribution, with length (x)-2 degrees of freedom if the samples follow independent normal distributions. Users can get detailed information by calling ?cor.test function

4.4 Mutation frequencies

“getMutationRate” function gets the data frame that stores mutation frequency for the genes. This function gets the mutation information for each sample that has data and calculates frequency for each gene.

# Mutation frequencies
mutFrq <- getMutationRate(dataObject=RTCGASample)
head(mutFrq[order(mutFrq[, 2], decreasing=TRUE), ])
##       Genes MutationRatio
## FCGBP FCGBP          0.46
## NF1     NF1          0.31
## ASTN1 ASTN1          0.24
## ODZ4   ODZ4          0.22
## BRWD1 BRWD1          0.22
## SYNE2 SYNE2          0.22

4.5 Univariate survival analysis

Survival analysis is considered as one of the methods that can provide clinically valuable information. To provide this information, the function creates 2 or 3 groups based on expression data.(If the dataset has RNASeq data, data will be normalized for survival analysis.). If function is triggered with 2 groups, RTCGAToolbox creates groups using the median expression level of individual genes. If group number is set to be 3, then the groups will be defined as: the samples in 1st. quartile (expression < 1st Q), the samples those have higher expression (expression > 3rd Q) and the samples lying in between these 2 groups.

This function also needs a survival data, which can be obtained using clinical data frame. Clinical data frames can be obtained from main data downloads. First column of the survival data frame should be sample barcodes, second column should be time and the last column should be event data. Below code chunk shows how survival data frame can be obtained from clinical data and how survival analysis can be done.

# Creating survival data frame and running analysis for
# FCGBP which is one of the most frequently mutated gene in the toy data
# Running following code will provide following KM plot.
clinicData <- getData(RTCGASample,"clinical")
head(clinicData)
##              Composite.Element.REF yearstobirth vitalstatus daystodeath
## TEST.00.0026                 value           53           0          NA
## TEST.00.0052                 value           50           0          NA
## TEST.00.0088                 value           56           0          NA
## TEST.00.0056                 value           56           0          NA
## TEST.00.0023                 value           56           0          NA
## TEST.00.0092                 value           52           0          NA
##              daystolastfollowup neoplasm.diseasestage pathology.T.stage
## TEST.00.0026               1183            stage iiia        stage iiia
## TEST.00.0052                897             stage iib         stage iib
## TEST.00.0088               1000             stage iib         stage iib
## TEST.00.0056               1134             stage iia         stage iia
## TEST.00.0023                794             stage iib         stage iib
## TEST.00.0092               1104             stage iia         stage iia
clinicData <- clinicData[, 3:5]
clinicData[is.na(clinicData[, 3]), 3] <- clinicData[is.na(clinicData[, 3]), 2]
survData <- data.frame(Samples=rownames(clinicData),
    Time=as.numeric(clinicData[, 3]), Censor=as.numeric(clinicData[, 1]))
getSurvival(dataObject=RTCGASample, geneSymbols=c("FCGBP"), sampleTimeCensor=survData)

5 Data Export

You can export downloaded data from FirehoseData object by using ‘getData()’ function.

# Note: This function is provided for real dataset test since the toy dataset is small.
RTCGASample
## TEST FirehoseData objectStandard run date: NA 
## Analysis running date: NA 
## Available data types: 
##   clinical: A data frame of phenotype data, dim:  100 x 7 
##   RNASeqGene: A matrix of count or normalized data, dim:  800 x 80 
##   GISTIC: A FirehoseGISTIC for copy number data 
##   Mutation: A data.frame, dim:  2685 x 30 
## To export data, use the 'getData' function.
RTCGASampleClinical <- getData(RTCGASample, "clinical")
RTCGASampleRNAseqCounts <- getData(RTCGASample, "RNASeqGene")
RTCGASampleCN <- getData(RTCGASample, "GISTIC", "AllByGene")

6 Reproducing BRCA results from original manuscript

Following code block is provided to reproduce case study in the RTCGAToolbox paper(Samur MK. 2014).

# BRCA data with mRNA (Both array and RNASeq), GISTIC processed copy number data
# mutation data and clinical data
# (Depends on bandwidth this process may take long time)
brcaData <- getFirehoseData (dataset="BRCA", runDate="20140416",
    gistic2Date="20140115", clinic=TRUE, RNAseqGene=TRUE, mRNAArray=TRUE,
    Mutation=TRUE)

# Differential gene expression analysis for gene level RNA data.
# Heatmaps are given below.
diffGeneExprs <- getDiffExpressedGenes(dataObject=brcaData,DrawPlots=TRUE,
    adj.method="BH", adj.pval=0.05, raw.pval=0.05, logFC=2, hmTopUpN=100,
    hmTopDownN=100)
# Show head for expression outputs
diffGeneExprs
showResults(diffGeneExprs[[1]])
toptableOut <- showResults(diffGeneExprs[[1]])

# Correlation between expresiion profiles and copy number data
corrGECN <- getCNGECorrelation(dataObject=brcaData, adj.method="BH",
    adj.pval=0.05, raw.pval=0.05)

corrGECN
showResults(corrGECN[[1]])
corRes <- showResults(corrGECN[[1]])

# Gene mutation frequincies in BRCA dataset
mutFrq <- getMutationRate(dataObject=brcaData)
head(mutFrq[order(mutFrq[,2],decreasing=TRUE),])

# PIK3CA which is one of the most frequently mutated gene in BRCA dataset
# KM plot is given below.
clinicData <- getData(brcaData,"clinical")
head(clinicData)
clinicData <- clinicData[, 3:5]
clinicData[is.na(clinicData[, 3]), 3] <- clinicData[is.na(clinicData[, 3]), 2]
survData <- data.frame(Samples=rownames(clinicData),
    Time=as.numeric(clinicData[, 3]), Censor=as.numeric(clinicData[, 1]))
getSurvival(dataObject=brcaData, geneSymbols=c("PIK3CA"),
    sampleTimeCensor=survData)

Differentially expressed genes.

KM plot for PIK3CA on BRCA dataset.

6.1 Report figure

This function provides an overall circle figure for the dataset by using the RCircos(Zhang, H. and Meltzer, P. and Davis, S 2013). This function uses differential gene expression analysis results (max results for 2 different platforms), copy number data estimates from GISTIC2 (Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G 2011) and mutation data.

Outer circle shows the gene symbols that have mutation in at least 5% of the samples. Inner tracks show the significantly altered gene expressions as fold change and copy number changes where blue represents the deletions and red represents the amplifications.

This function needs a genes location data frame, which can be obtained from “hg19.ucsc.gene.locations” data object. Please see the next section.

# Creating dataset analysis summary figure with getReport.
# Figure will be saved as PDF file.
library("Homo.sapiens")
locations <- genes(Homo.sapiens, columns="SYMBOL")
locations <- as.data.frame(locations)
locations <- locations[,c(6,1,5,2:3)]
locations <- locations[!is.na(locations[,1]), ]
locations <- locations[!duplicated(locations[,1]), ]
rownames(locations) <- locations[,1]
getReport(dataObject=brcaData, DGEResult1=diffGeneExprs[[1]],
    DGEResult2=diffGeneExprs[[2]], geneLocations=locations)

Running code above will provide following circle plot.

7 Data Objects

RTCGAToolbox provides toy data object for testing functions.

data(RTCGASample)

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RTCGAToolbox_2.8.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13               knitr_1.17                
##  [3] XVector_0.18.0             magrittr_1.5              
##  [5] splines_3.4.2              GenomicRanges_1.30.0      
##  [7] BiocGenerics_0.24.0        zlibbioc_1.24.0           
##  [9] IRanges_2.12.0             lattice_0.20-35           
## [11] stringr_1.2.0              GenomeInfoDb_1.14.0       
## [13] tools_3.4.2                grid_3.4.2                
## [15] SummarizedExperiment_1.8.0 parallel_3.4.2            
## [17] data.table_1.10.4-3        Biobase_2.38.0            
## [19] matrixStats_0.52.2         htmltools_0.3.6           
## [21] survival_2.41-3            yaml_2.1.14               
## [23] rprojroot_1.2              digest_0.6.12             
## [25] RJSONIO_1.3-0              Matrix_1.2-11             
## [27] GenomeInfoDbData_0.99.1    S4Vectors_0.16.0          
## [29] bitops_1.0-6               RCurl_1.95-4.8            
## [31] evaluate_0.10.1            rmarkdown_1.6             
## [33] limma_3.34.0               DelayedArray_0.4.0        
## [35] stringi_1.1.5              compiler_3.4.2            
## [37] RCircos_1.2.0              backports_1.1.1           
## [39] stats4_3.4.2               XML_3.98-1.9

References

Cancer Genome Atlas Research Network. 2008. “Comprehensive Genomic Characterization Defines Human Glioblastoma Genes and Core Pathways.”

Law, C. W. and Chen, Y. and Shi, W. and Smyth, G. K. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.”

Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G. 2011. “GISTIC2.0 Facilitates Sensitive and Confident Localization of the Targets of Focal Somatic Copy-Number Alteration in Human Cancers.”

Samur MK. 2014. “RTCGAToolbox: A New Tool for Exporting TCGA Firehose Data.”

Smyth, G. K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.”

Zhang, H. and Meltzer, P. and Davis, S. 2013. “RCircos: an R Package for Circos 2D Track Plots.”