Contents

1 Introduction

An increasing number of precision oncology programmes are being launched world-wide. To support this development, we present the Cancer Variant Explorer (CVE), an R package with an interactive Shiny interface. Leveraging Oncotator annotations and the Drug Gene Interaction Database, CVE prioritises variants to identify drivers, resistance mechanisms and druggability. We encourage the extension of CVE by additional modules for more tailored analyses and provide a first CVE extension adding the exploration of variant genes in a melanoma-specific co-expression network. The goal of the tutorial is to present the functionality of the CVE package.

2 Installation

CVE can be installed via Bioconductor

source('http://www.bioconductor.org/biocLite.R')
biocLite('CVE')

and once installed loaded by

library(CVE)

3 Single-patient case study

Variant information is downloaded for a single colorectal cancer patient using the RTCGAToolbox package.

#load package
library(RTCGAToolbox)

#load all colorectal cancer data 
crcData <- getFirehoseData(dataset="COAD",
                          clinical=TRUE,
                          Mutation=TRUE,
                          runDate="20160128")
#pick a single patient for case study
mutData <- selectType(crcData, "Mutation")
crcCase_Firehouse <- mutData[mutData[["Tumor_Sample_Barcode"]] == 
    "TCGA-AA-A00N-01A-02W-A00E-09",]

The colon adeonocarcinoma sample harbors 4709 variants. CVE requires a data frame including the columns chromosome, start, end, reference allele and observed allele.

crcCase_input = data.frame(chr=crcCase_Firehouse$Chromosome,
                      start=crcCase_Firehouse$Start_position,
                      end=crcCase_Firehouse$End_position,
                      reference_allele=crcCase_Firehouse$Reference_Allele,
                      observed_allele=crcCase_Firehouse$Tumor_Seq_Allele2)
head(crcCase_input)
##   chr     start       end reference_allele observed_allele
## 1  10 100894110 100894110                T               G
## 2  10 100985376 100985376                C               A
## 3  10 101137905 101137905                G               A
## 4  10 101429058 101429058                A               C
## 5  10 101445832 101445832                G               A
## 6  10 101479316 101479316                G               T

4 Annotate variants using Oncotator

The following function retrieves the Oncotator annotation via the Application Programming Interface (an internet connection is needed for this step). The Oncotator Variant Annotation tool summarises variant-centric information from 14 different publicly available resources relevant for cancer researchers 1. For more information, see Oncotator webpage.

library(jsonlite)
crcCase <- get.oncotator.anno(crcCase_input)

5 Opening the Shiny app

The function openCVE opens the CVE Shiny app. It requires the Oncotator output file (MAF file) as a data frame or a list of multiple MAF files. The single-patient case study in a colon adenocarcinoma can be loaded with

openCVE(crcCase)

6 CVE functionality

The core implementation of the CVE Shiny app offers four different tabs to explore variants.

6.1 Annotation

The first part of the annotation tab summarises the functional consequence annotation from GENCODE including the variant classification (e.g. missense, nonsense, frame-shift etc.). The left side panel of CVE offers a filter to also include non-SNVs for further prioritisation. The bottom part displays the heatmap of the clusters of mutational effect prediction algorithm from the dbNSFP database2 for the dataset. The algorithms primarily exploit the reasoning that more deleterious gene regions have fewer observed substitutions across species due to tighter evolutionary constrains (i.e. conservation-based algorithms) or the different physico-chemical properties of amino acids and the corresponding three-dimensional protein structure (i.e. functional prediction algorithms). In addition, ensembl scores combining different approaches have been developed (e.g. CADD). For more information about the individual algorithm, see . Of note, CVE does neither benchmark the scores of the functional prediction algorithms, nor tries to derive the best score. Instead, it displays the heterogeneity of predication based on the rankscores of the 18 algorithms. The rankscores are between 0 and 1, where 1 indicates the highest rank amoung the 87,347,043 non-synonymous single-nucleotide variants. CVE depicts algorithms with similar rankscores for the variants by means of a heatmap of the consensus indices derived by consensus clustering Based on the heatmap, the user can select a single prediction algorithm resembling the information of one algorithm cluster of choice. In addition, for users unfamiliar with the prospects of the individual algorithms we propose the dbNSFP combination score \(c\)

\[c = \sum_{j= 1}^{m} y_j \qquad \text{with} \qquad y_j = \begin{cases} \overline{x_{ij}} & \text{if} \quad \overline{x_{ij}} \geq 0.75\\ 0 & \text{if} \quad \overline{x_{ij}} < 0.75 \end{cases}\]

where \(x_{ij}\) is the rankscore of algorithm \(i\) in cluster \(j\) and \(\overline{x_{ij}}\) the mean rankscore of algorithm cluster \(j\). \(\overline{x_{ij}}\) is only added to \(c\) if there is significant evidence for the variant in algorithm cluster \(j\), defined by a mean rankscore belonging to the upper quartile of rankscores.

6.2 Prioritisation

Depending on the scientific question, a more or less restrictive prioritisation of variants is warranted. A study aiming to suggest options for targeted therapies might only be interested in the most promising druggable variant within an exome data set. On the contrary, for targeted sequencing, 10-100 variants are a feasible number to monitor key genomic changes over the disease course (e.g. analysis of circulating cell free tumour DNA, ctDNA). Therefore, CVE offers the instant and flexible modification of key filters and cutoffs.At the core of the prioritisation workflow is the functional prediction algorithm of choice. An interactive slider in the left sidebar panel can be used to modify the cutoff. In addition to the dbNSFP data, the Oncotator annotation includes further information that can be used for priortisation. We recommend to

  • exclude germline SNVs identified by the 1000 Genomes project
  • include all overlapping COSMIC variants

Optionally, variants in known DNA repair genes as summarised by Wood et al.3 are displayed and can also be included by applying another filter.

6.3 Top table

A table of the prioritised variants can be accessed in the top table tab. For easy data handling, this top table can also be downloaded as a tab-separated spread sheet using the download button in the sidebar. The columns of the top table summarize:

  • gene: gene symbol
  • protein change: location of amino acid change in protein
  • type: SNV, dinucleotide substitution (DNP), deletion (DEL) or insertion (INS)
  • classification: functional consequence annotation from GENCODE
  • score: (rank)score of the mutational effect prediction algorithm selected
  • COSMIC entity: number of mutations identified per tumour entity in the COSMIC database

For reproduciblity, the header of the top table includes all filters seleted to prioritise the variants.

6.4 Druggability

At this point of the workflow, variants were annotated, ranked and prioritised. As a result, we are left with a handful of variants likely to be essential to the individual tumour biology. The next step required to guide precision cancer medicine is the assessment of the druggability of candidate variants.

Up to date, the Drug-Gene Interaction database (DGIdb)4 offers the most comprehensive collection of drug-gene interactions. Within DGIdb, CVE only queries the TEND and My Cancer Genome information, as both sources were expert-curated and comprise multiple tumour types. CVE accesses the DGIdb data via the application programming interface (API). This way, no local installation of the database is needed and entries are automatically up-to-date. The interactions found can be explored in a data table which can also also downloaded as a csv file in the sidepanel of CVE.

6.5 Summary of case study

CVE reveals the following points about the single colorectal cancer:

  • 68 protein-changing SNVs (total of 4709 variants)
  • only 5 variants in the genes TMPRSS12, HNRNPUL1, KLK11, DNAJC6 and PCDHGB1 have evidence for a function impact in at least one prediction cluster (dbNSFP cutoff >0.75)
  • 2 variants were previously identified in other tumour entities (TMPRSS12 p.D262N in urinary tract tumor; MOSPD3 p.G139E in prostate cancer)
  • NOTCH4 p.P1723S is putatively druggable by the drug RO4929097 and NTRK1 p.I328M by Imatinib, however, both variants have a dbNSFP combination score of 0 = no predicted functional impact

7 Melanoma cohort case study and WGCNA extension

CVE can also be applied in cohort variant data. As a meaningful cohort we chose BRAF-wt/RAS-wt melanomas, as they can neither be target with BRAF nor MEK inhibitors in the metastasised setting. We gathered the case study from The Cancer Genome Atlas 5. 93 of the 345 patients could be classfied as BRAF-wt/RAS-wt. The four tabs in the core implementation of CVE can be applied to all tumor entities. To illustrate the functionality and flexibility of an open-source R code, we developed a first extension to explore variant genes of interest in a melanoma-specific pathway context. To this end, we generated a melanoma-specfic coexpression network from TCGA data using the WGCNA methodology (see vignette WGCNA_from_TCGA_RNAseq). To start the melanoma cohort study with the extension:

openCVE(melanomaCase, extension="WGCNAmelanoma")

8 Session information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RTCGAToolbox_2.10.0 BiocStyle_2.8.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16                compiler_3.5.0             
##  [3] GenomeInfoDb_1.16.0         XVector_0.20.0             
##  [5] bitops_1.0-6                tools_3.5.0                
##  [7] zlibbioc_1.26.0             digest_0.6.15              
##  [9] evaluate_0.10.1             lattice_0.20-35            
## [11] Matrix_1.2-14               DelayedArray_0.6.0         
## [13] yaml_2.1.18                 parallel_3.5.0             
## [15] xfun_0.1                    GenomeInfoDbData_1.1.0     
## [17] stringr_1.3.0               knitr_1.20                 
## [19] RCircos_1.2.0               S4Vectors_0.18.0           
## [21] IRanges_2.14.0              stats4_3.5.0               
## [23] rprojroot_1.3-2             grid_3.5.0                 
## [25] Biobase_2.40.0              data.table_1.10.4-3        
## [27] survival_2.42-3             XML_3.98-1.11              
## [29] BiocParallel_1.14.0         rmarkdown_1.9              
## [31] bookdown_0.7                RJSONIO_1.3-0              
## [33] limma_3.36.0                magrittr_1.5               
## [35] splines_3.5.0               backports_1.1.2            
## [37] htmltools_0.3.6             matrixStats_0.53.1         
## [39] BiocGenerics_0.26.0         GenomicRanges_1.32.0       
## [41] RaggedExperiment_1.4.0      SummarizedExperiment_1.10.0
## [43] stringi_1.1.7               RCurl_1.95-4.10

9 References