canceR

A Graphical User Interface for accessing and modeling the Cancer Genomics Data of MSKCC

Karim Mezhoud

2022-03-03

Introduction

canceR is a graphical user friendly interface to explore, compare, and analyse all available Cancer Data (Clinical data, Gene Mutation, Gene Methylation, Gene Expression, Protein Phosphorylation, Copy Number Alteration) hosted by the Computational Biology Center (cBio) at Memorial-Sloan-Kettering Cancer Center (MSKCC). canceR implements functions from various packages: 1. to acces, explore and extract Genomics Cancers Data Base of MSKCC (cgdsr,(Cerami et al. 2012, @Gao2013)),

  1. to associate phenotypes with gene expression (phenoTest, (Planet 2013)),

  2. to predict which biological process or pathway or immune system are significantly different under the phenotypes and which genes are associated (GSEA-R~[Subramanian2005,Subramanian2007]),

  3. to predict the most up/down regulated gene sets belonging to one of MSigDB collections~[Subramanian2005] (GSEAlm,(Oron, Jiang, and Gentleman 2008)),

  4. to classify genes by diseases (geNetClassifier,(Aibar et al. 2013)), or

  5. to classify genes by variable or phenotype (rpart, (Therneau, Atkinson, and Ripley 2014)),

  6. to plot genes correlations.

  7. to plot survival curves

  8. to plot muti-omics data using Circos style (circlize, (Gu et al. 2014))

Installation

Suplementary librairies (Not R packages)

For Debian distribution (GNU/Linux)

For Windows distribution

LibXml2: parser for XML

For OS X distribution X11: graphics device run R and write theses lines in console to install dependancies.

install.packages("RCurl", "XML")
install.packages(c("cgdsr","tkrplot","Formula", "RSvgDevice","RCurl" ))

dependencies from Bioconductor

library(biocManager)
biocManegr::install("GSEABase", "GSEAlm","geNetClassifier","Biobase", "phenoTest")
BiocManager::install("canceR")

Get the development version from github

library(devtools)
devtools::install_git("kmezhoud/canceR")

Starting Window

run R and write theses lines in console to run canceR package.

library(canceR)
canceR()

The starting window (Figure 1) loads all available Cancer Studies (Figure 1 3) or search some ones by keyword (Figure 1 4). Before to get Cancers Data (Figure 1 7), it is important to set workspace for output files (Figure 1 1). The starting window displays Help menu where user can get this vignette (Figure 1 2).

Figure 1: Starting Windows. 1, File Menu; 2, Help Menu; 3, Button to get all available studies; 4, Button to get only matched studies using key words; 5, list box that displays the number of studies listed in 6; 6, list box that displays the result of quering the Cancer Genomics Data Server. User can select one or multiple Studies; 7, Button to get Genetic Profiles and Clinical data for selected Studies.

Setting Workspace

canceR package uses input files to compute models and generates output files for biological knowledges. It is important to set workspace and know the location of used files and results. The Button Set Workspace allows user to set easily workspace (Figure 2). User needs just to browse workspace folder or creates a new one. The others necessary folders would be created by simple pressing Set buttons.

Figure 2: Setting Workspace

Main Window

After selecting studies and pressing on Get Cases and Genetic Profiles Button, the main window appears (Figure 3) and displays the progress of loading data of selected studies. The Main Window has a Toolbar with Menus (see following paragraphs). It is subdivised in two columns. The first column lists Cases for all selected studies. The first line of every study indicates its Index and its short description. The remain lines enumerate Cases with short description of data type and the number of samples. The second list box shows selected Cases. Similarly, the second column displays informations of Genetic Profiles. User can select a single or multiple lines with attention to correspond the Case with appropriate Genetic Profile.

Figure 3: Main Window of canceR package. 1, Toolbar, 2, list box of loaded Cases; 3, list box of selected Cases; 4, list box of loaded Genetic Profiles; 5, list box of selected Genetic Profiles

Gene List

The first step to get genomics data is to specify what are interesting genes for user. The Gene List button browses folders to load Gene list file or displays examples of genes list. The genes could be in text file (.txt) with one gene by line using HUGO gene Symbol. The function removes automatically duplicate genes.

Clinical Data

The Multiple Cases button displays successively selected Cases. Results are returned in a table with row for each case and a column for each clinical attribute (Figure 4B). User could select all or some clinical data by checking dialog box (Figure 4A). For example, we select clinical attributes:

Figure 4: Getting clinical data for Breast Invasive Carcinoma. A, Dialog Check Box to select clinical data; B, Results of quering clinical data of Breast Invasive Carcinoma (TCGA, Nature 2012).

Mutation

User can search all mutation in gene list of all selected studies. He needs to select All tumors samples in Cases and Mutations in gentics profiles to get mutations (Figure 5).

Mutation function allows user to select about 15 informations corresponding to mutations (Fig- ure 6A). The results is a table with rows for each sample/case, and columns corresponding to the informations cheched in dialog mutation check box (Figure 6B).

User can filter mutation result only for specific amino acid change (Figure 7).

Methylation

User can search gene methylation and its correlation with mRNA expression. User needs to select Cases and Genetic Profiles with same methylation assay (HM450 or HM27) for the same study. Multiple Cases selection is allowed for one gene list (Figure 8).

Selecting methylation data

The dialog box of methylation function allows user to specify the threshold of the correlation rate (Figure~A). cBioportal~(Cerami et al. 2012, @Gao2013) includes only methylation data from the probe with the strongest negative correlation between the methylation signal and the gene’s expression. The result table (Figure~Figure 9B) lists genes with median of rate upper than 0.8.

A, Methylation dialog box; B, Methylation results. Correlation of silensing gene expression by methylation (HM27) with correlation rate r > 0.8 for gene list in Breast Invasive Carcinoma (TCGA, Nature 2012).

Profiles

The function get Profile Data depends on gene list, cases, and genetic profiles. If a Single gene option is done, dialog box appears to specify gene symbol (Figure~Figure 10A). The returned dialog check box allows user to choose some/all profiles data (Figure~Figure 10B). The result (table) lists some/all genetic profiles data in columns (CNA, Met, Mut, mRNA,RPPA) and all available samples in rows (Figure~Figure 10C). Oppositely, if Multiple genes option is done, the returned table displays genes expression for gene list (column) for all samples (rows). In the case of multiple genes, the tables are saved in Results/ProfilesData folder.

Getting profile data of a single gene

Profile data of a multiple genes

PhenoTest

The function was implemented from package PhenoTest~(Planet 2013). The object of this function is to predict the association between a list of phenotype variables (Survival, DFS~Status, OS~Status) and the gene expression. There are two possible formula to get associations:

  1. Three variables: Survival status (event/time as Dead-Living / 30 Months), Categorical or ordinal description (DSF~STATUS or Tumor stage), and Continuous value (DFS~MONTHS, Tumor size).
  2. Two variables: Categorical or ordinal description (DSF~STATUS or Tumor stage), and Continuous value (DFS~MONTHS, Tumor size). In this case user does not need to select any variables for survival variable in the phenoTest dialog box (Figure~11A).

The output of this function does not expect to give systematically a relevant association between all formula of the chosen variables, although in some cases it is possible to cluster a list of genes significantly regulated (gene expression) at a range of tumor size (continuous) or tumoral stage (ordinal) for recurred or DiseaseFree cases (survival and categorical). The type of variables could be explored with Clinical Data tables and selected in the phenoTest dialog box (Figure~A). The effect of both continuous, categorical and ordinal phenotype variables on gene expression levels are tested via lmFit from limma package~(Wettenhall and Smyth 2004). Gene expression effects on survival are tested via Cox proportional hazards model~(Cox 1972), as implemented in function coxph from survival package.

Examples

Study: Prostate Adenocarcinoma (Broad/Cornal, Call 2013)

  • Cases: All tumor samples (57 samples),

  • Genetic Profiles: mRNA expression,

  • Gene list: 1021.txt file

  • Survival variable: empty

  • Categorical variable: Pathology Tumor Stage

  • Numeric variable: Serum PSA level

  • pVal adjust method: BH

  • PhenoTest with Two variables (Figure~)

After running Pheno/Exp, PhenoTest function returns two tables. The first table ranks gene list by pval (Figure~B). The first part (red square) displays pValues of the association between gene expression and Tumor stage. The second part (blue square) displays the fold change (fc) by PSA level rang.\

Interpretation: Notice that a single pValue is reported for each phenotype variable. For categorical variables these corresponds to the overall null hypothesis that there are no differences between groups.

In the second table, PhenoTest function filters only gene that has significant pval (pval <0.05, Figure~C red). Here we see that tumor stage has been categorized into 2 groups (pT2c, pT3c) and PSA level has been ranged into 2 groups (7.3-12.9, 12.9-16.7). This results shows that ANO3 gene is significantly down regulated (negative fold change) for the two pathology tumor stages (pT2c, pT3c).

PhenoTest; A, Dialog Boxused to select variables; B, Results; C, Only significant pValues.

Heteroneous Clinical Data

In some cases it is possible to have digital (0-9) and character (a-z) data in the same variable. in this case phenoTest function considers it as Categorical variable(Figure).

Clinical Data heterogeneity. In PURITY ABSOLUTE column is there digital and character values. in this cas PhenoTest consider this variable as Categoric.

Study: Prostate Adenocarcinoma, Metastatic (Michigan, Nature 2012)

  • Cases: All tumor samples (61 samples),

  • Genetic Profiles: mRNA expression

  • Gene list: 1021.txt file

  • Survival variable: OS MONTHS, OS STATUS

  • Categorical variable: OS STATUS

  • Numeric variable: Serum PSA level

  • pVal adjust method: BH

  • PhenoTest with three variables (Figure~)

In This test, Overall Survival (OS_STATUS) was used in survival and caterogical variables. The Clinical Data does not have enougth categorical variables. Figure~ B and C shows signicant association between 7 genes and Living Status (OS_STATUS.Living.pval column). The two last columns show opposite regulation of the 7 genes expression in living patient with serum PSA level. The Cox proportion hazard model does not give results with survival variables (OS_STATUS column).

PhenoTest: Prostate Adenocarcinoma, Metastatic (Michigan, Nature 2012)

Study:Lung Adenocarcinoma (TCGA, Nature, in press)

  • Cases: All Samples with mRNA expression data (230 samples),

  • Genetic Profiles: mRNA expression z-Scores (RNA Seq V2 RSEM)

  • Gene list: 1021.txt file

  • Survival variable: empty

  • Categorical variable: OS STATUS

  • Numeric variable: OS_MONTHS

  • pVal adjust method: BH

  • PhenoTest with two variables.

In this Lung cancer Study, the test shows significant association between living patient and 10 genes expression (Figure~).

PhenoTest: Lung Adenocarcinoma (TCGA, Nature, in press)

GSEA-R

Gene Set Enrichment Analysis (GSEA) is computational method that uses expression matrix of thousands of genes with phenotypes data (two biological states) and Molecular Signatures DataBase (MSigDB) to define which biological process or pathway or immune system are significantly different under the phenotypes and which genes are associated~(Subramanian et al. 2005).

Preprocessing of Exprimental Data

getGCT_CLS function loads Profile and Clinical data of selected study and saves two files into “gct_cls” folder (Figure~C).

  • The GCT file contents genes expression values with genes in the rows and samples in the columns.

  • The CLS file contents the two biological phenotypes selected from Clincical data. User needs to select clinical phenotype only with two classes.

Molecular Signatures DataBase

The Molecular Signatures DataBase (MSigDB) is a collection of annotated gene sets for use with GSEA computational method. The MSigDB gene sets are divided into 7 collections (positional gene sets, curated gene sets, motif gene sets, computational gene sets, GO gene sets, oncogenic signatures, and immunological signatures). All these collections are available at Broad Institute. Every collections consists in a tab delimited file format (.GMT file) that describes gene sets. Each row shows annotation terme with associated genes. User needs to download .gmt file with genes Symbols and saves them into “workspace/MSigDB/” folder. The MSigDB folder is created with the file menu in the starting windows(Figure).

For more detail about GCT, CLS, GMT files, see this link.

MSigDB Collection

  • C1: Positional Gene Sets Gene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene. These gene sets are helpful in identifying effects related to chromosomal deletions or amplification, epigenetic silencing, and other region effects.

  • C2: Curated Gene Sets into Pathways Gene sets collected from various sources such as online pathway databases.

    • CGP: Chemical and Genetic Perturbation - Gene sets represent expression signatures of genetic and chemical perturbations.
    • CP: Reactome gene sets - Gene sets derived from the Reactome pathway database.
  • C3: Motifs Gene Sets Gene sets that contain genes that share:

    • MIR: microRNA targets A 3’-UTR microRNA binding motif.

    • TFT: tanscription factor targets A transcription factor binding site defined in the TRANSFAC ([version 7.4(http://www.gene-regulation.com/) database.

  • C4: Computational Gene Sets Computational gene sets defined by mining large collections of cancer-oriented microarray data.

  • C5: GO Gene Sets Gene sets are named by GO term (GO and contain genes annotated by that term: Biological Process, Cellular Component, and Molecular Function.

  • C6: Oncogenic Signatures Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO.

  • C7: Immunologic Signatures Gene sets that represent cell states and perturbations within the immune system. This resource is generated as part of the Human Immunology Project Consortium (HIPC).

Examples

Study: Uterine Corpus Endometrioid Carcinoma (TCGA, Nature 2013)
  • Cases: All Samples with mRNA, CNA, and sequencing data (232 samples),

  • Genetic Profiles: mRNA expression (RNA Seq V2 RSEM)

  • MSigDB: c5.bp.v4.0.symbols

  • Gene list: 1021.txt file

  • Nbr of Samples: 100

  • Phenotype: DFS_STATUS

Based only on Gene list the function getGCT,CLS files builts the .gct and .cls files and save them under the folder “/gct_cls/”. The Figure~ shows the pre-porcessing steps to get gct and cls files.

Preprocessing of gct and cls files. A, Sampling the Cases from Uterine Corpus Endometrioid Carcinoma study; B, selecting phenotype variable with only two classes; C, Displaying cls and gct files. The names of the two files are indicated in the title of the windows.

For enrichment, GSEA function needs three files. The gct file with gene expression, the cls with phenotypes and gmt file with Molecular signature of Gene Sets. There are two options to load gmt file, from examples (Figure~A, MSigDB.gmt button) available into canceR package or from “workspace/MSigDB/” folder (Figure~A, browse button) . In the two ways the gmt files must be from Broad Institute and has gene Symbols.

Gene Set Enrichment Analysis of Uterine Corpus Endometrioid Carcinoma study using “DiseaseFree/Recurred” phenotypes. A, Selecting cls, gmt, gmt files ans setting output folder; B, Specifying the phenotype; C, displying the classes of the selected phenotype; D, Selecting Summary results files of the output and setting FDR; E, displying specific (FDR=0.25) Gene Sets (GS) involved in DiseaseFree phenotype. In this GSEA there is not significant GS involved specifically for Recurred phenotype.

References

Aibar, Sara, Celia Fontanillo, Javier De Las Rivas. Bioinformatics, and Functional Genomics Group. Cancer Research Center. Salamanca. Spain. 2013. GeNetClassifier: Classify Diseases and Build Associated Gene Networks Using Gene Expression Profiles. http://bioinfow.dep.usal.es/.

Cerami, Ethan, Jianjiong Gao, Ugur Dogrusoz, Benjamin E. Gross, Selcuk Onur Sumer, Bülent Arman Aksoy, Anders Jacobsen, et al. 2012. “The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data.” Cancer Discov 2 (5): 401–4. https://doi.org/10.1158/2159-8290.CD-12-0095.

Cox, D. R. 1972. “Regression Models and Life Tables.” Journal of the Royal Statistical Society Series B 34 (2): 187–220.

Gao, Jianjiong, Bülent Arman Aksoy, Ugur Dogrusoz, Gideon Dresdner, Benjamin Gross, S Onur Sumer, Yichao Sun, et al. 2013. “Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal.” Sci Signal 6 (269): pl1. https://doi.org/10.1126/scisignal.2004088.

Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2. https://doi.org/10.1093/bioinformatics/btu393.

Oron, Assaf P., Zhen Jiang, and Robert Gentleman. 2008. “Gene Set Enrichment Analysis Using Linear Models and Diagnostics.” Bioinformatics 24 (22): 2586–91. https://doi.org/10.1093/bioinformatics/btn465.

Planet, Evarist. 2013. “PhenoTest: Tools to Test Association Between Gene Expression and Phenotype in a Way That Is Efficient, Structured, Fast and Scalable.” 2013. http://www.bioconductor.org/packages/release/bioc/html/phenoTest.html.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc Natl Acad Sci U S A 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Therneau, T., B. Atkinson, and B. Ripley. 2014. “Rpart: Recursive Partitioning and Regression Trees.” R Project; R Project. 2014. http://cran.r-project.org/web/packages/rpart/index.html.

Wettenhall, James M., and Gordon K. Smyth. 2004. “LimmaGUI: A Graphical User Interface for Linear Modeling of Microarray Data.” Bioinformatics 20 (18): 3705–6. https://doi.org/10.1093/bioinformatics/bth449.