Contents

1 Overview

This package provides a unified approach to programming with Bioconductor components to address problems in cancer genomics. Central concerns are:

2 Ontology

2.1 Oncotree

The NCI Thesaurus project distributes an OBO representation of oncotree. We can use this through the ontoProc (devel branch only) and ontologyPlot packages. Code for visualizing the location of ‘Glioblastoma’ in the context of its ‘siblings’ in the ontology follows.

3 Resource interfaces

3.1 PanCancer Atlas

In conjunction with restfulSE which handles aspects of the interface to BigQuery, this package provides tools for working with the PanCancer atlas project data.

3.1.1 Sample types

A key feature distinguishing the pancancer-atlas project from TCGA is the availability of data from normal tissue or metastatic or recurrent tumor samples. Codes are used to distinguish the different sources:

##   SampleTypeLetterCode                                      SampleType
## 1                  TAM                           Additional Metastatic
## 2                  TAP                        Additional - New Primary
## 3                   TR                           Recurrent Solid Tumor
## 4                   TB Primary Blood Derived Cancer - Peripheral Blood
## 5                   TM                                      Metastatic
## 6                   NT                             Solid Tissue Normal
## 7                   TP                             Primary solid Tumor

3.1.2 SummarizedExperiments per assay per tumor or other sample type

The following code will run if you have a valid setting for environment variable CGC_BILLING, to allow BiocOncoTK::pancan_BQ() to generate a proper BigQueryConnection.

The result is

> BRCA_mir
class: SummarizedExperiment 
dim: 743 1068 
metadata(0):
assays(1): assay
rownames(743): hsa-miR-30d-3p hsa-miR-486-3p ... hsa-miR-525-3p
  hsa-miR-892b
rowData names(0):
colnames(1068): TCGA-LD-A7W6 TCGA-BH-A18I ... TCGA-E9-A1N9 TCGA-B6-A0X0
colData names(746): bcr_patient_uuid bcr_patient_barcode ...
  bilirubin_upper_limit days_to_last_known_alive

3.1.3 Subsetting to normal

To shift attention to the normal tissue samples provided, use

to find

class: SummarizedExperiment 
dim: 743 90 
metadata(0):
assays(1): assay
rownames(743): hsa-miR-7641 hsa-miR-135a-5p ... hsa-miR-1323
  hsa-miR-520d-5p
rowData names(0):
colnames(90): TCGA-BH-A18P TCGA-BH-A18S ... TCGA-E9-A1N6 TCGA-E9-A1N9
colData names(746): bcr_patient_uuid bcr_patient_barcode ...
  bilirubin_upper_limit days_to_last_known_alive

The intersection of the colnames from the two SummarizedExperiments thus formed (patients contributing both solid tumor and matched normal) has length 89.

3.1.5 Multiassay experiments per tumor

Suppose we want to work with the mRNA, RPPA, 27k/450k merged methylation and miRNA data together. We can invoke pancan_SE again, specifying the appropriate tables and fields.

After obtaining the clinical data for BRCA with

library(dplyr)
library(magrittr)
clinBRCA = pcbq %>% tbl(pancan_longname("clinical")) %>% 
  filter(acronym=="BRCA") %>% as.data.frame() 
rownames(clinBRCA) = clinBRCA[,2]
clinDF = DataFrame(clinBRCA)

we use

library(MultiAssayExperiment)
brcaMAE = MultiAssayExperiment(
  ExperimentList(rnaseq=BRCA_mrna, meth=BRCA_meth, rppa=BRCA_rppa,
    mirna=BRCA_mir),colData=clinDF)

to generate brcaMAE. No assay data are present in this object, but data are retrieved on request.

> brcaMAE
A MultiAssayExperiment object of 4 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 4: 
 [1] rnaseq: SummarizedExperiment with 20531 rows and 1097 columns 
 [2] meth: SummarizedExperiment with 22601 rows and 1067 columns 
 [3] rppa: SummarizedExperiment with 259 rows and 873 columns 
 [4] mirna: SummarizedExperiment with 743 rows and 1068 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices

It is convenient to check for sample availability for the different assays using upsetSamples in MultiAssayExperiment.

The upset diagram for brcaMAE, showing sample availability per assay.

The upset diagram for brcaMAE, showing sample availability per assay.

3.2 Supporting infrastructure

The API for pancan_SE in restfulSE is complicated.

## function (bqcon, colDataTableName = "clinical_PANCAN_patient_with_followup", 
##     colDSubjectIdName = "bcr_patient_barcode", colDFilterField = "acronym", 
##     colDFilterValue = "BRCA", assayDataTableName = "pancanMiRs_EBadjOnProtocolPlatformWithoutRepsWithUnCorrectMiRs_08_04_16_annot", 
##     assayFeatureName = "ID", subjectIDName = "ParticipantBarcode", 
##     tumorFieldName = "Study", tumorFieldValue = "BRCA", assayValueFieldName = "miRNAexpr") 
## NULL

Long, metadata-laden names are used for some tables, the clinical characteristics table has over 700 variables, and fields bearing information common to different tables may not have common names. Help is needed to permit programming for integrative analysis. BiocOncoTK provides the following assistance:

  • pancan_app: a shiny app that provides interactive table and data overviews
pancan_app

pancan_app

  • pancan_longname: a helper for generating the long table names using a hint that will be processed by agrep:
##                                                RNASeqv2 
## "EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_annot"
  • pancan_BQ: a function that will generate a BigQueryConnection instance provided billing code and Google authentication succeed.

3.3 TARGET

We assume that an ISB-CGC Google BigQuery billing number is assigned to the environment variable CGC_BILLING.

First we list the tables available and have a look at the RNA-seq table.

## Observations: NA
## Variables: 16
## $ project_short_name <chr> "TARGET-RT", "TARGET-RT", "TARGET-RT", "TARGE...
## $ case_barcode       <chr> "TARGET-52-PARPFY", "TARGET-52-PARPFY", "TARG...
## $ sample_barcode     <chr> "TARGET-52-PARPFY-11A", "TARGET-52-PARPFY-11A...
## $ aliquot_barcode    <chr> "TARGET-52-PARPFY-11A-01R", "TARGET-52-PARPFY...
## $ gene_name          <chr> "RIC8B", "ATOH7", "ZNF532", "XKR5", "RP11-33O...
## $ gene_type          <chr> "protein_coding", "protein_coding", "protein_...
## $ Ensembl_gene_id    <chr> "ENSG00000111785", "ENSG00000179774", "ENSG00...
## $ Ensembl_gene_id_v  <chr> "ENSG00000111785.17", "ENSG00000179774.8", "E...
## $ HTSeq__Counts      <int> 2396, 35, 5367, 17, 323, 1718, 1, 4, 3151, 25...
## $ HTSeq__FPKM        <dbl> 3.212811104, 0.247184268, 4.693986615, 0.0353...
## $ HTSeq__FPKM_UQ     <dbl> 7.790066e+04, 5.993448e+03, 1.138145e+05, 8.5...
## $ case_gdc_id        <chr> "5cdd05ea-5285-50b7-971a-8bc005d01669", "5cdd...
## $ sample_gdc_id      <chr> "7448bf2b-4ba0-5f98-ad0f-e87fa6619a43", "7448...
## $ aliquot_gdc_id     <chr> "TARGET-52-PARPFY-11A-01R", "TARGET-52-PARPFY...
## $ file_gdc_id        <chr> "f31fe296-402e-4e7d-b072-e4a6571a9c8a", "f31f...
## $ platform           <chr> "Illumina", "Illumina", "Illumina", "Illumina...

Now let’s see what tumor types are available.

## # Source: lazy query [?? x 2]
## # Database: BigQueryConnection
##   project_short_name        n
##   <chr>                 <int>
## 1 TARGET-NBL          9495831
## 2 TARGET-AML         11310321
## 3 TARGET-RT            302415
## 4 TARGET-WT           7983756

NBL is neuroblastoma, RT is rhabdoid tumor, WT is Wilms’ tumor.

3.4 CCLE

Figure 3a of Barretina et al 2012 shows that cell lines with NRAS mutations can be ordered according to a measure of PD-0325901 activity, and that this drug activity measure is correlated with expression of AHR. We will acquire the mutation and expression data using BigQuery as provided by ISB.

Here is a listing of all tables:

## [1] "AffyU133_RMA_expression" "Copy_Number_segments"   
## [3] "DataFile_info"           "Mutation_calls"         
## [5] "Sample_information"      "fastqc_metrics"

3.4.1 Mutation data

First we get an overview of the content:

## [1] 53

Now let’s filter by NRAS and get a feel for how many observations are returned per cell line.

We need to carve up the CCLE name to get the organ.

##   Variant_Classification Hugo_Symbol Cell_line_primary_name
## 1      Missense_Mutation        NRAS                SNU-387
## 2      Missense_Mutation        NRAS                SNU-719
## 3      Missense_Mutation        NRAS                 SNU-81
## 4      Missense_Mutation        NRAS                SW 1271
## 5      Missense_Mutation        NRAS                   TF-1
## 6      Missense_Mutation        NRAS                  THP-1
##                                 CCLE_name                              organ
## 1                            SNU387_LIVER                              LIVER
## 2                          SNU719_STOMACH                            STOMACH
## 3                   SNU81_LARGE_INTESTINE                    LARGE_INTESTINE
## 4                             SW1271_LUNG                               LUNG
## 5  TF1_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
## 6 THP1_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
## 
##                  AUTONOMIC_GANGLIA             CENTRAL_NERVOUS_SYSTEM 
##                                  2                                  1 
##                        ENDOMETRIUM HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 
##                                  3                                 25 
##                    LARGE_INTESTINE                              LIVER 
##                                  2                                  2 
##                               LUNG                              OVARY 
##                                  9                                  2 
##                               SKIN                        SOFT_TISSUE 
##                                  6                                  2 
##                            STOMACH                            THYROID 
##                                  1                                  1 
##          UPPER_AERODIGESTIVE_TRACT                      URINARY_TRACT 
##                                  2                                  3

3.4.2 Expression data

Let’s obtain the expression of AHR for these NRAS-mutated cell lines.

##   Cell_line_primary_name RMA_normalized_expression HGNC_gene_symbol
## 1                SNU-719                 11.859660              AHR
## 2                 SNU-81                 10.684210              AHR
## 3              SK-MEL-30                  9.901173              AHR
## 4               KU-19-19                  8.772602              AHR
## 5                  L-363                  3.704958              AHR
## 6                 DND-41                  4.110147              AHR

3.4.3 Drug responsiveness data from CCLE, using pogos

The pogos package (submitted, see github.com/vjcitn/pogos) includes software to query pharmacodb.pmgenomics.ca. We will use this to develop drug-response profiles for PD-0325901.

We’ll define a responsiveness method, that takes a function f that is applied to the responses component of the dose-response profile.

The activity area for a compound in this design is defined as

##   Cell_line_primary_name   resp       drug dataset
## 1              SK-MEL-30 2.8200 PD-0325901    CCLE
## 2                  L-363 8.0061 PD-0325901    CCLE
## 3                SK-N-AS 4.0700 PD-0325901    CCLE
## 4                SNU-387 7.5157 PD-0325901    CCLE
## 5                   Mino 6.8010 PD-0325901    CCLE
## 6                 MOLP-8 5.4010 PD-0325901    CCLE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.920   3.730   5.178   4.958   5.927   8.006

This is based on the supplement to Barretina et al. 2012. (There a slightly different formula in the addendum which uses notation that includes multiplying by a factor of i for dose index level i.)

Let’s merge the responsiveness data with the expression data for gene AHR.

##   Cell_line_primary_name   resp       drug dataset RMA_normalized_expression
## 1                  639-V 5.5290 PD-0325901    CCLE                  4.711943
## 2                    697 6.1672 PD-0325901    CCLE                  4.803876
##   HGNC_gene_symbol
## 1              AHR
## 2              AHR

3.4.4 CLUE

The CLUE platform is an interface to results of work on the connectivity map at Broad Institute. Usage of functions in this toolkit requires an API key, which can be acquired through registration at clue.io. Set the environment variable CLUE_KEY so that it can be found by Sys.getenv to use default key parameter to functions described here.

A basic purpose of the interface to CLUE is to allow identification of gene signatures of perturbations in specific cellular contexts.

We have serialized data on cell lines and perturbagens available in the GSE70138 snapshot of LINCS.

##  [1] "cell_id"                 "cell_type"              
##  [3] "base_cell_id"            "precursor_cell_id"      
##  [5] "modification"            "sample_type"            
##  [7] "primary_site"            "subtype"                
##  [9] "original_growth_pattern" "provider_catalog_id"    
## [11] "original_source_vendor"  "donor_age"              
## [13] "donor_sex"               "donor_ethnicity"
## 
##                               -666                            adipose 
##                                  4                                  2 
##                  autonomic ganglia                              blood 
##                                  1                                  1 
##                               bone                             breast 
##                                  4                                  9 
##             central nervous system                        endometrium 
##                                  3                                  3 
## haematopoietic and lymphoid tissue                             kidney 
##                                  8                                  6 
##                    large intestine                              liver 
##                                 17                                  3 
##                               lung                             muscle 
##                                 13                                  3 
##                              ovary                           pancreas 
##                                  6                                  2 
##                           prostate                               skin 
##                                  5                                  6 
##                            stomach                    vascular system 
##                                  1                                  1
## [1] 2170    3
## [1] "pert_id"    "pert_iname" "pert_type"

A number of API services have demonstration query expressions available in the package:

## [1] "cells"         "genes"         "profiles"      "perts"        
## [5] "sigs"          "rep_drug_moas" "pcls"
## $where
## $where$pert_desc
## [1] "imatinib"
## 
## $where$cell_id
## [1] "MCF7"

We use query_clue to query a service. Here we ask for perturbagens that have EGFR among their targets. We’ll retrieve a single ‘gold’ signature identifier.

Now we obtain the metadata about this signature.

3.4.4.1 Example

Task: Assess the effects of perturbagens on transcription in the NPC cell line. We’ll check for recurrence of landmark genes among the top 50 upregulated for perturbagens that are identified as HDAC inhibitors.

We can abstract from this process a function that takes perturbagen classes and cell lines to deliver collections of LINCS signatures of genes considered to produce transcriptional activities of certain kinds.

4 Curated single cell expression data from cancer studies

In this section we illustrate different modalities for acquiring and working with single cell transcriptomics data, after processing by the CONQUER workflow.

4.1 Patel 2014

The Patel et al. experiment assayed 864 cells. A standard in-memory representation is straightforward. The curated SummarizedExperiment is distributed in an AWS S3 bucket sponsored by the Bioconductor Foundation. The loadPatel function retrieves this and places it in a `r Biocpkg(“BiocFileCache”) instance.

## Loading required namespace: BiocFileCache
## Using temporary cache /tmp/RtmpLBu3qD/BiocFileCache
## adding RDS to local cache, future invocations will use local image
## class: RangedSummarizedExperiment 
## dim: 65218 864 
## metadata(2): MIAME README.txt
## assays(1): count_lstpm
## rownames(65218): ENSG00000000003.14 ENSG00000000005.5 ... ERCC-00170
##   ERCC-00171
## rowData names(3): gene genome symbol
## colnames(864): GSM1395399 GSM1395400 ... GSM1396261 GSM1396262
## colData names(44): title geo_accession ... relation.1
##   supplementary_file_1
##                    GSM1395399 GSM1395400 GSM1395401
## ENSG00000000003.14   35.04885  37.249969  335.54343
## ENSG00000000005.5    25.25627   8.198881   15.22254
## ENSG00000000419.12  138.79563   7.761093   52.63018
## ENSG00000000457.13   43.75390  27.625374   57.70224

Exploratory analysis of this dataset is described in the companion vignette on single cell transcriptomics for GBM.

4.2 Darmanis 2017

The Darmanis et al. experiment assayed over 3500 cells. The CONQUER compressed RDS representation of all the data is about 4 GB on disk. The gene level quantifications and sample-level data were manually extracted from this archive. The gene level quantifications in the count_lstpm form were then loaded into a public HDF object store sponsored by John Readey. These data will persist in this format for some time; a Bioconductor-sponsored representation will be introduced as soon as possible.

## <65218 x 3584> DelayedMatrix object of type "double":
##                    GSM2243439 GSM2243440 ... GSM2247076 GSM2247077
## ENSG00000000003.14   0.000000   0.000000   .    0.00000    0.00000
##  ENSG00000000005.5   0.000000   0.000000   .    0.00000    0.00000
## ENSG00000000419.12   0.000000   0.000000   .    0.00000    0.00000
## ENSG00000000457.13   5.335452  11.685833   .    0.00000   14.01612
## ENSG00000000460.16   0.000000   0.000000   .    0.00000    0.00000
##                ...          .          .   .          .          .
##         ERCC-00164     0.0000     0.0000   .    0.00000    0.00000
##         ERCC-00165   480.6895  1228.1385   .    0.00000    0.00000
##         ERCC-00168     0.0000     0.0000   .    0.00000    0.00000
##         ERCC-00170     0.0000   610.8300   .    0.00000    0.00000
##         ERCC-00171 10155.8034 25366.3010   .    4.01555 2531.88862

5 Summary

BiocOncoTK is a result of work carried out under NCI ITCR U01 “Accelerating cancer genomics with cloud-scale Bioconductor”. This package illustrates several Bioconductor-based representations of cancer data and metadata. Some of the resources, such as the PanCancer atlas, CCLE, and high-resolution single-cell transcriptomics studies are sufficiently large that cloud-oriented representation and analysis may be cost-effective. As this package matures, additional resources will be highlighted, with particular attention to integration processes.