Contents

Introduction

This vignette aims to help developers migrate from the now defunct cgdsr CRAN package. Note that the cgdsr package code is shown for comparison but it is not guaranteed to work. If you have questions regarding the contents, please create an issue at the GitHub repository: https://github.com/waldronlab/cBioPortalData/issues

Loading the package

library(cBioPortalData)

Discovering studies

cgdsr

library(cgdsr)
cgds <- CGDS("http://www.cbioportal.org/")
getCancerStudies.CGDS(cgds)

cBioPortalData

Here we show the default inputs to the cBioPortal function for clarity.

cbio <- cBioPortal(
    hostname = "www.cbioportal.org",
    protocol = "https",
    api. = "/api/api-docs"
)
getStudies(cbio)
## # A tibble: 342 × 13
##    name          description publicStudy pmid  citation groups status importDate
##    <chr>         <chr>       <lgl>       <chr> <chr>    <chr>   <int> <chr>     
##  1 Adenoid Cyst… "Whole exo… TRUE        2609… Martelo… "ACYC…      0 2022-03-0…
##  2 Adenoid Cyst… "Multi-Ins… TRUE        3148… Allen e… "ACYC…      0 2022-03-0…
##  3 Adrenocortic… "TCGA Adre… TRUE        <NA>  <NA>     "PUBL…      0 2022-03-0…
##  4 Bladder Canc… "Whole exo… TRUE        2690… Al-Ahma… ""          0 2022-03-0…
##  5 Basal Cell C… "Whole-exo… TRUE        2695… Bonilla… "PUBL…      0 2022-03-0…
##  6 Acute Lympho… "Comprehen… TRUE        2573… Anderss… "PUBL…      0 2022-03-0…
##  7 Ampullary Ca… "Exome seq… TRUE        2680… Gingras… "PUBL…      0 2022-03-0…
##  8 Bladder Urot… "Whole exo… TRUE        2509… Van All… "PUBL…      0 2022-03-0…
##  9 Bladder Canc… "Comprehen… TRUE        2389… Iyer et… "PUBL…      0 2022-03-0…
## 10 Bladder Urot… "Whole-exo… TRUE        2412… Guo et … "PUBL…      0 2022-03-0…
## # … with 332 more rows, and 5 more variables: allSampleCount <int>,
## #   readPermission <lgl>, studyId <chr>, cancerTypeId <chr>,
## #   referenceGenome <chr>

Note that the studyId column is important for further queries.

head(getStudies(cbio)[["studyId"]])
## [1] "acbc_mskcc_2015"              "acc_2019"                    
## [3] "acc_tcga"                     "blca_plasmacytoid_mskcc_2016"
## [5] "bcc_unige_2016"               "all_stjude_2015"

Obtaining Cases and Clinical data

cgdsr

The case_list_id in cgds and obtain the clinical data with the first case list identifier (gbm_tcga_pub_all in this example).

clist1 <-
    getCaseLists.CGDS(cgds, cancerStudy = "gbm_tcga_pub")[1, "case_list_id"]

getClinicalData.CGDS(cgds, clist1)

cBioPortalData

For the case list identifiers, you can use sampleLists and inspect the sampleListId column. Here we take the first value as in the example above.

clist1 <- sampleLists(cbio, "gbm_tcga_pub")[1, "sampleListId", drop = TRUE]
clist1
## [1] "gbm_tcga_pub_all"

Note that a sample list ID is not required when using the fetchAllClinicalDataInStudyUsingPOST internal endpoint. Data for all patients can be obtained using the clinicalData function.

clinicalData(cbio, "gbm_tcga_pub")
## # A tibble: 206 × 24
##    patientId    DFS_MONTHS  DFS_STATUS KARNOFSKY_PERFORMANC… OS_MONTHS OS_STATUS
##    <chr>        <chr>       <chr>      <chr>                 <chr>     <chr>    
##  1 TCGA-02-0001 4.504109589 1:Recurred 80.0                  11.60547… 1:DECEAS…
##  2 TCGA-02-0003 1.315068493 1:Recurred 100.0                 4.734246… 1:DECEAS…
##  3 TCGA-02-0004 10.32328767 1:Recurred 80.0                  11.34246… 1:DECEAS…
##  4 TCGA-02-0006 9.928767123 1:Recurred 80.0                  18.34520… 1:DECEAS…
##  5 TCGA-02-0007 17.03013699 1:Recurred 80.0                  23.17808… 1:DECEAS…
##  6 TCGA-02-0009 8.679452055 1:Recurred 80.0                  10.58630… 1:DECEAS…
##  7 TCGA-02-0010 11.53972603 1:Recurred 80.0                  35.40821… 1:DECEAS…
##  8 TCGA-02-0011 4.734246575 1:Recurred 80.0                  20.71232… 1:DECEAS…
##  9 TCGA-02-0014 <NA>        <NA>       100.0                 82.55342… 1:DECEAS…
## 10 TCGA-02-0015 14.99178082 1:Recurred 80.0                  20.61369… 1:DECEAS…
## # … with 196 more rows, and 18 more variables: PRETREATMENT_HISTORY <chr>,
## #   PRIOR_GLIOMA <chr>, SAMPLE_COUNT <chr>, SEX <chr>, sampleId <chr>,
## #   ACGH_DATA <chr>, CANCER_TYPE <chr>, CANCER_TYPE_DETAILED <chr>,
## #   COMPLETE_DATA <chr>, FRACTION_GENOME_ALTERED <chr>, MRNA_DATA <chr>,
## #   MUTATION_COUNT <chr>, ONCOTREE_CODE <chr>, SAMPLE_TYPE <chr>,
## #   SEQUENCED <chr>, SOMATIC_STATUS <chr>, TMB_NONSYNONYMOUS <chr>,
## #   TREATMENT_STATUS <chr>

But you can still use a different endpoint to obtain data for a single sample:

samplist <- samplesInSampleLists(cbio, clist1)
onesample <- samplist[["gbm_tcga_pub_all"]][1]
onesample
## [1] "TCGA-02-0001-01"
cbio$getAllClinicalDataOfSampleInStudyUsingGET(
    sampleId = onesample, studyId = "gbm_tcga_pub"
)
## Response [https://www.cbioportal.org/api/studies/gbm_tcga_pub/samples/TCGA-02-0001-01/clinical-data]
##   Date: 2022-04-26 20:13
##   Status: 200
##   Content-Type: application/json
##   Size: 3.33 kB

There may be other endpoints that you could look into:

searchOps(cbio, "clinical")
## [1] "getAllClinicalAttributesUsingGET"          
## [2] "fetchClinicalAttributesUsingPOST"          
## [3] "fetchClinicalDataUsingPOST"                
## [4] "getAllClinicalAttributesInStudyUsingGET"   
## [5] "getClinicalAttributeInStudyUsingGET"       
## [6] "getAllClinicalDataInStudyUsingGET"         
## [7] "fetchAllClinicalDataInStudyUsingPOST"      
## [8] "getAllClinicalDataOfPatientInStudyUsingGET"
## [9] "getAllClinicalDataOfSampleInStudyUsingGET"

Genetic profiles / molecular profiles

cgdsr

getGeneticProfiles.CGDS(cgds, cancerStudy = "gbm_tcga_pub")

cBioPortalData

molecularProfiles(cbio, "gbm_tcga_pub")
## # A tibble: 10 × 8
##    molecularAlteration… datatype name  description showProfileInAn… patientLevel
##    <chr>                <chr>    <chr> <chr>       <lgl>            <lgl>       
##  1 COPY_NUMBER_ALTERAT… DISCRETE Puta… Putative c… TRUE             FALSE       
##  2 COPY_NUMBER_ALTERAT… DISCRETE Puta… Putative c… TRUE             FALSE       
##  3 MUTATION_EXTENDED    MAF      Muta… Mutation d… TRUE             FALSE       
##  4 METHYLATION          CONTINU… Meth… Methylatio… FALSE            FALSE       
##  5 MRNA_EXPRESSION      CONTINU… mRNA… mRNA expre… FALSE            FALSE       
##  6 MRNA_EXPRESSION      Z-SCORE  mRNA… 18,698 gen… TRUE             FALSE       
##  7 MRNA_EXPRESSION      Z-SCORE  mRNA… Log-transf… TRUE             FALSE       
##  8 MRNA_EXPRESSION      CONTINU… micr… expression… FALSE            FALSE       
##  9 MRNA_EXPRESSION      Z-SCORE  micr… microRNA e… FALSE            FALSE       
## 10 MRNA_EXPRESSION      Z-SCORE  mRNA… mRNA and m… TRUE             FALSE       
## # … with 2 more variables: molecularProfileId <chr>, studyId <chr>

Note that we want to pull the molecularProfileId column to use in other queries.

Profile data or molecular data (mRNA expression) for a set of genes

cgdsr

getProfileData.CGDS(x = cgds,
    genes = c("NF1", "TP53", "ABL1"),
    geneticProfiles = "gbm_tcga_pub_mrna",
    caseList = "gbm_tcga_pub_all"
)

cBioPortalData

Currently, some conversion is needed to directly use the molecularData function, if you only have Hugo symbols. First, convert to Entrez gene IDs and then obtain all the samples in the sample list of interest.

genetab <- queryGeneTable(cbio,
    by = "hugoGeneSymbol",
    genes = c("NF1", "TP53", "ABL1")
)
genetab
## # A tibble: 3 × 3
##   entrezGeneId hugoGeneSymbol type          
##          <int> <chr>          <chr>         
## 1         4763 NF1            protein-coding
## 2           25 ABL1           protein-coding
## 3         7157 TP53           protein-coding
entrez <- genetab[["entrezGeneId"]]

allsamps <- samplesInSampleLists(cbio, "gbm_tcga_pub_all")

molecularData(cbio, "gbm_tcga_pub_mrna",
    entrezGeneIds = entrez, sampleIds = unlist(allsamps))
## $gbm_tcga_pub_mrna
## # A tibble: 618 × 8
##    uniqueSampleKey       uniquePatientKey entrezGeneId molecularProfil… sampleId
##    <chr>                 <chr>                   <int> <chr>            <chr>   
##  1 VENHQS0wMi0wMDAxLTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
##  2 VENHQS0wMi0wMDAxLTAx… VENHQS0wMi0wMDA…         4763 gbm_tcga_pub_mr… TCGA-02…
##  3 VENHQS0wMi0wMDAxLTAx… VENHQS0wMi0wMDA…         7157 gbm_tcga_pub_mr… TCGA-02…
##  4 VENHQS0wMi0wMDAzLTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
##  5 VENHQS0wMi0wMDAzLTAx… VENHQS0wMi0wMDA…         4763 gbm_tcga_pub_mr… TCGA-02…
##  6 VENHQS0wMi0wMDAzLTAx… VENHQS0wMi0wMDA…         7157 gbm_tcga_pub_mr… TCGA-02…
##  7 VENHQS0wMi0wMDA0LTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
##  8 VENHQS0wMi0wMDA0LTAx… VENHQS0wMi0wMDA…         4763 gbm_tcga_pub_mr… TCGA-02…
##  9 VENHQS0wMi0wMDA0LTAx… VENHQS0wMi0wMDA…         7157 gbm_tcga_pub_mr… TCGA-02…
## 10 VENHQS0wMi0wMDA2LTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
## # … with 608 more rows, and 3 more variables: patientId <chr>, studyId <chr>,
## #   value <dbl>

Note that this can also be done by using the getDataByGenes function, which automatically figures out all the sample identifiers in the study:

getDataByGenes(
    api =  cbio,
    studyId = "gbm_tcga_pub",
    genes = c("NF1", "TP53", "ABL1"),
    by = "hugoGeneSymbol",
    molecularProfileIds = "gbm_tcga_pub_mrna"
)
## $gbm_tcga_pub_mrna
## # A tibble: 618 × 10
##    uniqueSampleKey       uniquePatientKey entrezGeneId molecularProfil… sampleId
##    <chr>                 <chr>                   <int> <chr>            <chr>   
##  1 VENHQS0wMi0wMDAxLTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
##  2 VENHQS0wMi0wMDAxLTAx… VENHQS0wMi0wMDA…         4763 gbm_tcga_pub_mr… TCGA-02…
##  3 VENHQS0wMi0wMDAxLTAx… VENHQS0wMi0wMDA…         7157 gbm_tcga_pub_mr… TCGA-02…
##  4 VENHQS0wMi0wMDAzLTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
##  5 VENHQS0wMi0wMDAzLTAx… VENHQS0wMi0wMDA…         4763 gbm_tcga_pub_mr… TCGA-02…
##  6 VENHQS0wMi0wMDAzLTAx… VENHQS0wMi0wMDA…         7157 gbm_tcga_pub_mr… TCGA-02…
##  7 VENHQS0wMi0wMDA0LTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
##  8 VENHQS0wMi0wMDA0LTAx… VENHQS0wMi0wMDA…         4763 gbm_tcga_pub_mr… TCGA-02…
##  9 VENHQS0wMi0wMDA0LTAx… VENHQS0wMi0wMDA…         7157 gbm_tcga_pub_mr… TCGA-02…
## 10 VENHQS0wMi0wMDA2LTAx… VENHQS0wMi0wMDA…           25 gbm_tcga_pub_mr… TCGA-02…
## # … with 608 more rows, and 5 more variables: patientId <chr>, studyId <chr>,
## #   value <dbl>, hugoGeneSymbol <chr>, type <chr>

Mutation data

cgdsr

getMutationData.CGDS(
    x = cgds,
    caseList = "getMutationData",
    geneticProfile = "gbm_tcga_pub_mutations",
    genes = c("NF1", "TP53", "ABL1")
)

cBioPortalData

Similar to molecularData, mutation data can be obtained with the mutationData function or the getDataByGenes function.

mutationData(
    api = cbio,
    molecularProfileIds = "gbm_tcga_pub_mutations",
    entrezGeneIds = entrez,
    sampleIds = unlist(allsamps)
)
## $gbm_tcga_pub_mutations
## # A tibble: 57 × 28
##    uniqueSampleKey          uniquePatientKey molecularProfil… sampleId patientId
##    <chr>                    <chr>            <chr>            <chr>    <chr>    
##  1 VENHQS0wMi0wMDAxLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  2 VENHQS0wMi0wMDAxLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  3 VENHQS0wMi0wMDAzLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  4 VENHQS0wMi0wMDAzLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  5 VENHQS0wMi0wMDEwLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  6 VENHQS0wMi0wMDEwLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  7 VENHQS0wMi0wMDEwLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  8 VENHQS0wMi0wMDExLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  9 VENHQS0wMi0wMDE0LTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
## 10 VENHQS0wMi0wMDI0LTAxOmd… VENHQS0wMi0wMDI… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
## # … with 47 more rows, and 23 more variables: entrezGeneId <int>,
## #   studyId <chr>, center <chr>, mutationStatus <chr>, validationStatus <chr>,
## #   startPosition <int>, endPosition <int>, referenceAllele <chr>,
## #   proteinChange <chr>, mutationType <chr>, functionalImpactScore <chr>,
## #   fisValue <dbl>, linkXvar <chr>, linkPdb <chr>, linkMsa <chr>,
## #   ncbiBuild <chr>, variantType <chr>, keyword <chr>, chr <chr>,
## #   variantAllele <chr>, refseqMrnaId <chr>, proteinPosStart <int>, …
getDataByGenes(
    api = cbio,
    studyId = "gbm_tcga_pub",
    genes = c("NF1", "TP53", "ABL1"),
    by = "hugoGeneSymbol",
    molecularProfileIds = "gbm_tcga_pub_mutations"
)
## $gbm_tcga_pub_mutations
## # A tibble: 57 × 30
##    uniqueSampleKey          uniquePatientKey molecularProfil… sampleId patientId
##    <chr>                    <chr>            <chr>            <chr>    <chr>    
##  1 VENHQS0wMi0wMDAxLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  2 VENHQS0wMi0wMDAxLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  3 VENHQS0wMi0wMDAzLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  4 VENHQS0wMi0wMDAzLTAxOmd… VENHQS0wMi0wMDA… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  5 VENHQS0wMi0wMDEwLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  6 VENHQS0wMi0wMDEwLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  7 VENHQS0wMi0wMDEwLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  8 VENHQS0wMi0wMDExLTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
##  9 VENHQS0wMi0wMDE0LTAxOmd… VENHQS0wMi0wMDE… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
## 10 VENHQS0wMi0wMDI0LTAxOmd… VENHQS0wMi0wMDI… gbm_tcga_pub_mu… TCGA-02… TCGA-02-…
## # … with 47 more rows, and 25 more variables: entrezGeneId <int>,
## #   studyId <chr>, center <chr>, mutationStatus <chr>, validationStatus <chr>,
## #   startPosition <int>, endPosition <int>, referenceAllele <chr>,
## #   proteinChange <chr>, mutationType <chr>, functionalImpactScore <chr>,
## #   fisValue <dbl>, linkXvar <chr>, linkPdb <chr>, linkMsa <chr>,
## #   ncbiBuild <chr>, variantType <chr>, keyword <chr>, chr <chr>,
## #   variantAllele <chr>, refseqMrnaId <chr>, proteinPosStart <int>, …

cBioPortalData main function

It is important to note that end users who wish to obtain the data as easily as possible should use the main cBioPortalData function:

gbm_pub <- cBioPortalData(
    api = cbio,
    studyId = "gbm_tcga_pub",
    genes = c("NF1", "TP53", "ABL1"), by = "hugoGeneSymbol",
    molecularProfileIds = "gbm_tcga_pub_mrna"
)

assay(gbm_pub[["gbm_tcga_pub_mrna"]])[, 1:4]
##      TCGA-02-0001-01 TCGA-02-0003-01 TCGA-02-0004-01 TCGA-02-0006-01
## ABL1      -0.1744878    -0.177096729     -0.08782114      -0.1733767
## NF1       -0.2966920    -0.001066810     -0.23626512      -0.1691507
## TP53       0.6213171     0.006435625     -0.30507285       0.3967758