As an example of the utility of the PDATK package, we provide code
replicating the analysis published in Sandhu et al. (2019). While the
code presented here is run on a subset of the original data to ensure that
the PDATK installation size is not too large, the full data from that study can
is available via the MetaGxPancreas
Bioconductor data package.
data(sampleCohortList)
sampleCohortList
## CohortList of length 11
## names(11): ICGCMICRO ICGCSEQ PCSI TCGA KIRBY UNC CHEN COLLISON ZHANG OUH WINTER
To get started using PDATK, place each of your patient cohorts into
SurvivalExperiment
objects and then assemble those into a master CohortList
,
which holds the training and validation data for use with the various
SurvivalModel
s in this package.
commonGenes <- findCommonGenes(sampleCohortList)
# Subsets all list items, with subset for specifying rows and select for
# specifying columns
cohortList <- subset(sampleCohortList, subset=commonGenes)
ICGCcohortList <- cohortList[grepl('ICGC', names(cohortList), ignore.case=TRUE)]
validationCohortList <- cohortList[!grepl('icgc', names(cohortList),
ignore.case=TRUE)]
Since we are interested in predicting survival, it is necessary to remove patients with insufficient data to be useful. In general, we want to remove patients who did not have an event in our period of interest. As such we remove samples who dropped out of a study, but did not pass away before the first year.
validationCohortList <- dropNotCensored(validationCohortList)
ICGCcohortList <- dropNotCensored(ICGCcohortList)
We have now split our patient cohorts into training and validation data. For this analysis we will be training using the ICGC cohorts, which includes one cohort with RNA micro-array data and another with RNA sequencing data. When using multiple cohorts to train a model, it is required that those cohorts share samples. As a result we will take as training data all patients shared between the two cohorts and leave the remainder of patients as part of our validationData.
# find common samples between our training cohorts in a cohort list
commonSamples <- findCommonSamples(ICGCcohortList)
# split into shared samples for training, the rest for testing
ICGCtrainCohorts <- subset(ICGCcohortList, select=commonSamples)
ICGCtestCohorts <- subset(ICGCcohortList, select=commonSamples, invert=TRUE)
# merge our training cohort test data into the rest of the validation data
validationCohortList <- c(ICGCtestCohorts, validationCohortList)
# drop ICGCSEQ from the validation data, because it only has 7 patients
validationCohortList <-
validationCohortList[names(validationCohortList) != 'ICGCSEQ']
PCOSP
Model ObjectWe now have patient molecular data, annotated with the number of days survived
since treatment and the survival status and are ready to apply a SurvivalModel
to this data. In this example, we are applying a Pancreatic Cancer Overall
Survival Model, as described in switchBox
package to create an ensemble of binary classifiers, whos votes are
then tallied into a PCOSP score. A PCOSP score is simply the proportion of
models predicting good survival out of the total number of models in the
ensemble.
set.seed(1987)
PCOSPmodel <- PCOSP(ICGCtrainCohorts, minDaysSurvived=365, randomSeed=1987)
# view the model parameters; these make your model run reproducible
metadata(PCOSPmodel)$modelParams
## $randomSeed
## [1] 1987
##
## $RNGkind
## [1] "Mersenne-Twister" "Inversion" "Rejection"
##
## $minDaysSurvived
## [1] 365
To simplify working with different SurvivalModel
sub-classes, we have
implemented a standard work-flow that is valid for all SurvivalModel
s.
This involves first training the model, then using it to predict
risk/risk-classes for a set of validation cohorts and finally assessing
performance on the validation data.
To train a model, the trainModel
method is used. This function abstracts away
the implementation of model training, allowing end-users to focus on applying
the SurvivalModel
to make predictions without a need to understand the model
internals. We hope this will make the package useful for those unfamiliar or
uninterested in the details of survival prediction methods.
For training a PCOSP model there are two parameters. First, numModels
is the
number of models to train for use in the ensemble classifier to predict PCOSP
scores. To keep computation brief, we are only training 25 models. However, it
is recommended to use a minimum of 1000 for real world applications. The second
parameter is minAccuracy
, which is the minimum model accuracy for a trained
model to included in the final model ensemble. Paradoxically, increasing this
too high can actually decrease the overall performance of the PCOSP model. We
recommend 0.6 as a sweet spot between random chance and over-fitting but
encourage experimentation to see what works best with your data.
trainedPCOSPmodel <- trainModel(PCOSPmodel, numModels=15, minAccuracy=0.6)
metadata(trainedPCOSPmodel)$modelParams
## $randomSeed
## [1] 1987
##
## $RNGkind
## [1] "Mersenne-Twister" "Inversion" "Rejection"
##
## $minDaysSurvived
## [1] 365
##
## $numModels
## [1] 15
##
## $minAccuracy
## [1] 0.6
We can see that after training, the additional model parameters are added to the
modelParams
item in the model metadata
. The goal is to ensure that your
model training, prediction and validation are fully reproducible by capturing
the parameters relevant to a specific model.
After training, a model can now be used with new data to make risk predictions
and classify samples into ‘good’ or ‘bad’ survival groups. To do this, the
standard predictClasses
method is used. Similar to trainData
, we have
abstracted away the implementation details to provide users with a simple,
consistent interface for using SurvivalModel
sub-classes to make patient risk
predictions.
PCOSPpredValCohorts <- predictClasses(validationCohortList,
model=trainedPCOSPmodel)
The returned CohortList
object now indicates that each of the cohorts have
predictions. This information is available in the elementMetadata
slot of
the cohort list and can be accessed with the mcols
function from S4Vectors
.
mcols(PCOSPpredValCohorts)
## DataFrame with 10 rows and 2 columns
## mDataType hasPredictions
## <character> <logical>
## ICGCMICRO rna_micro TRUE
## PCSI rna_seq TRUE
## TCGA rna_seq TRUE
## KIRBY rna_seq TRUE
## UNC rna_micro TRUE
## CHEN rna_micro TRUE
## COLLISON rna_micro TRUE
## ZHANG rna_micro TRUE
## OUH rna_micro TRUE
## WINTER rna_micro TRUE
Predicting risk with a specific model adds a corresponding metadata
column to the object colData
. In the case of a PCOSP
model, the new column
is called PCOSP_prob_good
and represents the proportion of models in the
ensemble which predicted good survival for a given sample.
knitr::kable(head(colData(PCOSPpredValCohorts[[1]])))
unique_patient_ID | grade | sex | age | T | N | M | sample_type | sample_name | survival_time | event_occurred | prognosis | PCOSP_prob_good | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ICGC_0001 | ICGC_0001 | G1 | M | 59 | T2 | N0 | M0 | tumour | ICGC_0001 | 522 | 1 | good | 0.6666667 |
ICGC_0002 | ICGC_0002 | G2 | F | 77 | T2 | N1a | MX | tumour | ICGC_0002 | 2848 | 1 | good | 0.4666667 |
ICGC_0003 | ICGC_0003 | G2 | F | 58 | T3 | N1a | MX | tumour | ICGC_0003 | 1778 | 0 | good | 0.9333333 |
ICGC_0004 | ICGC_0004 | G2 | F | 78 | T2 | N0 | MX | tumour | ICGC_0004 | 1874 | 1 | good | 1.0000000 |
ICGC_0005 | ICGC_0005 | G3 | M | 63 | T3 | N1b | MX | tumour | ICGC_0005 | 641 | 1 | good | 0.5333333 |
ICGC_0008 | ICGC_0008 | G2 | M | 72 | T3 | N1b | MX | tumour | ICGC_0008 | 1209 | 1 | good | 0.8000000 |
Additionally, binary predictions of good or bad survival can be found in the
PCOSPpredictions
item of each SurvivalExperiment
s metadata
. This contains
the raw predictions from the model for each classifier in the ensemble, ordered
by classifier accuracy. This data is not important for end users, but is used
internally when calculating validation statistics for the model. For users
wishing to classify samples rather than estimate risks, we recommend a
PCOSP cut-off of >0.5 for good survival prognosis.
knitr::kable(metadata(PCOSPpredValCohorts[[1]])$PCOSPpredictions[1:5, 1:5])
ICGC_0001 | ICGC_0002 | ICGC_0003 | ICGC_0004 | ICGC_0005 | |
---|---|---|---|---|---|
rank1 | bad | bad | good | good | good |
rank2 | good | good | good | good | bad |
rank3 | good | good | good | good | good |
rank4 | bad | good | good | good | good |
rank5 | good | bad | good | good | bad |
The final step in the standard SurvivalModel
work-flow is to compute
model performance statistics for the model on the validation data. This can be
accomplished using the validateModel
method, which will add statistics to the
validationStats
slot of a SurvivalModel
object and the data to the
validationData
slot.
validatedPCOSPmodel <- validateModel(trainedPCOSPmodel,
valData=PCOSPpredValCohorts)
knitr::kable(head(validationStats(validatedPCOSPmodel)))
statistic | estimate | se | lower | upper | p_value | n | cohort | isSummary | mDataType |
---|---|---|---|---|---|---|---|---|---|
AUC | 0.7070000 | 0.0470000 | 0.6150000 | 0.7990000 | 0.0000142 | 158 | ICGCMICRO | FALSE | rna_micro |
log_D_index | 0.5956475 | 0.1592665 | 0.4882238 | 0.7030711 | 0.0001668 | 158 | ICGCMICRO | FALSE | rna_micro |
concordance_index | 0.6434057 | 0.0321350 | 0.5804223 | 0.7063892 | 0.0000081 | 158 | ICGCMICRO | FALSE | rna_micro |
AUC | 0.6400000 | 0.0570000 | 0.5280000 | 0.7530000 | 0.0088634 | 107 | PCSI | FALSE | rna_seq |
log_D_index | 0.4697159 | 0.1838308 | 0.3457239 | 0.5937078 | 0.0105421 | 107 | PCSI | FALSE | rna_seq |
concordance_index | 0.6212825 | 0.0387126 | 0.5454073 | 0.6971577 | 0.0017309 | 107 | PCSI | FALSE | rna_seq |
Examining the data.table
from the validationStats
slot we can see that three
model performance statistics have been calculated for all of the validation
cohorts. Additionally, aggregate statistics have been calculated by molecular
data type and for all cohorts combined. This table can be used to generate
model performance plots. We have included several functions for examining
model performance.
PCOSPdIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='log_D_index')
PCOSPdIndexForestPlot
PCOSPconcIndexForestPlot <- forestPlot(validatedPCOSPmodel, stat='concordance_index')
PCOSPconcIndexForestPlot
cohortROCplots <- plotROC(validatedPCOSPmodel, alpha=0.05)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the PDATK package.
## Please report the issue at <https://github.com/bhklab/PDATK/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cohortROCplots