mslp
(Mutation specific Synthetic Lethal Partners) is a comprehensive pipeline to identify consensus SLPs for loss of function mutations in a cancer context-specific manner through integrating genomic and transcriptomic data from patients, as well as genetic screen data in cell line 1. It is an unsupervised method which does not relies on training sets of curated SLPs. Compared to other approaches, mslp has the advantage of explicitly and stringently integrating available genetic screen data.
In the pipeline, we first infer the primary SLPs based on two simple yet complementary assumptions: 1) expression of mutations correlate with SLPs in wide type patients, 2) over-expression of SLPs compensate for the loss of function of the mutation. Two computational modules, correlationSLP and compensationSLP, are derived from state-of-art statistical methods to realize the assumptions with the ever-increasing patient omics data. Further, in spite of complex gene-gene interaction, we hypothesize that genetic perturbations targeting mutation specific SLPs would likely reduce cell viability. Thus, for the recurrent mutations detected in both patient tumors and cancer cell lines, we develope the idea of consensus SLPs under two constraints, 1) primary SLPs are screen hits, 2) consistent impact on cell viability in different cell lines. When applied to real datasets, we found that perturbation of predicted consensus SLPs has a significant impact on cell viability compared to other hits. Thus, mslp provides a novel approach to study mutation specific SLPs and explore the possibility of personalized medicine with great flexibilities. More on mslp could be found in the biorxiv paper.
mslp
has been submitted to Bioconductor for easy access, better documentation and user support.
Install the package from Bioconductor.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mslp")
BiocManager::install("data.table")
The raw data could be downloaded from public databases, like cBioPortal [2], where the following necessary files are needed:
It is possible to use customized datasets, simply following the format and column names of above files, e.g., “data_mutations_extended.txt” is a gene by sample matrix, with “Hugo_Symbol”, “Entrez_Gene_Id” as the first two columns. Three columns are mandatory in mutation profiles: “Tumor_Sample_Barcode”, “Gene” and “Variant_Classification”, while “Gene” contains the Ensembl gene ids. pp_tcga
could be used to preprocess the data.
#- Preprocessing the data.
#- Path to input files.
library(mslp)
P_mut <- "data_mutations_extended.txt"
P_cna <- "data_CNA.txt"
P_expr <- "data_RNA_Seq_v2_expression_median.txt"
P_z <- "data_RNA_Seq_v2_mRNA_median_Zscores.txt"
res <- pp_tcga(P_mut, P_cna, P_expr, P_z)
saveRDS(res$mut_data, "mut_data.rds")
saveRDS(res$expr_data, "expr_data.rds")
saveRDS(res$zscore_data, "zscore_data.rds")
Here we load toy datasets showing the proper data formats.
require(future)
#> Loading required package: future
require(doFuture)
#> Loading required package: doFuture
#> Loading required package: foreach
library(magrittr)
library(mslp)
library(data.table)
#- mutation from TCGA datasets to computate SLP via comp_slp
data(example_comp_mut)
#- mutation from TCGA datasets to computate SLP via corr_slp.
data(example_corr_mut)
data(example_expr)
data(example_z)
We use comp_slp
to predict SLPs compensated for loss of functions due to mutations. Briefly, we identify overexpressed genes in patients with interested mutations via the rank products algorithm [3], while co-occurring mutations are removed beforehand. mslp
uses Future
as the parallel backend, and it is recommended to run corr_slp
, comp_slp
, cons_slp
and est_im
in parallel.
plan(multisession, workers = 2)
res_comp <- comp_slp(example_z, example_comp_mut)
#> (==) Number of mutations: 5.
#> Warning: executing %dopar% sequentially: no parallel backend registered
#>
saveRDS(res_comp, file = "compSLP_res.rds")
plan(sequential)
We use corr_slp
to predict SLPs correlated with mutations in wide type patients. Internally, GENIE3 is used to select potential SLPs [4]. The im_thresh of 0.0016 was a rather robust threshold identified from various TCGA datasets with 100 random selected mutations and 50 repetitions.
plan(multisession, workers = 2)
res_corr <- corr_slp(example_expr, example_corr_mut)
#> (II) Number of mutations: 5.
saveRDS(res_corr, "corrSLP_res.rds")
#- Filter res by importance threshold to reduce false positives.
im_thresh <- 0.0016
res_f <- res_corr[im >= im_thresh]
plan(sequential)
It is recommended to compute the im_thresh
for each cancer type separately. We derived an approach estimating the threshold by running corr_slp
for randomly selected mutations repeatedly. SLPs of high relevance are picked as “true” SLPs for each mutation using the rank products algorithm. We then calculated the best threshold of receiver operating characteristic curve (ROC) of each repetition, and took the mean value across repetition. The average value among mutations is the final threshold.
plan(multisession, workers = 2)
#- Random mutations and runs
mutgene <- sample(intersect(example_corr_mut$mut_entrez, rownames(example_expr)), 5)
nperm <- 3
res_random <- lapply(seq_len(nperm), function(x) corr_slp(example_expr, example_corr_mut, mutgene = mutgene))
#> (II) Number of mutations: 5.
#> (II) Number of mutations: 5.
#> (II) Number of mutations: 5.
im_res <- est_im(res_random)
res_f_2 <- res_corr[im >= mean(im_res$roc_thresh)]
plan(sequential)
The mutation profiles and genetic screen data of cancer cell lines are required for this step. The Cancer Cell Line Encyclopedia (CCLE) is a great place to find mutation data [5]; and genetic screen results could be found in datasets such as Project Drive [6] and DepMap [7].
For example, the following codes show how to extract mutation data from CCLE.
library(readxl)
#- nature11003-s3.xls is available in the supplementary data of CCLE publication.
ccle <- readxl::read_excel("nature11003-s3.xls", sheet = "Table S1", skip = 2) %>%
as.data.frame %>%
set_names(gsub(" ", "_", names(.))) %>%
as.data.table %>%
.[, CCLE_name := toupper(CCLE_name)] %>%
unique
#- Only use the Nonsynonymous Mutations. CCLE_DepMap_18Q1_maf_20180207.txt can be downloaded from the CCLE website.
#- Only need the columns of cell_line and mut_entrez.
mut_type <- c("Missense_Mutation", "Nonsense_Mutation", "Frame_Shift_Del", "Frame_Shift_Ins", "In_Frame_Del", "In_Frame_Ins", "Nonstop_Mutation")
ccle_mut <- fread("CCLE_DepMap_18Q1_maf_20180207.txt") %>%
.[Variant_Classification %in% mut_type] %>%
.[, Tumor_Sample_Barcode := toupper(Tumor_Sample_Barcode)] %>%
.[, Entrez_Gene_Id := as.character(Entrez_Gene_Id)] %>%
.[, .(Tumor_Sample_Barcode, Entrez_Gene_Id)] %>%
unique %>%
setnames(c("cell_line", "mut_entrez"))
#- Select brca cell lines
brca_ccle_mut <- ccle_mut[cell_line %in% unique(ccle[CCLE_tumor_type == "breast"])]
Now we are ready to uncover the consensus SLPs, which are 1) hits in genetic screens, 2) consistent for the same mutations among cell lines evaluated by Cohen’s kappa coefficient. scr_slp
and cons_slp
are used for these two steps, respectively.
plan(multisession, workers = 2)
#- Merge data.
#- Load previous results.
res_comp <- readRDS("compSLP_res.rds")
res_corr <- readRDS("corrSLP_res.rds")
merged_res <- merge_slp(res_comp, res_corr)
#- Toy hits data.
screen_1 <- merged_res[, .(slp_entrez, slp_symbol)] %>%
unique %>%
.[sample(nrow(.), round(.8 * nrow(.)))] %>%
setnames(c(1, 2), c("screen_entrez", "screen_symbol")) %>%
.[, cell_line := "cell_1"]
screen_2 <- merged_res[, .(slp_entrez, slp_symbol)] %>%
unique %>%
.[sample(nrow(.), round(.8 * nrow(.)))] %>%
setnames(c(1, 2), c("screen_entrez", "screen_symbol")) %>%
.[, cell_line := "cell_2"]
screen_hit <- rbind(screen_1, screen_2)
#- Toy mutation data.
mut_1 <- merged_res[, .(mut_entrez)] %>%
unique %>%
.[sample(nrow(.), round(.8 * nrow(.)))] %>%
.[, cell_line := "cell_1"]
mut_2 <- merged_res[, .(mut_entrez)] %>%
unique %>%
.[sample(nrow(.), round(.8 * nrow(.)))] %>%
.[, cell_line := "cell_2"]
mut_info <- rbind(mut_1, mut_2)
#- Hits that are identified as SLPs.
scr_res <- lapply(c("cell_1", "cell_2"), scr_slp, screen_hit, mut_info, merged_res)
scr_res[lengths(scr_res) == 0] <- NULL
scr_res <- rbindlist(scr_res)
k_res <- cons_slp(scr_res, merged_res)
#- Filter results, e.g., by kappa_value and padj.
k_res_f <- k_res[kappa_value >= 0.6 & padj <= 0.1]
plan(sequential)
sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] doRNG_1.8.6 rngtools_1.5.2 data.table_1.15.4 mslp_1.6.0
#> [5] magrittr_2.0.3 doFuture_1.0.1 foreach_1.5.2 future_1.33.2
#> [9] BiocStyle_2.32.0
#>
#> loaded via a namespace (and not attached):
#> [1] gmp_0.7-4 sass_0.4.9 RSQLite_2.3.6
#> [4] pROC_1.18.5 listenv_0.9.1 digest_0.6.35
#> [7] evaluate_0.23 bookdown_0.39 iterators_1.0.14
#> [10] fastmap_1.1.1 blob_1.2.4 plyr_1.8.9
#> [13] jsonlite_1.8.8 AnnotationDbi_1.66.0 GenomeInfoDb_1.40.0
#> [16] DBI_1.2.2 BiocManager_1.30.22 httr_1.4.7
#> [19] fmsb_0.7.6 UCSC.utils_1.0.0 Biostrings_2.72.0
#> [22] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.2
#> [25] Rmpfr_0.9-5 crayon_1.5.2 rlang_1.1.3
#> [28] XVector_0.44.0 Biobase_2.64.0 parallelly_1.37.1
#> [31] future.apply_1.11.2 bit64_4.0.5 cachem_1.0.8
#> [34] yaml_2.3.8 tools_4.4.0 parallel_4.4.0
#> [37] memoise_2.0.1 GenomeInfoDbData_1.2.12 globals_0.16.3
#> [40] BiocGenerics_0.50.0 png_0.1-8 vctrs_0.6.5
#> [43] org.Hs.eg.db_3.19.1 R6_2.5.1 stats4_4.4.0
#> [46] lifecycle_1.0.4 randomForest_4.7-1.1 zlibbioc_1.50.0
#> [49] KEGGREST_1.44.0 S4Vectors_0.42.0 IRanges_2.38.0
#> [52] bit_4.0.5 pkgconfig_2.0.3 bslib_0.7.0
#> [55] Rcpp_1.0.12 RankProd_3.30.0 xfun_0.43
#> [58] knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
#> [61] compiler_4.4.0
[2]: Cerami, E. et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery 2, 401–404 (2012).
[3]: Breitling, R., Armengaud, P., Amtmann, A. & Herzyk, P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters 573, 83–92 (2004).
[4]: Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE 5, e12776 (2010).
[5]: Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity. Nature 483, 603 (2012).
[6]: McDonald, E. R. et al. Project DRIVE: A Compendium of Cancer Dependencies and Synthetic Lethal Relationships Uncovered by Large-Scale, Deep RNAi Screening. Cell 170, 577-592.e10 (2017).
[7]: Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564-576.e16 (2017).