DepInfeR 1.2.0
set.seed(123)
library(DepInfeR)
library(tibble)
library(tidyr)
library(ggplot2)
library(BiocParallel)
library(dplyr)
The DepInfeR package is available at bioconductor.org and can be downloaded via Bioc Manager:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("DepInfeR")
DepInfeR is an R package for a computational method that integrates two experimentally accessible input data matrices: the drug sensitivity profiles of cancer cell lines or primary tumors ex-vivo (X), and the drug affinities of a set of proteins (Y), to infer a matrix of molecular protein dependencies of the cancers (ß). DepInfeR deconvolutes the protein inhibition effect on the viability phenotype by using regularized multivariate linear regression. It assigns a “dependence coefficient” to each protein and each sample, and therefore could be used to gain a causal and accurate understanding of functional consequences of genomic aberrations in a heterogeneous disease, as well as to guide the choice of pharmacological intervention for a specific cancer type, sub-type, or an individual patient.
This document provides a walk-through of using the DepInfeR package to infer sample-specific protein dependencies from drug-protein affinity profiling and ex-vivo drug response data.
For more information, please read out preprint on bioRxiv: https://doi.org/10.1101/2022.01.11.475864.
As inputs, DepInfeR requires two types of data: 1) a drug-protein affinity profiling dataset containing affinity values between a set of drugs and a set of proteins, and 2) a drug response dataset containing phenotypic data (e.g. viability measurements) of samples in response to a set of drugs.
In this exemplary walk-through analysis we use 1) drug-protein affinity data from Klaeger et al. 2017, which can be found in the supplementary file of the paper (Table_S1 & Table_S2): https://science.sciencemag.org/content/358/6367/eaan4368/tab-figures-data, and 2) drug response data from the Genomics of Drug Sensitivity in Cancer (GDSC) cancer cell line screening dataset: https://www.cancerrxgene.org/.
A subset of leukemia and breast cancer cell lines was chosen for this analysis. The analyzed cancer types were.
The Her2 status of the breast cancer cell lines was annotated manually.
Genetic background of the cancer cell lines was annotated with their mutational background. The mutational background of the cell lines was retrieved from https://www.cancerrxgene.org/downloads/genetic_features.
This analysis starts with the two data tables that contain the common drugs in both drug-target affinity and drug sensitivity datasets.
data(targetsGDSC, drug_response_GDSC)
A glance at the drug-target affinity table
head(targetsGDSC)
## # A tibble: 6 × 6
## drugName drugID targetClassification EC50 Kd targetName
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 afatinib 1032 High confidence 3493. 3493. ADK
## 2 afatinib 1032 High confidence 3.01 2.39 EGFR
## 3 afatinib 1032 High confidence 3154. 1771. GAK
## 4 afatinib 1032 High confidence 3828. 2388. GRB2
## 5 afatinib 1032 High confidence 3884. 2811. MAPK11
## 6 afatinib 1032 High confidence 2044. 1047. MAPK14
A glance at the drug sensitivity table
head(drug_response_GDSC)
## # A tibble: 6 × 14
## Drug `Drug Id` Cell …¹ Cosmi…² TCGA …³ Tissue Tissu…⁴ IC50 AUC Max c…⁵
## <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 erlotinib 1 CCRF-C… 905952 ALL blood lympho… 2.73 0.983 2
## 2 erlotinib 1 MOLT-4 905958 ALL blood lympho… 2.36 0.987 2
## 3 erlotinib 1 BE-13 906763 ALL blood lympho… 2.60 0.975 2
## 4 erlotinib 1 GR-ST 906877 ALL blood lympho… 3.78 0.976 2
## 5 erlotinib 1 KARPAS… 907272 ALL blood lympho… 2.55 0.988 2
## 6 erlotinib 1 KE-37 907277 ALL blood lympho… 1.97 0.990 2
## # … with 4 more variables: RMSE <dbl>, `Z score` <dbl>,
## # `Dataset version` <chr>, synonyms <chr>, and abbreviated variable names
## # ¹`Cell line name`, ²`Cosmic sample Id`, ³`TCGA classification`,
## # ⁴`Tissue sub-type`, ⁵`Max conc`
We first need to do some pre-processing of both datasets, to turn them into numeric matrices that can be used as inputs for DepInfeR.
Rename BCR to BCR/ABL to avoid confusion with B-cell receptor (BCR)
targetsGDSC <- mutate(targetsGDSC,
targetName = ifelse(targetName %in% "BCR", "BCR/ABL", targetName)
)
Turn target table into drug-protein affinity matrix
targetMatrix <- filter(
targetsGDSC,
targetClassification == "High confidence"
) %>%
select(drugID, targetName, Kd) %>%
pivot_wider(names_from = "targetName", values_from = "Kd") %>%
remove_rownames() %>%
column_to_rownames("drugID") %>%
as.matrix()
We provide a function, ProcessTargetResults, for the pre-processing of the drug-protein affinity matrix with Kd values (or optionally other affinity measurement values at roughly normal distribution). This function can perform the following steps:
All steps within this function are optional depending on input data. The transformation steps should be performed if the affinity matrix consists of Kd values. If there are highly correlated features within the affinity matrix, they can be removed using the provided function. The users can also use their own method to preprocess the input drug-target affinity matrix, as long as the distribution of the affinity values is roughly normal and do not have too many highly correlated features.
ProcessTargetResults <- processTarget(targetMatrix,
KdAsInput = TRUE,
removeCorrelated = TRUE
)
The z-score was chosen as a suitable measurement value for our drug screening response matrix as it acts as normalization for each drug over all cell lines. When working with AUC or IC50 values, a suitable normalization of the values is recommended. In this analysis we used the z-score of the AUC values as the drug response input for DepInfeR.
responseMatrix <- filter(drug_response_GDSC, `Drug Id` %in% targetsGDSC$drugID) %>%
select(`Drug Id`, `Cell line name`, AUC) %>%
pivot_wider(names_from = `Cell line name`, values_from = AUC) %>%
remove_rownames() %>%
column_to_rownames("Drug Id") %>%
as.matrix()
Currently, DepInfeR does not support input data with missing values. Therefore, the missing values in the input datasets need to be properly handled. The entries with missing values can either be removed or imputed.
We firstly check the distribution of our missing values across all cell lines. Below is the plot of remaining cell lines depending on cutoff for the maximum number of missing values per column (cell line).
missTab <- data.frame(
NA_cutoff = 0,
remain_celllines = 0, stringsAsFactors = FALSE
)
for (i in 0:138) {
a <- dim(responseMatrix[, colSums(is.na(responseMatrix)) <= i])[2]
missTab[i, ] <- c(i, a)
}
ggplot(missTab, aes(x=NA_cutoff, y=remain_celllines)) +
geom_line() + theme_bw() +
xlab("Allowed NA values") +
ylab("Number of remaining cell lines") +
geom_vline(xintercept = 24, color = "red", linetype = "dashed")
In order to keep as many cell lines as possible while reducing the missing data points, we chose one of the elbow points (x=24, shown as the red dashed line) in the plot above as the cut-off for allowed missing values. We will impute the remaining missing values by using the MissForest imputation method.
responseMatrix <- responseMatrix[, colSums(is.na(responseMatrix)) <= 24]
library(missForest)
impRes <- missForest(t(responseMatrix))
imp_missforest <- impRes$ximp
responseMatrix_imputed <- t(imp_missforest)
responseMatrix_scaled <- t(scale(t(responseMatrix_imputed)))
In this step, we will also subset the drug-target affinity matrix and the drug response matrix to only keep the drugs present in both matrices.
targetInput <- ProcessTargetResults$targetMatrix
overlappedDrugs <- intersect(
rownames(responseMatrix_scaled),
rownames(targetInput)
)
targetInput <- targetInput[overlappedDrugs, ]
responseInput <- responseMatrix_scaled[overlappedDrugs, ]
Perform multivariate LASSO regression based on a drug-protein affinity matrix and a drug response matrix.
In this walk-through, we only use 20 repetitions to save time. In a real application, the number of repetitions can be larger, e.g. repeats = 100, to gain more robust estimations. We will also set a seed for the random number generator to ensure reproducibility.
Note that the global random number stream by set.seed() function is ignored by the runLASSORegression() function.
#set up BiocParallel back-end
param <- MulticoreParam(workers = 2, RNGseed = 333)
result <- runLASSORegression(
TargetMatrix = targetInput,
ResponseMatrix = responseInput, repeats = 3,
BPPARAM = param
)
Remove targets that were never selected
useTar <- rowSums(result$coefMat != 0) > 0
result$coefMat <- result$coefMat[useTar, ]
Number of selected targets
nrow(result$coefMat)
## [1] 21
The protein dependence matrix can be visualized in a heatmap. High positive coefficients imply strong reliance of a certain sample on this protein for survival. Proteins with coefficients close to zero are less essential for the survival of cells. Negative coefficients indicate that the viability phenotype benefits from inhibition of the protein.
library(RColorBrewer)
library(pheatmap)
# firstly, we need to load the processed mutation annotation of the cell lines
data("mutation_GDSC")
#setup column annotation and color scheme
annoColor <- list(
H2O2 = c(`-1` = "red", `0` = "black", `1` = "green"),
IL.1 = c(`-1` = "red", `0` = "black", `1` = "green"),
JAK.STAT = c(`-1` = "red", `0` = "black", `1` = "green"),
MAPK.only = c(`-1` = "red", `0` = "black", `1` = "green"),
MAPK.PI3K = c(`-1` = "red", `0` = "black"),
TLR = c(`-1` = "red", `0` = "black", `1` = "green"),
Wnt = c(`-1` = "red", `0` = "black", `1` = "green"),
VEGF = c(`-1` = "red", `0` = "black", `1` = "green"),
PI3K.only = c(`-1` = "red", `0` = "black", `1` = "green"),
TCGA.classification =
c(
ALL = "#BC3C29FF", AML = "#E18727FF",
DLBC = "#20854EFF", "BRCAHer-" = "#0072B5FF",
"BRCAHer+" = "#7876B1FF"
),
ARID1A_mut = c(`1` = "black", `0` = "grey80"),
EP300_mut = c(`1` = "black", `0` = "grey80"),
PTEN_mut = c(`1` = "black", `0` = "grey80"),
TP53_mut = c(`1` = "black", `0` = "grey80"),
PIK3CA_mut = c(`1` = "black", `0` = "grey80"),
BRCA2_mut = c(`1` = "black", `0` = "grey80"),
BRCA1_mut = c(`1` = "black", `0` = "grey80"),
CDH1_mut = c(`1` = "black", `0` = "grey80"),
FBXW7_mut = c(`1` = "black", `0` = "grey80"),
NRAS_mut = c(`1` = "black", `0` = "grey80"),
ASXL1_mut = c(`1` = "black", `0` = "grey80"),
MLL2_mut = c(`1` = "black", `0` = "grey80"),
ABL1_trans = c(`1` = "black", `0` = "grey80"),
missing_value_perc = c(`0` = "white", `25` = "red")
)
importanceTab <- result$coefMat
#change "-" to "." in the column name to match cell line annotation
colnames(importanceTab) <- gsub("-",".", colnames(importanceTab),)
plotTab <- importanceTab
# Row normalization while keeping sign
plotTab_scaled <- scale(t(plotTab), center = FALSE, scale = TRUE)
plotTab <- plotTab_scaled
mutation_GDSC$TCGA.classification[mutation_GDSC$TCGA.classification == "BRCA"] <- "BRCAHer-"
mutation_GDSC$TCGA.classification <-
factor(mutation_GDSC$TCGA.classification,
levels = c("ALL", "AML", "DLBC", "BRCAHer-", "BRCAHer+")
)
pheatmap(plotTab,
color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")),
bias = 1.8
)(100),
annotation_row = mutation_GDSC,
annotation_colors = annoColor,
clustering_method = "ward.D2", scale = "none",
main = "", fontsize = 9,
fontsize_row = 7, fontsize_col = 10
)
In this part, we will test whether the sample-specific protein dependence is associated with cancer types or certain genotypes of those samples.
Prepare genomic background table
cell_anno_final <- mutation_GDSC %>%
select(-missing_value_perc) %>%
rename(cancer_type = TCGA.classification) %>%
filter(rownames(mutation_GDSC) %in% colnames(importanceTab))
colnames(cell_anno_final) <- gsub("_mut", "", colnames(cell_anno_final))
colnames(cell_anno_final) <- gsub("_trans", "_translocation",
colnames(cell_anno_final))
A function to perform association test between target importance and cancer type or genomic features. If a feature has only two levels, Student’s t-test will be use. If a feature has more than two levels, ANOVA test will be used.
diffImportance <- function(coefMat, Annotation) {
# process genetic background table
geneBack <- Annotation
geneBack <- geneBack[colnames(coefMat), ]
keepCols <- apply(geneBack, 2, function(x) {
length(unique(na.omit(x))) >= 2 & all(table(x) > 6)
})
geneBack <- geneBack[, keepCols]
pTab <- lapply(rownames(coefMat), function(targetName) {
lapply(colnames(geneBack), function(mutName) {
impVec <- coefMat[targetName, ]
genoVec <- geneBack[, mutName]
resTab <- data.frame(
targetName = targetName, mutName = mutName,
stringsAsFactors = FALSE
)
if (length(unique(na.omit(genoVec))) == 2) {
res <- t.test(impVec ~ genoVec, var.equal = TRUE, na.action = na.omit)
resTab$p <- res$p.value
resTab$FC <- (res$estimate[[2]] - res$estimate[[1]]) / abs(res$estimate[[1]])
} else if (length(unique(na.omit(genoVec))) >= 3) {
res <- anova(lm(impVec ~ genoVec, na.action = na.omit))
diffTab <- data.frame(val = impVec, gr = genoVec) %>%
filter(!is.na(gr)) %>%
group_by(gr) %>%
summarise(meanVal = mean(val))
resTab$p <- res$`Pr(>F)`[1]
resTab$FC <- max(diffTab$meanVal) - min(diffTab$meanVal)
}
resTab
}) %>% bind_rows()
}) %>%
bind_rows() %>%
arrange(p) %>%
mutate(p.adj = p.adjust(p, method = "BH"))
pTab
}
Perform association test between protein dependence and cancer type or mutation.
testRes <- diffImportance(importanceTab, cell_anno_final)
Prepare data for beeswarm plot visualization.
CancerType <- testRes %>%
filter(mutName == "cancer_type") %>%
filter(p.adj < 0.05, FC > 0.1)
plotTab <- t(scale(t(importanceTab))) %>%
data.frame() %>%
rownames_to_column("target") %>%
gather(key = "CellLine", value = "coef", -target) %>%
mutate(Cancer_Type = mutation_GDSC[CellLine, ]$TCGA.classification) %>%
group_by(target, Cancer_Type) %>%
mutate(meanCoef = mean(coef)) %>%
arrange(meanCoef) %>%
ungroup() %>%
mutate(target = factor(target, levels = unique(target)))
plotTab <- plotTab %>% filter(target %in% CancerType$targetName)
plotTab$Cancer_Type <- factor(plotTab$Cancer_Type,
levels = c(
"ALL", "AML", "DLBC",
"BRCAHer-", "BRCAHer+"
)
)
Create beeswarm plot for visualizing target importance versus cancer type associations
ggplot(plotTab, aes(x = target, y = coef, group = Cancer_Type)) +
geom_jitter(aes(color = Cancer_Type),
position = position_jitterdodge(
jitter.width = 0.2,
dodge.width = 0.8
),
size = 1.2
) +
stat_summary(
fun = mean, fun.min = mean, fun.max = mean, colour = "grey25",
geom = "crossbar", size = 0.8,
position = position_dodge(0.8)
) +
scale_color_manual(
values = c(
"#BC3C29FF", "#E18727FF",
"#20854EFF", "#0072B5FF", "#7876B1FF"
),
guide = guide_legend(override.aes = list(size = 3))
) +
ggtitle("Protein dependence associated with cancer type") +
ylab("Protein dependence coefficient") +
xlab("Protein") +
theme_bw() +
geom_vline(xintercept = seq(from = 1.5, to = 8.5, by = 1), color = "darkgrey") +
labs(color = "Cancer Type")
Prepare tables and color scheme for visualization
colList2 <- c(
`not significant` = "grey80", mutated = "#BC3C29FF",
unmutated = "#0072B5FF"
)
pos <- position_jitter(width = 0.15, seed = 10)
plotTab <- testRes %>%
filter(mutName != "cancer_type") %>%
mutate(type = ifelse(p.adj > 0.1, "not significant",
ifelse(FC > 0, "mutated", "unmutated")
)) %>%
mutate(varName = ifelse(type == "not significant", "", targetName)) %>%
mutate(p.adj = ifelse(p.adj < 1e-5, 1e-5, p.adj))
# subset for mutation with at least one significant association
plotMut <- unique(filter(testRes, p.adj <= 0.1)$mutName)
plotTab <- plotTab %>% filter(mutName %in% plotMut)
plotTab$type <- factor(plotTab$type,
levels = c("mutated", "unmutated", "not significant")
)
Create dot plot visualization for the significant associations between mutational background and protein dependence of GDSC cancer cell lines.
library(ggrepel)
p <- ggplot(data = plotTab, aes(
x = mutName, y = -log10(p.adj),
col = type, label = varName
)) +
geom_text_repel(position = pos, color = "black", size = 6, force = 3) +
geom_hline(yintercept = -log10(0.1), linetype = "dotted", color = "grey20") +
geom_point(size = 3, position = pos) +
ylab(expression(-log[10] * "(" * adjusted ~ italic("P") ~ value * ")")) +
xlab("Mutation") +
scale_color_manual(values = colList2) +
scale_y_continuous(trans = "exp", limits = c(0, 2.5), breaks = c(0, 1, 1.5, 2)) +
coord_flip() +
labs(col = "Higher dependence in") +
theme_bw() +
theme(
legend.position = c(0.80, 0.6),
legend.background = element_rect(fill = NA),
legend.text = element_text(size = 14),
legend.title = element_text(size = 16),
axis.title = element_text(size = 18),
axis.text = element_text(size = 18)
)
plot(p)
## Warning: Removed 1 rows containing missing values (geom_text_repel).
## Warning: Removed 1 rows containing missing values (geom_point).
Function for boxplot, given a t-test result table, coef/frequency matrix and cell line annotation object
library(ggbeeswarm)
# Function to format floats
formatNum <- function(i, limit = 0.01, digits = 1, format = "e") {
r <- sapply(i, function(n) {
if (n < limit) {
formatC(n, digits = digits, format = format)
} else {
format(n, digits = digits)
}
})
return(r)
}
#function for plot boxplot
plotDiffBox <- function(pTab, coefMat, cellAnno, fdrCut = 0.05) {
pTab.sig <- filter(pTab, p.adj <= fdrCut)
geneBack <- cellAnno
geneBack <- geneBack[colnames(coefMat), ]
pList <- lapply(seq(nrow(pTab.sig)), function(i) {
geno <- pTab.sig[i, ]$mutName
target <- pTab.sig[i, ]$targetName
pval <- pTab.sig[i, ]$p
plotTab <- tibble(
id = colnames(coefMat),
mut = geneBack[, geno], val = coefMat[target, ]
) %>% filter(!is.na(mut))
plotTab <- mutate(plotTab, mut = ifelse(mut %in% c("M", "1", 1),
"Mutated", "Unmutated"
)) %>% mutate(mut = factor(mut, levels = c("Unmutated", "Mutated")))
numTab <- group_by(plotTab, mut) %>% summarise(n = length(id))
plotTab <- left_join(plotTab, numTab, by = "mut") %>%
mutate(mutNum = sprintf("%s\n(n=%s)", mut, n)) %>%
arrange(mut) %>%
mutate(mutNum = factor(mutNum, levels = unique(mutNum)))
titleText <- sprintf("%s ~ %s %s", target, geno, "mutations")
pval <- formatNum(pval, digits = 1, format = "e")
titleText <- bquote(atop(.(titleText), "(" ~ italic("P") ~ "=" ~ .(pval) ~ ")"))
ggplot(plotTab, aes(x = mutNum, y = val)) +
stat_boxplot(geom = "errorbar", width = 0.3) +
geom_boxplot(outlier.shape = NA, col = "black", width = 0.4) +
geom_beeswarm(cex = 2, size = 2, aes(col = mutNum)) +
theme_classic() +
xlab("") +
ylab("Protein dependence") +
ggtitle(titleText) +
xlab("") +
scale_color_manual(values = c("#0072B5FF", "#BC3C29FF")) +
theme(
axis.line.x = element_blank(), axis.ticks.x = element_blank(),
axis.title = element_text(size = 18), axis.text = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
legend.position = "none"
)
})
names(pList) <- paste0(pTab.sig$targetName, "_", pTab.sig$mutName)
return(pList)
}
Visualization of exemplary association between NRAS mutation status and MAP2K2 dependence using a beeswarm plot.
pList <- plotDiffBox(testRes, importanceTab, cell_anno_final, 0.05)
pList$MAP2K2_NRAS
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggbeeswarm_0.6.0 ggrepel_0.9.1 pheatmap_1.0.12
## [4] RColorBrewer_1.1-3 missForest_1.5 dplyr_1.0.10
## [7] BiocParallel_1.32.0 ggplot2_3.3.6 tidyr_1.2.1
## [10] tibble_3.1.8 DepInfeR_1.2.0 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 lattice_0.20-45 assertthat_0.2.1
## [4] glmnet_4.1-4 digest_0.6.30 foreach_1.5.2
## [7] utf8_1.2.2 R6_2.5.1 evaluate_0.17
## [10] highr_0.9 pillar_1.8.1 itertools_0.1-3
## [13] rlang_1.0.6 jquerylib_0.1.4 magick_2.7.3
## [16] Matrix_1.5-1 rmarkdown_2.17 labeling_0.4.2
## [19] splines_4.2.1 stringr_1.4.1 munsell_0.5.0
## [22] vipor_0.4.5 compiler_4.2.1 xfun_0.34
## [25] pkgconfig_2.0.3 shape_1.4.6 htmltools_0.5.3
## [28] tidyselect_1.2.0 bookdown_0.29 codetools_0.2-18
## [31] matrixStats_0.62.0 randomForest_4.7-1.1 fansi_1.0.3
## [34] withr_2.5.0 grid_4.2.1 jsonlite_1.8.3
## [37] gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3
## [40] magrittr_2.0.3 scales_1.2.1 cli_3.4.1
## [43] stringi_1.7.8 cachem_1.0.6 farver_2.1.1
## [46] doRNG_1.8.2 bslib_0.4.0 ellipsis_0.3.2
## [49] generics_0.1.3 vctrs_0.5.0 iterators_1.0.14
## [52] tools_4.2.1 beeswarm_0.4.0 glue_1.6.2
## [55] purrr_0.3.5 rngtools_1.5.2 parallel_4.2.1
## [58] fastmap_1.1.0 survival_3.4-0 yaml_2.3.6
## [61] colorspace_2.0-3 BiocManager_1.30.19 knitr_1.40
## [64] sass_0.4.2