Contents

1 Introduction


Why do we need group markers

In biological and clinical research, the identification of biomarkers specific to a disease, tissue, or cell type is a critical step in advancing our understanding and application of this knowledge.

Specific biomarkers can be identified through various methods, such as transcriptomics, proteomics, or metabolomics, which can provide a global view of the molecular landscape of the system being studied.

To denote these biomarkers, we use the term “group markers” in a generic sense, referring to any molecular feature that is specific to a particular group or condition of interest.

Such markers can provide insight into disease diagnosis, prognosis, and treatment options, and can help to differentiate between diseased and healthy states. Thus the identification of the group markers is crucial in various biological and medical applications, as it allows us to distinguish between different cell types or disease states.

For example, in cancer research, identifying marker genes that are differentially expressed between cancer cells and normal cells can help diagnose and monitor the progression of the disease, as well as identify potential therapeutic targets. Similarly, in developmental biology, identifying marker genes that are specific to certain cell types or stages can help us understand the underlying mechanisms of differentiation and development.

Furthermore, marker genes can also be used in diagnostic assays to detect specific diseases or monitor treatment responses. For instance, the presence of certain marker genes in a patient’s blood or tissue sample can indicate the presence or progression of a disease.

Importance of immune cells signature in tumor micro-environment (TME)

In this demo, we will be using the example of immune cell signature in the context of the TME as a practical case study to showcase the application of our approach.

TME is made up of a diverse range of cell types (fibroblasts, epithelial cells, endothelial cells, and immune cells) as well as various extracellular components (collagens, growth factors, hormones, cytokines, etc.). Tumor immune micro-environment (TIME) is reported to be highly associated with prognosis and various treatment response to many kinds of cancers.

Recent studies have highlighted the role of immune components in the TME in modulating tumor progression, making them attractive therapeutic targets. These components make up TIME, which is a subset of the TME. Subsequently, it’s proved that tumor infiltrating lymphocytes (TILs) play an essential role in tumor progression, metastasis and treatment response.

This drives TILs to be a strong prognostic indicator for better precision therapy of cancer patients. The identification of these TILs are the subject of extensive research interest due to the roles of specific subset of immune cells acting on different tissues types.

Identification of these cell types are based on flow cytometry in the past, which has limited granularity. But now with the advance of single cell RNA-seq and spatial transcriptomics technologies, more detailed and novel cell types or subtypes can be identified. One of the key advantages of scRNA-seq and spatial technologies is the ability to investigate cell types in the TIME and understand cellular heterogeneity in the TME.

To quantify TILs infiltration (in bulk), estimate cellular composition (in bulk), annotate single cell types (in scRNA) or identify sample states/activities, many computational methods like cell deconvolution (CIBERSORTx), marker based annotation (CelliD) and sample scoring (singscore) are developed. But to distinguish between closely related cell types, estimate cell composition and states, a refined signature is required which typically requires manual curation by domain expert. With the increasing large amount of data and cell types, this is gradually getting to be untenable and will require a different approach to automatically screen such signatures.

mastR is a software package specifically designed to automatically screen signatures of interest for particular research questions. This automated process saves significant time and effort that would otherwise be required for manual labor.

What this package does

mastR, Markers Automated Screening Tool in R, is a package to automatically derive signature for specific group of interest in specific tissue.

With mastR, users can simply input their expression data containing the groups of interest, along with group labels, to obtain a list of marker genes (signature) for the target group.

While there are many tools available for generating cell type-specific markers, they are primarily designed for scRNA-seq data and often rely on machine learning algorithms to select relevant features for classification. However, these methods may lack robustness and consistency across datasets when compared to using statistical methods like empirical Bayesian in limma. Furthermore, some of these tools may return a signature even when the data does not contain any, leading to potential inaccuracies.

Although differential expression (DE) analysis can also be done on scRNA data, like Seurat::FindMarkers(), it’s reported that DE analysis done on pseudo-bulk scRNA data is more robust and reliable than directly done on scRNA data.

Thus, mastR is designed to generate a more refined list of signature genes from multiple group comparisons based on the results from edgeR and limma DE analysis workflow. The final signature is selected by rank product score test for picking up genes with high ranks in the most of comparisons. The rank can be ordered by any gene statistic generated by limma analysis. Signature can be further refined by keeping the top n DEGs in the specified comparison(s), which can help to improve the discrimination between fairly similar cell types.

Another unique advantage of mastR is that it takes into account the background expression of tissue-specificity, which is often ignored by other marker generation tools. Unlike most tools that only consider the genes’ contribution to the classification, mastR also offers a background expression removal function. This function is designed to remove parts of the marker genes that are highly expressed in specific tissues or cancer cells. These signals from the background, regarded as background expression, can cause potential ambiguity when applying the markers to specific tissues due to background or sample purity issues. This feature makes mastR apart and enhances the accuracy and specificity of the generated markers.

Furthermore, mastR allows users to build a markers pool before signature screening, which might contain the potential markers of interest to the users. The final signature will be integrated with this pool (by intersection), which can help to constrain the final signature within the interested pathway-related genes or functional gene-sets. People can borrow the published knowledge to build this.

The motivation of this package arises from the importance and necessity of identifying specific genes that are differentially expressed in different groups or tissues, as these genes can serve as biomarkers for diagnosis, prognosis, and therapeutic targeting. However, identifying marker genes can be a challenging and time-consuming task, and the presence of background expression can lead to erroneous results. Our package simplifies and streamlines this process, allowing researchers to focus on their analyses and interpretations without the burden of manual marker gene selection and background expression removal.

This report demonstrates the main functions and applications of mastR 1.2.3.

This report demonstrates the signature screening workflow of NK cells in colorectal cancer (CRC), assessing the results by using in-built visualization functions.

mastR screen the signature using the following 3 key steps:

step 1. build markers pool
step 2. identify signature from the pool
step 3. refine signature by background expression
step 4. visualize signature performance

Applications

2 Installation


mastR R package can be installed from Bioconductor or GitHub.

The most updated version of mastR is hosted on GitHub and can be installed using devtools::install_github() function provided by devtools.

# if (!requireNamespace("devtools", quietly = TRUE)) {
#   install.packages("devtools")
# }
# if (!requireNamespace("mastR", quietly = TRUE)) {
#   devtools::install_github("DavisLaboratory/mastR")
# }

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("mastR", quietly = TRUE)) {
  BiocManager::install("mastR")
}
packages <- c(
  "BiocStyle",
  "clusterProfiler",
  "ComplexHeatmap",
  "depmap",
  "enrichplot",
  "ggrepel",
  "gridExtra",
  "jsonlite",
  "knitr",
  "rmarkdown",
  "RobustRankAggreg",
  "rvest",
  "singscore",
  "UpSetR"
)
for (i in packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i)
  }
  if (!requireNamespace(i, quietly = TRUE)) {
    BiocManager::install(i)
  }
}
library(mastR)
library(edgeR)
library(ggplot2)
library(GSEABase)

3 Step 1. Build Markers Pool


The first step is to define the original markers pool this analysis will be based on.

The final signature will only be the intersected genes with this markers pool. The whole gene list of the data will be regarded as the markers pool if no preliminary result or interested genes are provided.

Note: markers = NULL won’t keep any special genes if they fail the filtration by edgeR.

If users have any preliminary knowledge about the target group type (cell type), they can build the markers pool of interest from the available datasets or build from MSigDB, PanglaoDB or LM7/LM22 using our in-built functions. All genes in the pool will be reserved for DE analysis even even if they are filtered out during the filtration of edgeR.

The standard pool building process involves the following:

  1. generate from sources
    1. LM7/22 signature matrix for CIBERSORT
    2. MSigDB
    3. PanglaoDB
    4. Customized gene list
  2. merge gene-sets together

mastR allows the following markers to be conveniently loaded:

All markers generation functions will return GeneSet or GeneSetCollection object.

3.1 Generate Markers from Sources

To generate the immune cell signatures, firstly users need to define a pool of markers of interest. mastR allows users to build markers pool from multiple resources.

In this demo, we will load some some publicly available example datasets.

3.1.1 i) Leukocyte gene signature Matrix (LM)

lm7/lm22 are immune cells signature matrices from CIBERSORT, contains 7 or 22 immune cell types.

Users can use function get_lm_sig() to generate immune cells markers from LM7 and/or lm22.

data("lm7", "lm22")

## collect both LM7 and LM22
LM <- get_lm_sig(lm7.pattern = "^NK", lm22.pattern = "NK cells")
LM
#> GeneSetCollection
#>   names: LM7, LM22 (2 total)
#>   unique identifiers: CD244, FASLG, ..., ZNF135 (92 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

Function gsc_plot() allows visualization of an UpSetR plot across all gene-sets.

## show upset diagram
gsc_plot(LM)

3.1.2 ii) MSigDB

The Molecular Signatures Database (MSigDB) is a database of annotated gene sets, typically used for pathway analysis.

Users can use get_gsc_sig() to generate a collection of gene sets for the biological subjects of interest.

In this case, NK cell relevant gene-sets from msigdb_gobp_nk are collected.

data("msigdb_gobp_nk")
MSig <- get_gsc_sig(
  gsc = msigdb_gobp_nk,
  pattern = "NATURAL_KILLER_CELL_MEDIATED"
)
MSig
#> GeneSetCollection
#>   names: GOBP_NATURAL_KILLER_CELL_MEDIATED_IMMUNITY, GOBP_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY_DIRECTED_AGAINST_TUMOR_CELL_TARGET, ..., GOBP_POSITIVE_REGULATION_OF_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY (18 total)
#>   unique identifiers: AP1G1, ARRB2, ..., KLRC4-KLRK1 (67 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

Note: data = "msigdb" allows searching of MsigDB without loading into the environment. species or version are optional parameters.

Similarly, using gsc_plot() users can visualize the overlapping gene sets.

As the gene-set names are too long, for better visualization the gene-set names are replaced with letters and shown below.

## cut gene set name within 11 characters
gsn <- setNames(names(MSig), LETTERS[seq_along(MSig)])
for (i in seq_along(MSig)) {
  setName(MSig[[i]]) <- LETTERS[i]
}

## show upset diagram of collected gene-sets
gsc_plot(MSig)

gsn[c("A", "M", "D")] ## show gene-set names of top 3
#>                                                          A 
#>               "GOBP_NATURAL_KILLER_CELL_MEDIATED_IMMUNITY" 
#>                                                          M 
#>           "GOBP_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY" 
#>                                                          D 
#> "GOBP_REGULATION_OF_NATURAL_KILLER_CELL_MEDIATED_IMMUNITY"

There are 18 gene-sets in MSig, which is too many for visualization. Thus, we use function merge_markers() to merge all gene-sets into one GeneSet object. (Note: Users can directly use the un-merged object for subsequent analyses.)

## merge all gene sets into one
MSig <- merge_markers(MSig)
setName(MSig) <- "MSigDB"

3.1.3 iii) PanglaoDB

PanglaoDB is a database of single cell RNA sequencing experiments with cell type markers.

Users can use get_panglao_sig() to generate markers based on the required organs and cell types.

Functions list_panglao_organs() and list_panglao_types() show all available organs and cell types in PanglaoDB.

Note: This requires real-time connection to PanglaoDB, thus not run in this demo.

## show available organs on PanglaoDB
list_panglao_organs()

## show available cell types of interest organ on PanglaoDB
list_panglao_types(organ = "Immune system")

## collect all "NK cells" markers from PanglaoDB website
Panglao <- get_panglao_sig(type = "NK cells")

Panglao

3.1.4 iv) Customized gene list

Customized markers can be imported as a GeneSet object.

Here we demonstrate this by loading the in-built nk_markers markers dataset and convert it into GeneSet object.

## show what nk_markers looks like:
data("nk_markers")
nk_markers
#> # A tibble: 114 × 4
#>    HGNC_Symbol LM22  LM7   Huntington
#>    <chr>       <chr> <chr> <chr>     
#>  1 APOBEC3G    TRUE  -     -         
#>  2 APOL6       TRUE  -     -         
#>  3 AZU1        TRUE  -     -         
#>  4 BPI         TRUE  -     -         
#>  5 CAMP        TRUE  -     -         
#>  6 CCL4        TRUE  -     -         
#>  7 CCL5        TRUE  -     TRUE      
#>  8 CCND2       TRUE  -     -         
#>  9 CD160       TRUE  -     -         
#> 10 CD2         TRUE  -     -         
#> # ℹ 104 more rows

## convert NK markers into `GeneSet` object
nk_m <- GeneSet(nk_markers$HGNC_Symbol,
  geneIdType = SymbolIdentifier(),
  setName = "NK_markers"
)

3.2 Pool Markers from Sources

With multiple lists of markers from different sources, use merge_markers() to merge them into one GeneSet object.

gsc <- GeneSetCollection(c(nk_m, LM, MSig)) ## add Panglao if you run it
Markers <- merge_markers(gsc)

## upset plot
gsc_plot(gsc)


Markers
#> setName: merged_markers_pool 
#> geneIds: APOBEC3G, APOL6, ..., KLRC4-KLRK1 (total: 167)
#> geneIdType: Symbol
#> collectionType: Computed 
#> details: use 'details(object)'

The summary of the merged list is saved as longDescription in the output.

## to show the table summary of merged list
head(jsonlite::fromJSON(GSEABase::longDescription(Markers)))
#>       Gene NK_markers LM7 LM22 MSigDB
#> 1    AP1G1          -   -    -   TRUE
#> 2 APOBEC3G       TRUE   - TRUE      -
#> 3    APOL6       TRUE   - TRUE      -
#> 4    ARRB2          -   -    -   TRUE
#> 5     AZU1       TRUE   - TRUE      -
#> 6      BPI       TRUE   - TRUE      -

Now with a merged markers pool, we refine the signature genes for group specificity and tissue background removal.

4 Step 2. Signature Identification for Target Group


For group specificity, there are 3 main steps:

  1. differential expression (DE) analysis: filtration, normalization, sample weighting and linear model fit;
  2. feature selection: select differentially expressed genes based on rank product score;
  3. constrain selected genes within the markers pool.

Note: External data is accepted in different formats (e.g. DGEList, eSet, matrix). Input data must be raw counts or log-transformed expression data.

In this demo, we use the imported example data im_data_6 from GSE60424 (Download using GEOquery::getGEO()), consisting of immune cells from healthy individuals.

im_data_6 is a eSet object, containing RNA-seq TMM normalized counts data of 6 sorted immune cell types each with 4 samples. More details in ?mastR::im_data_6.

data("im_data_6")
im_data_6
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 50045 features, 24 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: GSM1479438 GSM1479439 ... GSM1479525 (24 total)
#>   varLabels: title geo_accession ... years since diagnosis:ch1 (66
#>     total)
#>   varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#>   pubMedIds: 25314013 
#> Annotation: GPL15456

4.1 a) data processing

To screen signature for the group specificity, the data needs to be pre-processed.

process_data() in mastR does the ‘End-to-End’ differential expression analysis using edgeR and limma pipeline as a single function call.

Processes under the hood:

  • Filter data by the given cutoff, genes with low expression will be removed by edgeR.

  • If data is raw counts (normalize = TRUE), normalize data by ‘TMM’ and fit it using limma::voom(). Otherwise apply trend of limma on normalized data.

  • Fit linear model.

  • Compute gene statistic for differential expression analysis using limma::treat().

im_data_6 has 6 immune cell types:

table(im_data_6$`celltype:ch1`)
#> 
#>     B-cells         CD4         CD8   Monocytes          NK Neutrophils 
#>           4           4           4           4           4           4

process_data() requires an expression matrix and group labels of the samples. This returns a DGEList object with processed data.

To keep consistent with Markers, use param gene_id to convert ENSEMBL IDs of im_data_6 to SYMBOLs for intersection.

Refer to help(proc_data) for optional parameters.

proc_data <- process_data(
  data = im_data_6,
  group_col = "celltype:ch1",
  target_group = "NK",
  markers = geneIds(Markers),
  gene_id = "ENSEMBL" ## rownames of im_data_6 is ENSEMBL ID
)
#> 'select()' returned 1:many mapping between keys and columns
#>        NK-Neutrophils NK-Monocytes NK-B.cells NK-CD4 NK-CD8
#> Down             4012         3949       3146   2698   2155
#> NotSig           1492         2694       4429   5001   6201
#> Up               4945         3806       2874   2750   2093

attributes(proc_data)
#> $names
#> [1] "counts"          "samples"         "original_counts" "vfit"           
#> [5] "tfit"           
#> 
#> $class
#> [1] "DGEList"
#> attr(,"package")
#> [1] "edgeR"

For ease of process later in this demo, add the expression matrix E fitted by limma::voom() as ‘voomE’ into proc_data.

## add voom fitted expression as a new list of proc_data for use
proc_data$voomE <- proc_data$vfit$E

Note:

mastR provides visualization functions to compare the data before and after process_data() and assess the quality control (QC) by using plot_diagnostics() and plot_mean_var().

To assess the removal of the low quality genes, use plot_diagnostics() to show the expression distribution, RLE and MDS plots.

4.2 b) signature selection based on differential expression

For selection of group specific signature, pass proc_data to select_sig() function. Genes with high rank product score in DE results are selected.

mastR automatically determines if feature selection is required, the default approach (feature_selection = "auto") performs rank product scoring to select genes. But if the numbers of the result are < 5, no feature selection will be conducted (switch to feature_selection = "none").

## get the same result as there's permutation test for rank product
set.seed(123)
sig_ct <- select_sig(proc_data, feature_selection = "auto")

head(sig_ct)
#> GeneSetCollection
#>   names: UP, DOWN (2 total)
#>   unique identifiers: ENSG00000149294, ENSG00000173068, ..., ENSG00000162591 (98 total)
#>   types in collection:
#>     geneIdType: NullIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

Note:

mastR also implements a feature to further optimize the discriminative power of the signature between fairly similar groups by using params keep.top and keep.group. More details in help(select_sig).

Tips:

All above steps from 4.1 to 4.2 for refining group signature can be done using an integrated function get_degs().

It will return a list of a processed data proc_data and a GeneSetCollection DEGs with UP and DOWN regulated genes.

QC plots can also be shown by setting optional param plot = TRUE.

4.3 c) constrain signature within markers pool

To constrain the final signature within the interested gene list, intersect the signature genes sig_ct with the markers pool Markers. In this demo, only the UP regulated genes are kept.

For consistency with Markers type, convert ENSEMBL IDs of im_data_6 to gene SYMBOLs.

## convert ensembl IDs into symbols to match markers pool
deg_up <- mapIds(
  org.Hs.eg.db::org.Hs.eg.db,
  geneIds(sig_ct[["UP"]]),
  "SYMBOL", "ENSEMBL"
)
#> 'select()' returned 1:1 mapping between keys and columns
deg_up <- na.omit(deg_up)
## markers specific for NK cells
m_ct <- intersect(geneIds(Markers), deg_up)
names(m_ct) <- names(deg_up)[match(m_ct, deg_up)] ## set ensembl ID as names for downstream visualization

head(m_ct)
#> ENSG00000122223 ENSG00000198821 ENSG00000172543 ENSG00000145649 ENSG00000197540 
#>         "CD244"         "CD247"          "CTSW"          "GZMA"          "GZMM" 
#> ENSG00000100385 
#>         "IL2RB"

Users can use pca_matrix_plot() to assess the group separation.

Looking at the variance of PC1 shown below, it’s clear the intersected genes explain more variance in PC1 compared with all UP DEGs.

## PCA shows clear separation of NK cells
## after intersection
pca_matrix_plot(proc_data,
  features = m_ct,
  group_by = "celltype.ch1",
  slot = "voomE",
  n = 3,
  gene_id = "ENSEMBL"
) +
  patchwork::plot_annotation("Intersected UP DEGs")
#> 'select()' returned 1:many mapping between keys and columns

## before intersection
pca_matrix_plot(proc_data,
  features = as.vector(deg_up),
  group_by = "celltype.ch1",
  slot = "voomE",
  n = 3,
  gene_id = "ENSEMBL"
) +
  patchwork::plot_annotation("All UP DEGs")
#> 'select()' returned 1:many mapping between keys and columns

Note:

Wrapper function filter_subset_sig() can complete all processes in Step 2 using a single function. The result would be the same.

filter_subset_sig() is helpful when users have multiple datasets (more details in Section 7.1). It can output the final signature after aggregating signature lists from all datasets.

5 Step 3. Signature Refinement by Background Expression in Tissue


Further refinement can be achieved by eliminating genes with strong signal in the cancer cells or tissues, regarded as background expression.

mastR utilizes a signal-to-noise ratio (SNR) approach to filter out genes with low SNR values that have low discriminative power between the group of interest and the background. This removes tumor purity as a factor in the identified signature markers.

Here the signal is the expression of the signature in signal data and the noise is the expression of the signature in background data.

Two input datasets are required:

Note: Both of signal data and background data must be log-normalized.

The signal data should be the result generated by process_data() function.

The refinement consists of:

  1. data subset
  2. data filtration
  3. markers removal
  4. combination with the signature

5.1 I) data subsetting

To remove genes with background expression, subset the samples based on the context. In this demo, we are looking to remove CRC-specific signal.

Here we choose to load in-built ccle_crc_5 as the background data. The CRC adherent cell lines data is extracted (bg_mat).

The signal data is in voomE of proc_data and NK cells data is extracted (sig_mat).

data("ccle_crc_5")

## subset CRC cell lines of bg data
bg_mat <- ccle_crc_5$counts[, ccle_crc_5$samples$cancer == "CRC"]
## subset all NK cells of sig data
sig_mat <- proc_data$voomE[, proc_data$samples$celltype.ch1 == "NK"]

Note:

ccle_crc_5 is a DGEList object contains 5 CRC cell line samples from CCLE. More details for help(ccle_crc_5).

DepMap CCLE is a large database containing RSEM quantified TPM data of more than 1,000 cancer cell lines.

5.2 II) data filtration

Because SNR is computed using scaled expression across genes, low-expression genes must be removed.

In this case, sig_mat from proc_data has already been filtered, therefore background data (bg_mat) needs to be filtered.

bg_mat has only 5 samples, genes with logTPM > 1 within more than 2 samples are kept.

keep <- rowSums(bg_mat > 1, na.rm = TRUE) > 2
bg_mat <- bg_mat[keep, ]

5.3 III) markers pool refinement

To refine the markers pool, use remove_bg_exp_mat() to keep genes from Markers with high expression in sig_mat and low expression in bg_mat.

This ensures the specificity of the markers minimizing the context effect.

Similarly, use gene_id to convert gene IDs of input to the same type (SYMBOL). As sig_mat uses ENSEMBL IDs, gene_id types have to be defined in the order: sig_mat, bg_mat.

m_ccl <- remove_bg_exp_mat(
  sig_mat = sig_mat,
  bg_mat = bg_mat,
  markers = geneIds(Markers),
  gene_id = c("ENSEMBL", "SYMBOL")
)
#> 'select()' returned 1:many mapping between keys and columns
#> Warning in remove_bg_exp_mat(sig_mat = sig_mat, bg_mat = bg_mat, markers = geneIds(Markers), : Gene CCL4 is not in the sig_mat! So remove it!
#> Gene CCL5 is not in the sig_mat! So remove it!
#> Gene KIR2DL2 is not in the sig_mat! So remove it!
#> Gene KIR2DL5A is not in the sig_mat! So remove it!
#> Gene KIR2DS2 is not in the sig_mat! So remove it!
#> Gene KIR2DS5 is not in the sig_mat! So remove it!
#> Gene KLRA1P is not in the sig_mat! So remove it!
#> Gene LOC653757 is not in the sig_mat! So remove it!
#> Gene MGC24103 is not in the sig_mat! So remove it!
#> Gene TRDC is not in the sig_mat! So remove it!
#> Gene MGC61571 is not in the sig_mat! So remove it!
#> Gene PIK3R6 is not in the sig_mat! So remove it!

head(m_ccl)
#> [1] "IL32"   "RAB27B" "STYK1"  "RAB27A" "TXK"    "TTC38"

5.4 use CCLE as background data (optional)

Users can also use CCLE downloaded by depmap::depmap_TPM() for the refinement.

The entire CCLE data is quite large, thus use ccle_crc_5 to construct a small pseudo-CCLE data ccle with similar format.

ccle <- data.frame(ccle_crc_5$counts,
  gene_name = rownames(ccle_crc_5),
  primary_disease = "CRC"
) |>
  tidyr::pivot_longer(-c(gene_name, primary_disease),
    names_to = "depmap_id",
    values_to = "rna_expression"
  )

ccle
#> # A tibble: 95,885 × 4
#>    gene_name primary_disease depmap_id                 rna_expression
#>    <chr>     <chr>           <chr>                              <dbl>
#>  1 TSPAN6    CRC             SNU283_LARGE_INTESTINE             4.91 
#>  2 TSPAN6    CRC             TGBC18TKB_LARGE_INTESTINE          6.05 
#>  3 TSPAN6    CRC             SW837_LARGE_INTESTINE              5.26 
#>  4 TSPAN6    CRC             SNU1040_LARGE_INTESTINE            5.16 
#>  5 TSPAN6    CRC             HT29_LARGE_INTESTINE               4.40 
#>  6 TNMD      CRC             SNU283_LARGE_INTESTINE             0.176
#>  7 TNMD      CRC             TGBC18TKB_LARGE_INTESTINE          0    
#>  8 TNMD      CRC             SW837_LARGE_INTESTINE              0.333
#>  9 TNMD      CRC             SNU1040_LARGE_INTESTINE            0.465
#> 10 TNMD      CRC             HT29_LARGE_INTESTINE               0    
#> # ℹ 95,875 more rows

Because ccle is a long data, subsetting needs to be done before converting it into matrix.

In this demo, we still extract CRC adherent cell lines data from it.

## subset CRC cell lines of bg data
ccle <- ccle[ccle$primary_disease == "CRC", ]

As remove_bg_exp_mat() can only accept matrix in wide format, users can use ccle_2_wide() to convert ccle into wide matrix.

ccle <- ccle_2_wide(ccle = ccle)

ccle[1:3, 1:3]
#>        SNU283_LARGE_INTESTINE TGBC18TKB_LARGE_INTESTINE SW837_LARGE_INTESTINE
#> TSPAN6              4.9140861                  6.045923             5.2570106
#> TNMD                0.1763228                  0.000000             0.3334237
#> DPM1                6.9468478                  6.724105             6.9242187

The same filtration cutoff used in bg_mat above is used for ccle.

keep <- rowSums(ccle > 1, na.rm = TRUE) > 2
ccle <- ccle[keep, ]

Use remove_bg_exp_mat() to refine the markers pool based on ccle. As we use ccle_crc_5 to construct ccle, the result should be the same.

m_ccl <- remove_bg_exp_mat(
  sig_mat = sig_mat,
  bg_mat = ccle,
  markers = geneIds(Markers),
  gene_id = c("ENSEMBL", "SYMBOL")
)
#> 'select()' returned 1:many mapping between keys and columns
#> Warning in remove_bg_exp_mat(sig_mat = sig_mat, bg_mat = ccle, markers = geneIds(Markers), : Gene CCL4 is not in the sig_mat! So remove it!
#> Gene CCL5 is not in the sig_mat! So remove it!
#> Gene KIR2DL2 is not in the sig_mat! So remove it!
#> Gene KIR2DL5A is not in the sig_mat! So remove it!
#> Gene KIR2DS2 is not in the sig_mat! So remove it!
#> Gene KIR2DS5 is not in the sig_mat! So remove it!
#> Gene KLRA1P is not in the sig_mat! So remove it!
#> Gene LOC653757 is not in the sig_mat! So remove it!
#> Gene MGC24103 is not in the sig_mat! So remove it!
#> Gene TRDC is not in the sig_mat! So remove it!
#> Gene MGC61571 is not in the sig_mat! So remove it!
#> Gene PIK3R6 is not in the sig_mat! So remove it!

head(m_ccl)
#> [1] "IL32"   "RAB27B" "STYK1"  "RAB27A" "TXK"    "TTC38"

Note:

Wrapper function remove_bg_exp() can complete Step 3. I) II) III) all using one single function.

Users can use the entire CCLE data from DepMap without loading it by using param bg_data = 'CCLE'. More details in help(remove_bg_exp).

5.5 IV) combination with Signature

To get the final signature with high group specificity and low background expression, intersect the result m_ccl with the signature m_ct we get from Step 2.

sig_NK_CRC <- intersect(m_ct, m_ccl)
head(sig_NK_CRC)
#> [1] "CD244" "CD247" "CTSW"  "GZMA"  "GZMM"  "IL2RB"

Here, if we directly use m_ct instead of Markers as input markers for remove_bg_exp_mat() or remove_bg_exp(), m_ccl will already be the intersection result of the original m_ccl and m_ct, then we can skip this.

6 Step 4. Visualization of Final Results


To assess how well the refined signature can discriminate our target group from others, mastR provides the following visualization functions for different purposes:

6.1 Heatmap

Use sig_heatmap() function to compare the expression pattern between the original markers pool and the final signature in proc_data.

  • sigs: final signature (or list of multiple signatures).
  • markers: original markers pool (optional).
sig_heatmap(
  data = proc_data,
  sigs = sig_NK_CRC,
  group_col = "celltype.ch1",
  markers = geneIds(Markers),
  gene_id = "ENSEMBL",
  slot = "voomE",
  scale = "row",
  show_column_den = FALSE,
  show_row_den = FALSE,
  show_column_names = FALSE,
  show_row_names = FALSE
)
#> 'select()' returned 1:many mapping between keys and columns
#> Gene CCL4 is not in data.
#> Gene CCL5 is not in data.
#> Gene KIR2DL2 is not in data.
#> Gene KIR2DL5A is not in data.
#> Gene KIR2DS2 is not in data.
#> Gene KIR2DS5 is not in data.
#> Gene KLRA1P is not in data.
#> Gene LOC653757 is not in data.
#> Gene MGC24103 is not in data.
#> Gene TRDC is not in data.
#> Gene MGC61571 is not in data.
#> Gene PIK3R6 is not in data.
#> 'select()' returned 1:many mapping between keys and columns