Spatial transcriptomics allows the spatial profiling of complex tissue architectures. The spatial arrangement and interactions between cells can aid in understanding of complex functions and regulatory mechanisms in various tissue micro environments. Commercially available image-based spatial technologies such as Xenium, CosMX, and MERSCOPSE take advantage of fluorescence-based microscopy to quantify transcripts and focus on a pre-designed panel of genes.
One crucial step in the analysis of spatial transcriptomics data is cell type annotation. There are a number of ways to perform cell type annotation, and marker analysis is one of them. Marker gene analysis to identify genes highly expressed in each cluster compared to the remaining clusters. The identified marker genes are used to annotate clusters with cell types. Computationally tools originally developed for single cell data are used for spatial transcriptomics studies. However, those methods ignore the spatial information for the cells and gene. There is limited literature on developing marker gene detection methods that account for the spatial distribution of gene expression. jazzPanda provides two novel approaches to detect marker genes with transformed spatial information. Our first approach is based on correlation and the second linear modelling approach can account for technical noise and work for studies with multiple samples.
We assume a marker gene will show a significant linear relationship with the target cluster over the tissue space. This suggests that the transcript detection of a marker gene will show similar patterns with the cells from the cluster over tissue space. Given the cluster label of every cell, there are two steps to obtain marker genes for every cluster. We first compute spatial vectors from the spatial coordinates of the genes and the clusters. After that, we can measure the linear relationship between the genes and clusters based on spatial vectors. We develop two approaches for detecting genes that show strong linear relationship with the cluster.
Step 1: Create spatial vectors for every gene and every cluster
get_vectors()
can be used to convert the transcript detection and the cell centroids to spatial vectors. You can specify the tile shape and length based on you dataset. The hex bins will generally take longer than the square/rectangle bins to compute. In practice, we find that we can choose the length tiles such that the average cell per tile (\# cells per cluster\# tiles) is close to one for each cluster.
Step 2: Detect marker genes
Correlation approach: compute_permp()
This approach can detect marker genes for one sample study. We calculate a correlation coefficient between every pair of gene and cluster vector. We perform permutation testing to assess the statistical significance of the calculated correlation and followed by multiple testing adjustment to control the false discovery rate. We keep the genes with significant adjusted p-value and large correlation as marker genes. In practice, we recommend to calculate a correlation threshold value for every cluster based on the data distribution. During our analysis, we use 75% quantile value of all correlations to the given cluster as the cutoff and manage to keep meaningful marker genes.
Linear modelling approach: lasso_markers()
In this approach, we treat the gene vector as the response variable and the cluster vectors as the explanatory variables. We select the important cluster vectors by lasso regularization first, and fit a linear model to find the cluster that show minimum p-value and largest model coefficient. This approach can account for multi-sample studies and the technical background noise (such as the non-specific binding). We recommend to set a model coefficient cutoff value based on your data. A large cutoff value will result in fewer marker genes whereas a small cutoff value will detect more marker genes. We find a cutoff value at 0.1 or 0.2 work well for our analysis. Marker genes with less than 0.1 model coefficient are generally weak markers.
Two data frames will be returned from this approach:
We record the most significant cluster for each gene by calling the get_top_mg()
function. This table provides unique marker genes for every cluster. Genes whose top cluster shows a model coefficient smaller than the specified cutoff value will not be labeled as marker genes.
We record all the significant cluster for each gene by using the function get_full_mg()
. This table can be used to investigate shared marker genes for different clusters.
The dataset used in this vignette is a selected subset from two replicates of Xenium human breast cancer tissue Sample 1. We select 20 genes for package illustration. This subset was extracted from the raw dataset as described in the R script located at system.file("script","generate_vignette_data.R", package="jazzPanda")
.
This subset data is used for package illustration purpose only. The resulting marker genes may not be strong markers for annotating clusters. Please see the full analysis for this dataset for marker genes.
library(jazzPanda)
library(SpatialExperiment)
library(ggplot2)
library(grid)
library(data.table)
library(dplyr)
library(glmnet)
library(caret)
library(corrplot)
library(igraph)
library(ggraph)
library(ggrepel)
library(gridExtra)
library(utils)
library(spatstat)
library(tidyr)
library(ggpubr)
# ggplot style
theme(strip.text = element_text(size = rel(1)),
defined_theme <-strip.background = element_rect(fill = NA,
colour = "black"),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none",
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
plot.background=element_blank())
data(rep1_sub, rep1_clusters, rep1_neg)
$cluster<-factor(rep1_clusters$cluster,
rep1_clusterslevels=paste("c",1:8, sep=""))
# record the transcript coordinates for rep1
::unsplitAsDataFrame(molecules(rep1_sub))
rep1_transcripts <-BumpyMatrix as.data.frame(rep1_transcripts)
rep1_transcripts <-colnames(rep1_transcripts) <- c("feature_name","cell_id","x","y")
# record the negative control transcript coordinates for rep1
BumpyMatrix::unsplitAsDataFrame(molecules(rep1_neg))
rep1_nc_data <- as.data.frame(rep1_nc_data)
rep1_nc_data <-colnames(rep1_nc_data) <- c("feature_name","cell_id","x","y","category")
# record all real genes in the data
unique(as.character(rep1_transcripts$feature_name))
all_real_genes <- unique(rep1_clusters[,c("anno","cluster")]) all_celltypes <-
We can plot the cells coordinates for each cluster of Replicate 1 subset
ggplot(data = rep1_clusters,
p1<-aes(x = x, y = y, color=cluster))+
geom_point(position=position_jitterdodge(jitter.width=0,
jitter.height=0), size=0.1)+
scale_y_reverse()+
theme_classic()+
facet_wrap(~sample)+
scale_color_manual(values = c("#FC8D62","#66C2A5" ,"#8DA0CB","#E78AC3",
"#A6D854","skyblue","purple3","#E5C498"))+
guides(color=guide_legend(title="cluster", nrow = 2,
override.aes=list(alpha=1, size=2)))+
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
panel.spacing = grid::unit(0.5, "lines"),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none",
strip.text = element_text(size = rel(1)))+
xlab("")+
ylab("")
ggplot(data = rep1_clusters,
p2<-aes(x = x, y = y, color=cluster))+
geom_point(position=position_jitterdodge(jitter.width=0,
jitter.height=0), size=0.1)+
facet_wrap(~cluster, nrow = 2)+
scale_y_reverse()+
theme_classic()+
scale_color_manual(values = c("#FC8D62","#66C2A5" ,"#8DA0CB","#E78AC3",
"#A6D854","skyblue","purple3","#E5C498"))+
guides(color=guide_legend(title="cluster", nrow = 1,
override.aes=list(alpha=1, size=4)))+
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.spacing = grid::unit(0.5, "lines"),
legend.text = element_text(size=10),
legend.position="none",
legend.title = element_text(size=10),
strip.text = element_text(size = rel(1)))+
xlab("")+
ylab("")
patchwork::plot_spacer()
spacer <- (p1 / spacer) | p2
layout_design <-
layout_design +
layout_design <- patchwork::plot_layout(widths = c(1, 4), heights = c(1, 1))
print(layout_design)
We can visualize the spatial vectors for clusters and genes as follows. As an example for creating spatial vectors for genes, we plot the transcript detections for the gene EPCAM over tissue space, along with the square and hex binning result. Similarly, we plot the cell coordinates in cluster c1, as well as the square and hex bin values over the space as an example. We can see that with the square and hex bins capture the key patterns of the original coordinates. Hex bins can capture more details than square bins.
c(min(floor(min(rep1_transcripts$x)),
w_x <-floor(min(rep1_clusters$x))),
max(ceiling(max(rep1_transcripts$x)),
ceiling(max(rep1_clusters$x))))
c(min(floor(min(rep1_transcripts$y)),
w_y <-floor(min(rep1_clusters$y))),
max(ceiling(max(rep1_transcripts$y)),
ceiling(max(rep1_clusters$y))))
# plot transcript detection coordinates
rep1_transcripts$feature_name == "EPCAM"
selected_genes <- as.data.frame(rep1_transcripts[selected_genes,
loc_mt <-c("x","y","feature_name")]%>%distinct())
colnames(loc_mt)<-c("x","y","feature_name")
layout(matrix(c(1, 2, 3), 1, 3, byrow = TRUE))
par(mar=c(5,3,6,3))
plot(loc_mt$x, loc_mt$y, main = "", xlab = "", ylab = "",
pch = 20, col = "maroon4", cex = 0.1,xaxt='n', yaxt='n')
title(main = "EPCAM transcript detection", line = 3)
box()
# plot square binning
"feature_name"]=="EPCAM",c("x","y")] %>% distinct()
curr<-loc_mt[loc_mt[, ppp(curr$x,curr$y,w_x, w_y)
curr_ppp <- quadratcount(curr_ppp, 10,10)
vec_quadrat <- intensity(vec_quadrat, image=TRUE)
vec_its <-par(mar=c(0.01,1, 1, 2))
plot(vec_its, main = "")
title(main = "square binning", line = -2)
# plot hex binning
owin(xrange=w_x, yrange=w_y)
w <- hextess(W=w, 20)
H <- length(H$tiles)
bin_length <-"feature_name"]=="EPCAM",c("x","y")] %>% distinct()
curr<-loc_mt[loc_mt[, ppp(curr$x,curr$y,w_x, w_y)
curr_ppp <- quadratcount(curr_ppp, tess=H)
vec_quadrat <- intensity(vec_quadrat, image=TRUE)
vec_its <-par(mar=c(0.1,1, 1, 2))
plot(vec_its, main = "")
title(main = "hex binning", line = -2)
c(min(floor(min(rep1_transcripts$x)),
w_x <-floor(min(rep1_clusters$x))),
max(ceiling(max(rep1_transcripts$x)),
ceiling(max(rep1_clusters$x))))
c(min(floor(min(rep1_transcripts$y)),
w_y <-floor(min(rep1_clusters$y))),
max(ceiling(max(rep1_transcripts$y)),
ceiling(max(rep1_clusters$y))))
# plot cell coordinates
as.data.frame(rep1_clusters[rep1_clusters$cluster=="c1",
loc_mt <-c("x","y","cluster")])
colnames(loc_mt)=c("x","y","cluster")
layout(matrix(c(1, 2, 3), 1, 3, byrow = TRUE))
par(mar=c(5,3,6,3))
plot(loc_mt$x, loc_mt$y, main = "", xlab = "", ylab = "",
pch = 20, col = "maroon4", cex = 0.1,xaxt='n', yaxt='n')
title(main = "cell coordinates in cluster c1", line = 3)
box()
# plot square binning
"cluster"]=="c1", c("x","y")]%>%distinct()
curr<-loc_mt[loc_mt[, ppp(curr$x,curr$y,w_x, w_y)
curr_ppp <- quadratcount(curr_ppp, 10,10)
vec_quadrat <- intensity(vec_quadrat, image=TRUE)
vec_its <-par(mar=c(0.1,1, 1, 2))
plot(vec_its, main = "")
title(main = "square binning", line = -2)
# plot hex binning
owin(xrange=w_x, yrange=w_y)
w <- hextess(W=w, 20)
H <- length(H$tiles)
bin_length <-"cluster"]=="c1",c("x","y")] %>%distinct()
curr<-loc_mt[loc_mt[, ppp(curr$x,curr$y,w_x, w_y)
curr_ppp <- quadratcount(curr_ppp, tess=H)
vec_quadrat <- intensity(vec_quadrat, image=TRUE)
vec_its <-par(mar=c(0.1,1, 1, 2))
plot(vec_its, main = "")
title(main = "hex binning", line = -2)
The function get_vectors()
can be used to create spatial vectors for all the genes and clusters. These spatial vectors may take the form of squares, rectangles, or hexagons specified by the bin_type
parameter.
589
seed_number<-
c(min(floor(min(rep1_transcripts$x)),
w_x <-floor(min(rep1_clusters$x))),
max(ceiling(max(rep1_transcripts$x)),
ceiling(max(rep1_clusters$x))))
c(min(floor(min(rep1_transcripts$y)),
w_y <-floor(min(rep1_clusters$y))),
max(ceiling(max(rep1_transcripts$y)),
ceiling(max(rep1_clusters$y))))
10
grid_length <-# get spatial vectors
get_vectors(x= rep1_sub,
rep1_sq10_vectors <-sample_names = "rep1",
cluster_info = rep1_clusters,
bin_type="square",
bin_param=c(grid_length,grid_length),
test_genes = all_real_genes ,
w_x=w_x, w_y=w_y)
The constructed spatial vectors can be used to quantify cluster-cluster and gene-gene correlation.
paste("c", 1:8, sep="")
exp_ord <-$cluster_mt <- rep1_sq10_vectors$cluster_mt[,exp_ord]
rep1_sq10_vectors cor(rep1_sq10_vectors$cluster_mt,
cor_cluster_mt <-$cluster_mt, method = "pearson")
rep1_sq10_vectors# Calculate pairwise correlations
cor(rep1_sq10_vectors$gene_mt, rep1_sq10_vectors$gene_mt,
cor_gene_mt <-method = "pearson")
grDevices::colorRampPalette(c("#4477AA", "#77AADD",
col <-"#FFFFFF","#EE9988", "#BB4444"))
::corrplot(cor_cluster_mt, method="color", col=col(200), diag=TRUE,
corrplotaddCoef.col = "black",type="upper",
tl.col="black", tl.srt=45, mar=c(0,0,5,0),sig.level = 0.05,
insig = "blank",
title = "cluster-cluster correlation (square bin = 40x40)"
)
cor(x=rep1_sq10_vectors$gene_mt,
cor_genecluster_mt <-y=rep1_sq10_vectors$cluster_mt, method = "pearson")
as.data.frame(cbind(apply(cor_gene_mt, MARGIN=1,
gg_correlation <-FUN = mean, na.rm=TRUE),
apply(cor_genecluster_mt, MARGIN=1,
FUN = mean, na.rm=TRUE)))
colnames(gg_correlation) <- c("mean_correlation","mean_cluster")
$gene<-row.names(gg_correlation)
gg_correlation
plot(ggplot(data = gg_correlation,
aes(x= mean_correlation, y=mean_cluster))+
geom_point()+
geom_text_repel(aes(label=gg_correlation$gene), size=1.8, hjust=1)+
theme_bw()+
theme(legend.title=element_blank(),
axis.text.y = element_text(size=20),
axis.text.x = element_text(size=20),
axis.title.x=element_text(size=20),
axis.title.y=element_text(size=20),
panel.spacing = grid::unit(0.5, "lines"),
legend.position="none",
legend.text=element_blank())+
xlab("Average gene-gene correlation")+
ylab("Average gene-cluster correlation"))
We can also construct a gene network based on the spatial vector for the genes.
igraph::graph_from_adjacency_matrix(cor_gene_mt,
vector_graph<-mode = "undirected",
weighted = TRUE,
diag = FALSE)
::simplify(igraph::delete_edges(vector_graph,
vector_graph<-igraphE(vector_graph)[abs(E(vector_graph)$weight) <= 0.7]))
::layout_with_kk(vector_graph)
layout<-igraph
# Plot the graph
::ggraph(vector_graph, layout = layout) +
ggraph geom_edge_link(aes(edge_alpha = weight), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name),
vjust = 1, hjust = 1,size=2,color="orange", repel = TRUE)
The main function lasso_markers()
requires spatial vectors for each cluster and gene. These vectors can be conveniently generated using the get_vectors()
function. Currently, the get_vectors() function supports inputs of type list
, SingleCellExperiment
, SpatialExperiment
, or SpatialFeatureExperiment
. The following sections will illustrate how to create spatial vectors for genes and clusters given different data objects you have.
A named list can be effectively utilized to store the transcript detection data for each sample. Specifically, we will create a named list where each name corresponds to a different sample within your data. Each element of this list is a dataframe containing columns: “feature_name” (gene name), “x” (x-coordinate), and “y” (y-coordinate). It is crucial that the list names match the sample identifiers in the cluster_info. This approach is highly recommended when difficulties arise in defining a SpatialExperiment object for your data.
589
seed_number<- c(min(floor(min(rep1_transcripts$x)),
w_x <-floor(min(rep1_clusters$x))),
max(ceiling(max(rep1_transcripts$x)),
ceiling(max(rep1_clusters$x))))
c(min(floor(min(rep1_transcripts$y)),
w_y <-floor(min(rep1_clusters$y))),
max(ceiling(max(rep1_transcripts$y)),
ceiling(max(rep1_clusters$y))))
10
grid_length <-# get spatial vectors
get_vectors(x= list("rep1" = rep1_transcripts),
rep1_sq10_vectors_lst <-sample_names = "rep1",
cluster_info = rep1_clusters,
bin_type="square",
bin_param=c(grid_length,grid_length),
test_genes = all_real_genes ,
w_x=w_x, w_y=w_y)
# the created spatial vectors will be the same as from other input structure
table(rep1_sq10_vectors$gene_mt[,"DST"]==rep1_sq10_vectors_lst$gene_mt[,"DST"])
##
## TRUE
## 100
# spatial vector for every cluster
head(rep1_sq10_vectors_lst$cluster_mt)
## c5 c8 c3 c2 c1 c4 c6 c7
## [1,] 0 0 0 0 4 1 0 1
## [2,] 0 1 2 0 7 0 0 0
## [3,] 0 0 1 0 10 0 0 0
## [4,] 0 0 6 0 5 2 1 1
## [5,] 0 0 1 0 0 8 7 2
## [6,] 0 0 4 0 0 3 3 2
# spatial vector for every gene
head(rep1_sq10_vectors_lst$gene_mt)
## CD68 DST EPCAM ERBB2 FOXA1 KRT7 LUM MYLK PECAM1 PTPRC SERPINA3 AQP1 VWF
## [1,] 9 9 34 76 34 68 13 2 7 3 6 5 2
## [2,] 4 21 101 170 72 107 34 7 10 2 2 10 10
## [3,] 6 23 146 249 113 147 25 2 0 0 3 3 5
## [4,] 14 6 50 153 71 103 81 0 3 16 0 1 1
## [5,] 28 7 10 45 16 21 20 0 8 52 2 1 1
## [6,] 21 3 7 46 24 5 39 3 12 21 13 4 0
## CCDC80 ITM2C POSTN FGL2 MZB1 IL7R LYZ
## [1,] 13 20 30 4 4 1 12
## [2,] 21 4 38 2 2 7 6
## [3,] 12 4 15 0 1 1 3
## [4,] 23 17 67 5 7 12 9
## [5,] 6 26 18 41 11 33 57
## [6,] 36 21 52 27 13 34 30
If the transcript coordinates are available, you can use either transcript coordinates or the count matrix to define spatial vectors for genes. The defined example_vectors_cm/example_vectors_tr can be passed to lasso_markers
to identify marker genes.
library(SpatialFeatureExperiment)
library(SingleCellExperiment)
library(TENxXeniumData)
library(ExperimentHub)
library(scran)
library(scater)
ExperimentHub()
eh <- query(eh, "TENxXenium")
q <- q[["EH8547"]]
spe_example <-colData(spe_example)$cell_id <- colnames(spe_example)
set.seed(123)
# use a subset data
sample(colnames(spe_example), size = 100, replace = FALSE)
subset_cells <- spe_example[, colnames(spe_example) %in% subset_cells]
spe_sub <-# -----------------------------------------------------------------------------
# calculate logcounts and store in object
logNormCounts(spe_sub)
spe_sub <-rm(spe_example)
# compute PCA
set.seed(123)
runPCA(spe_sub, subset_row = row.names(spe_sub))
spe_sub <-
# example Non-spatial clustering (for illustration purpose only)
set.seed(123)
10
k <- buildSNNGraph(spe_sub, k = k, use.dimred = "PCA")
g <- igraph::cluster_walktrap(g)
g_walk <- paste("c",g_walk$membership,sep="")
clusters <- as.data.frame(cbind(cluster = clusters,
scran_clusters <-cell_id = colnames(spe_sub)))
# -----------------------------------------------------------------------------
# combine cluster labels and the coordinates
# make sure the cluster information contains column names:
# cluster, x, y, sample and cell_id
as.data.frame(spatialCoords(spe_sub))
clusters_info <-colnames(clusters_info) <- c("x","y")
$cell_id <- row.names(clusters_info)
clusters_info$sample <- spe_sub$sample_id
clusters_info merge(clusters_info, scran_clusters, by="cell_id")
clusters_info <- c(floor(min(clusters_info$x)), ceiling(max(clusters_info$x)))
w_x <- c(floor(min(clusters_info$y)), ceiling(max(clusters_info$y)))
w_y <-# -----------------------------------------------------------------------------
# build spatial vectors from count matrix and cluster coordinates
get_vectors(x= spe_sub,
example_vectors_cm <-sample_names = "sample01",
cluster_info = clusters_info,
bin_type="square",
bin_param=c(5,5),
test_genes = row.names(spe_sub)[1:5],
use_cm=TRUE,
w_x=w_x, w_y=w_y)
# spatial vector for every cluster
head(example_vectors_cm$cluster_mt)
## c1 c2 c3
## [1,] 2 3 0
## [2,] 2 2 0
## [3,] 2 1 0
## [4,] 2 0 0
## [5,] 6 6 0
## [6,] 1 3 0
# spatial vector for every gene
head(example_vectors_cm$gene_mt)
## 2010300C02Rik Acsbg1 Acta2 Acvrl1 Adamts2
## [1,] 21 20 3 2 0
## [2,] 9 2 0 0 0
## [3,] 1 7 0 1 0
## [4,] 24 17 0 5 0
## [5,] 9 18 1 4 0
## [6,] 12 20 0 1 0
# -----------------------------------------------------------------------------
# build spatial vectors from transcript coordinates and cluster coordinates
get_vectors(x= spe_sub,
example_vectors_tr <-sample_names = "sample01",
cluster_info = clusters_info,
bin_type="square",
bin_param=c(5,5),
test_genes = row.names(spe_sub)[1:5],
use_cm=FALSE,
w_x=w_x, w_y=w_y)
# spatial vector for every cluster
# example_vectors_tr$cluster_mt
# spatial vector for every gene
# example_vectors_tr$gene_mt
SpatialFeatureExperiment(
sfe_example <-list(counts = spe_sub@assays@data$counts),
colData = spe_sub@colData,
spatialCoords = spatialCoords(spe_sub),
spatialCoordsNames = c("x_centroid", "y_centroid"))
# -----------------------------------------------------------------------------
# build spatial vectors from count matrix and cluster coordinates
# make sure the cluster information contains column names:
# cluster, x, y, sample and cell_id
get_vectors(x= sfe_example,
example_vectors_cm <-sample_names = "sample01",
cluster_info = clusters_info,
bin_type="square",
bin_param=c(5,5),
test_genes = row.names(spe_sub)[1:3],
w_x=w_x, w_y=w_y,
use_cm = TRUE)
# spatial vector for every cluster
head(example_vectors_cm$cluster_mt)
# spatial vector for every gene
head(example_vectors_cm$gene_mt)
# -----------------------------------------------------------------------------
# build spatial vectors from transcript coordinates and cluster coordinates
get_vectors(x= sfe_example,
example_vectors_tr <-sample_names = "sample01",
cluster_info = clusters_info,
bin_type="square",
bin_param=c(5,5),
test_genes = row.names(spe_sub)[1:3],
w_x=w_x, w_y=w_y,
use_cm = FALSE)
# spatial vector for every cluster
# example_vectors_tr$cluster_mt
# spatial vector for every gene
# example_vectors_tr$gene_mt
If the input is a SingleCellExperiment object, the spatial vectors for the genes can only be computed using the count matrix and the cell coordinates The defined example_vectors_cm can be passed to lasso_markers
to identify marker genes. \ If transcript coordinate information is available, you can alternatively create a SpatialExperiment object and compute the spatial vectors using the transcript coordinates.
SingleCellExperiment(list(sample01 = cm))
sce_example <-# -----------------------------------------------------------------------------
# build spatial vectors from count matrix and cluster coordinates
# make sure the cluster information contains column names:
# cluster, x, y, sample and cell_id
get_vectors(x= sce_example,
example_vectors_cm <-sample_names = "sample01",
cluster_info = clusters_info,
bin_type="square",
bin_param=c(5,5),
test_genes = row.names(spe_sub)[1:3],
w_x=w_x, w_y=w_y,
use_cm=TRUE)
# spatial vector for every cluster
head(example_vectors_cm$cluster_mt)
# spatial vector for every gene
head(example_vectors_cm$gene_mt)
# -----------------------------------------------------------------------------
# If the transcript coordinate information is available
# make sure the transcript information contains column names:
# feature_name, x, y
SpatialExperiment(
spe <-assays = list(molecules = molecules(sfe_example)),
sample_id ="sample01")
# build spatial vectors from transcript coordinates and cluster coordinates
get_vectors(x= spe, sample_names = "sample01",
example_vectors_tr <-cluster_info = clusters_info,
bin_type="square",
bin_param=c(5,5),
test_genes = row.names(spe_sub)[1:3],
w_x=w_x, w_y=w_y)
# spatial vector for every cluster
# example_vectors_tr$cluster_mt
# spatial vector for every gene
# example_vectors_tr$gene_mt
If you have a Seurat object, the spatial vectors for the genes can only be computed using the count matrix and the cell coordinates The defined example_vectors_cm can be passed to lasso_markers
to identify marker genes. \ If transcript coordinate information is available, you can alternatively create a SpatialExperiment object and compute the spatial vectors using the transcript coordinates.
library(Seurat)
# suppose cm is the count matrix
Seurat::CreateSeuratObject(counts = cm)
seu_obj <- SingleCellExperiment(list(sample01 =seu_obj@assays$RNA$counts ))
sce <-# we will use the previously defined cluster_info for illustration here
# make sure the clusters information contains column names:
# cluster, x, y and sample
clusters_info
clusters_info =
c(floor(min(clusters_info$x)), ceiling(max(clusters_info$x)))
w_x <- c(floor(min(clusters_info$y)), ceiling(max(clusters_info$y)))
w_y <-# -----------------------------------------------------------------------------
# build spatial vectors from count matrix and cluster coordinates
# make sure the cluster information contains column names:
# cluster, x, y, sample and cell_id
get_vectors(x= sce, sample_names = "sample01",
example_vectors_cm <-cluster_info = clusters_info,
bin_type="square",
bin_param=c(10,10),
test_genes = test_genes,
w_x=w_x, w_y=w_y,
use_cm=TRUE)
# spatial vector for every cluster
$cluster_mt
example_vectors_cm
# spatial vector for every gene
$gene_mt
example_vectors_cm
# -----------------------------------------------------------------------------
# If the transcript coordinate information is available
# make sure the transcript information contains column names:
# feature_name, x, y
SpatialExperiment(
spe <-assays = list(
molecules = molecules(sfe_example)),
sample_id ="sample01" )
# build spatial vectors from transcript coordinates and cluster coordinates
get_vectors(x= spe, sample_names = "sample01",
example_vectors_tr <-cluster_info = clusters_info,
bin_type="square",
bin_param=c(10,10),
test_genes = test_genes,
w_x=w_x, w_y=w_y)
# spatial vector for every cluster
$cluster_mt
example_vectors_tr
# spatial vector for every gene
$gene_mt example_vectors_tr
We assume that the relationship between a marker gene vector its cluster spatial vector is linear.
Here are several genes and their annotation from the panel.
Gene | Annotation |
ERBB2 | Breast cancer cells |
IL7R | T cells |
MZB1 | B cells |
AQP1 | Endothelial |
LUM | Fibroblasts |
c("ERBB2","AQP1","LUM","IL7R","MZB1")
genes_lst <-
for (i_cluster in c("c1","c8","c3","c6","c7")){
$cluster_mt[,i_cluster]
cluster_vector<-rep1_sq10_vectors
as.data.frame(cbind("cluster", cluster_vector,
data_vis<-$gene_mt[, genes_lst]))
rep1_sq10_vectors
colnames(data_vis)<-c("cluster","cluster_vector",genes_lst)
::melt(data_vis,variable.name = "genes",
data_vis<-reshape2value.name = "gene_vector",
id= c("cluster","cluster_vector" ))
$cluster_vector<-as.numeric(data_vis$cluster_vector)
data_vis$genes<-factor(data_vis$genes)
data_vis$gene_vector<-as.numeric(data_vis$gene_vector)
data_vis
plot(ggplot(data = data_vis,
aes(x= cluster_vector, y=gene_vector))+
geom_point(size=0.1)+
facet_wrap(~genes,scales = "free_y", ncol=10)+
theme_bw()+
theme(legend.title=element_blank(),
axis.text.y = element_text(size=6),
axis.text.x = element_text(size=6,angle=0),
axis.title.x=element_text(size=10),
axis.title.y=element_text(size=10),
panel.spacing = grid::unit(0.5, "lines"),
legend.position="none",
legend.text=element_blank(),
strip.text = element_text(size = rel(1)))+
xlab(paste(i_cluster," - cluster vector", sep=""))+
ylab("gene vector"))
}
A straightforward approach to identifying genes that exhibit a linear correlation with cluster vectors involves computing the Pearson correlation for each gene with every cluster. To assess the statistical significance of these correlations, the compute_permp()
function can be used to perform permutation testing, generating a p-value for every pair of gene cluster and cluster vector.
c(min(floor(min(rep1_transcripts$x)),
w_x <-floor(min(rep1_clusters$x))),
max(ceiling(max(rep1_transcripts$x)),
ceiling(max(rep1_clusters$x))))
c(min(floor(min(rep1_transcripts$y)),
w_y <-floor(min(rep1_clusters$y))),
max(ceiling(max(rep1_transcripts$y)),
ceiling(max(rep1_clusters$y))))
set.seed(seed_number)
compute_permp(x=rep1_sub,
perm_p <-cluster_info=rep1_clusters,
perm.size=1000,
bin_type="square",
bin_param=c(10,10),
test_genes= all_real_genes,
correlation_method = "pearson",
n_cores=1,
correction_method="BH",
w_x=w_x ,
w_y=w_y)
# observed correlation for every pair of gene and cluster vector
get_cor(perm_p)
obs_corr <-head(obs_corr)
## c5 c8 c3 c2 c1 c4
## CD68 -0.26037887 0.06259766 0.2360819 -0.4126090 -0.02885842 0.5951329
## DST 0.47226365 -0.13560127 -0.4095928 0.4282446 0.51097133 -0.3263375
## EPCAM -0.11473423 -0.35287011 -0.5724549 0.3454557 0.86438353 -0.3496045
## ERBB2 0.08159437 -0.34437593 -0.5542681 0.6259111 0.69027586 -0.3613350
## FOXA1 -0.12850033 -0.33199825 -0.5574530 0.2897598 0.90412737 -0.3504568
## KRT7 -0.15416917 -0.29951312 -0.5314679 0.1429028 0.93356172 -0.2936468
## c6 c7
## CD68 0.2343318 0.1504930
## DST -0.3835009 -0.4094809
## EPCAM -0.4591613 -0.4444322
## ERBB2 -0.4587071 -0.4455121
## FOXA1 -0.4489792 -0.4284400
## KRT7 -0.4350973 -0.4261451
# permutation adjusted p-value for every pair of gene and cluster vector
get_perm_adjp(perm_p)
perm_res <-head(perm_res)
## c5 c8 c3 c2 c1 c4
## CD68 1.00000000 0.7142857 0.1958042 1.000000000 1.000000000 0.003330003
## DST 0.01332001 1.0000000 1.0000000 0.006660007 0.003996004 1.000000000
## EPCAM 1.00000000 1.0000000 1.0000000 0.115884116 0.003996004 1.000000000
## ERBB2 1.00000000 1.0000000 1.0000000 0.006660007 0.003996004 1.000000000
## FOXA1 1.00000000 1.0000000 1.0000000 0.406260406 0.003996004 1.000000000
## KRT7 1.00000000 1.0000000 1.0000000 1.000000000 0.003996004 1.000000000
## c6 c7
## CD68 0.06608776 0.2942512
## DST 1.00000000 1.0000000
## EPCAM 1.00000000 1.0000000
## ERBB2 1.00000000 1.0000000
## FOXA1 1.00000000 1.0000000
## KRT7 1.00000000 1.0000000
Genes with a significant adjusted p-value are considered as marker genes for the corresponding cluster. We can rank the marker genes by the observed correlationand plot the transcript detection coordinates for the top three marker genes for every cluster.
1000<-as.data.frame(perm_p$perm.pval.adj)
res_df_1000$gene<-row.names(res_df_1000)
res_df_ unique(as.character(rep1_clusters$cluster))
cluster_names <-for (cl in cluster_names){
res_df_1000[res_df_1000[,cl]<0.05,]
perm_sig <-# define a cutoff value based on 75% quantile
quantile(obs_corr[, cl], 0.75)
obs_cutoff <-intersect(row.names(perm_res[perm_res[,cl]<0.05,]),
perm_cl<-row.names(obs_corr[obs_corr[, cl]>obs_cutoff,]))
inters<-perm_clsignif(as.numeric(obs_corr[inters,cl]), digits = 3)
rounded_val<- as.data.frame(cbind(gene=inters, value=rounded_val))
inters_df<-$value<- as.numeric(inters_df$value)
inters_dforder(inters_df$value, decreasing = TRUE),]
inters_df<-inters_df[1:min(nrow(inters_df),2),]
inters_df<-inters_df[$text<- paste(inters_df$gene,inters_df$value,sep=": ")
inters_df rep1_transcripts$feature_name %in% inters_df$gene
curr_genes <- rep1_transcripts[curr_genes, c("x","y","feature_name")]
data_vis <-$text <- inters_df[match(data_vis$feature_name,inters_df$gene),
data_vis"text"]
$text <- factor(data_vis$text, levels=inters_df$text)
data_visggplot(data = data_vis,
p1<-aes(x = x, y = y))+
geom_point(size=0.01,color="maroon4")+
facet_wrap(~text,ncol=10, scales="free")+
scale_y_reverse()+
guides(fill = guide_colorbar(height= grid::unit(5, "cm")))+
defined_theme
ggplot(data = rep1_clusters[rep1_clusters$cluster==cl, ],
cl_pt<-aes(x = x, y = y, color=cluster))+
geom_point(position=position_jitterdodge(jitter.width=0,
jitter.height=0), size=0.2)+
facet_wrap(~cluster)+
scale_y_reverse()+
theme_classic()+
scale_color_manual(values = "black")+
defined_theme
cl_pt | p1
lyt <- lyt + patchwork::plot_layout(widths = c(1,3))
layout_design <-print(layout_design)
}
To check the linear relationship between the cluster vector and the marker gene vectors, we can plot the cluster vector on x-axis, and the marker gene vector on y-axis. The figure below shows the relationship between the cluster vector and the top marker gene vectors detected by correlation approach.
paste("c", 1:8, sep="")
cluster_names <-list()
plot_lst<-for (cl in cluster_names){
res_df_1000[res_df_1000[,cl]<0.05,]
perm_sig<- all_celltypes[all_celltypes$cluster==cl,"anno"]
curr_cell_type <- quantile(obs_corr[, cl], 0.75)
obs_cutoff <-intersect(row.names(perm_res[perm_res[,cl]<0.05,]),
perm_cl<-row.names(obs_corr[obs_corr[, cl]>obs_cutoff,]))
inters<-perm_clsignif(as.numeric(obs_corr[inters,cl]), digits = 3)
rounded_val<- as.data.frame(cbind(gene=inters, value=rounded_val))
inters_df <-$value<- as.numeric(inters_df$value)
inters_dforder(inters_df$value, decreasing = TRUE),]
inters_df<-inters_df[$text<- paste(inters_df$gene,inters_df$value,sep=": ")
inters_df inters_df[1:min(2, nrow(inters_df)),"gene"]
mk_gene<-if (length(mk_gene > 0)){
as.data.frame(cbind(rep1_sq10_vectors$cluster_mt[,cl],
dff <-$gene_mt[,mk_gene]))
rep1_sq10_vectorscolnames(dff) <- c("cluster", mk_gene)
$vector_id <- c(1:(grid_length * grid_length))
dff dff %>%
long_df <- pivot_longer(cols = -c(cluster, vector_id), names_to = "gene",
values_to = "vector_count")
$gene <- factor(long_df$gene, levels=mk_gene)
long_df
ggplot(long_df, aes(x = cluster, y = vector_count )) +
p<- geom_point( size=0.01) +
facet_wrap(~gene, scales = "free_y", nrow=1) +
labs(x = paste("cluster vector ", curr_cell_type, sep=""),
y = "marker gene vectors") +
theme_minimal()+
guides(color=guide_legend(nrow = 1,
override.aes=list(alpha=1, size=2)))+
theme(panel.grid = element_blank(),legend.position = "none",
strip.text = element_text(size = rel(1)),
axis.line=element_blank(),
legend.title = element_blank(),
legend.key.size = grid::unit(0.5, "cm"),
legend.text = element_text(size=10),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_text(size = 10),
panel.border =element_rect(colour = "black",
fill=NA, linewidth=0.5)
) p
plot_lst[[cl]] =
}
} ggarrange(plotlist = plot_lst,
combined_plot <-ncol = 2, nrow = 4,
common.legend = FALSE, legend = "none")
combined_plot
The other method to identify linearly correlated genes for each cluster is to construct a linear model for each gene. We can use the
lasso_markers
function to get the most relevant cluster label for every gene.
We can create spatial vectors for negative control genes and include them as background noise “clusters”.
unique(rep1_nc_data[rep1_nc_data$category=="probe","feature_name"])
probe_nm <- unique(rep1_nc_data[rep1_nc_data$category=="codeword",
codeword_nm <-"feature_name"])
create_genesets(x=rep1_neg,sample_names="rep1",
rep1_nc_vectors <-name_lst=list(probe=probe_nm,
codeword=codeword_nm),
bin_type="square",
bin_param=c(10, 10),
w_x=w_x, w_y=w_y,
cluster_info = NULL)
set.seed(seed_number)
lasso_markers(gene_mt=rep1_sq10_vectors$gene_mt,
rep1_lasso_with_nc <-cluster_mt = rep1_sq10_vectors$cluster_mt,
sample_names=c("rep1"),
keep_positive=TRUE,
background=rep1_nc_vectors)
get_top_mg(rep1_lasso_with_nc, coef_cutoff=0.2)
rep1_top_df_nc <-# the top result table
head(rep1_top_df_nc)
## gene top_cluster glm_coef pearson max_gg_corr max_gc_corr
## CD68 CD68 c4 4.468619 0.5951329 0.5143408 0.5951329
## DST DST c1 2.035571 0.5109713 0.6765032 0.5109713
## EPCAM EPCAM c1 10.590911 0.8643835 0.9547929 0.8643835
## ERBB2 ERBB2 c1 14.508203 0.6902759 0.8864028 0.6902759
## FOXA1 FOXA1 c1 10.148961 0.9041274 0.9547929 0.9041274
## KRT7 KRT7 c1 14.177664 0.9335617 0.9540827 0.9335617
# the full result table
get_full_mg(rep1_lasso_with_nc)
rep1_full_df <-head(rep1_full_df)
## gene cluster glm_coef p_value pearson max_gg_corr max_gc_corr
## 31 AQP1 c8 6.998951 3.650999e-25 0.8270796 0.8442294 0.8270796
## 32 AQP1 c6 1.702569 9.317924e-05 0.3209543 0.8442294 0.8270796
## 33 AQP1 c3 1.013530 2.956703e-03 0.2099349 0.8442294 0.8270796
## 36 CCDC80 c3 3.872311 3.842116e-20 0.7163637 0.8086714 0.7163637
## 37 CCDC80 c7 1.512364 4.072797e-02 0.2349379 0.8086714 0.7163637
## 1 CD68 c4 4.468619 2.344267e-11 0.5951329 0.5143408 0.5951329
We can rank the marker genes by its linear model coefficient to the cluster ans plot the transcript detection coordinates for the top three marker genes for every cluster.
paste("c", 1:8, sep="")
cluster_names <-for (cl in setdiff(cluster_names,"NoSig")){
$top_cluster==cl,"gene"]
inters<-rep1_top_df_nc[rep1_top_df_ncsignif(as.numeric(rep1_top_df_nc[inters,"glm_coef"]),
rounded_val<-digits = 3)
as.data.frame(cbind(gene=inters, value=rounded_val))
inters_df <-$value <- as.numeric(inters_df$value)
inters_dforder(inters_df$value, decreasing = TRUE),]
inters_df<-inters_df[$text<- paste(inters_df$gene,inters_df$value,sep=": ")
inters_df
if (length(inters > 0)){
1:min(2, nrow(inters_df)),]
inters_df<-inters_df[$gene
inters <-inters_df$feature_name %in% inters
iters_rep1<-rep1_transcripts
vis_r1<-rep1_transcripts[iters_rep1,c("x","y","feature_name")]
$value<-inters_df[match(vis_r1$feature_name,inters_df$gene),
vis_r1"value"]
#vis_r1=vis_r1[order(vis_r1$value,decreasing = TRUE),]
$text_label<- paste(vis_r1$feature_name,
vis_r1$value,sep=": ")
vis_r1$text_label<-factor(vis_r1$text_label, levels = inters_df$text)
vis_r1$sample<-"rep1"
vis_r1 ggplot(data = vis_r1,
p1<-aes(x = x, y = y))+
geom_point(size=0.01,color="maroon4")+
facet_wrap(~text_label,ncol=10, scales="free")+
scale_y_reverse()+
guides(fill = guide_colorbar(height= grid::unit(5, "cm")))+
defined_theme
ggplot(data = rep1_clusters[rep1_clusters$cluster==cl, ],
cl_pt<-aes(x = x, y = y, color=cluster))+
geom_point(position=position_jitterdodge(jitter.width=0,
jitter.height=0), size=0.2)+
facet_wrap(~cluster)+
scale_y_reverse()+
theme_classic()+
scale_color_manual(values = "black")+
defined_theme
cl_pt | p1
lyt <- lyt + patchwork::plot_layout(widths = c(1,3))
layout_design <-print(layout_design)
}}
We can plot the cluster vector on x-axis, and the marker gene vectors (detected by the linear modelling approach) on y-axis to validate the linear relationship assumption between the cluster vector and the marker gene vectors.
paste("c", 1:8, sep="")
cluster_names <-list()
plot_lst=for (cl in cluster_names){
$cluster==cl,"anno"]
curr_cell_type<-all_celltypes[all_celltypes$top_cluster==cl,"gene"]
inters<-rep1_top_df_nc[rep1_top_df_ncif (length(inters > 0)){
signif(as.numeric(rep1_top_df_nc[inters,"glm_coef"]),
rounded_val<-digits = 3)
as.data.frame(cbind(gene=inters, value=rounded_val))
inters_df<-$value<-as.numeric(inters_df$value)
inters_dforder(inters_df$value, decreasing = TRUE),]
inters_df<-inters_df[$text<-paste(inters_df$gene,inters_df$value,sep=": ")
inters_df
1:min(2, nrow(inters_df)),]
inters_df<-inters_df[$gene
mk_gene<-inters_df
as.data.frame(cbind(rep1_sq10_vectors$cluster_mt[,cl],
dff<-$gene_mt[,mk_gene]))
rep1_sq10_vectorscolnames(dff)<-c("cluster", mk_gene)
$vector_id<-c(1:(grid_length * grid_length))
dff dff %>%
long_df <- pivot_longer(cols = -c(cluster, vector_id), names_to = "gene",
values_to = "vector_count")
$gene<-factor(long_df$gene, levels=mk_gene)
long_dfggplot(long_df, aes(x = cluster, y = vector_count )) +
p<- geom_point( size=0.01) +
facet_wrap(~gene, scales = "free_y", nrow=1) +
labs(x = paste("cluster vector ", curr_cell_type, sep=""),
y = "marker gene vectors") +
theme_minimal()+
guides(color=guide_legend(nrow = 1,
override.aes=list(alpha=1, size=2)))+
theme(panel.grid = element_blank(),legend.position = "none",
strip.text = element_text(size = rel(1)),
axis.line=element_blank(),
legend.title = element_blank(),
legend.key.size = grid::unit(0.5, "cm"),
legend.text = element_text(size=10),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_text(size = 10),
panel.border =element_rect(colour = "black",
fill=NA, linewidth=0.5)
) p
plot_lst[[cl]] =
}
} ggarrange(plotlist = plot_lst,
combined_plot <-ncol = 2, nrow = 4,
common.legend = FALSE, legend = "none")
combined_plot
Load the replicate 2 from sample 1.
data(rep2_sub, rep2_clusters, rep2_neg)
$cluster<-factor(rep2_clusters$cluster,
rep2_clusterslevels=paste("c",1:8, sep=""))
$cells<-paste(row.names(rep1_clusters),"_1",sep="")
rep1_clusters$cells<-paste(row.names(rep2_clusters),"_2",sep="")
rep2_clustersrbind(rep1_clusters,rep2_clusters)
rep_clusters<-$cluster<-factor(rep_clusters$cluster,
rep_clusterslevels=paste("c",1:8, sep=""))
table(rep_clusters$sample, rep_clusters$cluster)
##
## c1 c2 c3 c4 c5 c6 c7 c8
## rep1 745 205 221 127 87 176 45 99
## rep2 27 476 313 190 360 256 99 94
# record the transcript coordinates for rep2
::unsplitAsDataFrame(molecules(rep2_sub))
rep2_transcripts <-BumpyMatrix as.data.frame(rep2_transcripts)
rep2_transcripts <-colnames(rep2_transcripts) <- c("feature_name","cell_id","x","y")
# record the negative control transcript coordinates for rep2
::unsplitAsDataFrame(molecules(rep2_neg))
rep2_nc_data <-BumpyMatrix as.data.frame(rep2_nc_data)
rep2_nc_data <-colnames(rep2_nc_data) <- c("feature_name","cell_id","x","y","category")
We can plot the coordinates of cells for every cluster in every replicate
ggplot(data = rep_clusters,
aes(x = x, y = y, color=cluster))+
geom_point(position=position_jitterdodge(jitter.width=0,
jitter.height=0),size=0.1)+
facet_grid(sample~cluster)+
scale_y_reverse()+
theme_classic()+
scale_color_manual(values = c("#FC8D62","#66C2A5" ,"#8DA0CB","#E78AC3",
"#A6D854","skyblue","purple3","#E5C498"))+
guides(color=guide_legend(title="cluster", nrow = 1,
override.aes=list(alpha=1, size=7)))+
theme(
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
plot.background=element_blank(),
legend.text = element_text(size=10),
legend.position="none",
legend.title = element_text(size=10),
strip.text = element_text(size = rel(1)))+
xlab("")+
ylab("")
When we have multiple replicates in the dataset, we can find marker genes by providing additional sample information as the input for the function lasso_markers
.
c(min(floor(min(rep1_transcripts$x)),
w_x <-floor(min(rep2_transcripts$x)),
floor(min(rep_clusters$x))),
max(ceiling(max(rep1_transcripts$x)),
ceiling(max(rep2_transcripts$x)),
ceiling(max(rep_clusters$x))))
c(min(floor(min(rep1_transcripts$y)),
w_y <-floor(min(rep2_transcripts$y)),
floor(min(rep_clusters$y))),
max(ceiling(max(rep1_transcripts$y)),
ceiling(max(rep2_transcripts$y)),
ceiling(max(rep_clusters$y))))
10
grid_length<-cbind(rep1_sub, rep2_sub)
twosample_spe<-# get spatial vectors
get_vectors(x= twosample_spe,
two_rep_vectors<-sample_names=c("rep1","rep2"),
cluster_info = rep_clusters, bin_type="square",
bin_param=c(grid_length, grid_length),
test_genes = all_real_genes ,
w_x=w_x, w_y=w_y)
cbind(rep1_neg, rep2_neg)
twosample_neg_spe<-
unique(c(rep1_nc_data[rep1_nc_data$category=="probe",
probe_nm<-"feature_name"],
$category=="probe","feature_name"]))
rep2_nc_data[rep2_nc_dataunique(c(rep1_nc_data[rep1_nc_data$category=="codeword",
codeword_nm<-"feature_name"],
$category=="codeword","feature_name"]))
rep2_nc_data[rep2_nc_data
create_genesets(x=twosample_neg_spe,
two_rep_nc_vectors<-sample_names=c("rep1","rep2"),
name_lst=list(probe=probe_nm,
codeword=codeword_nm),
bin_type="square",
bin_param=c(10,10),
w_x=w_x, w_y=w_y,
cluster_info = NULL)
set.seed(seed_number)
lasso_markers(gene_mt=two_rep_vectors$gene_mt,
two_rep_lasso_with_nc<-cluster_mt = two_rep_vectors$cluster_mt,
sample_names=c("rep1","rep2"),
keep_positive=TRUE,
background=two_rep_nc_vectors,n_fold = 5)
get_top_mg(two_rep_lasso_with_nc, coef_cutoff=0.2)
tworep_res<-$celltype<-rep_clusters[match(tworep_res$top_cluster,
tworep_res$cluster),"anno"]
rep_clusterstable(tworep_res$top_cluster)
##
## c1 c2 c3 c4 c5 c6 c7 c8
## 3 2 3 3 2 2 2 3
head(tworep_res)
## gene top_cluster glm_coef pearson max_gg_corr max_gc_corr
## CD68 CD68 c4 2.822240 0.6902781 0.6540004 0.6902781
## DST DST c5 7.431511 0.8603213 0.8203238 0.8603213
## EPCAM EPCAM c1 9.306909 0.6247635 0.9398823 0.6247635
## ERBB2 ERBB2 c2 21.682425 0.7811584 0.9072989 0.7811584
## FOXA1 FOXA1 c1 8.666483 0.5577500 0.9398823 0.6031995
## KRT7 KRT7 c1 12.191969 0.6529267 0.9155480 0.6529267
## celltype
## CD68 Macrophages
## DST Myoepithelial
## EPCAM Tumor
## ERBB2 DCIS
## FOXA1 Tumor
## KRT7 Tumor
for (cl in all_celltypes$anno){
$celltype==cl,"gene"]
inters<-tworep_res[tworep_ressignif(as.numeric(tworep_res[inters,"glm_coef"]),
rounded_val<-digits = 3)
as.data.frame(cbind(gene=inters, value=rounded_val))
inters_df<-$value<-as.numeric(inters_df$value)
inters_dforder(inters_df$value, decreasing = TRUE),]
inters_df<-inters_df[$text<-paste(inters_df$gene,inters_df$value,sep=": ")
inters_dfif (length(inters > 0)){
1:min(2, nrow(inters_df)),]
inters_df<-inters_df[$gene
inters<-inters_df$feature_name %in% inters
iters_rep1<-rep1_transcripts
vis_r1<-rep1_transcripts[iters_rep1,c("x","y","feature_name")]
$value<-inters_df[match(vis_r1$feature_name,inters_df$gene),
vis_r1"value"]
order(vis_r1$value,decreasing = TRUE),]
vis_r1<-vis_r1[$text_label<- paste(vis_r1$feature_name,
vis_r1$value,sep=": ")
vis_r1$text_label<-factor(vis_r1$text_label)
vis_r1$sample<-"rep1"
vis_r1 rep2_transcripts$feature_name %in% inters
iters_rep2<-
vis_r2<-rep2_transcripts[iters_rep2,c("x","y","feature_name")]
$value<-inters_df[match(vis_r2$feature_name,inters_df$gene),
vis_r2"value"]
order(vis_r2$value, decreasing = TRUE),]
vis_r2<-vis_r2[$text_label<-paste(vis_r2$feature_name,
vis_r2$value,sep=": ")
vis_r2$text_label<-factor(vis_r2$text_label)
vis_r2$sample<-"rep2"
vis_r2 ggplot(data = vis_r1,
p1<-aes(x = x, y = y))+
geom_point(size=0.01,color="maroon4")+
facet_grid(sample~text_label, scales="free")+
scale_y_reverse()+
guides(fill = guide_colorbar(height= grid::unit(5, "cm")))+
defined_theme
ggplot(data = vis_r2,
p2<-aes(x = x, y = y))+
geom_point(size=0.01,color="maroon4")+
facet_grid(sample~text_label,scales="free")+
scale_y_reverse()+
guides(fill = guide_colorbar(height= grid::unit(5, "cm")))+
defined_theme
ggplot(data = rep_clusters[rep_clusters$anno==cl, ],
cl_pt<-aes(x = x, y = y, color=cluster))+
geom_point(position=position_jitterdodge(jitter.width=0,
jitter.height=0), size=0.2)+
facet_grid(sample~cluster)+
scale_y_reverse()+
theme_classic()+
scale_color_manual(values = "black")+
defined_theme
cl_pt | (p1 / p2)
lyt <-# if (cl %in% c("c2","c5","c6","c7")){
# lyt <- cl_pt | ((p1 / p2) | patchwork::plot_spacer())
# }
lyt + patchwork::plot_layout(widths = c(1,2))
layout_design <-
print(layout_design)
}}
The figure below shows the relationship between the cluster vector and the top marker gene vectors detected by linear modelling approach by accounting for multiple samples and background noise
paste("c", 1:8, sep="")
cluster_names<-list()
plot_lst<-for (cl in cluster_names){
$top_cluster==cl,"gene"]
inters<-tworep_res[tworep_res all_celltypes[all_celltypes$cluster==cl,"anno"]
curr_cell_type<-signif(as.numeric(tworep_res[inters,"glm_coef"]),
rounded_val<-digits = 3)
as.data.frame(cbind(gene=inters, value=rounded_val))
inters_df<-$value<-as.numeric(inters_df$value)
inters_dforder(inters_df$value, decreasing = TRUE),]
inters_df<-inters_df[$text<-paste(inters_df$gene,inters_df$value,sep=": ")
inters_df
1:min(2, nrow(inters_df)),"gene"]
mk_gene<-inters_df[if (length(inters > 0)){
as.data.frame(cbind(two_rep_vectors$cluster_mt[,cl],
dff<-$gene_mt[,mk_gene]))
two_rep_vectorscolnames(dff) <- c("cluster", mk_gene)
grid_length * grid_length
total_tiles <-$vector_id <- c(1:total_tiles)
dff$sample<- "Replicate1"
dff+1):(total_tiles*2),"sample"] <- "Replicate2"
dff[(total_tiles$vector_id <- c(1:total_tiles, 1:total_tiles)
dff dff %>%
long_df <- pivot_longer(cols = -c(cluster, sample, vector_id), names_to = "gene",
values_to = "vector_count")
$gene <- factor(long_df$gene, levels=mk_gene)
long_dfggplot(long_df,
p<-aes(x = cluster, y = vector_count, color =sample)) +
geom_point( size=0.01) +
facet_wrap(~gene, scales = "free_y", nrow=1) +
labs(x = paste("cluster vector ", curr_cell_type, sep=""),
y = "marker gene vectors") +
theme_minimal()+
guides(color=guide_legend(nrow = 1,
override.aes=list(alpha=1, size=2)))+
theme(panel.grid = element_blank(),legend.position = "bottom",
strip.text = element_text(size = rel(1)),
axis.line=element_blank(),
legend.title = element_blank(),
legend.key.size = grid::unit(0.5, "cm"),
legend.text = element_text(size=10),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_text(size = 10),
panel.border =element_rect(colour = "black",
fill=NA, linewidth=0.5)
) p
plot_lst[[cl]] <-
}
} ggarrange(plotlist = plot_lst,
combined_plot <-ncol = 2, nrow = 4,
common.legend = TRUE, legend = "top")
combined_plot
sessionInfo()
## R Under development (unstable) (2025-02-19 r87757)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scater_1.35.4 scran_1.35.0
## [3] scuttle_1.17.0 TENxXeniumData_1.3.0
## [5] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [7] BiocFileCache_2.15.1 dbplyr_2.5.0
## [9] SpatialFeatureExperiment_1.9.9 ggpubr_0.6.0
## [11] tidyr_1.3.1 spatstat_3.3-1
## [13] spatstat.linnet_3.2-5 spatstat.model_3.3-4
## [15] rpart_4.1.24 spatstat.explore_3.3-4
## [17] nlme_3.1-167 spatstat.random_3.3-2
## [19] spatstat.geom_3.3-5 spatstat.univar_3.1-2
## [21] spatstat.data_3.1-4 gridExtra_2.3
## [23] ggrepel_0.9.6 ggraph_2.2.1
## [25] igraph_2.1.4 corrplot_0.95
## [27] caret_7.0-1 lattice_0.22-6
## [29] glmnet_4.1-8 Matrix_1.7-2
## [31] dplyr_1.1.4 data.table_1.17.0
## [33] ggplot2_3.5.1 SpatialExperiment_1.17.0
## [35] SingleCellExperiment_1.29.2 SummarizedExperiment_1.37.0
## [37] Biobase_2.67.0 GenomicRanges_1.59.1
## [39] GenomeInfoDb_1.43.4 IRanges_2.41.3
## [41] S4Vectors_0.45.4 BiocGenerics_0.53.6
## [43] generics_0.1.3 MatrixGenerics_1.19.1
## [45] matrixStats_1.5.0 jazzPanda_0.99.6
##
## loaded via a namespace (and not attached):
## [1] spatialreg_1.3-6 spatstat.sparse_3.1-0
## [3] bitops_1.0-9 sf_1.0-19
## [5] EBImage_4.49.0 lubridate_1.9.4
## [7] doParallel_1.0.17 httr_1.4.7
## [9] tools_4.5.0 backports_1.5.0
## [11] R6_2.6.1 HDF5Array_1.35.15
## [13] mgcv_1.9-1 rhdf5filters_1.19.2
## [15] withr_3.0.2 sp_2.2-0
## [17] cli_3.6.4 sandwich_3.1-1
## [19] labeling_0.4.3 sass_0.4.9
## [21] mvtnorm_1.3-3 proxy_0.4-27
## [23] R.utils_2.13.0 parallelly_1.42.0
## [25] limma_3.63.8 RSQLite_2.3.9
## [27] shape_1.4.6.1 car_3.1-3
## [29] spdep_1.3-10 ggbeeswarm_0.7.2
## [31] abind_1.4-8 R.methodsS3_1.8.2
## [33] terra_1.8-29 lifecycle_1.0.4
## [35] multcomp_1.4-28 yaml_2.3.10
## [37] edgeR_4.5.8 carData_3.0-5
## [39] rhdf5_2.51.2 recipes_1.1.1
## [41] SparseArray_1.7.6 blob_1.2.4
## [43] dqrng_0.4.1 crayon_1.5.3
## [45] cowplot_1.1.3 beachmat_2.23.6
## [47] KEGGREST_1.47.0 magick_2.8.5
## [49] metapod_1.15.0 zeallot_0.1.0
## [51] pillar_1.10.1 knitr_1.49
## [53] rjson_0.2.23 boot_1.3-31
## [55] future.apply_1.11.3 codetools_0.2-20
## [57] wk_0.9.4 glue_1.8.0
## [59] vctrs_0.6.5 png_0.1-8
## [61] gtable_0.3.6 cachem_1.1.0
## [63] gower_1.0.2 xfun_0.51
## [65] mime_0.12 S4Arrays_1.7.3
## [67] prodlim_2024.06.25 DropletUtils_1.27.2
## [69] tidygraph_1.3.1 coda_0.19-4.1
## [71] survival_3.8-3 timeDate_4041.110
## [73] sfheaders_0.4.4 iterators_1.0.14
## [75] hardhat_1.4.1 units_0.8-6
## [77] lava_1.8.1 bluster_1.17.0
## [79] statmod_1.5.0 TH.data_1.1-3
## [81] ipred_0.9-15 bit64_4.6.0-1
## [83] filelock_1.0.3 BumpyMatrix_1.15.0
## [85] bslib_0.9.0 irlba_2.3.5.1
## [87] vipor_0.4.7 KernSmooth_2.23-26
## [89] colorspace_2.1-1 spData_2.3.4
## [91] DBI_1.2.3 nnet_7.3-20
## [93] tidyselect_1.2.1 curl_6.2.1
## [95] bit_4.6.0 compiler_4.5.0
## [97] BiocNeighbors_2.1.3 h5mread_0.99.4
## [99] DelayedArray_0.33.6 scales_1.3.0
## [101] classInt_0.4-11 rappdirs_0.3.3
## [103] tiff_0.1-12 stringr_1.5.1
## [105] digest_0.6.37 goftest_1.2-3
## [107] fftwtools_0.9-11 spatstat.utils_3.1-2
## [109] rmarkdown_2.29 XVector_0.47.2
## [111] htmltools_0.5.8.1 pkgconfig_2.0.3
## [113] jpeg_0.1-10 sparseMatrixStats_1.19.0
## [115] fastmap_1.2.0 rlang_1.1.5
## [117] htmlwidgets_1.6.4 UCSC.utils_1.3.1
## [119] DelayedMatrixStats_1.29.1 farver_2.1.2
## [121] jquerylib_0.1.4 zoo_1.8-13
## [123] jsonlite_1.9.1 BiocParallel_1.41.2
## [125] ModelMetrics_1.2.2.2 R.oo_1.27.0
## [127] BiocSingular_1.23.0 RCurl_1.98-1.16
## [129] magrittr_2.0.3 Formula_1.2-5
## [131] GenomeInfoDbData_1.2.13 s2_1.1.7
## [133] patchwork_1.3.0 Rhdf5lib_1.29.1
## [135] munsell_0.5.1 Rcpp_1.0.14
## [137] viridis_0.6.5 stringi_1.8.4
## [139] pROC_1.18.5 MASS_7.3-65
## [141] plyr_1.8.9 parallel_4.5.0
## [143] listenv_0.9.1 deldir_2.0-4
## [145] Biostrings_2.75.4 graphlayouts_1.2.2
## [147] splines_4.5.0 tensor_1.5
## [149] locfit_1.5-9.12 ggsignif_0.6.4
## [151] ScaledMatrix_1.15.0 reshape2_1.4.4
## [153] LearnBayes_2.15.1 BiocVersion_3.21.1
## [155] evaluate_1.0.3 BiocManager_1.30.25
## [157] foreach_1.5.2 tweenr_2.0.3
## [159] purrr_1.0.4 polyclip_1.10-7
## [161] future_1.34.0 ggforce_0.4.2
## [163] rsvd_1.0.5 broom_1.0.7
## [165] e1071_1.7-16 rstatix_0.7.2
## [167] viridisLite_0.4.2 class_7.3-23
## [169] tibble_3.2.1 beeswarm_0.4.0
## [171] AnnotationDbi_1.69.0 memoise_2.0.1
## [173] cluster_2.1.8 timechange_0.3.0
## [175] globals_0.16.3