Here we walk through an end-to-end GeoMx-NGS gene expression analysis workflow. We start with raw gene expression count files. Using a combination of NanoString-developed (GeoMxTools & NanoStringNCTools) and open source R packages, we evaluate samples and expression targets and prepare gene-level count data for downstream analysis. To understand our spatial data, we perform unsupervised clustering, dimension reduction, and differential gene expression analyses and visually explore the results.
GeomxTools 3.4.0
R version: R version 4.3.0 RC (2023-04-13 r84269)
Bioconductor version: 3.17
Package: 3.4.0
<>
In this vignette, we will introduce a data analysis workflow for GeoMx-NGS mRNA expression data.
The GeoMx Digital Spatial Profiler (DSP) is a platform for capturing spatially resolved high-plex gene (or protein) expression data from tissue Merritt et al., 2020. In particular, formalin-fixed paraffin-embedded (FFPE) or fresh-frozen (FF) tissue sections are stained with barcoded in-situ hybridization probes that bind to endogenous mRNA transcripts. The user then selects regions of the interest (ROI) to profile; if desired, each ROI segment can be further sub-divided into areas of illumination (AOI) based on tissue morphology. The GeoMx then photo-cleaves and collects expression barcodes for each AOI segment separately for downstream sequencing and data processing.
The final results are spatially resolved unique expression datasets for every protein-coding gene (>18,000 genes) from every individual segment profiled from tissue.
The motivation for this vignette is to enable scientists to work with GeoMx-NGS gene expression data and understand a standard data analysis workflow.
Our specific objectives:
Let’s install and load the GeoMx packages we need:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes most up to date version of Bioc
BiocManager::install(version="3.15")
BiocManager::install("NanoStringNCTools")
BiocManager::install("GeomxTools")
BiocManager::install("GeoMxWorkflows")
library(NanoStringNCTools)
library(GeomxTools)
library(GeoMxWorkflows)
if(packageVersion("GeomxTools") < "2.1" &
packageVersion("GeoMxWorkflows") >= "1.0.1"){
stop("GeomxTools and Workflow versions do not match. Please use the same version.
This workflow is meant to be used with most current version of packages.
If you are using an older version of Bioconductor please reinstall GeoMxWorkflows and use vignette(GeoMxWorkflows) instead")
}
if(packageVersion("GeomxTools") > "2.1" &
packageVersion("GeoMxWorkflows") <= "1.0.1"){
stop("GeomxTools and Workflow versions do not match.
Please use the same version, see install instructions above.")
# to remove current package version
# remove.packages("GeomxTools")
# remove.packages("GeoMxWorkflows")
# see install instructions above
}
In this vignette, we will analyze a GeoMx kidney dataset created with the human whole transcriptome atlas (WTA) assay. The dataset includes 4 diabetic kidney disease (DKD) and 3 healthy kidney tissue samples. Regions of interest (ROI) were spatially profiled to focus on two different kidney structures: tubules or glomeruli. One glomerular ROI contains the entirety of a single glomerulus. Each tubular ROI contains multiple tubules that were segmented into distal (PanCK+) and proximal (PanCK-) tubule areas of illumination (AOI).
Download and the unzip the kidney data set found on the NanoString Website
The key data files are:
We first locate the downloaded files:
# Reference the main folder 'file.path' containing the sub-folders with each
# data file type:
datadir <- system.file("extdata", "WTA_NGS_Example",
package="GeoMxWorkflows")
# to locate a specific file path replace the above line with
# datadir <- file.path("~/Folder/SubFolder/DataLocation")
# replace the Folder, SubFolder, DataLocation as needed
# the DataLocation folder should contain a dccs, pkcs, and annotation folder
# with each set of files present as needed
# automatically list files in each directory for use
DCCFiles <- dir(file.path(datadir, "dccs"), pattern = ".dcc$",
full.names = TRUE, recursive = TRUE)
PKCFiles <- unzip(zipfile = dir(file.path(datadir, "pkcs"), pattern = ".zip$",
full.names = TRUE, recursive = TRUE))
SampleAnnotationFile <-
dir(file.path(datadir, "annotation"), pattern = ".xlsx$",
full.names = TRUE, recursive = TRUE)
We then load the data to create a data object using the
readNanoStringGeoMxSet
function.
# load data
demoData <-
readNanoStringGeoMxSet(dccFiles = DCCFiles,
pkcFiles = PKCFiles,
phenoDataFile = SampleAnnotationFile,
phenoDataSheet = "Template",
phenoDataDccColName = "Sample_ID",
protocolDataColNames = c("aoi", "roi"),
experimentDataColNames = c("panel"))
All of the expression, annotation, and probe information are now linked and stored together into a single data object.
For more details on this object’s structure and accessors, please refer to the “GeoMxSet Object Overview” section at the end of this vignette.
First let’s access the PKC files, to ensure that the expected PKCs have been
loaded for this study. For the demo data we are using the file
Hsa_WTA_1.0.pkc
.
library(knitr)
pkcs <- annotation(demoData)
modules <- gsub(".pkc", "", pkcs)
kable(data.frame(PKCs = pkcs, modules = modules))
PKCs | modules |
---|---|
Hsa_WTA_v1.0.pkc | Hsa_WTA_v1.0 |
Now that we have loaded the data, we can visually summarize the experimental design for our dataset to look at the different types of samples and ROI/AOI segments that have been profiled. We present this information in a Sankey diagram.
library(dplyr)
library(ggforce)
# select the annotations we want to show, use `` to surround column names with
# spaces or special symbols
count_mat <- count(pData(demoData), `slide name`, class, region, segment)
# simplify the slide names
count_mat$`slide name` <- gsub("disease", "d",
gsub("normal", "n", count_mat$`slide name`))
# gather the data and plot in order: class, slide name, region, segment
test_gr <- gather_set_data(count_mat, 1:4)
test_gr$x <- factor(test_gr$x,
levels = c("class", "slide name", "region", "segment"))
# plot Sankey
ggplot(test_gr, aes(x, id = id, split = y, value = n)) +
geom_parallel_sets(aes(fill = region), alpha = 0.5, axis.width = 0.1) +
geom_parallel_sets_axes(axis.width = 0.2) +
geom_parallel_sets_labels(color = "white", size = 5) +
theme_classic(base_size = 17) +
theme(legend.position = "bottom",
axis.ticks.y = element_blank(),
axis.line = element_blank(),
axis.text.y = element_blank()) +
scale_y_continuous(expand = expansion(0)) +
scale_x_discrete(expand = expansion(0)) +
labs(x = "", y = "") +
annotate(geom = "segment", x = 4.25, xend = 4.25,
y = 20, yend = 120, lwd = 2) +
annotate(geom = "text", x = 4.19, y = 70, angle = 90, size = 5,
hjust = 0.5, label = "100 segments")
The steps above encompass the standard pre-processing workflow for GeoMx data. In short, they represent the selection of ROI/AOI segments and genes based on quality control (QC) or limit of quantification (LOQ) metrics and data normalization.
Before we begin, we will shift any expression counts with a value of 0 to 1 to enable in downstream transformations.
# Shift counts to one
demoData <- shiftCountsOne(demoData, useDALogic = TRUE)
We first assess sequencing quality and adequate tissue sampling for every ROI/AOI segment.
Every ROI/AOI segment will be tested for:
First, we select the QC parameter cutoffs, against which our ROI/AOI segments will be tested and flagged appropriately. We have selected the appropriate study-specific parameters for this study. Note: the default QC values recommended above are advised when surveying a new dataset for the first time.
# Default QC cutoffs are commented in () adjacent to the respective parameters
# study-specific values were selected after visualizing the QC results in more
# detail below
QC_params <-
list(minSegmentReads = 1000, # Minimum number of reads (1000)
percentTrimmed = 80, # Minimum % of reads trimmed (80%)
percentStitched = 80, # Minimum % of reads stitched (80%)
percentAligned = 75, # Minimum % of reads aligned (80%)
percentSaturation = 50, # Minimum sequencing saturation (50%)
minNegativeCount = 1, # Minimum negative control counts (10)
maxNTCCount = 9000, # Maximum counts observed in NTC well (1000)
minNuclei = 20, # Minimum # of nuclei estimated (100)
minArea = 1000) # Minimum segment area (5000)
demoData <-
setSegmentQCFlags(demoData,
qcCutoffs = QC_params)
# Collate QC Results
QCResults <- protocolData(demoData)[["QCFlags"]]
flag_columns <- colnames(QCResults)
QC_Summary <- data.frame(Pass = colSums(!QCResults[, flag_columns]),
Warning = colSums(QCResults[, flag_columns]))
QCResults$QCStatus <- apply(QCResults, 1L, function(x) {
ifelse(sum(x) == 0L, "PASS", "WARNING")
})
QC_Summary["TOTAL FLAGS", ] <-
c(sum(QCResults[, "QCStatus"] == "PASS"),
sum(QCResults[, "QCStatus"] == "WARNING"))
Before excluding any low-performing ROI/AOI segments, we visualize the distributions of the data for the different QC parameters. Note that the “Select Segment QC” and “Visualize Segment QC” sections are performed in parallel to fully understand low-performing segments for a given study. Iteration may follow to select the study-specific QC cutoffs.
For QC visualization, we write a quick function to draw histograms of our data.
library(ggplot2)
col_by <- "segment"
# Graphical summaries of QC statistics plot function
QC_histogram <- function(assay_data = NULL,
annotation = NULL,
fill_by = NULL,
thr = NULL,
scale_trans = NULL) {
plt <- ggplot(assay_data,
aes_string(x = paste0("unlist(`", annotation, "`)"),
fill = fill_by)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = thr, lty = "dashed", color = "black") +
theme_bw() + guides(fill = "none") +
facet_wrap(as.formula(paste("~", fill_by)), nrow = 4) +
labs(x = annotation, y = "Segments, #", title = annotation)
if(!is.null(scale_trans)) {
plt <- plt +
scale_x_continuous(trans = scale_trans)
}
plt
}
Now we explore each of the QC metrics for the segments.
QC_histogram(sData(demoData), "Trimmed (%)", col_by, 80)
QC_histogram(sData(demoData), "Stitched (%)", col_by, 80)
QC_histogram(sData(demoData), "Aligned (%)", col_by, 75)
QC_histogram(sData(demoData), "Saturated (%)", col_by, 50) +
labs(title = "Sequencing Saturation (%)",
x = "Sequencing Saturation (%)")
QC_histogram(sData(demoData), "area", col_by, 1000, scale_trans = "log10")
QC_histogram(sData(demoData), "nuclei", col_by, 20)
# calculate the negative geometric means for each module
negativeGeoMeans <-
esBy(negativeControlSubset(demoData),
GROUP = "Module",
FUN = function(x) {
assayDataApply(x, MARGIN = 2, FUN = ngeoMean, elt = "exprs")
})
protocolData(demoData)[["NegGeoMean"]] <- negativeGeoMeans
# explicitly copy the Negative geoMeans from sData to pData
negCols <- paste0("NegGeoMean_", modules)
pData(demoData)[, negCols] <- sData(demoData)[["NegGeoMean"]]
for(ann in negCols) {
plt <- QC_histogram(pData(demoData), ann, col_by, 2, scale_trans = "log10")
print(plt)
}
# detatch neg_geomean columns ahead of aggregateCounts call
pData(demoData) <- pData(demoData)[, !colnames(pData(demoData)) %in% negCols]
# show all NTC values, Freq = # of Segments with a given NTC count:
kable(table(NTC_Count = sData(demoData)$NTC),
col.names = c("NTC Count", "# of Segments"))
NTC Count | # of Segments |
---|---|
3 | 36 |
113 | 71 |
397 | 34 |
8704 | 94 |
Finally we plot all of the QC Summary information in a table.
kable(QC_Summary, caption = "QC Summary Table for each Segment")
Pass | Warning | |
---|---|---|
LowReads | 231 | 4 |
LowTrimmed | 235 | 0 |
LowStitched | 235 | 0 |
LowAligned | 229 | 6 |
LowSaturation | 231 | 4 |
LowNegatives | 235 | 0 |
HighNTC | 235 | 0 |
LowNuclei | 235 | 0 |
LowArea | 235 | 0 |
TOTAL FLAGS | 229 | 6 |
As the final step in Segment QC, we remove flagged segments that do not meet our QC cutoffs.
demoData <- demoData[, QCResults$QCStatus == "PASS"]
# Subsetting our dataset has removed samples which did not pass QC
dim(demoData)
#> Features Samples
#> 18642 229
Before we summarize our data into gene-level count data, we will remove low-performing probes. In short, this QC is an outlier removal process, whereby probes are either removed entirely from the study (global) or from specific segments (local). The QC applies to gene targets for which there are multiple distinct probes representing the count for a gene per segment. In WTA data, one specific probe exists per target gene; thus, Probe QC does not apply to the endogenous genes in the panel. Rather, it is performed on the negative control probes; there are multiple probes representing our negative controls, which do not target any sequence in the genome. These probes enable calculation of the background per segment and will be important for determining gene detection downstream.
After Probe QC, there will always remain at least one probe representing every gene target. In other words, Probe QC never removes genes from your data.
A probe is removed globally from the dataset if either of the following is true:
A probe is removed locally (from a given segment) if the probe is an outlier according to the Grubb’s test in that segment.
We do not typically adjust these QC parameters.
# Generally keep the qcCutoffs parameters unchanged. Set removeLocalOutliers to
# FALSE if you do not want to remove local outliers
demoData <- setBioProbeQCFlags(demoData,
qcCutoffs = list(minProbeRatio = 0.1,
percentFailGrubbs = 20),
removeLocalOutliers = TRUE)
ProbeQCResults <- fData(demoData)[["QCFlags"]]
# Define QC table for Probe QC
qc_df <- data.frame(Passed = sum(rowSums(ProbeQCResults[, -1]) == 0),
Global = sum(ProbeQCResults$GlobalGrubbsOutlier),
Local = sum(rowSums(ProbeQCResults[, -2:-1]) > 0
& !ProbeQCResults$GlobalGrubbsOutlier))
We report the number of global and local outlier probes.
Passed | Global | Local |
---|---|---|
18619 | 1 | 22 |
#Subset object to exclude all that did not pass Ratio & Global testing
ProbeQCPassed <-
subset(demoData,
fData(demoData)[["QCFlags"]][,c("LowProbeRatio")] == FALSE &
fData(demoData)[["QCFlags"]][,c("GlobalGrubbsOutlier")] == FALSE)
dim(ProbeQCPassed)
#> Features Samples
#> 18641 229
demoData <- ProbeQCPassed
With our Probe QC steps complete, we will generate a gene-level count matrix. The count for any gene with multiple probes per segment is calculated as the geometric mean of those probes.
# Check how many unique targets the object has
length(unique(featureData(demoData)[["TargetName"]]))
#> [1] 18504
# collapse to targets
target_demoData <- aggregateCounts(demoData)
dim(target_demoData)
#> Features Samples
#> 18504 229
exprs(target_demoData)[1:5, 1:2]
#> DSP-1001250007851-H-A02.dcc DSP-1001250007851-H-A03.dcc
#> A2M 485 262
#> NAT2 15 18
#> ACADM 31 15
#> ACADS 27 17
#> ACAT1 29 24
In addition to Segment and Probe QC, we also determine the limit of quantification (LOQ) per segment. The LOQ is calculated based on the distribution of negative control probes and is intended to approximate the quantifiable limit of gene expression per segment. Please note that this process is more stable in larger segments. Likewise, the LOQ may not be as accurately reflective of true signal detection rates in segments with low negative probe counts (ex: <2). The formula for calculating the LOQ in the \(i^{th}\) segment is:
\[LOQ_{i} = geomean(NegProbe_{i}) * geoSD(NegProbe_{i})^{n}\]
We typically use 2 geometric standard deviations (\(n = 2\)) above the geometric mean as the LOQ, which is reasonable for most studies. We also recommend that a minimum LOQ of 2 be used if the LOQ calculated in a segment is below this threshold.
# Define LOQ SD threshold and minimum value
cutoff <- 2
minLOQ <- 2
# Calculate LOQ per module tested
LOQ <- data.frame(row.names = colnames(target_demoData))
for(module in modules) {
vars <- paste0(c("NegGeoMean_", "NegGeoSD_"),
module)
if(all(vars[1:2] %in% colnames(pData(target_demoData)))) {
LOQ[, module] <-
pmax(minLOQ,
pData(target_demoData)[, vars[1]] *
pData(target_demoData)[, vars[2]] ^ cutoff)
}
}
pData(target_demoData)$LOQ <- LOQ
After determining the limit of quantification (LOQ) per segment, we recommend filtering out either segments and/or genes with abnormally low signal. Filtering is an important step to focus on the true biological data of interest.
We determine the number of genes detected in each segment across the dataset.
LOQ_Mat <- c()
for(module in modules) {
ind <- fData(target_demoData)$Module == module
Mat_i <- t(esApply(target_demoData[ind, ], MARGIN = 1,
FUN = function(x) {
x > LOQ[, module]
}))
LOQ_Mat <- rbind(LOQ_Mat, Mat_i)
}
# ensure ordering since this is stored outside of the geomxSet
LOQ_Mat <- LOQ_Mat[fData(target_demoData)$TargetName, ]
We first filter out segments with exceptionally low signal. These segments will have a small fraction of panel genes detected above the LOQ relative to the other segments in the study. Let’s visualize the distribution of segments with respect to their % genes detected:
# Save detection rate information to pheno data
pData(target_demoData)$GenesDetected <-
colSums(LOQ_Mat, na.rm = TRUE)
pData(target_demoData)$GeneDetectionRate <-
pData(target_demoData)$GenesDetected / nrow(target_demoData)
# Determine detection thresholds: 1%, 5%, 10%, 15%, >15%
pData(target_demoData)$DetectionThreshold <-
cut(pData(target_demoData)$GeneDetectionRate,
breaks = c(0, 0.01, 0.05, 0.1, 0.15, 1),
labels = c("<1%", "1-5%", "5-10%", "10-15%", ">15%"))
# stacked bar plot of different cut points (1%, 5%, 10%, 15%)
ggplot(pData(target_demoData),
aes(x = DetectionThreshold)) +
geom_bar(aes(fill = region)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
theme_bw() +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Gene Detection Rate",
y = "Segments, #",
fill = "Segment Type")
#> Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(count)` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
We can also create a table to review what kidney tissue type (DKD vs normal) is going to be impacted by each threshold:
# cut percent genes detected at 1, 5, 10, 15
kable(table(pData(target_demoData)$DetectionThreshold,
pData(target_demoData)$class))
DKD | normal | |
---|---|---|
<1% | 0 | 1 |
1-5% | 0 | 0 |
5-10% | 6 | 1 |
10-15% | 21 | 4 |
>15% | 102 | 94 |
In this example, we choose to remove segments with less than 10% of the genes detected. Generally, 5-10% detection is a reasonable segment filtering threshold. However, based on the experimental design (e.g. segment types, size, nuclei) and tissue characteristics (e.g. type, age), these guidelines may require adjustment.
target_demoData <-
target_demoData[, pData(target_demoData)$GeneDetectionRate >= .1]
dim(target_demoData)
#> Features Samples
#> 18504 221
Let’s re-plot the Sankey diagram showing our current working dataset. This is now a dataset that no longer contains segments flagged by Segment QC or that have low gene detection rates.
# select the annotations we want to show, use `` to surround column names with
# spaces or special symbols
count_mat <- count(pData(demoData), `slide name`, class, region, segment)
# simplify the slide names
count_mat$`slide name` <-
gsub("disease", "d",
gsub("normal", "n", count_mat$`slide name`))
# gather the data and plot in order: class, slide name, region, segment
test_gr <- gather_set_data(count_mat, 1:4)
test_gr$x <-
factor(test_gr$x,
levels = c("class", "slide name", "region", "segment"))
# plot Sankey
ggplot(test_gr, aes(x, id = id, split = y, value = n)) +
geom_parallel_sets(aes(fill = region), alpha = 0.5, axis.width = 0.1) +
geom_parallel_sets_axes(axis.width = 0.2) +
geom_parallel_sets_labels(color = "white", size = 5) +
theme_classic(base_size = 17) +
theme(legend.position = "bottom",
axis.ticks.y = element_blank(),
axis.line = element_blank(),
axis.text.y = element_blank()) +
scale_y_continuous(expand = expansion(0)) +
scale_x_discrete(expand = expansion(0)) +
labs(x = "", y = "") +
annotate(geom = "segment", x = 4.25, xend = 4.25, y = 20,
yend = 120, lwd = 2) +
annotate(geom = "text", x = 4.19, y = 70, angle = 90, size = 5,
hjust = 0.5, label = "100 segments")
Next, we determine the detection rate for genes across the study. To illustrate
this idea, we create a small gene list (goi
) to review.
library(scales) # for percent
# Calculate detection rate:
LOQ_Mat <- LOQ_Mat[, colnames(target_demoData)]
fData(target_demoData)$DetectedSegments <- rowSums(LOQ_Mat, na.rm = TRUE)
fData(target_demoData)$DetectionRate <-
fData(target_demoData)$DetectedSegments / nrow(pData(target_demoData))
# Gene of interest detection table
goi <- c("PDCD1", "CD274", "IFNG", "CD8A", "CD68", "EPCAM",
"KRT18", "NPHS1", "NPHS2", "CALB1", "CLDN8")
goi_df <- data.frame(
Gene = goi,
Number = fData(target_demoData)[goi, "DetectedSegments"],
DetectionRate = percent(fData(target_demoData)[goi, "DetectionRate"]))
Gene | Detection, # Segments | Detection Rate, % of Segments |
---|---|---|
PDCD1 | 1 | 0.5% |
CD274 | 75 | 33.9% |
IFNG | 9 | 4.1% |
CD8A | 33 | 14.9% |
CD68 | 160 | 72.4% |
EPCAM | 64 | 29.0% |
KRT18 | 217 | 98.2% |
NPHS1 | 142 | 64.3% |
NPHS2 | 142 | 64.3% |
CALB1 | 41 | 18.6% |
CLDN8 | 47 | 21.3% |
We can see that individual genes are detected to varying degrees in the segments, which leads us to the next QC we will perform across the dataset.
We will graph the total number of genes detected in different percentages of segments. Based on the visualization below, we can better understand global gene detection in our study and select how many low detected genes to filter out of the dataset. Gene filtering increases performance of downstream statistical tests and improves interpretation of true biological signal.
# Plot detection rate:
plot_detect <- data.frame(Freq = c(1, 5, 10, 20, 30, 50))
plot_detect$Number <-
unlist(lapply(c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5),
function(x) {sum(fData(target_demoData)$DetectionRate >= x)}))
plot_detect$Rate <- plot_detect$Number / nrow(fData(target_demoData))
rownames(plot_detect) <- plot_detect$Freq
ggplot(plot_detect, aes(x = as.factor(Freq), y = Rate, fill = Rate)) +
geom_bar(stat = "identity") +
geom_text(aes(label = formatC(Number, format = "d", big.mark = ",")),
vjust = 1.6, color = "black", size = 4) +
scale_fill_gradient2(low = "orange2", mid = "lightblue",
high = "dodgerblue3", midpoint = 0.65,
limits = c(0,1),
labels = scales::percent) +
theme_bw() +
scale_y_continuous(labels = scales::percent, limits = c(0,1),
expand = expansion(mult = c(0, 0))) +
labs(x = "% of Segments",
y = "Genes Detected, % of Panel > LOQ")
We typically set a % Segment cutoff ranging from 5-20% based on the biological diversity of our dataset. For this study, we will select 10% as our cutoff. In other words, we will focus on the genes detected in at least 10% of our segments; we filter out the remainder of the targets.
Note: if we know that a key gene is represented in only a small number of segments (<10%) due to biological diversity, we may select a different cutoff or keep the target gene by manually selecting it for inclusion in the data object.
# Subset to target genes detected in at least 10% of the samples.
# Also manually include the negative control probe, for downstream use
negativeProbefData <- subset(fData(target_demoData), CodeClass == "Negative")
neg_probes <- unique(negativeProbefData$TargetName)
target_demoData <-
target_demoData[fData(target_demoData)$DetectionRate >= 0.1 |
fData(target_demoData)$TargetName %in% neg_probes, ]
dim(target_demoData)
#> Features Samples
#> 10131 221
# retain only detected genes of interest
goi <- goi[goi %in% rownames(target_demoData)]
We will now normalize the GeoMx data for downstream visualizations and differential expression. The two common methods for normalization of DSP-NGS RNA data are i) quartile 3 (Q3) or ii) background normalization.
Both of these normalization methods estimate a normalization factor per segment to bring the segment data distributions together. More advanced methods for normalization and modeling are under active development. However, for most studies, these methods are sufficient for understanding differences between biological classes of segments and samples.
Q3 normalization is typically the preferred normalization strategy for most DSP-NGS RNA studies. Given the low negative probe counts in this particular dataset as shown during Segment QC, we would further avoid background normalization as it may be less stable.
Before normalization, we will explore the relationship between the upper quartile (Q3) of the counts in each segment with the geometric mean of the negative control probes in the data. Ideally, there should be a separation between these two values to ensure we have stable measure of Q3 signal. If you do not see sufficient separation between these values, you may consider more aggressive filtering of low signal segments/genes.
library(reshape2) # for melt
library(cowplot) # for plot_grid
# Graph Q3 value vs negGeoMean of Negatives
ann_of_interest <- "region"
Stat_data <-
data.frame(row.names = colnames(exprs(target_demoData)),
Segment = colnames(exprs(target_demoData)),
Annotation = pData(target_demoData)[, ann_of_interest],
Q3 = unlist(apply(exprs(target_demoData), 2,
quantile, 0.75, na.rm = TRUE)),
NegProbe = exprs(target_demoData)[neg_probes, ])
Stat_data_m <- melt(Stat_data, measure.vars = c("Q3", "NegProbe"),
variable.name = "Statistic", value.name = "Value")
plt1 <- ggplot(Stat_data_m,
aes(x = Value, fill = Statistic)) +
geom_histogram(bins = 40) + theme_bw() +
scale_x_continuous(trans = "log2") +
facet_wrap(~Annotation, nrow = 1) +
scale_fill_brewer(palette = 3, type = "qual") +
labs(x = "Counts", y = "Segments, #")
plt2 <- ggplot(Stat_data,
aes(x = NegProbe, y = Q3, color = Annotation)) +
geom_abline(intercept = 0, slope = 1, lty = "dashed", color = "darkgray") +
geom_point() + guides(color = "none") + theme_bw() +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
theme(aspect.ratio = 1) +
labs(x = "Negative Probe GeoMean, Counts", y = "Q3 Value, Counts")
plt3 <- ggplot(Stat_data,
aes(x = NegProbe, y = Q3 / NegProbe, color = Annotation)) +
geom_hline(yintercept = 1, lty = "dashed", color = "darkgray") +
geom_point() + theme_bw() +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
theme(aspect.ratio = 1) +
labs(x = "Negative Probe GeoMean, Counts", y = "Q3/NegProbe Value, Counts")
btm_row <- plot_grid(plt2, plt3, nrow = 1, labels = c("B", ""),
rel_widths = c(0.43,0.57))
plot_grid(plt1, btm_row, ncol = 1, labels = c("A", ""))
As expected, we see separation of the Q3 and negative probe counts at both the distribution (A) and per segment (B) levels. For additional conceptual guidance, please refer to our Data Analysis White Paper for DSP-NGS Assays.
Next, we normalize our data. We will use Q3 normalized data moving forward. We
use the normalize
function from NanoStringNCTools
to create normalization
factors reflecting each data type. Upper quartile (Q3) normalization is
performed using norm_method = "quant"
setting the desiredQuantile
flag to
0.75. Other quantiles could be specified by changing that value. We save the
normalized data to a specific slot using toELT = "q_norm"
. Similarly
background normalization is performed by setting norm_method = "neg"
and
toElt = "neg_norm"
.
# Q3 norm (75th percentile) for WTA/CTA with or without custom spike-ins
target_demoData <- normalize(target_demoData ,
norm_method = "quant",
desiredQuantile = .75,
toElt = "q_norm")
# Background normalization for WTA/CTA without custom spike-in
target_demoData <- normalize(target_demoData ,
norm_method = "neg",
fromElt = "exprs",
toElt = "neg_norm")
To demonstrate the effects of normalization, we graph representative box plots of the data for individual segments before and after normalization.
# visualize the first 10 segments with each normalization method
boxplot(exprs(target_demoData)[,1:10],
col = "#9EDAE5", main = "Raw Counts",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Raw")
boxplot(assayDataElement(target_demoData[,1:10], elt = "q_norm"),
col = "#2CA02C", main = "Q3 Norm Counts",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Q3 Normalized")
boxplot(assayDataElement(target_demoData[,1:10], elt = "neg_norm"),
col = "#FF7F0E", main = "Neg Norm Counts",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Neg. Normalized")
One common approach to understanding high-plex data is dimension reduction. Two
common methods are UMAP and tSNE, which are non-orthogonally constrained
projections that cluster samples based on overall gene expression. In this
study, we see by either UMAP (from the umap
package) or tSNE (from the
Rtsne
package), clusters of segments related to structure (glomeruli or
tubules) and disease status (normal or diabetic kidney disease).
library(umap)
library(Rtsne)
# update defaults for umap to contain a stable random_state (seed)
custom_umap <- umap::umap.defaults
custom_umap$random_state <- 42
# run UMAP
umap_out <-
umap(t(log2(assayDataElement(target_demoData , elt = "q_norm"))),
config = custom_umap)
pData(target_demoData)[, c("UMAP1", "UMAP2")] <- umap_out$layout[, c(1,2)]
ggplot(pData(target_demoData),
aes(x = UMAP1, y = UMAP2, color = region, shape = class)) +
geom_point(size = 3) +
theme_bw()
# run tSNE
set.seed(42) # set the seed for tSNE as well
tsne_out <-
Rtsne(t(log2(assayDataElement(target_demoData , elt = "q_norm"))),
perplexity = ncol(target_demoData)*.15)
pData(target_demoData)[, c("tSNE1", "tSNE2")] <- tsne_out$Y[, c(1,2)]
ggplot(pData(target_demoData),
aes(x = tSNE1, y = tSNE2, color = region, shape = class)) +
geom_point(size = 3) +
theme_bw()
Another approach to explore the data is to calculate the coefficient of variation (CV) for each gene (\(g\)) using the formula \(CV_g = SD_g/mean_g\). We then identify genes with high CVs that should have large differences across the various profiled segments. This unbiased approach can reveal highly variable genes across the study.
We plot the results using unsupervised hierarchical clustering, displayed as a heatmap.
library(pheatmap) # for pheatmap
# create a log2 transform of the data for analysis
assayDataElement(object = target_demoData, elt = "log_q") <-
assayDataApply(target_demoData, 2, FUN = log, base = 2, elt = "q_norm")
# create CV function
calc_CV <- function(x) {sd(x) / mean(x)}
CV_dat <- assayDataApply(target_demoData,
elt = "log_q", MARGIN = 1, calc_CV)
# show the highest CD genes and their CV values
sort(CV_dat, decreasing = TRUE)[1:5]
#> CAMK2N1 AKR1C1 AQP2 GDF15 REN
#> 0.5886006 0.5114973 0.4607206 0.4196469 0.4193216
# Identify genes in the top 3rd of the CV values
GOI <- names(CV_dat)[CV_dat > quantile(CV_dat, 0.8)]
pheatmap(assayDataElement(target_demoData[GOI, ], elt = "log_q"),
scale = "row",
show_rownames = FALSE, show_colnames = FALSE,
border_color = NA,
clustering_method = "average",
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
breaks = seq(-3, 3, 0.05),
color = colorRampPalette(c("purple3", "black", "yellow2"))(120),
annotation_col =
pData(target_demoData)[, c("class", "segment", "region")])