Overview

This tutorial demonstrates how to use GeomxTools for preprocessing protein or proteogenomics data.

Data Processing

Data processing is very similar to what is shown in the Developer_Introduction_to_the_NanoStringGeoMxSet and GeoMx Workflow vignettes with a couple of protein specific functions.

library(GeomxTools)

GeoMxSet objects can only read in one analyte at a time. With protein or proteogenomics data, the desired analyte must be added to the call to read in the object. RNA is the default analyte.

datadir <- system.file("extdata","DSP_Proteogenomics_Example_Data",
                       package = "GeomxTools")

DCCFiles <- unzip(zipfile = file.path(datadir,  "/DCCs.zip"))
PKCFiles <- unzip(zipfile = file.path(datadir,  "/pkcs.zip"))
SampleAnnotationFile <- file.path(datadir, "Annotation.xlsx")


RNAData <- suppressWarnings(readNanoStringGeoMxSet(dccFiles = DCCFiles,
                                                   pkcFiles = PKCFiles,
                                                   phenoDataFile = SampleAnnotationFile,
                                                   phenoDataSheet = "Annotations",
                                                   phenoDataDccColName = "Sample_ID",
                                                   protocolDataColNames = c("Tissue", 
                                                                            "Segment_Type", 
                                                                            "ROI.Size"),
                                                   configFile = NULL,
                                                   analyte = "RNA",
                                                   phenoDataColPrefix = "",
                                                   experimentDataColNames = NULL))

proteinData <- suppressWarnings(readNanoStringGeoMxSet(dccFiles = DCCFiles,
                                                       pkcFiles = PKCFiles,
                                                       phenoDataFile = SampleAnnotationFile,
                                                       phenoDataSheet = "Annotations",
                                                       phenoDataDccColName = "Sample_ID",
                                                       protocolDataColNames = c("Tissue", 
                                                                                "Segment_Type", 
                                                                                "ROI.Size"),
                                                       configFile = NULL,
                                                       analyte = "protein",
                                                       phenoDataColPrefix = "",
                                                       experimentDataColNames = NULL))

RNAData <- aggregateCounts(RNAData)
RNAData

## NanoStringGeoMxSet (storageMode: lockedEnvironment)
## assayData: 18677 features, 84 samples 
##   element names: exprs 
## protocolData
##   sampleNames: DSP-1001900002618-G-A02.dcc DSP-1001900002618-G-A03.dcc
##     ... DSP-1001900002618-G-H01.dcc (84 total)
##   varLabels: FileVersion SoftwareVersion ... ROI.Size (17 total)
##   varMetadata: labelDescription
## phenoData
##   sampleNames: DSP-1001900002618-G-A02.dcc DSP-1001900002618-G-A03.dcc
##     ... DSP-1001900002618-G-H01.dcc (84 total)
##   varLabels: Plate Well ... NegGeoSD_Hs_R_NGS_WTA_v1.0 (13 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A2M NAT2 ... CST2 (18677 total)
##   fvarLabels: TargetName Module ... Negative (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Hs_R_NGS_WTA_v1.0.pkc 
## signature: none
## feature: Target
## analyte: RNA

#Protein Data is aggregated automatically on readin
proteinData

## NanoStringGeoMxSet (storageMode: lockedEnvironment)
## assayData: 147 features, 84 samples 
##   element names: exprs 
## protocolData
##   sampleNames: DSP-1001900002618-G-A02.dcc DSP-1001900002618-G-A03.dcc
##     ... DSP-1001900002618-G-H01.dcc (84 total)
##   varLabels: FileVersion SoftwareVersion ... ROI.Size (17 total)
##   varMetadata: labelDescription
## phenoData
##   sampleNames: DSP-1001900002618-G-A02.dcc DSP-1001900002618-G-A03.dcc
##     ... DSP-1001900002618-G-H01.dcc (84 total)
##   varLabels: Plate Well ... Y (11 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Ms IgG1 CD45 ... ADAM10 (147 total)
##   fvarLabels: RTS_ID TargetName ... Negative (8 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Hs_P_NGS_ADPath_Ext_v1.0.pkc Hs_P_NGS_ADPath_v1.0.pkc Hs_P_NGS_Autophagy_v1.0.pkc Hs_P_NGS_CellDeath_v1.0.pkc Hs_P_NGS_Core_v1.0.pkc Hs_P_NGS_GlialSubtype_v1.0.pkc Hs_P_NGS_IODrugTarget_v1.0.pkc Hs_P_NGS_ImmuneActivation_v1.0.pkc Hs_P_NGS_ImmuneCellTyping_v1.0.pkc Hs_P_NGS_MAPK_v1.1.pkc Hs_P_NGS_Myeloid_v1.0.pkc Hs_P_NGS_NeuralCellTyping_v1.0.pkc Hs_P_NGS_PDPath_v1.0.pkc Hs_P_NGS_PI3K_AKT_v1.0.pkc Hs_P_NGS_PanTumor_v1.0.pkc 
## signature: none
## feature: Target
## analyte: Protein

By having the datasets split by analyte, each object can go through the typical QC and normalization steps specific to that analyte.

RNA

For RNA please refer to the introduction or GeoMx Workflow vignettes.

Protein

Segment QC

After reading in the object, we will do one QC step: flag and remove low quality ROIs

proteinData <- setSegmentQCFlags(proteinData, qcCutoffs = list(percentSaturation = 45,
                                                               minSegmentReads=1000, 
                                                               percentAligned=80, 
                                                               minNegativeCount=10, 
                                                               maxNTCCount=60, 
                                                               minNuclei=16000, 
                                                               minArea=20))

# low sequenced ROIs
lowSaturation <- which(as.data.frame(protocolData(proteinData)[["QCFlags"]])["LowSaturation"] == TRUE)

# remove low quality ROIs
passedQC <- proteinData[, -lowSaturation]
dim(proteinData)

## Features  Samples 
##      147       84

dim(passedQC)

## Features  Samples 
##      147       82

Target QC

Housekeepers and negative controls (IgGs) can easily be pulled out of the dataset.

hk.names <- hkNames(proteinData)
hk.names

## [1] "Histone H3" "GAPDH"      "S6"

igg.names <- iggNames(proteinData)
igg.names

## [1] "Ms IgG1"  "Ms IgG2a" "Rb IgG"

For the target QC step, we identify proteins with potentially little useful signal using this figure.

fig <- qcProteinSignal(object = proteinData, neg.names = igg.names)

proteinOrder <- qcProteinSignalNames(object = proteinData, neg.names = igg.names)
genesOfInterest <- c(which(proteinOrder == "Tyrosine Hydroxylase"),
                     which(proteinOrder == "ApoA-I"),
                     which(proteinOrder == "EpCAM"))

fig()
rect(xleft = 0, xright = 4, 
     ybottom = -2, ytop = 2, density = 0, col = "#1B9E77", lwd = 2)
rect(xleft = genesOfInterest[1]-1, xright = genesOfInterest[1]+1, 
     ybottom = -2, ytop = 1.25, density = 0, col = "#D95F02", lwd = 2)
rect(xleft = genesOfInterest[2]-1, xright = genesOfInterest[2]+1, 
     ybottom = -1, ytop = 3, density = 0, col = "#66A61E", lwd = 2)
rect(xleft = genesOfInterest[3]-1, xright = genesOfInterest[3]+1, 
     ybottom = -3, ytop = 6.5, density = 0, col = "#E7298A", lwd = 2)

Negative controls (IgGs) (cyan) are plotted on the far left of the plot.
Tyrosine Hydroxylase (orange) hovers around background in all segments and might need to be excluded from analysis.
ApoA-I (green) is mostly near-background, but it has meaningfully high signal in a handful of segments.
EpCAM (pink) seems to have lower background than the negative controls. But its long range, and especially the existence of points well above background, suggests this protein has interpretable data.

The highlighted proteins may require further investigation after differential expression analysis but can typically be kept in the study.

proteinOrder <- qcProteinSignalNames(object = proteinData, neg.names = igg.names)

P62 <- which(proteinOrder == "P62")

fig()
rect(xleft = 3.5, xright = P62, ybottom = -6, ytop = 10, density = 2, col = "red", lty = 3)

However, here is example code if you choose to remove them.

In bulk:

proteinOrder <- qcProteinSignalNames(object = proteinData, neg.names = igg.names)
length(proteinOrder)

P62 <- which(proteinOrder == "P62")

fig()
rect(xleft = 3.5, xright = P62, ybottom = -6, ytop = 10, density = 2, col = "red", lty = 3)

#Right most protein where all proteins to the left will get removed
#start at 4 to keep the 3 IgG targets
proteinOrder <- proteinOrder[-c(4:P62)]
length(proteinOrder)

#replot with fewer targets
fig <- qcProteinSignal(object = proteinData[proteinOrder,], neg.names = igg.names)
fig()

Or by specific proteins:

proteinOrder <- qcProteinSignalNames(object = proteinData[proteinOrder,], neg.names = igg.names)
#which proteins to remove from analysis
lowTargets <- c("pan-RAS", "Neprilysin", "Olig2", "P2ry12", "p53", "NY-ESO-1", "INPP4B", "CD31", "Phospho-Alpha-synuclein (S129)", "Bcl-2")
proteinOrder <- proteinOrder[-c(which(proteinOrder %in% lowTargets))]
length(proteinOrder)

fig <- qcProteinSignal(object = proteinData[proteinOrder,], neg.names = igg.names)
fig()

Normalization

For more information on protein normalization please refer to our whitepaper.

After filtering targets, we move onto normalization. There are many types of normalization and we have two built in figure types to help decide what is the best method for the dataset.

The first is a concordance plot of a list of targets, normally the IgGs or HK, colored by ROI factors like tissue or segment type. The upper panels are the concordance plots and the lower panels are the standard deviation of the log2-ratios between the targets. This figure does not show correlations because that calculation is increased with the large range that these values can take (198-165497 in this example). SD(log2(ratios)) measures essentially the same thing but is invariant to that range. However the metrics are inversed, high correlation = low SDs.

Our motivating theory is simple: if several targets all accurately measure signal strength, they should be highly correlated with each other. More precisely, the log-ratios between them should have low SDs.

plotConcordance(object = proteinData, targetList = igg.names, plotFactor = "Tissue")

Above we see good concordance amongst the IgGs, confirming they all can be used. Numbers in the top-right panels show the SD of the log2-ratios between IgGs. Importantly, we do not see a tendency for one IgG to be offset from the others, suggesting there’s no between-slide bias in calculation of background.

The second plot helps show the concordance of normalization factors. The factors are calculated on the IgG and HK targets and the area or nuclei count if provided. The lower panels are the concordance plots and the upper panels are the standard deviation of the log2-ratios between the normalization factors.

normfactors <- computeNormalizationFactors(object = proteinData,
                                           area = "AOI.Size.um2",
                                           nuclei = "Nuclei.Counts")

plotNormFactorConcordance(object = proteinData, plotFactor = "Tissue",
                          normfactors = normfactors)

From this plot we can conclude that:

The IgGs and the housekeepers agree nicely, suggesting that if we normalize using one of them, the other will leave little artifactual signal in the data. If these factors diverged strongly, we would know that normalization with one of them would fail to account for the other, leaving an artifact in the data that must be accounted for in downstream analysis.
Area and nuclei are consistent with each other (SD log2 ratio of 0.61).
Area and nuclei diverge somewhat from the target-based normalization factors Neg geomean and HK geomean. This suggests that signal strength is not purely a result of area/cell count, or alternatively, that the neg and HK geomeans are noisy metrics.
The concordance of Negs/HKs suggests their performance is adequate, leading to the conclusion that area/nuclei are noisy measurements of signal strength in this data.

This divergence of area and nuclei vs IgGs and HKs is common which is why Background or HK normalization is recommended. The area and nuclei plots are good QC metrics to look for outliers or additionally can help you potentially ID some preferential bias in a study design.

After choosing a normalization technique from these plots, we normalize the data. Area and nuclei normalization are not native functions in GeomxTools, if you decide on normalizing by those factors you will need to do that separately. Quantile normalization is also available if HK or background normalization are not preferred.

#HK normalization
proteinData <- normalize(proteinData, norm_method="hk", toElt = "hk_norm")

#Background normalization
proteinData <- normalize(proteinData, norm_method="neg", toElt = "neg_norm")

#Quantile normalization
proteinData <- normalize(proteinData, norm_method="quant", desiredQuantile = .75, toElt = "q_norm")

names(proteinData@assayData)

## [1] "neg_norm" "q_norm"   "exprs"    "hk_norm"

This dataset is now ready for downstream analysis.

sessionInfo()

## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SpatialExperiment_1.10.0    SingleCellExperiment_1.22.0
##  [3] SummarizedExperiment_1.30.0 GenomicRanges_1.52.0       
##  [5] GenomeInfoDb_1.36.0         IRanges_2.34.0             
##  [7] MatrixGenerics_1.12.0       matrixStats_0.63.0         
##  [9] patchwork_1.1.2             SpatialDecon_1.10.0        
## [11] SeuratObject_4.1.3          Seurat_4.3.0               
## [13] ggiraph_0.8.7               EnvStats_2.7.0             
## [15] GeomxTools_3.4.0            NanoStringNCTools_1.8.0    
## [17] ggplot2_3.4.2               S4Vectors_0.38.0           
## [19] Biobase_2.60.0              BiocGenerics_0.46.0        
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.20          splines_4.3.0            
##   [3] later_1.3.0               bitops_1.0-7             
##   [5] tibble_3.2.1              R.oo_1.25.0              
##   [7] cellranger_1.1.0          polyclip_1.10-4          
##   [9] lifecycle_1.0.3           edgeR_3.42.0             
##  [11] globals_0.16.2            lattice_0.21-8           
##  [13] MASS_7.3-59               magrittr_2.0.3           
##  [15] limma_3.56.0              plotly_4.10.1            
##  [17] sass_0.4.5                rmarkdown_2.21           
##  [19] jquerylib_0.1.4           yaml_2.3.7               
##  [21] httpuv_1.6.9              sctransform_0.3.5        
##  [23] sp_1.6-0                  spatstat.sparse_3.0-1    
##  [25] reticulate_1.28           cowplot_1.1.1            
##  [27] pbapply_1.7-0             minqa_1.2.5              
##  [29] RColorBrewer_1.1-3        abind_1.4-5              
##  [31] zlibbioc_1.46.0           Rtsne_0.16               
##  [33] R.cache_0.16.0            purrr_1.0.1              
##  [35] R.utils_2.12.2            RCurl_1.98-1.12          
##  [37] GenomeInfoDbData_1.2.10   ggrepel_0.9.3            
##  [39] irlba_2.3.5.1             listenv_0.9.0            
##  [41] spatstat.utils_3.0-2      pheatmap_1.0.12          
##  [43] goftest_1.2-3             dqrng_0.3.0              
##  [45] spatstat.random_3.1-4     fitdistrplus_1.1-11      
##  [47] parallelly_1.35.0         DelayedMatrixStats_1.22.0
##  [49] DropletUtils_1.20.0       DelayedArray_0.26.0      
##  [51] leiden_0.4.3              codetools_0.2-19         
##  [53] scuttle_1.10.0            tidyselect_1.2.0         
##  [55] farver_2.1.1              lme4_1.1-33              
##  [57] spatstat.explore_3.1-0    jsonlite_1.8.4           
##  [59] ellipsis_0.3.2            progressr_0.13.0         
##  [61] ggridges_0.5.4            survival_3.5-5           
##  [63] systemfonts_1.0.4         tools_4.3.0              
##  [65] ica_1.0-3                 Rcpp_1.0.10              
##  [67] glue_1.6.2                gridExtra_2.3            
##  [69] mgcv_1.8-42               xfun_0.39                
##  [71] ggthemes_4.2.4            HDF5Array_1.28.0         
##  [73] dplyr_1.1.2               withr_2.5.0              
##  [75] numDeriv_2016.8-1.1       fastmap_1.1.1            
##  [77] GGally_2.1.2              rhdf5filters_1.12.0      
##  [79] repmis_0.5                boot_1.3-28.1            
##  [81] fansi_1.0.4               digest_0.6.31            
##  [83] R6_2.5.1                  mime_0.12                
##  [85] colorspace_2.1-0          scattermore_0.8          
##  [87] tensor_1.5                spatstat.data_3.0-1      
##  [89] R.methodsS3_1.8.2         utf8_1.2.3               
##  [91] tidyr_1.3.0               generics_0.1.3           
##  [93] data.table_1.14.8         httr_1.4.5               
##  [95] htmlwidgets_1.6.2         uwot_0.1.14              
##  [97] pkgconfig_2.0.3           gtable_0.3.3             
##  [99] lmtest_0.9-40             XVector_0.40.0           
## [101] htmltools_0.5.5           scales_1.2.1             
## [103] png_0.1-8                 logNormReg_0.5-0         
## [105] knitr_1.42                reshape2_1.4.4           
## [107] rjson_0.2.21              uuid_1.1-0               
## [109] nlme_3.1-162              nloptr_2.0.3             
## [111] rhdf5_2.44.0              cachem_1.0.7             
## [113] zoo_1.8-12                stringr_1.5.0            
## [115] KernSmooth_2.23-20        parallel_4.3.0           
## [117] miniUI_0.1.1.1            vipor_0.4.5              
## [119] ggrastr_1.0.1             pillar_1.9.0             
## [121] grid_4.3.0                reshape_0.8.9            
## [123] vctrs_0.6.2               RANN_2.6.1               
## [125] promises_1.2.0.1          beachmat_2.16.0          
## [127] xtable_1.8-4              cluster_2.1.4            
## [129] beeswarm_0.4.0            evaluate_0.20            
## [131] magick_2.7.4              locfit_1.5-9.7           
## [133] cli_3.6.1                 compiler_4.3.0           
## [135] rlang_1.1.0               crayon_1.5.2             
## [137] future.apply_1.10.0       labeling_0.4.2           
## [139] plyr_1.8.8                ggbeeswarm_0.7.1         
## [141] stringi_1.7.12            BiocParallel_1.34.0      
## [143] viridisLite_0.4.1         deldir_1.0-6             
## [145] lmerTest_3.1-3            munsell_0.5.0            
## [147] Biostrings_2.68.0         lazyeval_0.2.2           
## [149] spatstat.geom_3.1-0       Matrix_1.5-4             
## [151] sparseMatrixStats_1.12.0  future_1.32.0            
## [153] Rhdf5lib_1.22.0           shiny_1.7.4              
## [155] highr_0.10                ROCR_1.0-11              
## [157] igraph_1.4.2              bslib_0.4.2              
## [159] readxl_1.4.2

Protein data using GeomxTools