Example: finding E. coli outstanding genomic zones in response to nickel stress

Hua Zhong and Joe Song

Updated August 9, 2019; Created August 9, 2019

We illustrate the GenomicOZone package with an example. The Escherichia coli transcriptome data set (Gault and Rodrigue 2016) contains transcriptome profiles of E. coli K-12 under nickel stress, with three samples under exposure to nickel and three normal samples.

To run this example, two other packages GEOquery (Davis and Meltzer 2007) for data download and readxl for Excel file reading are required and should have been installed, in addition to GenomicOZone.

The following code chunk downloads a data transcriptome set, prepares genome annotation, and creates parameters before outstanding zone analysis. Specifically, an Escherichia coli transcriptome data set of series number GSE76167 is downloaded from GEO database (Edgar, Domrachev, and Lash 2002). We extracted the RPKM values from six samples including three stressed samples and three wild-type samples. The genome annotation is prepared as a GRanges (Lawrence et al. 2013) object.

# Obtain data matrix from GSE76167 supplementary file invisible(getGEOSuppFiles("GSE76167")) data <- read_excel("./GSE76167/GSE76167_GeneFPKM_AllSamples.xlsx") # Adjust the input data data.info <- data[,1:5] data <- data[,-c(1:5)] data <- data[,substr(colnames(data), 1, 4) == "FPKM"] data <- data.matrix(data[,c(1,5,6,3,4,2)]) colnames(data) <- c(paste(rep("WT",3), "_", c(1,2,3), sep=""), paste(rep("Ni",3), "_", c(1,2,3), sep="")) rownames(data) <- data.info$tracking_id # Obtain genes data.genes <- data.info$gene_short_name data.genes[data.genes == "-"] <- data.info$tracking_id[data.genes == "-"] # Create colData colData <- data.frame(Sample_name = colnames(data), Condition = factor(rep(c("WT", "Ni"), each = 3), levels = c("WT", "Ni"))) # Create design design <- ~ Condition # Create rowData.GRanges pattern <- "(.[^\\:]*)\\:([0-9]+)\\-([0-9]+)" matched <- regexec(pattern, as.character(data.info$locus)) values <- regmatches(as.character(data.info$locus), matched) data.gene.coor <- data.frame(chr = as.character(sapply(values, function(x){x[[2]]})), start = as.numeric(sapply(values, function(x){x[[3]]})), end = as.numeric(sapply(values, function(x){x[[4]]}))) rownames(data.gene.coor) <- as.character(data.info$tracking_id) rowData.GRanges <- GRanges(seqnames = data.gene.coor$chr, IRanges(start = data.gene.coor$start, end = data.gene.coor$end), Gene.name = data.genes) names(rowData.GRanges) <- data.info$tracking_id chr.size <- 4646332 names(chr.size) <- "NC_007779" seqlevels(rowData.GRanges) <- names(chr.size) seqlengths(rowData.GRanges) <- chr.size

With the formatted data, parameters, and annotation, we run the outstanding zone analysis as below:

# Create an input object also checking for data format, consistency, and completeness GOZ.ds <- GOZDataSet(data = data, colData = colData, design = design, rowData.GRanges = rowData.GRanges) # Run the outstanding zone analysis GOZ.ds <- GenomicOZone(GOZ.ds)

The following four auxiliary functions extract gene annotation, zone annotation, outstanding zone annotation, and zone expression matrix, respectively:

# Extract gene/zone GRanges object Gene.GRanges <- extract_genes(GOZ.ds) head(Gene.GRanges) #> GRanges object with 6 ranges and 2 metadata columns: #> seqnames ranges strand | Gene.name zone #> <Rle> <IRanges> <Rle> | <character> <factor> #> gene0 NC_007779 189-255 * | thrL NC_007779_1 #> gene1 NC_007779 336-2799 * | thrA NC_007779_1 #> gene2 NC_007779 2800-3733 * | thrB NC_007779_1 #> gene3 NC_007779 3733-5020 * | thrC NC_007779_1 #> gene4 NC_007779 5233-5530 * | yaaX NC_007779_1 #> gene5 NC_007779 5682-6459 * | yaaA NC_007779_1 #> ------- #> seqinfo: 1 sequence from an unspecified genome Zone.GRanges <- extract_zones(GOZ.ds) head(Zone.GRanges) #> GRanges object with 6 ranges and 3 metadata columns: #> seqnames ranges strand | zone #> <Rle> <IRanges> <Rle> | <factor> #> NC_007779_1 NC_007779 1-10641 * | NC_007779_1 #> NC_007779_2 NC_007779 10642-37896 * | NC_007779_2 #> NC_007779_3 NC_007779 37897-72227 * | NC_007779_3 #> NC_007779_4 NC_007779 72228-96000 * | NC_007779_4 #> NC_007779_5 NC_007779 96001-117750 * | NC_007779_5 #> NC_007779_6 NC_007779 117751-147942 * | NC_007779_6 #> p.value.adj effect.size #> <numeric> <numeric> #> NC_007779_1 0.0512661950077081 0.0746666666666668 #> NC_007779_2 0.718168991290051 0.000940623162845382 #> NC_007779_3 0.0143495591325428 0.0408649173955296 #> NC_007779_4 1.72429037000748e-06 0.186783284742468 #> NC_007779_5 3.22434278587209e-05 0.152380952380952 #> NC_007779_6 0.922932027413127 6.61375661375664e-05 #> ------- #> seqinfo: 1 sequence from an unspecified genome # min.effect.size = 0.36 is chosen from the # minumum of top 5% effect size values OZone.GRanges <- extract_outstanding_zones( GOZ.ds, alpha = 0.05, min.effect.size = 0.36) head(OZone.GRanges) #> GRanges object with 6 ranges and 3 metadata columns: #> seqnames ranges strand | zone #> <Rle> <IRanges> <Rle> | <factor> #> NC_007779_9 NC_007779 194902-211875 * | NC_007779_9 #> NC_007779_20 NC_007779 505826-538369 * | NC_007779_20 #> NC_007779_40 NC_007779 1027532-1044650 * | NC_007779_40 #> NC_007779_127 NC_007779 3256940-3288667 * | NC_007779_127 #> NC_007779_140 NC_007779 3542408-3569464 * | NC_007779_140 #> NC_007779_164 NC_007779 4190532-4193530 * | NC_007779_164 #> p.value.adj effect.size #> <numeric> <numeric> #> NC_007779_9 2.24272378014001e-08 0.369737954353339 #> NC_007779_20 7.00190140709185e-18 0.367495219885278 #> NC_007779_40 3.67992619550741e-11 0.368724279835391 #> NC_007779_127 4.06310885925749e-20 0.384378843788437 #> NC_007779_140 1.33423830245937e-28 0.591375238095239 #> NC_007779_164 3.55346011365623e-13 0.771428571428571 #> ------- #> seqinfo: 1 sequence from an unspecified genome Zone.exp.mat <- extract_zone_expression(GOZ.ds) head(Zone.exp.mat) #> WT_1 WT_2 WT_3 Ni_1 Ni_2 Ni_3 #> NC_007779_1 62254.279 84002.857 73827.669 48028.196 55300.012 49693.918 #> NC_007779_2 14081.429 30395.490 34327.503 31066.653 31408.564 35835.076 #> NC_007779_3 3969.418 4348.122 4297.210 5082.839 6690.894 6168.065 #> NC_007779_4 10769.938 10361.703 11069.645 13117.399 14347.793 16482.038 #> NC_007779_5 5118.158 4878.421 4847.403 4985.081 4368.440 4238.027 #> NC_007779_6 8668.344 8609.166 9161.151 9130.383 8987.596 8840.630

Three types of plot can be generated from the returned object by GenomicOZone(), including genome-wide overview, within-chromosome heatmap, and within-zone expression. The plots are generated using R package ggplot2 (Wickham 2016) and ggbio (Yin, Cook, and Lawrence 2012). The value of min.effect.size = 0.36 is chosen from the minumum of top 5% effect size values. The effect size for ANOVA is calculated by R package sjstats (Lüdecke 2019).

# Genome-wide overview plot_genome(GOZ.ds, plot.file = "E_coli_genome.pdf", plot.width = 15, plot.height = 4, alpha = 0.05, min.effect.size = 0.36) #> Scale for 'x' is already present. Adding another scale for 'x', which #> will replace the existing scale. #> Scale for 'x' is already present. Adding another scale for 'x', which #> will replace the existing scale. #> NULL knitr::include_graphics("E_coli_genome.pdf")

# Within-chromosome heatmap plot_chromosomes(GOZ.ds, plot.file = "E_coli_chromosome.pdf", plot.width = 20, plot.height = 4, alpha = 0.05, min.effect.size = 0.36) #> png #> 2 knitr::include_graphics("E_coli_chromosome.pdf")

# Within-zone expression plot_zones(GOZ.ds, plot.file = "E_coli_zone.pdf", plot.all.zones = FALSE, alpha = 0.05, min.effect.size = 0.36) #> png #> 2 knitr::include_graphics("E_coli_zone.pdf")

References

Davis, Sean, and Paul S Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (Geo) and Bioconductor.” Bioinformatics 23 (14). Oxford University Press:1846–7.

Edgar, Ron, Michael Domrachev, and Alex E Lash. 2002. “Gene Expression Omnibus: NCBI Gene Expression and Hybridization Array Data Repository.” Nucleic Acids Research 30 (1). Oxford University Press:207–10.

Gault, Manon, and Agnès Rodrigue. 2016. “Data Set for Transcriptome Analysis of Escherichia Coli Exposed to Nickel.” Data in Brief 9:314–17.

Lawrence, Michael, Wolfgang Huber, Hervé Pages, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9 (8). Public Library of Science:e1003118.

Lüdecke, Daniel. 2019. Sjstats: Statistical Functions for Regression Models (Version 0.17.5). https://doi.org/10.5281/zenodo.1284472.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Yin, Tengfei, Dianne Cook, and Michael Lawrence. 2012. “Ggbio: An R Package for Extending the Grammar of Graphics for Genomic Data.” Genome Biology 13 (8). BioMed Central:R77.