Example: finding E. coli outstanding genomic zones in response to nickel stress

Hua Zhong and Joe Song

Updated August 9, 2019; Created August 9, 2019

We illustrate the GenomicOZone package with an example. The Escherichia coli transcriptome data set (Gault and Rodrigue 2016) contains transcriptome profiles of E. coli K-12 under nickel stress, with three samples under exposure to nickel and three normal samples.

To run this example, two other packages GEOquery (Davis and Meltzer 2007) for data download and readxl for Excel file reading are required and should have been installed, in addition to GenomicOZone.

require(GenomicOZone)
require(GEOquery)
require(readxl)

The following code chunk downloads a data transcriptome set, prepares genome annotation, and creates parameters before outstanding zone analysis. Specifically, an Escherichia coli transcriptome data set of series number GSE76167 is downloaded from GEO database (Edgar, Domrachev, and Lash 2002). We extracted the RPKM values from six samples including three stressed samples and three wild-type samples. The genome annotation is prepared as a GRanges (Lawrence et al. 2013) object.

# Obtain data matrix from GSE76167 supplementary file
invisible(getGEOSuppFiles("GSE76167"))
data <- read_excel("./GSE76167/GSE76167_GeneFPKM_AllSamples.xlsx")


# Adjust the input data
data.info <- data[,1:5]
data <- data[,-c(1:5)]
data <- data[,substr(colnames(data), 1, 4) == "FPKM"]
data <- data.matrix(data[,c(1,5,6,3,4,2)])
colnames(data) <- c(paste(rep("WT",3), "_", c(1,2,3), sep=""), 
                    paste(rep("Ni",3), "_", c(1,2,3), sep=""))
rownames(data) <- data.info$tracking_id


# Obtain genes
data.genes <- data.info$gene_short_name
data.genes[data.genes == "-"] <- data.info$tracking_id[data.genes == "-"]


# Create colData
colData <- data.frame(Sample_name = colnames(data),
                      Condition = factor(rep(c("WT", "Ni"), each = 3), 
                                         levels = c("WT", "Ni")))


# Create design
design <- ~ Condition


# Create rowData.GRanges
pattern <- "(.[^\\:]*)\\:([0-9]+)\\-([0-9]+)"
matched <- regexec(pattern, as.character(data.info$locus))
values <- regmatches(as.character(data.info$locus), matched)
data.gene.coor <-  data.frame(chr = as.character(sapply(values, function(x){x[[2]]})),
                              start = as.numeric(sapply(values, function(x){x[[3]]})),
                              end = as.numeric(sapply(values, function(x){x[[4]]})))
rownames(data.gene.coor) <- as.character(data.info$tracking_id)
rowData.GRanges <- GRanges(seqnames = data.gene.coor$chr,
                           IRanges(start = data.gene.coor$start, 
                                   end = data.gene.coor$end),
                           Gene.name = data.genes)
names(rowData.GRanges) <- data.info$tracking_id

chr.size <- 4646332
names(chr.size) <- "NC_007779"
seqlevels(rowData.GRanges) <- names(chr.size)
seqlengths(rowData.GRanges) <- chr.size

With the formatted data, parameters, and annotation, we run the outstanding zone analysis as below:

# Create an input object also checking for data format, consistency, and completeness
GOZ.ds <- GOZDataSet(data = data,
                     colData = colData,
                     design = design,
                     rowData.GRanges = rowData.GRanges)


# Run the outstanding zone analysis
GOZ.ds <- GenomicOZone(GOZ.ds)

The following four auxiliary functions extract gene annotation, zone annotation, outstanding zone annotation, and zone expression matrix, respectively:

# Extract gene/zone GRanges object
Gene.GRanges <- extract_genes(GOZ.ds)
head(Gene.GRanges)
#> GRanges object with 6 ranges and 2 metadata columns:
#>          seqnames    ranges strand |   Gene.name        zone
#>             <Rle> <IRanges>  <Rle> | <character>    <factor>
#>   gene0 NC_007779   189-255      * |        thrL NC_007779_1
#>   gene1 NC_007779  336-2799      * |        thrA NC_007779_1
#>   gene2 NC_007779 2800-3733      * |        thrB NC_007779_1
#>   gene3 NC_007779 3733-5020      * |        thrC NC_007779_1
#>   gene4 NC_007779 5233-5530      * |        yaaX NC_007779_1
#>   gene5 NC_007779 5682-6459      * |        yaaA NC_007779_1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome

Zone.GRanges <- extract_zones(GOZ.ds)
head(Zone.GRanges)
#> GRanges object with 6 ranges and 3 metadata columns:
#>                seqnames        ranges strand |        zone
#>                   <Rle>     <IRanges>  <Rle> |    <factor>
#>   NC_007779_1 NC_007779       1-10641      * | NC_007779_1
#>   NC_007779_2 NC_007779   10642-37896      * | NC_007779_2
#>   NC_007779_3 NC_007779   37897-72227      * | NC_007779_3
#>   NC_007779_4 NC_007779   72228-96000      * | NC_007779_4
#>   NC_007779_5 NC_007779  96001-117750      * | NC_007779_5
#>   NC_007779_6 NC_007779 117751-147942      * | NC_007779_6
#>                        p.value.adj          effect.size
#>                          <numeric>            <numeric>
#>   NC_007779_1   0.0512661950077081   0.0746666666666668
#>   NC_007779_2    0.718168991290051 0.000940623162845382
#>   NC_007779_3   0.0143495591325428   0.0408649173955296
#>   NC_007779_4 1.72429037000748e-06    0.186783284742468
#>   NC_007779_5 3.22434278587209e-05    0.152380952380952
#>   NC_007779_6    0.922932027413127 6.61375661375664e-05
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome

# min.effect.size = 0.36 is chosen from the 
#     minumum of top 5% effect size values
OZone.GRanges <- extract_outstanding_zones(
                              GOZ.ds,
                              alpha = 0.05, 
                              min.effect.size = 0.36)
head(OZone.GRanges)
#> GRanges object with 6 ranges and 3 metadata columns:
#>                  seqnames          ranges strand |          zone
#>                     <Rle>       <IRanges>  <Rle> |      <factor>
#>     NC_007779_9 NC_007779   194902-211875      * |   NC_007779_9
#>    NC_007779_20 NC_007779   505826-538369      * |  NC_007779_20
#>    NC_007779_40 NC_007779 1027532-1044650      * |  NC_007779_40
#>   NC_007779_127 NC_007779 3256940-3288667      * | NC_007779_127
#>   NC_007779_140 NC_007779 3542408-3569464      * | NC_007779_140
#>   NC_007779_164 NC_007779 4190532-4193530      * | NC_007779_164
#>                          p.value.adj       effect.size
#>                            <numeric>         <numeric>
#>     NC_007779_9 2.24272378014001e-08 0.369737954353339
#>    NC_007779_20 7.00190140709185e-18 0.367495219885278
#>    NC_007779_40 3.67992619550741e-11 0.368724279835391
#>   NC_007779_127 4.06310885925749e-20 0.384378843788437
#>   NC_007779_140 1.33423830245937e-28 0.591375238095239
#>   NC_007779_164 3.55346011365623e-13 0.771428571428571
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome


Zone.exp.mat <- extract_zone_expression(GOZ.ds)
head(Zone.exp.mat)
#>                  WT_1      WT_2      WT_3      Ni_1      Ni_2      Ni_3
#> NC_007779_1 62254.279 84002.857 73827.669 48028.196 55300.012 49693.918
#> NC_007779_2 14081.429 30395.490 34327.503 31066.653 31408.564 35835.076
#> NC_007779_3  3969.418  4348.122  4297.210  5082.839  6690.894  6168.065
#> NC_007779_4 10769.938 10361.703 11069.645 13117.399 14347.793 16482.038
#> NC_007779_5  5118.158  4878.421  4847.403  4985.081  4368.440  4238.027
#> NC_007779_6  8668.344  8609.166  9161.151  9130.383  8987.596  8840.630

Three types of plot can be generated from the returned object by GenomicOZone(), including genome-wide overview, within-chromosome heatmap, and within-zone expression. The plots are generated using R package ggplot2 (Wickham 2016) and ggbio (Yin, Cook, and Lawrence 2012). The value of min.effect.size = 0.36 is chosen from the minumum of top 5% effect size values. The effect size for ANOVA is calculated by R package sjstats (Lüdecke 2019).

# Genome-wide overview
plot_genome(GOZ.ds, plot.file = "E_coli_genome.pdf",
            plot.width = 15, plot.height = 4,
            alpha = 0.05, min.effect.size = 0.36)
#> Scale for 'x' is already present. Adding another scale for 'x', which
#> will replace the existing scale.
#> Scale for 'x' is already present. Adding another scale for 'x', which
#> will replace the existing scale.
#> NULL
knitr::include_graphics("E_coli_genome.pdf")
# Within-chromosome heatmap
plot_chromosomes(GOZ.ds, plot.file = "E_coli_chromosome.pdf",
                 plot.width = 20, plot.height = 4,
                 alpha = 0.05, min.effect.size = 0.36)
#> png 
#>   2
knitr::include_graphics("E_coli_chromosome.pdf")
# Within-zone expression
plot_zones(GOZ.ds, plot.file = "E_coli_zone.pdf", 
           plot.all.zones = FALSE,
           alpha = 0.05, min.effect.size = 0.36)
#> png 
#>   2
knitr::include_graphics("E_coli_zone.pdf")

References

Davis, Sean, and Paul S Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (Geo) and Bioconductor.” Bioinformatics 23 (14). Oxford University Press:1846–7.

Edgar, Ron, Michael Domrachev, and Alex E Lash. 2002. “Gene Expression Omnibus: NCBI Gene Expression and Hybridization Array Data Repository.” Nucleic Acids Research 30 (1). Oxford University Press:207–10.

Gault, Manon, and Agnès Rodrigue. 2016. “Data Set for Transcriptome Analysis of Escherichia Coli Exposed to Nickel.” Data in Brief 9:314–17.

Lawrence, Michael, Wolfgang Huber, Hervé Pages, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9 (8). Public Library of Science:e1003118.

Lüdecke, Daniel. 2019. Sjstats: Statistical Functions for Regression Models (Version 0.17.5). https://doi.org/10.5281/zenodo.1284472.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Yin, Tengfei, Dianne Cook, and Michael Lawrence. 2012. “Ggbio: An R Package for Extending the Grammar of Graphics for Genomic Data.” Genome Biology 13 (8). BioMed Central:R77.