1 Overview of GEOmetadb

The NCBI Gene Expression Omnibus (GEO) represents the largest repository of microarray data in existence. One difficulty in dealing with GEO is finding the microarray data that is of interest. As part of the NCBI Entrez search system, GEO can be searched online via web pages or using NCBI Eutils. However, the web search is not as full-featured as it could be, particularly for programmatic access. NCBI Eutils offers another option for finding data within the vast stores of GEO, but it is cumbersome to use, often requiring multiple complicated Eutils calls to get at the relevant information. We have found it absolutely critical to have ready access not just to the microarray data, but to the metadata describing the microarray experiments. To this end we have created GEOmetadb.

1.1 Citing GEOmetadb

If you use GEOmetadb or the associated SQLite database, please cite as follows:

citation("GEOmetadb")
## 
## Please cite the following if utilizing the GEOmetadb software:
## 
##   Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb:
##   powerful alternative search engine for the Gene Expression
##   Omnibus. Bioinformatics. 2008 Dec 1;24(23):2798-800. doi:
##   10.1093/bioinformatics/btn520. Epub 2008 Oct 7. PubMed PMID:
##   18842599; PubMed Central PMCID: PMC2639278.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {Yuelin Zhu and Sean Davis and Robert Stephens and Paul S. Meltzer and Yidong Chen},
##     title = {GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus.},
##     journal = {Bioinformatics (Oxford, England)},
##     year = {2008},
##     month = {Dec},
##     day = {01},
##     volume = {24},
##     number = {23},
##     pages = {2798--2800},
##     abstract = {The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R.},
##     issn = {1367-4811},
##     doi = {10.1093/bioinformatics/btn520},
##     url = {http://www.ncbi.nlm.nih.gov/pubmed/18842599},
##     language = {eng},
##   }

1.2 What is GEOmetadb?

The GEOmetadb package is an attempt to make querying the metadata describing microarray experiments, platforms, and datasets both easier and more powerful. At the heart of GEOmetadb is a SQLite database that stores nearly all the metadata associated with all GEO data types including GEO samples (GSM), GEO platforms (GPL), GEO data series (GSE), and curated GEO datasets (GDS), as well as the relationships between these data types. This database is generated by our server by parsing all the records in GEO and needs to be downloaded via a simple helper function to the user’s local machine before GEOmetadb is useful. Once this is done, the entire GEO database is accessible with simple SQL-based queries. With the GEOmetadb database, queries that are simply not possible using NCBI tools or web pages are often quite simple.

The relationships between the tables in the GEOmetadb SQLite database can be seen in the following entity-relationship diagram.

1.3 Conversion capabilities

A very typical problem for large-scale consumers of GEO data is to determine the relationships between various GEO accession types. As examples, consider the following questions:

  • What samples are associated with GEO platform “GPL96”, which represents the Affymetrix hgu133a array?}
  • What GEO Series were performed using GPL96?
  • What samples are in my favorite three GEO Series records?
  • How many samples are associated with the ten most popular GEO platforms?

Because these types of questions are common, GEOmetadb contains the function geoConvert that addresses these questions directly and efficiently.

1.4 What GEOmetadb is not

We have faithfully parsed and maintained in GEO when creating GEOmetadb. This means that limitations inherent to GEO are also inherent in GEOmetadb. We have made no attempt to curate, semantically recode, or otherwise “clean up”" GEO; to do so would require significant resources, which we do not have.

GEOmetadb does not contain any microarray data. For access to microarray data from within R/Bioconductor, please look at the GEOquery package. In fact, we would expect that many users will find that the combination of GEOmetadb and GEOquery is quite powerful.

2 Getting Started

Once GEOmetadb is installed (see the Bioconductor website for full installation instructions), we are ready to begin.

2.1 Getting the GEOmetadb database

This package does not come with a pre-installed version of the database. This has the advantage that the user will get the most up-to-date version of the database to start; the database can be re-downloaded using the same command as often as desired. First, load the library.

library(GEOmetadb)

The download and uncompress steps are done automatically with a single command, getSQLiteFile.

if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
## Unzipping...
## Metadata associate with downloaded file:
##                 name               value
## 1     schema version                 1.0
## 2 creation timestamp 2015-04-11 12:35:51
## [1] "/tmp/Rtmpfyn3Gv/Rbuild1f9b25a8e235/GEOmetadb/vignettes/GEOmetadb.sqlite"

The default storage location is in the current working directory and the default filename is “GEOmetadb.sqlite”; it is best to leave the name unchanged unless there is a pressing reason to change it.

Since this SQLite file is of key importance in GEOmetadb, it is perhaps of some interest to know some details about the file itself.

file.info('GEOmetadb.sqlite')
##                        size isdir mode
## GEOmetadb.sqlite 3623028736 FALSE  644
##                                mtime
## GEOmetadb.sqlite 2015-04-16 19:36:23
##                                ctime
## GEOmetadb.sqlite 2015-04-16 19:36:23
##                                atime uid gid     uname
## GEOmetadb.sqlite 2015-04-16 19:36:23 691 692 biocbuild
##                       grname
## GEOmetadb.sqlite phs_compbio

Now, the SQLite file is available for connection. The standard DBI functionality as implemented in RSQLite function dbConnect makes the connection to the database. The dbDisconnect function disconnects the connection.

con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
dbDisconnect(con)
## [1] TRUE

The variable con is an RSQLite connection object.

2.2 A word about SQL

The Structured Query Language, or SQL, is a very powerful and standard way of working with relational data. GEO is composed of several data types, all of which are related to each other; in fact, NCBI uses a relational SQL database for metadata storage and querying. SQL databases and SQL itself are designed specifically to work efficiently with just such data. While the goal of many programming projects and programmers is to hide the details of SQL from the user, we are of the opinion that such efforts may be counterproductive, particularly with complex data and the need for ad hoc queries, both of which are characteristics with GEO metadata. We have taken the view that exposing the power of SQL will enable users to maximally utilize the vast data repository that is GEO. We understand that many users are not accustomed to working with SQL and, therefore, have devoted a large section of the vignette to working examples. Our goal is not to teach SQL, so a quick tutorial of SQL is likely to be beneficial to those who have not used it before. Many such tutorials are available online and can be completed in 30 minutes or less.

3 Examples

3.1 Interacting with the database

The functionality covered in this section is covered in much more detail in the DBI and RSQLite package documentation. We cover enough here only to be useful.

Again, we connect to the database.

con <- dbConnect(SQLite(),'GEOmetadb.sqlite')

The dbListTables function lists all the tables in the SQLite database handled by the connection object con.

geo_tables <- dbListTables(con)
geo_tables
##  [1] "gds"               "gds_subset"       
##  [3] "geoConvert"        "geodb_column_desc"
##  [5] "gpl"               "gse"              
##  [7] "gse_gpl"           "gse_gsm"          
##  [9] "gsm"               "metaInfo"         
## [11] "sMatrix"

There is also the dbListFields function that can list database fields associated with a table.

dbListFields(con,'gse')
##  [1] "ID"                   "title"               
##  [3] "gse"                  "status"              
##  [5] "submission_date"      "last_update_date"    
##  [7] "pubmed_id"            "summary"             
##  [9] "type"                 "contributor"         
## [11] "web_link"             "overall_design"      
## [13] "repeats"              "repeats_sample_list" 
## [15] "variable"             "variable_description"
## [17] "contact"              "supplementary_file"

Sometimes it is useful to get the actual SQL schema associated with a table. Here, we get the table schema for the gpl table.

dbGetQuery(con,'PRAGMA TABLE_INFO(gpl)')
##    cid                 name type notnull dflt_value pk
## 1    0                   ID REAL       0       <NA>  0
## 2    1                title TEXT       0       <NA>  0
## 3    2                  gpl TEXT       0       <NA>  0
## 4    3               status TEXT       0       <NA>  0
## 5    4      submission_date TEXT       0       <NA>  0
## 6    5     last_update_date TEXT       0       <NA>  0
## 7    6           technology TEXT       0       <NA>  0
## 8    7         distribution TEXT       0       <NA>  0
## 9    8             organism TEXT       0       <NA>  0
## 10   9         manufacturer TEXT       0       <NA>  0
## 11  10 manufacture_protocol TEXT       0       <NA>  0
## 12  11              coating TEXT       0       <NA>  0
## 13  12       catalog_number TEXT       0       <NA>  0
## 14  13              support TEXT       0       <NA>  0
## 15  14          description TEXT       0       <NA>  0
## 16  15             web_link TEXT       0       <NA>  0
## 17  16              contact TEXT       0       <NA>  0
## 18  17       data_row_count REAL       0       <NA>  0
## 19  18   supplementary_file TEXT       0       <NA>  0
## 20  19         bioc_package TEXT       0       <NA>  0

3.2 Writing SQL queries and getting results

Select 5 records from the gse table and show the first 7 columns.

rs <- dbGetQuery(con,'select * from gse limit 5')
rs[,1:7]
##   ID
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
##                                                      title
## 1                                     NHGRI_Melanoma_class
## 2                                   Cerebellar development
## 3             Renal Cell Carcinoma Differential Expression
## 4     Diurnal and Circadian-Regulated Genes in Arabidopsis
## 5 Global profile of germline gene expression in C. elegans
##    gse                status submission_date
## 1 GSE1 Public on Jan 22 2001      2001-01-22
## 2 GSE2 Public on Apr 26 2001      2001-04-19
## 3 GSE3 Public on Jul 19 2001      2001-07-19
## 4 GSE4 Public on Jul 20 2001      2001-07-20
## 5 GSE5 Public on Jul 24 2001      2001-07-24
##   last_update_date pubmed_id
## 1       2012-02-23  10952317
## 2       2012-02-23        NA
## 3       2012-02-23  11691851
## 4       2012-02-23  11158533
## 5       2012-02-23  11030340

Get the GEO series accession and title from GEO series that were submitted by “Sean Davis”. The “%” sign is used in combination with the “like” operator to do a “wildcard” search for the name “Sean Davis” with any number of characters before or after or between “Sean” and “Davis”.

rs <- dbGetQuery(con,paste("select gse,title from gse where",
                           "contributor like '%Sean%Davis%'",sep=" "))
rs
##         gse
## 1   GSE2553
## 2   GSE4406
## 3   GSE5357
## 4   GSE7376
## 5   GSE7882
## 6   GSE8486
## 7   GSE9328
## 8  GSE14543
## 9  GSE15621
## 10 GSE16087
## 11 GSE16088
## 12 GSE16091
## 13 GSE16102
## 14 GSE18544
## 15 GSE19063
## 16 GSE20016
## 17 GSE22520
## 18 GSE25127
## 19 GSE25164
## 20 GSE29428
## 21 GSE41795
## 22  GSE5481
## 23 GSE33249
##                                                                                                                      title
## 1                                                                                                      NHGRI_Sarcoma_Baird
## 2                                           Gene expression profiling of CD4+ T-cells and GM6990 lymphoblastoid cell lines
## 3                                                                                                    NHGRI Menin ChIP-Chip
## 4                                                                Detection of novel amplification units in prostate cancer
## 5                          Gene Expression and Comparative Genomic Hybridization of Ductal Carcinoma In Situ of the Breast
## 6                                                                Whole genome DNAse hypersensitivity in human CD4+ T-cells
## 7                                                                                              ATF2 knockout in papillomas
## 8                                                                              A molecular function map of Ewing’s Sarcoma
## 9                                                                  Acute Lymphocytic Leukemia versus associated xenografts
## 10                                                                         Gene expression profiles of canine osteosarcoma
## 11                                                                          Gene expression profiles of human osteosarcoma
## 12                                                                    Gene expression profiles of human osteosarcoma, set2
## 13                                                               Gene expression profiles of canine and human osteosarcoma
## 14     Expression Profiling of a Mouse Xenograft Model of “Triple-Negative” Breast Cancer Brain Metastases With Vorinostat
## 15                                                          Genome-wide map of PAX3-FKHR binding sites in rhabdomyosarcoma
## 16 Analyses of Human Brain Metastases of Breast Cancer Reveal the Association between HK2 Up-Regulation and Poor Prognosis
## 17                                             Mouse Models of Alveolar/Embryonal Rhabdomyosarcoma & Spindle Cell Sarcomas
## 18                                                                       Ewing Sarcoma cell lines treated with mithramycin
## 19                                                                                         UV effects in mouse melanocytes
## 20       A methyl-deviator epigenotype of estrogen-receptor-positive breast carcinoma is associated with malignant biology
## 21                               Accelerated high-yield generation of limb-innervating motor neurons from human stem cells
## 22                                                                                             BRAF siRNA NHGRI_Liang_BRAF
## 23                    High resolution comparative genomic hybridisation analysis of ductal carcinoma in situ of the breast

As another example, GEOmetadb can find all samples on GPL96 (Affymetrix hgu133a) that have .CEL files available for download.

rs <- dbGetQuery(con,paste("select gsm,supplementary_file",
                           "from gsm where gpl='GPL96'",
                           "and supplementary_file like '%CEL.gz'"))
dim(rs)
## [1] 26075     2

But why limit to only GPL96? Why not look for all Affymetrix arrays that have .CEL files? And list those with their associated GPL information, as well as the Bioconductor annotation package name?

rs <- dbGetQuery(con,paste("select gpl.bioc_package,gsm.gpl,",
                           "gsm,gsm.supplementary_file",
                           "from gsm join gpl on gsm.gpl=gpl.gpl",
                           "where gpl.manufacturer='Affymetrix'",
                           "and gsm.supplementary_file like '%CEL.gz' "))
rs[1:5,]
##   bioc_package   gpl    gsm
## 1       hu6800 GPL80 GSM575
## 2       hu6800 GPL80 GSM576
## 3       hu6800 GPL80 GSM577
## 4       hu6800 GPL80 GSM578
## 5       hu6800 GPL80 GSM579
##                                                         supplementary_file
## 1 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM575/suppl/GSM575.cel.gz
## 2 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM576/suppl/GSM576.cel.gz
## 3 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM577/suppl/GSM577.cel.gz
## 4 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM578/suppl/GSM578.cel.gz
## 5 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM579/suppl/GSM579.cel.gz

Of course, we can combine programming and data access. A simple sapply example shows how to query each of the tables for number of records.

getTableCounts <- function(tableName,conn) {
  sql <- sprintf("select count(*) from %s",tableName)
  return(dbGetQuery(conn,sql)[1,1])
}
do.call(rbind,sapply(geo_tables,getTableCounts,con,simplify=FALSE))
##                      [,1]
## gds                  3867
## gds_subset          23357
## geoConvert        6419200
## geodb_column_desc     104
## gpl                 14221
## gse                 56657
## gse_gpl             68772
## gse_gsm           1668212
## gsm               1366560
## metaInfo                2
## sMatrix             49161

3.3 Conversion of GEO entity types

Large-scale consumers of GEO data might want to convert GEO entity type from one to others, e.g. finding all GSM and GSE associated with ‘GPL96’. Function goeConvert does the conversion with a very fast mapping between entity types.

Covert ‘GPL96’ to other possible types in the GEOmetadb.sqlite.

conversion <- geoConvert('GPL96')

Check what GEO types and how many entities in each type in the conversion.

lapply(conversion, dim)
## $gse
## [1] 1024    2
## 
## $gsm
## [1] 35554     2
## 
## $gds
## [1] 359   2
## 
## $sMatrix
## [1] 1015    2
conversion$gse[1:5,]
##   from_acc   to_acc
## 1    GPL96  GSE1000
## 2    GPL96 GSE10024
## 3    GPL96 GSE10043
## 4    GPL96 GSE10072
## 5    GPL96 GSE10089
conversion$gsm[1:5,]
##   from_acc     to_acc
## 1    GPL96 GSM1003058
## 2    GPL96 GSM1003059
## 3    GPL96 GSM1003060
## 4    GPL96 GSM1003061
## 5    GPL96 GSM1003062
conversion$gds[1:5,]
##   from_acc  to_acc
## 1    GPL96 GDS1023
## 2    GPL96 GDS1036
## 3    GPL96 GDS1050
## 4    GPL96 GDS1062
## 5    GPL96 GDS1063
conversion$sMatrix[1:5,]
##   from_acc                        to_acc
## 1    GPL96  GSE1000_series_matrix.txt.gz
## 2    GPL96 GSE10024_series_matrix.txt.gz
## 3    GPL96 GSE10043_series_matrix.txt.gz
## 4    GPL96 GSE10072_series_matrix.txt.gz
## 5    GPL96 GSE10089_series_matrix.txt.gz

3.4 Mappings between GPL and Bioconductor microarry annotation packages

The function getBiocPlatformMap is to get GPL information of a given list of Bioconductor microarry annotation packages. Note currently the GEOmetadb does not contains all the mappings, but we are trying to construct a relative complete list.

Get GPL information of ‘hgu133a’ and ‘hgu95av2’:

getBiocPlatformMap(con, bioc=c('hgu133a','hgu95av2'))
##                                                     title
## 1          [HG-U133A] Affymetrix Human Genome U133A Array
## 2            [HG_U95A] Affymetrix Human Genome U95A Array
## 3 [HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array
##       gpl bioc_package manufacturer     organism
## 1   GPL96      hgu133a   Affymetrix Homo sapiens
## 2   GPL91     hgu95av2   Affymetrix Homo sapiens
## 3 GPL8300     hgu95av2   Affymetrix Homo sapiens
##   data_row_count
## 1          22283
## 2          12626
## 3          12625

3.5 More advanced queries

Now, for something a bit more complicated, we would like to find all the human breast cancer-related Affymetrix gene expression GEO series.

sql <- paste("SELECT DISTINCT gse.title,gse.gse",
             "FROM",
             "  gsm JOIN gse_gsm ON gsm.gsm=gse_gsm.gsm",
             "  JOIN gse ON gse_gsm.gse=gse.gse",
             "  JOIN gse_gpl ON gse_gpl.gse=gse.gse",
             "  JOIN gpl ON gse_gpl.gpl=gpl.gpl",
             "WHERE",
             "  gsm.molecule_ch1 like '%total RNA%' AND",
             "  gse.title LIKE '%breast cancer%' AND",
             "  gpl.organism LIKE '%Homo sapiens%'",sep=" ")
rs <- dbGetQuery(con,sql)
dim(rs)
## [1] 781   2
print(rs[1:5,],right=FALSE)
##   title                                                                     
## 1 Human mammary epithelium and breast cancer                                
## 2 Cell-type specific responses to chemotherapeutics in breast cancer        
## 3 E2 and SERM Treatment of MCF-7 Breast Cancer Cells                        
## 4 Estradiol Treated Breast Cancer Cells Expressing Mutant Estrogen Receptors
## 5 Breast Cancer Cell Line Experiment                                        
##   gse    
## 1 GSE53  
## 2 GSE763 
## 3 GSE848 
## 4 GSE1045
## 5 GSE1299

3.6 A wordcloud of GSE titles

Since we have all the data from NCBI GEO available, why not look around a bit. The next code block creates a wordcloud using the tm and wordcloud packages based on the titles of all GSE records.

library(tm)
library(wordcloud)
gseTitles = dbGetQuery(con,"select title from gse")
corp = VCorpus(VectorSource(gseTitles))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
v <- sort(rowSums(term.matrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
n = 100
wordcloud(d[1:n,]$word,d[1:n,]$freq)

3.7 Using dplyr with GEOmetadb

The dplyr package provides a very streamlined functional approach to accessing SQL databases.

Load the dplyr library and connect to the GEOmetadb.sqlite database. The dplyr package is written around manipulation of tabular data, so we will work with only one GEOmetadb table, the gse table. Instead of using SQL, we can use dplyr verbs to get information from tables of interest.

library(dplyr)
db = src_sqlite('GEOmetadb.sqlite')
gse = tbl(db,'gse')
filter(gse,gse=='GSE2553')
## Source: sqlite 3.8.6 [GEOmetadb.sqlite]
## From: gse [1 x 18]
## Filter: gse == "GSE2553" 
## 
##     ID               title     gse
## 1 2151 NHGRI_Sarcoma_Baird GSE2553
## Variables not shown: status (chr), submission_date
##   (chr), last_update_date (chr), pubmed_id (int),
##   summary (chr), type (chr), contributor (chr),
##   web_link (chr), overall_design (chr), repeats (chr),
##   repeats_sample_list (chr), variable (chr),
##   variable_description (chr), contact (chr),
##   supplementary_file (chr)

3.8 Cleanup

Finally, it is probably a good idea to close the connection. Please see the DBI package for details on this point.

dbDisconnect(con)
## [1] TRUE

If you want to remove old GEOmetadb.sqlite file before retrieve a new version from the server, execute the following codes:

file.remove('GEOmetadb.sqlite')

4 sessionInfo()

sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8      
##  [2] LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8      
##  [8] LC_NAME=C                 
##  [9] LC_ADDRESS=C              
## [10] LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8
## [12] LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils    
## [6] datasets  methods   base     
## 
## other attached packages:
##  [1] dplyr_0.4.1         wordcloud_2.5      
##  [3] RColorBrewer_1.1-2  tm_0.6             
##  [5] NLP_0.1-6           GEOmetadb_1.28.0   
##  [7] RSQLite_1.0.0       DBI_0.3.1          
##  [9] GEOquery_2.34.0     Biobase_2.28.0     
## [11] BiocGenerics_0.14.0 knitr_1.9          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5      magrittr_1.5     R6_2.0.1        
##  [4] stringr_0.6.2    tools_3.2.0      htmltools_0.2.6 
##  [7] lazyeval_0.1.10  assertthat_0.1   yaml_2.1.13     
## [10] digest_0.6.8     formatR_1.1      bitops_1.0-6    
## [13] codetools_0.2-11 RCurl_1.95-4.5   evaluate_0.6    
## [16] slam_0.1-32      rmarkdown_0.5.1  XML_3.98-1.1