BgeeDB: an R package for datasets retrieval from Bgee database

Andrea Komljenovic, Julien Roux, Marc Robinson-Rechavi, Frederic Bastian

2016-09-03

BgeeDB is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from humans and many animals.

The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the data reprocessed by the Bgee pipeline. It works for all the species in Bgee. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages, and is based on the topGO package.

This package allows: 1. Listing annotation files gene of expression data available in the current version of Bgee database 2. Downloading the processed gene expression data available in the current version of Bgee database 3. Downloading the gene expression calls and annotations and using them to perform TopAnat analyses

How to use BgeeDB package

Running example for Mus musculus

List available species in current version of Bgee

library(BgeeDB)
listBgeeSpecies()
## Last update: Jul11:54
## $`Species in the Database`
##       ID           GENUS      SPECIES     COMMON
## 1   9606            Homo      sapiens      human
## 2  10090             Mus     musculus      mouse
## 3   7955           Danio        rerio  zebrafish
## 4   7227      Drosophila melanogaster   fruitfly
## 5   6239  Caenorhabditis      elegans  c.elegans
## 6   9598             Pan  troglodytes chimpanzee
## 7   9597             Pan     paniscus     bonobo
## 8   9593         Gorilla      gorilla    gorilla
## 9   9544          Macaca      mulatta   macacque
## 10 10116          Rattus   norvegicus        rat
## 11  9913             Bos       taurus        cow
## 12  9823             Sus       scrofa        pig
## 13 13616     Monodelphis    domestica    opossum
## 14  9258 Ornithorhynchus     anatinus   platypus
## 15  9031          Gallus       gallus    chicken
## 16 28377          Anolis carolinensis     anolis
## 17  8364         Xenopus   tropicalis    xenopus
## 18  9600           Pongo     pygmaeus  orangutan
## 19 99883       Tetraodon nigroviridis  tetraodon
## 
## $RNAseq
##  [1] "Anolis_carolinensis"      "Bos_taurus"              
##  [3] "Caenorhabditis_elegans"   "Gallus_gallus"           
##  [5] "Gorilla_gorilla"          "Homo_sapiens"            
##  [7] "Macaca_mulatta"           "Monodelphis_domestica"   
##  [9] "Mus_musculus"             "Ornithorhynchus_anatinus"
## [11] "Pan_paniscus"             "Pan_troglodytes"         
## [13] "Rattus_norvegicus"        "Sus_scrofa"              
## [15] "Xenopus_tropicalis"      
## 
## $Affymetrix
## [1] "Caenorhabditis_elegans"  "Danio_rerio"            
## [3] "Drosophila_melanogaster" "Homo_sapiens"           
## [5] "Mus_musculus"

Choose the species and create a new Bgee species object

From the list above, please choose one species (e.g., “Homo_sapiens”, “Mus_musculus”,…) and platform (“rna_seq” or “affymetrix”).

bgee <- Bgee$new(species = "Mus_musculus", datatype = "rna_seq")

Retrieve annotation for Mus musculus RNA-seq data

The bgee$get_annotation() will output the list of experiments and libraries currently available in Bgee for RNA-seq of Mus musculus. The bgee$get_annotation() loads the annotation in R, but also creates the Mus musculus folder in your current path, where it saves the downloaded annotation locally, so you can use the annotation for later as well.

# check the path where your folder with annotation will be saved. The folder is named after your chosen species.
getwd()
annotation_bgee_mouse <- bgee$get_annotation()
## Downloading annotation files...
## Saved annotation files in Mus_musculus folder.
# head the experiments and libraries
lapply(annotation_bgee_mouse, head)
## $sample_annotation
##   Experiment ID Library ID Library secondary ID Anatomical entity ID
## 1      GSE30617  GSM759583            ERX012363       UBERON:0000948
## 2      GSE30617  GSM759584            ERX012348       UBERON:0000948
## 3      GSE30617  GSM759585            ERX012344       UBERON:0000948
## 4      GSE30617  GSM759586            ERX012362       UBERON:0000948
## 5      GSE30617  GSM759587            ERX012378       UBERON:0000948
## 6      GSE30617  GSM759588            ERX012374       UBERON:0000948
##   Anatomical entity name       Stage ID      Stage name
## 1                  heart MmusDv:0000052 8 weeks (mouse)
## 2                  heart MmusDv:0000052 8 weeks (mouse)
## 3                  heart MmusDv:0000052 8 weeks (mouse)
## 4                  heart MmusDv:0000052 8 weeks (mouse)
## 5                  heart MmusDv:0000052 8 weeks (mouse)
## 6                  heart MmusDv:0000052 8 weeks (mouse)
##                   Platform ID Library type Library orientation Read count
## 1 Illumina Genome Analyzer II       paired          unstranded   31000737
## 2 Illumina Genome Analyzer II       paired          unstranded    8605668
## 3 Illumina Genome Analyzer II       paired          unstranded   30075234
## 4 Illumina Genome Analyzer II       paired          unstranded   29498377
## 5 Illumina Genome Analyzer II       paired          unstranded   27824366
## 6 Illumina Genome Analyzer II       paired          unstranded   31160789
##   Left part mapped read count/Mapped read count
## 1                                      23674754
## 2                                       7099912
## 3                                      24308959
## 4                                      24792064
## 5                                      19632932
## 6                                      21342634
##   Right part mapped read count Min. read length Max. read length
## 1                     23809225               76               76
## 2                      7083684               76               76
## 3                     23884114               76               76
## 4                     24122931               76               76
## 5                     20578789               76               76
## 6                     23529915               76               76
##   All genes percent present Protein coding genes percent present
## 1                     50.73                                31.84
## 2                     41.94                                25.69
## 3                     50.70                                31.66
## 4                     49.12                                30.51
## 5                     50.24                                31.51
## 6                     50.32                                31.72
##   Intergenic regions percent present   Run IDs Discarded run IDs
## 1                               0.98 ERR032227                NA
## 2                               1.17 ERR032228                NA
## 3                               0.95 ERR032229                NA
## 4                               0.87 ERR032238                NA
## 5                               0.97 ERR032230                NA
## 6                               1.04 ERR032231                NA
##   Data source                                             Data source URL
## 1         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759583
## 2         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759584
## 3         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759585
## 4         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759586
## 5         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759587
## 6         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759588
##                                                                                                                                             Bgee normalized data URL
## 1 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 2 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 3 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 4 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 5 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 6 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
##                                                Raw file URL
## 1 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759583
## 2 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759584
## 3 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759585
## 4 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759586
## 5 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759587
## 6 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759588
## 
## $experiment_annotation
##   Experiment ID Library count Condition count Organ count Stage count
## 1      GSE30617            36               6           6           1
## 2      GSE41637            26               9           9           1
## 3      GSE30352            17               6           6           1
## 4      GSE36026            12              12          12           3
## 5      GSE43520             9               5           4           2
## 6      GSE41338             6               5           5           1
##   Data source                                            Data source URL
## 1         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617
## 2         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637
## 3         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30352
## 4         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36026
## 5         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43520
## 6         GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41338
##                                                                                                                                             Bgee normalized data URL
## 1 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 2 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE41637.tsv.zip
## 3 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30352.tsv.zip
## 4 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE36026.tsv.zip
## 5 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE43520.tsv.zip
## 6 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE41338.tsv.zip
##                                                             Experiment name
## 1                                          [E-MTAB-599] Mouse Transcriptome
## 2 Evolutionary dynamics of gene and isoform regulation in mammalian tissues
## 3               The evolution of gene expression levels in mammalian organs
## 4                                                  RNA-seq from ENCODE/LICR
## 5  The evolution of lncRNA repertoires and expression patterns in tetrapods
## 6  The evolutionary landscape of alternative splicing in vertebrate species
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Experiment description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Sequencing the transcriptome of DBAxC57BL/6J mice. To study the regulation of transcription, splicing and RNA turnover we have sequenced the transcriptomes of tissues collected DBAxC57BL/6J mice.
## 2                                                                                                                                                                                                                                                                                                                                                      Most mammalian genes produce multiple distinct mRNAs through alternative splicing, but the extent of splicing conservation is not clear. To assess tissue-specific transcriptome variation across mammals, we sequenced cDNA from 9 tissues from 4 mammals and one bird in biological triplicate, at unprecedented depth. We find that while tissue-specific gene expression programs are largely conserved, alternative splicing is well conserved in only a subset of tissues and is frequently lineage-specific. Thousands of novel, lineage-specific and conserved alternative exons were identified; widely conserved alternative exons had signatures of binding by MBNL, PTB, RBFOX, STAR and TIA family splicing factors, implicating them as ancestral mammalian splicing regulators. Our data also indicates that alternative splicing is often used to alter protein phosphorylatability, delimiting the scope of kinase signaling.
## 3 Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses of gene expression evolution were until recently prevented by technological limitations. Here we report the sequencing of polyadenylated RNA from six organs across ten species that represent all major mammalian lineages (placentals, marsupials and monotremes) and birds (the evolutionary outgroup), with the goal of understanding the dynamics of mammalian transcriptome evolution. We show that the rate of gene expression evolution varies among organs, lineages and chromosomes, owing to differences in selective pressures: transcriptome change was slow in nervous tissues and rapid in testes, slower in rodents than in apes and monotremes, and rapid for the X chromosome right after its formation. Although gene expression evolution in mammals was strongly shaped by purifying selection, we identify numerous potentially selectively driven expression switches, which occurred at different rates across lineages and tissues and which probably contributed to the specific organ biology of various mammals. Our transcriptome data provide a valuable resource for functional and evolutionary analyses of mammalian genomes.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Using RNA-Seq (Mortazavi et al., 2008), high-resolution genome-wide maps of the mouse transcriptome across multiple mouse (C57Bl/6) tissues and primary cells were generated.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                To broaden our understanding of lncRNA evolution, we used an extensive RNA-seq dataset to establish lncRNA repertoires and homologous gene families in 11 tetrapod species. We analyzed the poly- adenylated transcriptomes of 8 organs (cortex/whole brain without cerebellum, cerebellum, heart, kidney, liver, placenta, ovary and testis) and 11 species (human, chimpanzee, bonobo, gorilla, orangutan, macaque, mouse, opossum, platypus, chicken and the frog Xenopus tropicalis), which shared a common ancestor ~370 millions of years (MY) ago. Our dataset included 47 strand-specific samples, which allowed us to confirm the orientation of gene predictions and to address the evolution of sense-antisense transcripts. See also GSE43721 (Soumillon et al, Cell Reports, 2013) for three strand-specific samples for mouse brain, liver and testis.
## 6                                                                                       How species with similar repertoires of protein coding genes differ so dramatically at the phenotypic level is poorly understood. From comparing the transcriptomes of multiple organs from vertebrate species spanning ~350 million years of evolution, we observe significant differences in alternative splicing complexity between the main vertebrate lineages, with the highest complexity in the primate lineage. Moreover, within as little as six million years, the splicing profiles of physiologically-equivalent organs have diverged to the extent that they are more strongly related to the identity of a species than they are to organ type. Most vertebrate species-specific splicing patterns are governed by the highly variable use of a largely conserved cis-regulatory code. However, a smaller number of pronounced species-dependent splicing changes are predicted to remodel interactions involving factors acting at multiple steps in gene regulation. These events are expected to further contribute to the dramatic diversification of alternative splicing as well as to other gene regulatory changes that contribute to phenotypic differences among vertebrate species.

Download the processed RNA-seq data for Mus musculus

The bgee$get_data() will download read counts and RPKMs for Mus musculus from all available experiments in Bgee database as a list (see below). In case of downloaded data from all experiments for Mus musculus, bgee$get_data() will save the downloaded data in your current folder for later usage.

# download all RPKMs and counts for Mus musculus
data_bgee_mouse <- bgee$get_data()
## The experiment is not defined. Hence taking all rna_seq available for Mus_musculus .
## Downloading expression data...
## Saved expression data file in Mus_musculus folder.
## Unzipping file...
## 
Read 61.7% of 1410444 rows
Read 1410444 rows and 13 (of 13) columns from 0.210 GB file in 00:00:03
## Saving all data in .rds file...
# the number of experiments downloaded from Bgee
length(data_bgee_mouse)
## [1] 7
# see your first experiment
data_bgee_experiment1 <- data_bgee_mouse[[1]]
head(data_bgee_experiment1)
##   Experiment ID Library ID Library type            Gene ID
## 1      GSE30352  GSM752614       single ENSMUSG00000000001
## 2      GSE30352  GSM752614       single ENSMUSG00000000003
## 3      GSE30352  GSM752614       single ENSMUSG00000000028
## 4      GSE30352  GSM752614       single ENSMUSG00000000031
## 5      GSE30352  GSM752614       single ENSMUSG00000000037
## 6      GSE30352  GSM752614       single ENSMUSG00000000049
##   Anatomical entity ID Anatomical entity name       Stage ID
## 1       UBERON:0000955                  brain UBERON:0000113
## 2       UBERON:0000955                  brain UBERON:0000113
## 3       UBERON:0000955                  brain UBERON:0000113
## 4       UBERON:0000955                  brain UBERON:0000113
## 5       UBERON:0000955                  brain UBERON:0000113
## 6       UBERON:0000955                  brain UBERON:0000113
##                  Stage name Read count     RPKM Detection flag
## 1 post-juvenile adult stage        550 11.92921        present
## 2 post-juvenile adult stage          0  0.00000         absent
## 3 post-juvenile adult stage         12  0.53941         absent
## 4 post-juvenile adult stage          2  0.11157         absent
## 5 post-juvenile adult stage         13  0.27897         absent
## 6 post-juvenile adult stage          7  0.74416         absent
##   Detection quality                                  State in Bgee
## 1      high quality                        Used in expression call
## 2      high quality         Result excluded, reason: pre-filtering
## 3      high quality                        Used in expression call
## 4      high quality Result excluded, reason: noExpression conflict
## 5      high quality                     Used in no-expression call
## 6      high quality                        Used in expression call

Alternatively, you can choose to download only one experiment from Bgee, as in the example below. The data is then saved as a .tsv file in your current folder.

# download RPKMs and counts only for GSE30617 for Mus musculus
data_bgee_mouse_gse30617 <- bgee$get_data(experiment.id = "GSE30617")
## Downloading expression data for the experiment GSE30617 
## Downloading expression data...
## Saved expression data file in Mus_musculus folder.
## Unzipping file...
## 
Read 85.1% of 1410444 rows
Read 1410444 rows and 13 (of 13) columns from 0.210 GB file in 00:00:03
## Saving all data in .rds file...

The data from different samples will be listed in rows, one after the other. It is sometimes easier to work with data organized as a matrix, where different columns represent different samples. To transform the data into a matrix with genes in rows and samples in columns, you can use the bgee$format_data() function that separates data according to anatomical term. This function also allows to filter out genes that are not called present in a given sample (giving them NA values).

# only present calls and rpkm values for GSE30617
gene.expression.mouse.rpkm <- bgee$format_data(data_bgee_mouse_gse30617, calltype = "expressed", stats = "rpkm")
## keeping only expressed genes...
## Extracting features...
## Done...
## Extracting pheno...
## Done...
names(gene.expression.mouse.rpkm)
## NULL
head(gene.expression.mouse.rpkm$"Ammon's horn")
## NULL

Download the data to perform GO-like enrichment test for anatomical terms

The loadTopAnatData() function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.

# Loading calls of expression based on RNA-seq data only
myTopAnatData <- loadTopAnatData(species=10090, datatype="rna_seq")
## 
## Building URLs to retrieve organ relationships from Bgee.........
##    URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.ontologycommon.RelationDAO.getAnatEntityRelations&display_type=tsv&species_list=10090&attr_list=SOURCE_ID&attr_list=TARGET_ID)
##    Submitting URL to Bgee webservice (can be long)
##      Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
## 
## Building URLs to retrieve organ names from Bgee.................
##    URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.anatdev.AnatEntityDAO.getAnatEntities&display_type=tsv&species_list=10090&attr_list=ID&attr_list=NAME)
##    Submitting URL to Bgee webservice (can be long)
##      Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
## 
## Building URLs to retrieve mapping of gene to organs from Bgee...
##    URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.expressiondata.ExpressionCallDAO.getExpressionCalls&display_type=tsv&species_list=10090&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&data_type=RNA_SEQ)
##    Submitting URL to Bgee webservice (can be long)
##      Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
## 
## Parsing the results...............................................
## 
## Adding BGEE:0 as unique root of all terms of the ontology.........
## 
## Done.
# Loading calls observed in embryonic stages only
myTopAnatData <- loadTopAnatData(species=10090, datatype="rna_seq", stage="UBERON:0000068")
## 
## Warning: an organ relationships file was found in the working directory and will be used as is. Please delete and rerun the function if you want the data to be updated.
## 
## Warning: an organ names file was found in the working directory and will be used as is. Please delete and rerun the function if you want the data to be updated.
## 
## Building URLs to retrieve mapping of gene to organs from Bgee...
##    URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.expressiondata.ExpressionCallDAO.getExpressionCalls&display_type=tsv&species_list=10090&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&data_type=RNA_SEQ&stage_id=UBERON:0000068)
##    Submitting URL to Bgee webservice (can be long)
##      Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
## 
## Parsing the results...............................................
## 
## Adding BGEE:0 as unique root of all terms of the ontology.........
## 
## Done.
# Look at the data
lapply(myTopAnatData, head)
## $gene2anatomy
## ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000031 
##       "CL:0002322"       "CL:0002322"       "CL:0002322" 
## ENSMUSG00000000037 ENSMUSG00000000056 ENSMUSG00000000058 
##       "CL:0002322"       "CL:0002322"       "CL:0002322" 
## 
## $organ.relationships
## $organ.relationships$`AEO:0000013`
## [1] "UBERON:0000479"
## 
## $organ.relationships$`AEO:0000127`
## [1] "UBERON:0005423"
## 
## $organ.relationships$`AEO:0000129`
## [1] "UBERON:0005423"
## 
## $organ.relationships$`AEO:0000131`
## [1] "UBERON:0005423"
## 
## $organ.relationships$`AEO:0000153`
## [1] "AEO:0000187"
## 
## $organ.relationships$`AEO:0000162`
## [1] "UBERON:0006984"
## 
## 
## $organ.names
##       organId                               organName
## 1 AEO:0001009           proliferating neuroepithelium
## 2 AEO:0001010         differentiating neuroepithelium
## 3 AEO:0001013                         neuronal column
## 4  CL:0000005         fibroblast neural crest derived
## 5  CL:0000027 smooth muscle cell neural crest derived
## 6  CL:0000041                       mature eosinophil

Note: the results are stored in files (see the pathToData arguments). To save time, if you query again with the exact same parameters, these cache files will be read instead of querying the web-service. So do not delete the files in the working folder if you plan to perform additional queries.

Prepare a topGO object allowing to perform GO-like enrichment test for anatomical terms, for Mus musculus

First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata object in the topGO package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes, annotated with “spermatogenesis” in the Gene Ontology (using the biomaRt package).

source("https://bioconductor.org/biocLite.R")
## Bioconductor version 3.3 (BiocInstaller 1.22.3), ?biocLite for help
biocLite("biomaRt")
## BioC_mirror: https://bioconductor.org
## Using Bioconductor 3.3 (BiocInstaller 1.22.3), R 3.3.1 (2016-06-21).
## Installing package(s) 'biomaRt'
## Old packages: 'EnrichmentBrowser', 'reutils', 'scran'
library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)

# Foreground genes are those with a GO annotation "spermatogenesis"
myGenes <- getBM(attributes= "ensembl_gene_id", filters=c("go_id"), values=list(c("GO:0007283")), mart=ensembl)

# Background are all genes with a GO annotation
universe <- getBM(attributes= "ensembl_gene_id", filters=c("with_go_go"), values=list(c(TRUE)), mart=ensembl)

# Prepares the gene list vector 
geneList <- factor(as.integer(universe[,1] %in% myGenes[,1]))
names(geneList) <- universe[,1]
head(geneList)
## ENSMUSG00000064370 ENSMUSG00000064368 ENSMUSG00000064367 
##                  0                  0                  0 
## ENSMUSG00000064363 ENSMUSG00000065947 ENSMUSG00000064360 
##                  0                  0                  0 
## Levels: 0 1
summary(geneList == 1)
##    Mode   FALSE    TRUE    NA's 
## logical   21239     335       0
# Prepares the topGO object
myTopAnatObject <-  topAnat(myTopAnatData, geneList)
## 
## Checking topAnatData object.........
## 
## Checking gene list..................
## Warning: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis.  11803  genes in background will be kept.
## 
## Building 'most specific' Terms......  ( 1 Terms found. )
## 
## Build DAG topology..................  ( 13 terms and 15 relations. )
## 
## Annotating nodes (Can be long)......  ( 11803 genes annotated to the nodes. )

Warning: This can be long, especially if the gene list is large, since the anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.

Launch an enrichment test for anatomical terms

For this step, see the vignette of the topGO package for more details, as you have to directly use the tests implemented in the topGO package, as shown in this example:

results <- runTest(myTopAnatObject, algorithm = 'classic', statistic = 'fisher')
## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 13 nontrivial nodes
##       parameters: 
##           test statistic: fisher
# You can also use the topGO decorrelation methods, for example the "weight" method to get less redundant results
results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')
## 
##           -- Weight Algorithm -- 
## 
##       The algorithm is scoring 13 nontrivial nodes
##       parameters: 
##           test statistic: fisher : ratio
## 
##   Level 9:   1 nodes to be scored.
## 
##   Level 8:   1 nodes to be scored.
## 
##   Level 7:   1 nodes to be scored.
## 
##   Level 6:   2 nodes to be scored.
## 
##   Level 5:   1 nodes to be scored.
## 
##   Level 4:   1 nodes to be scored.
## 
##   Level 3:   3 nodes to be scored.
## 
##   Level 2:   2 nodes to be scored.
## 
##   Level 1:   1 nodes to be scored.

Warning: This can be long because of the size of the ontology. Consider running a script in batch mode if you have multiple analyses to do.

Format the table of results after an enrichment test for anatomical terms

We built the makeTable function to filter and format the test results. Results are sorted by p-value, and FDR values are calculated.

# Display results sigificant at a 1% FDR threshold
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, 0.01)
## 
## Warning: there was no significant term at a FDR threshold of 0.01
# Display all results
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, 1)
## 
## Building the results table for the 13 significant terms at FDR threshold of 1... Done

Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO package.