Contents

1 Introduction

In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R.

2 Selecting a BioMart database and dataset

Every analysis with biomaRt starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function listMarts() will display all available BioMart web services

library("biomaRt")
listMarts()
##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 101
## 2   ENSEMBL_MART_MOUSE      Mouse strains 101
## 3     ENSEMBL_MART_SNP  Ensembl Variation 101
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 101

The useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts(). In the next example we choose to query the Ensembl BioMart database.

ensembl <- useMart("ensembl")

BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function listDatasets(). Note: use the function head() to display only the first 5 entries as the complete list has 203 entries.

datasets <- listDatasets(ensembl)
head(datasets)
##                      dataset                                  description                version
## 1   acalliptera_gene_ensembl             Eastern happy genes (fAstCal1.2)             fAstCal1.2
## 2 acarolinensis_gene_ensembl               Anole lizard genes (AnoCar2.0)              AnoCar2.0
## 3  acchrysaetos_gene_ensembl              Golden eagle genes (bAquChr1.2)             bAquChr1.2
## 4  acitrinellus_gene_ensembl               Midas cichlid genes (Midas_v5)               Midas_v5
## 5  amelanoleuca_gene_ensembl                        Panda genes (ailMel1)                ailMel1
## 6    amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0) Astyanax_mexicanus-2.0

To select a dataset we can update the Mart object using the function useDataset(). In the example below we choose to use the hsapiens dataset.

ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

3 How to build a biomaRt query

The getBM() function has three arguments that need to be introduced: filters, attributes and values. Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value ‘X’. The listFilters() function shows you all available filters in the selected dataset.

filters = listFilters(ensembl)
filters[1:5,]
##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End

Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The listAttributes() function displays all available attributes in the selected dataset.

attributes = listAttributes(ensembl)
attributes[1:5,]
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page

The getBM() function is the main query function in biomaRt. It has four main arguments:

Note: for some frequently used queries to Ensembl, wrapper functions are available: getGene() and getSequence(). These functions call the getBM() function with hard coded filter and attribute names.

Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a biomaRt query. Let’s make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.

The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the listAttributes() and listFilters() function respectively. Let’s now run the query:

affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id'), 
      filters = 'affy_hg_u133_plus_2', 
      values = affyids, 
      mart = ensembl)
##   affy_hg_u133_plus_2 entrezgene_id
## 1         209310_s_at           837
## 2           207500_at           838
## 3           202763_at           836

3.1 Searching for datasets, filters and attributes

The functions listDatasets(), listAttributes(), and listFilters() will return every available option for their respective types. However, this can be unwieldy when the list of results is long, involving much scrolling to find the entry you are interested in. biomaRt also provides the functions searchDatasets(), searchAttributes(), and searchFilters() which will try to find any entries matching a specific term or pattern. For example, if we want to find the details of any datasets in our ensembl mart that contain the term ‘hsapiens’ we could do the following:

searchDatasets(mart = ensembl, pattern = "hsapiens")
##                  dataset              description    version
## 80 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

You can search in a simlar fashion to find available attributes and filters that you may be interested in. The example below returns the details for all attributes that contain the pattern ‘hgnc’.

searchAttributes(mart = ensembl, pattern = "hgnc")
##               name        description         page
## 63         hgnc_id            HGNC ID feature_page
## 64     hgnc_symbol        HGNC symbol feature_page
## 94 hgnc_trans_name Transcript name ID feature_page

For advanced use, note that the pattern argument takes a regular expression. This means you can create more complex queries if required. Imagine, for example, that we have the string ENST00000577249.1, which we know is an Ensembl ID of some kind, but we aren’t sure what the appropriate filter term is. The example shown next uses a pattern that will find all filters that contain both the terms ‘ensembl’ and ‘id’. This allows use to reduced the list of filters to only those that might be appropriate for our example.

searchFilters(mart = ensembl, pattern = "ensembl.*id")
##                             name                                                    description
## 56               ensembl_gene_id                       Gene stable ID(s) [e.g. ENSG00000000003]
## 57       ensembl_gene_id_version       Gene stable ID(s) with version [e.g. ENSG00000000003.15]
## 58         ensembl_transcript_id                 Transcript stable ID(s) [e.g. ENST00000000233]
## 59 ensembl_transcript_id_version Transcript stable ID(s) with version [e.g. ENST00000000233.10]
## 60            ensembl_peptide_id                    Protein stable ID(s) [e.g. ENSP00000000233]
## 61    ensembl_peptide_id_version     Protein stable ID(s) with version [e.g. ENSP00000000233.5]
## 62               ensembl_exon_id                              Exon ID(s) [e.g. ENSE00000000003]

From this we can see that ENST00000577249.1 is a Transcript ID with version, and the appropriate filter value to use with it is ensembl_transcript_id_version.

3.2 Using predefined filter values

Many filters have a predefined list of values that are known to be in the dataset they are associated with. An common example would be the names of chromosomes when searching a dataset at Ensembl. In this online interface to BioMart these available options are displayed as a list as shown in Figure 1.

The options available to the Chromosome/Scaffold field are limited to a pretermined list based on the values in this dataset.

Figure 1: The options available to the Chromosome/Scaffold field are limited to a pretermined list based on the values in this dataset

You can list this in an R session with the function listFilterValues(), passing it a mart object and the name of the filter. For example, to list the possible chromosome names you could run the following:

listFilterValues(mart = ensembl, filter = "chromosome_name")

It is also possible to search the list of available values via searchFilterValues(). In the examples below, the first returns all chromosome names starting with “GL”, which the second will find any phenotype descriptions that contain the string “Crohn”.

searchFilterValues(mart = ensembl, filter = "chromosome_name", pattern = "^GL")
searchFilterValues(mart = ensembl, filter = "phenotype_description", pattern = "Crohn")

3.3 Attribute Pages

For large BioMart databases such as Ensembl, the number of attributes displayed by the listAttributes() function can be very large. In BioMart databases, attributes are put together in pages, such as sequences, features, homologs for Ensembl. An overview of the attributes pages present in the respective BioMart dataset can be obtained with the attributePages() function.

pages = attributePages(ensembl)
pages
## [1] "feature_page" "structure"    "homologs"     "snp"          "snp_somatic"  "sequences"

To show us a smaller list of attributes which belong to a specific page, we can now specify this in the listAttributes() function. The set of attributes is still quite long, so we use head() to show only the first few items here.

head(listAttributes(ensembl, page="feature_page"))
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

We now get a short list of attributes all of which are related to the region where the genes are located.

4 Result Caching

To save time and computing resources biomaRt will attempt to identify when you are re-running a query you have executed before. Each time a new query is run, the results will be saved to a cache on your computer. If a query is identified as having been run previously, rather than submitting the query to the server, the results will be loaded from the cache.

You can get some information on the size and location of the cache using the function biomartCacheInfo():

biomartCacheInfo()
## biomaRt cache
## - Location: /home/biocbuild/.cache/biomaRt
## - No. of files: 702
## - Total size: 88.8 Mb

The cache can be deleted using the command biomartCacheClear(). This will remove all cached files.

5 Using a BioMart other than Ensembl

There are a small number of non-Ensembl databases that offer a BioMart interface to their data. The biomaRt package can be used to access these in a very similar fashion to Ensembl. The majority of biomaRt functions will work in the same manner, but the construction of the initial Mart object requires slightly more setup. In this section we demonstrate the setting requires to query Wormbase ParaSite and Phytozome

5.1 Wormbase

To demonstrate the use of the biomaRt package with non-Ensembl databases the next query is performed using the Wormbase ParaSite BioMart. In this example, we use the listMarts() function to find the name of the available marts, given the URL of Wormbase. We use this to connect to Wormbase BioMart using the useMart() function. Note that we use the https address and must provide the port as 443. Queries to WormBase will fail without these options.

listMarts(host = "parasite.wormbase.org")
##         biomart       version
## 1 parasite_mart ParaSite Mart
wormbase <- useMart(biomart = "parasite_mart", 
                   host = "https://parasite.wormbase.org", 
                   port = 443)

We can then use functions described earlier in this vignette to find and select the gene dataset, and print the first 6 available attributes and filters. Then we use a list of gene names as filter and retrieve associated transcript IDs and the transcript biotype.

listDatasets(wormbase)
##     dataset          description version
## 1 wbps_gene All Species (WBPS14)      14
wormbase <- useDataset(mart = wormbase, dataset = "wbps_gene")
head(listFilters(wormbase))
##                  name     description
## 1     species_id_1010          Genome
## 2 nematode_clade_1010  Nematode Clade
## 3     chromosome_name Chromosome name
## 4               start           Start
## 5                 end             End
## 6              strand          Strand
head(listAttributes(wormbase))
##                      name        description         page
## 1          species_id_key      Internal Name feature_page
## 2    production_name_1010     Genome project feature_page
## 3       display_name_1010        Genome name feature_page
## 4        taxonomy_id_1010        Taxonomy ID feature_page
## 5 assembly_accession_1010 Assembly accession feature_page
## 6     nematode_clade_1010     Nematode clade feature_page
getBM(attributes = c("external_gene_id", "wbps_transcript_id", "transcript_biotype"), 
      filters = "gene_name", 
      values = c("unc-26","his-33"), 
      mart = wormbase)
##   external_gene_id wbps_transcript_id transcript_biotype
## 1           his-33         F17E9.13.1     protein_coding
## 2           unc-26          JC8.10a.1     protein_coding
## 3           unc-26          JC8.10b.1     protein_coding
## 4           unc-26          JC8.10c.1     protein_coding
## 5           unc-26          JC8.10c.2     protein_coding
## 6           unc-26          JC8.10d.1     protein_coding

5.2 Phytozome

5.2.1 Version 12

The Phytozome BioMart can be accessed with the following settings. Note that we use the https address. Queries to Phytozome will fail without specifying this.

listMarts(host = "https://phytozome.jgi.doe.gov")
##                     biomart                  version
## 1            phytozome_mart V12 Genomes and Families
## 2 phytozome_diversity__mart     V12 Genome Diversity
## 3    phytozome_mart_archive           Genome Archive
phytozome <- useMart(biomart = "phytozome_mart", 
                host = "https://phytozome.jgi.doe.gov")
listDatasets(phytozome)
##              dataset           description version
## 1          phytozome  Phytozome 12 Genomes        
## 2 phytozome_clusters Phytozome 12 Families
wormbase <- useDataset(mart = phytozome, dataset = "phytozome")

Once this is set up the usual biomaRt functions can be used to interogate the database options and run queries.

getBM(attributes = c("organism_name", "gene_name1"), 
      filters = "gene_name_filter", 
      values = "82092", 
      mart = phytozome)
## Error in martCheck(mart): No dataset selected, please select a dataset first.  You can see the available datasets by using the listDatasets function see ?listDatasets for more information.  Then you should create the Mart object by using the useMart function.  See ?useMart for more information

5.2.2 Version 13

Version 13 of Phyotozome can be found at https://phytozome-next.jgi.doe.gov/ and if you wish to query that version the URL used to create the Mart object must reflect that.

phytozome_v13 <- useMart(biomart = "phytozome_mart", 
                dataset = "phytozome", 
                host = "https://phytozome-next.jgi.doe.gov")

6 Local BioMart databases

The biomaRt package can be used with a local install of a public BioMart database or a locally developed BioMart database and web service. In order for biomaRt to recognize the database as a BioMart, make sure that the local database you create has a name conform with database_mart_version where database is the name of the database and version is a version number. No more underscores than the ones showed should be present in this name. A possible name is for example ensemblLocal_mart_46. ## Minimum requirements for local database installation More information on installing a local copy of a BioMart database or develop your own BioMart database and webservice can be found on http://www.biomart.org Once the local database is installed you can use biomaRt on this database by:

listMarts(host="www.myLocalHost.org", path="/myPathToWebservice/martservice")
mart=useMart("nameOfMyMart",dataset="nameOfMyDataset",host="www.myLocalHost.org", path="/myPathToWebservice/martservice")

For more information on how to install a public BioMart database see: http://www.biomart.org/install.html and follow link databases.

7 Using select()

In order to provide a more consistent interface to all annotations in Bioconductor the select(), columns(), keytypes() and keys() have been implemented to wrap some of the existing functionality above. These methods can be called in the same manner that they are used in other parts of the project except that instead of taking a AnnotationDb derived class they take instead a Mart derived class as their 1st argument. Otherwise usage should be essentially the same. You still use columns() to discover things that can be extracted from a Mart, and keytypes() to discover which things can be used as keys with select().

mart <- useMart(dataset="hsapiens_gene_ensembl",biomart='ensembl')
head(keytypes(mart), n=3)
## [1] "affy_hc_g110"        "affy_hg_focus"       "affy_hg_u133_plus_2"
head(columns(mart), n=3)
## [1] "3_utr_end"   "3_utr_end"   "3_utr_start"

And you still can use keys() to extract potential keys, for a particular key type.

k = keys(mart, keytype="chromosome_name")
head(k, n=3)
## [1] "1" "2" "3"

When using keys(), you can even take advantage of the extra arguments that are available for others keys methods.

k = keys(mart, keytype="chromosome_name", pattern="LRG")
head(k, n=3)
## character(0)

Unfortunately the keys() method will not work with all key types because they are not all supported.

But you can still use select() here to extract columns of data that match a particular set of keys (this is basically a wrapper for getBM()).

affy=c("202763_at","209310_s_at","207500_at")
select(mart, keys=affy, columns=c('affy_hg_u133_plus_2','entrezgene_id'),
  keytype='affy_hg_u133_plus_2')
##   affy_hg_u133_plus_2 entrezgene_id
## 1         209310_s_at           837
## 2           207500_at           838
## 3           202763_at           836

So why would we want to do this when we already have functions like getBM()? For two reasons: 1) for people who are familiar with select and it’s helper methods, they can now proceed to use biomaRt making the same kinds of calls that are already familiar to them and 2) because the select method is implemented in many places elsewhere, the fact that these methods are shared allows for more convenient programmatic access of all these resources. An example of a package that takes advantage of this is the OrganismDbi package. Where several packages can be accessed as if they were one resource.

8 Connection troubleshooting

Note: if the function useMart() runs into proxy problems you should set your proxy first before calling any biomaRt functions.
You can do this using the Sys.putenv command:

Sys.setenv("http_proxy" = "http://my.proxy.org:9999")

Some users have reported that the workaround above does not work, in this case an alternative proxy solution below can be tried:

options(RCurlOptions = list(proxy="uscache.kcc.com:80",proxyuserpwd="------:-------"))

9 Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biomaRt_2.44.4   BiocStyle_2.16.1
## 
## loaded via a namespace (and not attached):
##  [1] progress_1.2.2       tidyselect_1.1.0     xfun_0.18            purrr_0.3.4         
##  [5] vctrs_0.3.4          generics_0.0.2       htmltools_0.5.0      stats4_4.0.3        
##  [9] BiocFileCache_1.12.1 yaml_2.2.1           blob_1.2.1           XML_3.99-0.5        
## [13] rlang_0.4.8          pillar_1.4.6         glue_1.4.2           DBI_1.1.0           
## [17] rappdirs_0.3.1       BiocGenerics_0.34.0  bit64_4.0.5          dbplyr_1.4.4        
## [21] lifecycle_0.2.0      stringr_1.4.0        codetools_0.2-16     memoise_1.1.0       
## [25] evaluate_0.14        Biobase_2.48.0       knitr_1.30           IRanges_2.22.2      
## [29] parallel_4.0.3       curl_4.3             AnnotationDbi_1.50.3 highr_0.8           
## [33] Rcpp_1.0.5           openssl_1.4.3        BiocManager_1.30.10  S4Vectors_0.26.1    
## [37] bit_4.0.4            hms_0.5.3            askpass_1.1          png_0.1-7           
## [41] digest_0.6.25        stringi_1.5.3        bookdown_0.21        dplyr_1.0.2         
## [45] tools_4.0.3          magrittr_1.5         RSQLite_2.2.1        tibble_3.0.4        
## [49] crayon_1.3.4         pkgconfig_2.0.3      ellipsis_0.3.1       xml2_1.3.2          
## [53] prettyunits_1.1.1    assertthat_0.2.1     rmarkdown_2.4        httr_1.4.2          
## [57] R6_2.4.1             compiler_4.0.3
warnings()