Contents

1 Introduction

InterMine is a powerful open source data warehouse system integrating diverse biological data sets (e.g. genomic, expression and protein data) for various organisms. Integrating data makes it possible to run sophisticated data mining queries that span domains of biological knowledge. A selected list of databases powered by InterMine is shown in Table 1:

Database Organism Data
FlyMine Drosophila Genes, homology, proteins, interactions, gene ontology, expression, regulation, phenotypes, pathways, diseases, resources, publications
HumanMine H. sapiens Genomics, SNPs, GWAS, proteins, gene ontology, pathways, gene expression, interactions, publications, disease, orthologues, alleles
MouseMine M. musculus Genomics, proteins, gene ontology, expression, interactions, pathways, phenotypes, diseases, homology, publications
RatMine R. norvegicus Disease, gene ontology, genomics, interactions, phenotype, pathway, proteins, publication QTL, SNP
WormMine C. elegans Genes, alleles, homology, go annotation, phenotypes, strains
YeastMine S. cerevisiae Genomics, proteins, gene ontology, comparative genomics, phenotypes, interactions, literature, pathways, gene expression
ZebrafishMine D. rerio Genes, constructs, disease, gene ontology, genotypes, homology, morpholinos, phenotypes
TargetMine H. sapiens, M. musculus Genes, protein structures, chemical compounds, protein domains, gene function, pathways, interactions, disease, drug targets
MitoMiner H. sapiens, M. musculus, R. norvegicus, D. rerio, S. cerevisiae, S. pombe Genes, homology, localisation evidence, Mitochondrial reference gene lists, phenotypes, diseases, expression, interactions, pathways, exome
IndigoMine Archae Genomics
ThaleMine A. thaliana Genomics, proteins, domains, homology, gene ontology, interactions, expression, publications, pathways, GeneRIF, stocks, phenotypes, alleles, insertions, TAIR
MedicMine Medicago truncatula Genomics, pathways, gene ontology, publications, proteins, homology
PhytoMine over 50 plant genomes Genes, proteins, expression, transcripts, homology

Please see the InterMine home page for a full list of available InterMines.

InterMine includes an attractive, user-friendly web interface that works ‘out of the box’ and a powerful, scriptable web-service API to allow programmatic access to your data. This R package provides an interface with the InterMine-powered databases through Web services.

2 Jumpstart: How to build queries using InterMineR

Let’s start with a simple task - find the homologues of gene ABO.

2.1 Select a database

First, we look at what databases are available.

library(InterMineR)
listMines()
##                                            BeanMine 
##             "https://mines.legumeinfo.org/beanmine" 
##                                          BovineMine 
##                "http://bovinegenome.org/bovinemine" 
##                                             CHOmine 
##                "https://chomine.boku.ac.at/chomine" 
##                                        ChickpeaMine 
##         "https://mines.legumeinfo.org/chickpeamine" 
##                                          CowpeaMine 
##           "https://mines.legumeinfo.org/cowpeamine" 
##                                             FlyMine 
##                    "http://www.flymine.org/flymine" 
##                                           GrapeMine 
##          "http://urgi.versailles.inra.fr/GrapeMine" 
##                                           HumanMine 
##                "http://www.humanmine.org/humanmine" 
##                                     HymenopteraMine 
##      "http://hymenopteragenome.org/hymenopteramine" 
##                                          IndigoMine 
##               "http://www.cbrc.kaust.edu.sa/indigo" 
##                                          LegumeMine 
## "https://intermine.legumefederation.org/legumemine" 
##                                           MedicMine 
##               "http://medicmine.jcvi.org/medicmine" 
##                                           MitoMiner 
##    "http://mitominer.mrc-mbu.cam.ac.uk/release-4.0" 
##                                             ModMine 
##         "http://intermine.modencode.org/release-33" 
##                                           MouseMine 
##                "http://www.mousemine.org/mousemine" 
##                                          PeanutMine 
##           "https://mines.legumeinfo.org/peanutmine" 
##                                           PhytoMine 
##          "https://phytozome.jgi.doe.gov/phytomine/" 
##                                            PlanMine 
##               "http://planmine.mpi-cbg.de/planmine" 
##                                             RatMine 
##                    "http://ratmine.mcw.edu/ratmine" 
##                                             RepetDB 
##            "http://urgi.versailles.inra.fr/repetdb" 
##                                              Shaare 
##              "http://www.shaare.org.uk/release-1.0" 
##                                             SoyMine 
##              "https://mines.legumeinfo.org/soymine" 
##                                          TargetMine 
##     "http://targetmine.mizuguchilab.org/targetmine" 
##                                           ThaleMine 
##                "https://apps.araport.org/thalemine" 
##                                         Wheat3BMine 
##        "http://urgi.versailles.inra.fr/Wheat3BMine" 
##                                            WormMine 
##                     "http://intermine.wormbase.org" 
##                                             XenMine 
##                    "http://www.xenmine.org/xenmine" 
##                                           YeastMine 
##        "http://yeastmine.yeastgenome.org/yeastmine" 
##                                       ZebrafishMine 
##                          "http://zebrafishmine.org"

Since we would like to query human genes, we select HumanMine.

# load HumaMine
im <- initInterMine(mine=listMines()["HumanMine"])
im
## $mine
##                            HumanMine 
## "http://www.humanmine.org/humanmine" 
## 
## $token
## [1] ""

2.2 Obtain a prebuilt query

In InterMine you are able to build custom queries, but using R you are only allowed to run pre-built queries – called templates. Templates are queries that have already been created with a fixed set of output columns and one or more constraints.

# Get template (collection of pre-defined queries)
template = getTemplates(im)
head(template)
##                          name
## 1        Gene_Alleles_Disease
## 2            Gene_Identifiers
## 3                PathwayGenes
## 4               Gene_Location
## 5                 GeneExpress
## 6 Gene_particularGoannotation
##                                                                  title
## 1                                         Gene --> Alleles and Disease
## 2                                            Gene --> All identifiers.
## 3                                                    Pathway --> Genes
## 4                                       Gene --> Chromosomal location.
## 5 Gene --> Gene Expression (Tissue, Disease; Array Express, E-MTAB-62)
## 6                                  Gene + GO term --> Genes by GO term

We would like to find templates involving genes.

# Get gene-related templates
template[grep("gene", template$name, ignore.case=TRUE),]
##                               name
## 1             Gene_Alleles_Disease
## 2                 Gene_Identifiers
## 3                     PathwayGenes
## 4                    Gene_Location
## 5                      GeneExpress
## 6      Gene_particularGoannotation
## 8              Pathway_ProteinGene
## 10     Gene_proteinAtlasExpression
## 11         Term_inGWASoptionalGene
## 12            Gene_Expression_GTex
## 13              Gene_proteindomain
## 15            Gene_To_Publications
## 17      ChromRegion_GenesTransExon
## 18                       Gene_Orth
## 19                    Gene_Protein
## 20               ChromRegion_Genes
## 21                     Gene_inGWAS
## 22                  GeneOrthAllele
## 25   Gene_TissueExpressionIllumina
## 27       GeneInteractorsExpression
## 29                    Gene_Pathway
## 31                   PhenotypeGene
## 32       GenePathway_interactions2
## 33                humDisGeneOrthol
## 36             domain_protein_gene
## 37 Gene_Interactions_forReportPage
## 38                        Dis_Gene
## 39                     GOterm_Gene
## 41    Protein_GeneChromosomeLength
## 42             geneInteractiongene
## 43               geneGWAS_reportPg
## 44            gene_complex_details
## 46                     disExprGene
## 47                         Gene_GO
## 49                 Gene_AllelePhen
## 50              Protein_Gene_Ortho
## 52              Gene_Interactions2
## 54                        Gene_Dis
## 57          Gene_OverlapppingGenes
##                                                                          title
## 1                                                 Gene --> Alleles and Disease
## 2                                                    Gene --> All identifiers.
## 3                                                            Pathway --> Genes
## 4                                               Gene --> Chromosomal location.
## 5         Gene --> Gene Expression (Tissue, Disease; Array Express, E-MTAB-62)
## 6                                          Gene + GO term --> Genes by GO term
## 8                                                 Pathway --> Protein and Gene
## 10                                        Gene --> Protein tissue Localisation
## 11                            GWAS term --> SNP + Associated gene if available
## 12                                      Gene --> Tissue Expression (GTex data)
## 13                                                  Gene --> Protein + Domains
## 15                                                      Gene --> Publications.
## 17                    Chromosomal Location --> All Genes + Transcripts + Exons
## 18                                                        Gene --> Orthologues
## 19                                                          Gene --> Proteins.
## 20                                                            Region --> Genes
## 21                                                           Gene --> GWAS hit
## 22                              Gene (Hum OR Rat) --> Mouse Allele (Phenotype)
## 25                              Gene --> Tissue Expression (Illumina body map)
## 27 Gene + Tissue Expression  --> Interactors that are expressed in that tissue
## 29                                                            Gene --> Pathway
## 31                        Mouse Phenotype -->  Mouse Genes + Orthologous genes
## 32                                             Gene + Pathway --> Interactions
## 33                              Human Disease --> [Human +] Orthologue Gene(s)
## 36                                        Protein Domain --> Protein and Genes
## 37                                  Gene --> Physical and Genetic Interactions
## 38                                                         Disease --> Gene(s)
## 39                                                           GO term --> Genes
## 41                                                           Protein --> Gene.
## 42                                           Gene A --> Interaction <-- Gene B
## 43                                                    Gene Report --> GWAS hit
## 44                                                    Gene --> protein complex
## 46                                                Disease Expression --> Genes
## 47                                                          Gene --> GO terms.
## 49                                           Mouse Gene --> Allele [Phenotype]
## 50                                            Protein --> Gene and Orthologues
## 52                                                       Gene --> Interactions
## 54                                                     Gene --> Disease (OMIM)
## 57                                                 Gene --> Overlapping genes.

The template Gene_Orth seems to be what we want. Let’s look at this template in more detail.

# Query for gene orthologs
queryGeneOrth = getTemplateQuery(
  im = im, 
  name = "Gene_Orth"
)
queryGeneOrth
## $model
##      name 
## "genomic" 
## 
## $title
## [1] "Gene --> Orthologues"
## 
## $description
## [1] "For a given Gene (or List of Genes) in named organism (default: Human) returns the orthologues in a different organisms. [keywords: homologue, homolog, paralogue, paralogue, ortholog]"
## 
## $select
## [1] "Gene.primaryIdentifier"                      
## [2] "Gene.symbol"                                 
## [3] "Gene.homologues.homologue.primaryIdentifier" 
## [4] "Gene.homologues.homologue.symbol"            
## [5] "Gene.homologues.homologue.organism.shortName"
## 
## $name
## [1] "Gene_Orth"
## 
## $comment
## [1] ""
## 
## $orderBy
## $orderBy[[1]]
## Gene.symbol 
##       "ASC" 
## 
## 
## $where
## $where[[1]]
## $where[[1]]$path
## [1] "Gene"
## 
## $where[[1]]$op
## [1] "LOOKUP"
## 
## $where[[1]]$code
## [1] "A"
## 
## $where[[1]]$editable
## [1] TRUE
## 
## $where[[1]]$switchable
## [1] FALSE
## 
## $where[[1]]$switched
## [1] "LOCKED"
## 
## $where[[1]]$value
## [1] "PPARG"
## 
## $where[[1]]$extraValue
## [1] "H. sapiens"

There are three essential members in a query - SELECT, WHERE and constraintLogic.

  1. SELECT
    1. The SELECT (or view) represents the output columns in the query output.
    2. Columns of a view are usually of the form “A.B”, where B is the child of A. For example in the column Gene.symbol, symbol is the child of Gene. Columns could also be in cascade form “A.B.C”. For example, in the column Gene.locations.start, locations is the child of Gene and start is the child of locations.
  2. WHERE
    1. The WHERE statement is a collection of constraints.
    2. Query constraints include a list of the following columns:
      1. path
        1. in the same format as view columns
      2. op
        1. the constraint operator
        2. Valid values: “=”, “!=”, “LOOKUP”, “ONE OF”, “NONE OF”, “>”, “<”, “>=”, “<=”, “LIKE”
      3. value
        1. the constraint value
      4. code
        1. Ignore
        2. The logic code for the constraint (e.g. A, B or C).
        3. Only used in the constrainLogic (discussed below
      5. extraValue
        1. optional, required for LOOKUP constraints
        2. Short name of organism, e.g. H. sapiens
        1. Editable
          1. Ignore
          2. Used to determine if user is allowed to edit this constraint. Only for the UI.
      6. Switchable 1. Ignore 2. Used to determine if user is allowed to disable this constraint.
        Only for the UI.
      7. Switched 1. Ignore 2. Used to determine if user has enabled this constraint. Only for the UI.
  3. constraintLogic
    1. Constraint Logic, if not explicitly given, is “AND” operation, e.g., “A and B”, where A and B are the codes in the constraints.

2.2.1 Look at the data model

What does ‘Gene.symbol’ mean? What is ‘Gene.homologues.homologue.symbol’?

Let’s take a look at the data model.

model <- getModel(im)
head(model)
##     type           child_name child_type
## 1 Allele            alternate           
## 2 Allele clinicalSignificance           
## 3 Allele                   id           
## 4 Allele                 name           
## 5 Allele    primaryIdentifier           
## 6 Allele            reference

Let’s look at the children of the Gene data type.

model[which(model$type=="Gene"),]
##     type                 child_name             child_type
## 599 Gene           briefDescription                       
## 600 Gene               cytoLocation                       
## 601 Gene                description                       
## 602 Gene                         id                       
## 603 Gene                     length                       
## 604 Gene                       name                       
## 605 Gene          primaryIdentifier                       
## 606 Gene                      score                       
## 607 Gene                  scoreType                       
## 608 Gene        secondaryIdentifier                       
## 609 Gene                     symbol                       
## 610 Gene                    alleles                 Allele
## 611 Gene            atlasExpression        AtlasExpression
## 612 Gene                       CDSs                    CDS
## 613 Gene                 chromosome             Chromosome
## 614 Gene            crossReferences         CrossReference
## 615 Gene                   dataSets                DataSet
## 616 Gene                   diseases                Disease
## 617 Gene                      exons                   Exon
## 618 Gene               goAnnotation           GOAnnotation
## 619 Gene            flankingRegions     GeneFlankingRegion
## 620 Gene                 homologues              Homologue
## 621 Gene               interactions            Interaction
## 622 Gene downstreamIntergenicRegion       IntergenicRegion
## 623 Gene   upstreamIntergenicRegion       IntergenicRegion
## 624 Gene                    introns                 Intron
## 625 Gene         chromosomeLocation               Location
## 626 Gene            locatedFeatures               Location
## 627 Gene                  locations               Location
## 628 Gene        ontologyAnnotations     OntologyAnnotation
## 629 Gene                   organism               Organism
## 630 Gene                   pathways                Pathway
## 631 Gene                  probeSets               ProbeSet
## 632 Gene                   proteins                Protein
## 633 Gene     proteinAtlasExpression ProteinAtlasExpression
## 634 Gene               publications            Publication
## 635 Gene              rnaSeqResults           RNASeqResult
## 636 Gene          regulatoryRegions       RegulatoryRegion
## 637 Gene                       SNPs                    SNP
## 638 Gene       sequenceOntologyTerm                 SOTerm
## 639 Gene                   sequence               Sequence
## 640 Gene              childFeatures        SequenceFeature
## 641 Gene        overlappingFeatures        SequenceFeature
## 642 Gene                   synonyms                Synonym
## 643 Gene                transcripts             Transcript
## 644 Gene                       UTRs                    UTR

Gene has a field called ‘symbol’ (hence the column Gene.symbol). Gene also has a child called homologues, which is of the Homologue data type.

model[which(model$type=="Homologue"),]
##          type      child_name         child_type
## 732 Homologue              id                   
## 733 Homologue            type                   
## 734 Homologue crossReferences     CrossReference
## 735 Homologue        dataSets            DataSet
## 736 Homologue            gene               Gene
## 737 Homologue       homologue               Gene
## 738 Homologue        evidence OrthologueEvidence

Homologue has a child called ‘gene’ which is of the type ‘Gene’, which we saw above has a field called ‘symbol’ (hence the column Gene.homologues.homologue.symbol).

2.3 Run a Query

Let’s now run our template.

resGeneOrth <- runQuery(im, queryGeneOrth)
head(resGeneOrth)
##   Gene.primaryIdentifier Gene.symbol
## 1                   5468       PPARG
## 2                   5468       PPARG
## 3                   5468       PPARG
## 4                   5468       PPARG
## 5                   5468       PPARG
## 6                   5468       PPARG
##   Gene.homologues.homologue.primaryIdentifier
## 1                                       10062
## 2                                        5465
## 3                                        5467
## 4                                        5914
## 5                                        5915
## 6                                        5916
##   Gene.homologues.homologue.symbol
## 1                            NR1H3
## 2                            PPARA
## 3                            PPARD
## 4                             RARA
## 5                             RARB
## 6                             RARG
##   Gene.homologues.homologue.organism.shortName
## 1                                   H. sapiens
## 2                                   H. sapiens
## 3                                   H. sapiens
## 4                                   H. sapiens
## 5                                   H. sapiens
## 6                                   H. sapiens

3 Modify a Query

3.1 Edit a constraint

Let’s modify the query to find the orthologues of the gene ABO. We want to change the ‘value’ attribute from PPARG to ABO.

There are two ways to build a query in InterMineR.

  1. We can either build a query as a list object with newQuery function, and assign all input values (selection of retrieved data type, constraints, etc.) as items of that list,

  2. rr we can build the query as an InterMineR-class object with the functions setConstraint, which allows us to generate a new or modify an existing list of constraints, and setQuery, which generates the query as a InterMineR-class object.

setConstraints and setQuery functions are designed to facilitate the generation of queries for InterMine instances and avoid using multiple iterative loops, especially when it is required to include multiple constraints or constraint values (e.g. genes, organisms) in your query.

# modify directly the value of the first constraint from the list query
queryGeneOrth$where[[1]][["value"]] <- "ABO"

# or modify the value of the first constraint from the list query with setConstraints
queryGeneOrth$where = setConstraints(
  modifyQueryConstraints = queryGeneOrth,
  m.index = 1,
  values = list("ABO")
)

queryGeneOrth$where
## [[1]]
## [[1]]$path
## [1] "Gene"
## 
## [[1]]$op
## [1] "LOOKUP"
## 
## [[1]]$code
## [1] "A"
## 
## [[1]]$editable
## [1] TRUE
## 
## [[1]]$switchable
## [1] FALSE
## 
## [[1]]$switched
## [1] "LOCKED"
## 
## [[1]]$value
## [1] "ABO"
## 
## [[1]]$extraValue
## [1] "H. sapiens"

Note the value is now equal to ‘ABO’. Let’s rerun our query with the new constraint.

resGeneOrth <- runQuery(im, queryGeneOrth)
head(resGeneOrth)
##   Gene.primaryIdentifier Gene.symbol
## 1                     28         ABO
## 2                     28         ABO
## 3                     28         ABO
## 4                     28         ABO
## 5                     28         ABO
## 6                     28         ABO
##   Gene.homologues.homologue.primaryIdentifier
## 1                                      127550
## 2                                       26301
## 3                                      360203
## 4                                 MGI:2135738
## 5                                 RGD:2307241
## 6                                  RGD:628609
##   Gene.homologues.homologue.symbol
## 1                          A3GALT2
## 2                            GBGT1
## 3                           GLT6D1
## 4                              Abo
## 5                              Abo
## 6                             Abo3
##   Gene.homologues.homologue.organism.shortName
## 1                                   H. sapiens
## 2                                   H. sapiens
## 3                                   H. sapiens
## 4                                  M. musculus
## 5                                R. norvegicus
## 6                                R. norvegicus

Now we are seeing orthologues for the ABO gene. Let’s add the organism to the view to make sure we are looking at the desired gene.

3.2 Add a new constraint

You can also add additional filters. Let’s exclude all homologues where organism is H. sapiens.

There are four parts of a constraint to add:

  1. path
    1. I got the path from the output columns but I could have figured out it from the data model.
  2. op
    1. Valid values: “=”, “!=”, “LOOKUP”, “ONE OF”, “NONE OF”, “>”, “<”, “>=”, “<=”, “LIKE”
  3. value
    1. What value I am filtering on.
  4. code
    1. Must be a letter not in use by the query already. Looking at the query output above we can see we only have one constraint, labelled ‘A’. Let’s use ‘B’ for our code.
newConstraint <- list(
  path=c("Gene.homologues.homologue.organism.shortName"),
  op=c("!="), 
  value=c("H. sapiens"), 
  code=c("B")
)

queryGeneOrth$where[[2]] <- newConstraint
queryGeneOrth$where
## [[1]]
## [[1]]$path
## [1] "Gene"
## 
## [[1]]$op
## [1] "LOOKUP"
## 
## [[1]]$code
## [1] "A"
## 
## [[1]]$editable
## [1] TRUE
## 
## [[1]]$switchable
## [1] FALSE
## 
## [[1]]$switched
## [1] "LOCKED"
## 
## [[1]]$value
## [1] "ABO"
## 
## [[1]]$extraValue
## [1] "H. sapiens"
## 
## 
## [[2]]
## [[2]]$path
## [1] "Gene.homologues.homologue.organism.shortName"
## 
## [[2]]$op
## [1] "!="
## 
## [[2]]$value
## [1] "H. sapiens"
## 
## [[2]]$code
## [1] "B"

Our new filter has been added successfully. Rerun the query and you see you only have non-Homo sapiens orthologues.

resGeneOrth <- runQuery(im, queryGeneOrth)
resGeneOrth
##    Gene.primaryIdentifier Gene.symbol
## 1                      28         ABO
## 2                      28         ABO
## 3                      28         ABO
## 4                      28         ABO
## 5                      28         ABO
## 6                      28         ABO
## 7                      28         ABO
## 8                      28         ABO
## 9                      28         ABO
## 10                     28         ABO
## 11                     28         ABO
##    Gene.homologues.homologue.primaryIdentifier
## 1                                  MGI:2135738
## 2                                  RGD:2307241
## 3                                   RGD:628609
## 4                            ZDB-GENE-031204-4
## 5                         ZDB-GENE-040426-1117
## 6                           ZDB-GENE-040912-46
## 7                           ZDB-GENE-060531-15
## 8                           ZDB-GENE-060531-59
## 9                           ZDB-GENE-060531-70
## 10                          ZDB-GENE-060531-71
## 11                          ZDB-GENE-081104-23
##    Gene.homologues.homologue.symbol
## 1                               Abo
## 2                               Abo
## 3                              Abo3
## 4                                  
## 5                                  
## 6                                  
## 7                                  
## 8                                  
## 9                                  
## 10                                 
## 11                                 
##    Gene.homologues.homologue.organism.shortName
## 1                                   M. musculus
## 2                                 R. norvegicus
## 3                                 R. norvegicus
## 4                                      D. rerio
## 5                                      D. rerio
## 6                                      D. rerio
## 7                                      D. rerio
## 8                                      D. rerio
## 9                                      D. rerio
## 10                                     D. rerio
## 11                                     D. rerio

3.3 Add a column

You can also add additional columns to the output. For instance, where do these homologues come from? Let’s add this information.

Let’s see what we know about homologues.

model[which(model$type=="Homologue"),]
##          type      child_name         child_type
## 732 Homologue              id                   
## 733 Homologue            type                   
## 734 Homologue crossReferences     CrossReference
## 735 Homologue        dataSets            DataSet
## 736 Homologue            gene               Gene
## 737 Homologue       homologue               Gene
## 738 Homologue        evidence OrthologueEvidence

The Homologue data type has an ‘dataSets’ reference of type ‘DataSet’.

model[which(model$type=="DataSet"),]
##        type  child_name  child_type
## 403 DataSet description            
## 404 DataSet          id            
## 405 DataSet        name            
## 406 DataSet         url            
## 407 DataSet     version            
## 408 DataSet bioEntities   BioEntity
## 409 DataSet  dataSource  DataSource
## 410 DataSet publication Publication

DataSet has a child called name. Add Gene.homologues.dataSets.name to the view. We’ll add it as the last column, we can see from above there are 5 other columns already so we’ll put it as #6:

# use setQuery function which will create an InterMineR-class query
queryGeneOrth.InterMineR = setQuery(
  inheritQuery = queryGeneOrth,
  select = c(queryGeneOrth$select, 
             "Gene.homologues.dataSets.name")
  )

getSelect(queryGeneOrth.InterMineR)
## [1] "Gene.primaryIdentifier"                      
## [2] "Gene.symbol"                                 
## [3] "Gene.homologues.homologue.primaryIdentifier" 
## [4] "Gene.homologues.homologue.symbol"            
## [5] "Gene.homologues.homologue.organism.shortName"
## [6] "Gene.homologues.dataSets.name"
#queryGeneOrth.InterMineR@select

# or assign new column directly to the existing list query
queryGeneOrth$select[[6]] <- "Gene.homologues.dataSets.name"
queryGeneOrth$select
## [1] "Gene.primaryIdentifier"                      
## [2] "Gene.symbol"                                 
## [3] "Gene.homologues.homologue.primaryIdentifier" 
## [4] "Gene.homologues.homologue.symbol"            
## [5] "Gene.homologues.homologue.organism.shortName"
## [6] "Gene.homologues.dataSets.name"
# run queries
resGeneOrth.InterMineR <- runQuery(im, queryGeneOrth.InterMineR)
resGeneOrth <- runQuery(im, queryGeneOrth)

all(resGeneOrth == resGeneOrth.InterMineR)
## [1] TRUE
head(resGeneOrth, 3)
##   Gene.primaryIdentifier Gene.symbol
## 1                     28         ABO
## 2                     28         ABO
## 3                     28         ABO
##   Gene.homologues.homologue.primaryIdentifier
## 1                                 MGI:2135738
## 2                                 RGD:2307241
## 3                                  RGD:628609
##   Gene.homologues.homologue.symbol
## 1                              Abo
## 2                              Abo
## 3                             Abo3
##   Gene.homologues.homologue.organism.shortName Gene.homologues.dataSets.name
## 1                                  M. musculus              Panther data set
## 2                                R. norvegicus              Panther data set
## 3                                R. norvegicus              Panther data set

NB: adding columns can result in changing the row count.

3.4 Change constraint logic

The constraintLogic, if not given, is ‘A and B’. We would now try to explicitly specify the constraintLogic. A and B corresponds to the ‘code’ for each constraint.

queryGeneOrth$constraintLogic <- "A and B"
queryGeneOrth$constraintLogic
## [1] "A and B"

Run the query again and see no change:

resGeneOrth <- runQuery(im, queryGeneOrth)
resGeneOrth
##    Gene.primaryIdentifier Gene.symbol
## 1                      28         ABO
## 2                      28         ABO
## 3                      28         ABO
## 4                      28         ABO
## 5                      28         ABO
## 6                      28         ABO
## 7                      28         ABO
## 8                      28         ABO
## 9                      28         ABO
## 10                     28         ABO
## 11                     28         ABO
##    Gene.homologues.homologue.primaryIdentifier
## 1                                  MGI:2135738
## 2                                  RGD:2307241
## 3                                   RGD:628609
## 4                            ZDB-GENE-031204-4
## 5                         ZDB-GENE-040426-1117
## 6                           ZDB-GENE-040912-46
## 7                           ZDB-GENE-060531-15
## 8                           ZDB-GENE-060531-59
## 9                           ZDB-GENE-060531-70
## 10                          ZDB-GENE-060531-71
## 11                          ZDB-GENE-081104-23
##    Gene.homologues.homologue.symbol
## 1                               Abo
## 2                               Abo
## 3                              Abo3
## 4                                  
## 5                                  
## 6                                  
## 7                                  
## 8                                  
## 9                                  
## 10                                 
## 11                                 
##    Gene.homologues.homologue.organism.shortName
## 1                                   M. musculus
## 2                                 R. norvegicus
## 3                                 R. norvegicus
## 4                                      D. rerio
## 5                                      D. rerio
## 6                                      D. rerio
## 7                                      D. rerio
## 8                                      D. rerio
## 9                                      D. rerio
## 10                                     D. rerio
## 11                                     D. rerio
##    Gene.homologues.dataSets.name
## 1               Panther data set
## 2               Panther data set
## 3               Panther data set
## 4               Panther data set
## 5               Panther data set
## 6               Panther data set
## 7               Panther data set
## 8               Panther data set
## 9               Panther data set
## 10              Panther data set
## 11              Panther data set

Change to be ‘A or B’ and see how the results change.

4 Recipes

4.1 Obtain the gene ontology (GO) terms associated with gene ABO

  • Start with the template Gene GO
queryGeneGO <- getTemplateQuery(im, "Gene_GO")
queryGeneGO
## $model
##      name 
## "genomic" 
## 
## $title
## [1] "Gene --> GO terms."
## 
## $description
## [1] "Search for GO annotations for a particular gene (or List of Genes)."
## 
## $select
## [1] "Gene.primaryIdentifier"                           
## [2] "Gene.symbol"                                      
## [3] "Gene.goAnnotation.ontologyTerm.identifier"        
## [4] "Gene.goAnnotation.ontologyTerm.name"              
## [5] "Gene.goAnnotation.ontologyTerm.namespace"         
## [6] "Gene.goAnnotation.evidence.code.code"             
## [7] "Gene.goAnnotation.ontologyTerm.parents.identifier"
## [8] "Gene.goAnnotation.ontologyTerm.parents.name"      
## [9] "Gene.goAnnotation.qualifier"                      
## 
## $name
## [1] "Gene_GO"
## 
## $comment
## [1] "Added 15NOV2010: ML"
## 
## $orderBy
## $orderBy[[1]]
## Gene.primaryIdentifier 
##                  "ASC" 
## 
## 
## $where
## $where[[1]]
## $where[[1]]$path
## [1] "Gene"
## 
## $where[[1]]$op
## [1] "LOOKUP"
## 
## $where[[1]]$code
## [1] "A"
## 
## $where[[1]]$editable
## [1] TRUE
## 
## $where[[1]]$switchable
## [1] FALSE
## 
## $where[[1]]$switched
## [1] "LOCKED"
## 
## $where[[1]]$value
## [1] "PPARG"
## 
## $where[[1]]$extraValue
## [1] "H. sapiens"
  • Modify the view to display a compact view
queryGeneGO$select <- queryGeneGO$select[2:5]
queryGeneGO$select
## [1] "Gene.symbol"                              
## [2] "Gene.goAnnotation.ontologyTerm.identifier"
## [3] "Gene.goAnnotation.ontologyTerm.name"      
## [4] "Gene.goAnnotation.ontologyTerm.namespace"
  • Modify the constraints to look for gene ABO.
queryGeneGO$where[[1]][["value"]] <- "ABO"
queryGeneGO$where
## [[1]]
## [[1]]$path
## [1] "Gene"
## 
## [[1]]$op
## [1] "LOOKUP"
## 
## [[1]]$code
## [1] "A"
## 
## [[1]]$editable
## [1] TRUE
## 
## [[1]]$switchable
## [1] FALSE
## 
## [[1]]$switched
## [1] "LOCKED"
## 
## [[1]]$value
## [1] "ABO"
## 
## [[1]]$extraValue
## [1] "H. sapiens"
  • Run the query
resGeneGO <- runQuery(im, queryGeneGO )
head(resGeneGO)
##   Gene.symbol Gene.goAnnotation.ontologyTerm.identifier
## 1         ABO                                GO:0000166
## 2         ABO                                GO:0003823
## 3         ABO                                GO:0004380
## 4         ABO                                GO:0004381
## 5         ABO                                GO:0005576
## 6         ABO                                GO:0005794
##                                                Gene.goAnnotation.ontologyTerm.name
## 1                                                               nucleotide binding
## 2                                                                  antigen binding
## 3 glycoprotein-fucosylgalactoside alpha-N-acetylgalactosaminyltransferase activity
## 4                        fucosylgalactoside 3-alpha-galactosyltransferase activity
## 5                                                             extracellular region
## 6                                                                  Golgi apparatus
##   Gene.goAnnotation.ontologyTerm.namespace
## 1                       molecular_function
## 2                       molecular_function
## 3                       molecular_function
## 4                       molecular_function
## 5                       cellular_component
## 6                       cellular_component

4.2 Obtain the genes associated with gene ontology (GO) term ‘metal ion binding’

  • Start with the template Gene GO
queryGOGene <- getTemplateQuery(im, "GOterm_Gene")
queryGOGene
## $model
##      name 
## "genomic" 
## 
## $title
## [1] "GO term --> Genes"
## 
## $description
## [1] "Search for Genes in a specified organism that are associated with a particular Gene Ontology (GO) annotation."
## 
## $select
## [1] "Gene.primaryIdentifier"                   
## [2] "Gene.symbol"                              
## [3] "Gene.name"                                
## [4] "Gene.goAnnotation.ontologyTerm.identifier"
## [5] "Gene.goAnnotation.ontologyTerm.name"      
## [6] "Gene.organism.shortName"                  
## 
## $constraintLogic
## [1] "A and B"
## 
## $name
## [1] "GOterm_Gene"
## 
## $comment
## [1] "Added 26OCT2010: ML"
## 
## $orderBy
## $orderBy[[1]]
## Gene.symbol 
##       "ASC" 
## 
## 
## $where
## $where[[1]]
## $where[[1]]$path
## [1] "Gene.goAnnotation.ontologyTerm.name"
## 
## $where[[1]]$op
## [1] "LIKE"
## 
## $where[[1]]$code
## [1] "A"
## 
## $where[[1]]$editable
## [1] TRUE
## 
## $where[[1]]$switchable
## [1] FALSE
## 
## $where[[1]]$switched
## [1] "LOCKED"
## 
## $where[[1]]$value
## [1] "DNA binding"
## 
## 
## $where[[2]]
## $where[[2]]$path
## [1] "Gene.organism.shortName"
## 
## $where[[2]]$op
## [1] "="
## 
## $where[[2]]$code
## [1] "B"
## 
## $where[[2]]$editable
## [1] TRUE
## 
## $where[[2]]$switchable
## [1] FALSE
## 
## $where[[2]]$switched
## [1] "LOCKED"
## 
## $where[[2]]$value
## [1] "H. sapiens"
  • Modify the view to display a compact view
queryGOGene$select <- queryGOGene$select[2:5]
queryGOGene$select
## [1] "Gene.symbol"                              
## [2] "Gene.name"                                
## [3] "Gene.goAnnotation.ontologyTerm.identifier"
## [4] "Gene.goAnnotation.ontologyTerm.name"
  • Modify the constraints to look for GO term ‘metal ion binding’
queryGOGene$where[[1]]$value = "metal ion binding"
queryGOGene$where
## [[1]]
## [[1]]$path
## [1] "Gene.goAnnotation.ontologyTerm.name"
## 
## [[1]]$op
## [1] "LIKE"
## 
## [[1]]$code
## [1] "A"
## 
## [[1]]$editable
## [1] TRUE
## 
## [[1]]$switchable
## [1] FALSE
## 
## [[1]]$switched
## [1] "LOCKED"
## 
## [[1]]$value
## [1] "metal ion binding"
## 
## 
## [[2]]
## [[2]]$path
## [1] "Gene.organism.shortName"
## 
## [[2]]$op
## [1] "="
## 
## [[2]]$code
## [1] "B"
## 
## [[2]]$editable
## [1] TRUE
## 
## [[2]]$switchable
## [1] FALSE
## 
## [[2]]$switched
## [1] "LOCKED"
## 
## [[2]]$value
## [1] "H. sapiens"
  • Run the query
resGOGene <- runQuery(im, queryGOGene )
head(resGOGene)
##   Gene.symbol                                  Gene.name
## 1     A3GALT2          alpha 1,3-galactosyltransferase 2
## 2        AARS                     alanyl-tRNA synthetase
## 3       AARS2    alanyl-tRNA synthetase 2, mitochondrial
## 4      AARSD1 alanyl-tRNA synthetase domain containing 1
## 5        ABAT           4-aminobutyrate aminotransferase
## 6      ABLIM1                actin binding LIM protein 1
##   Gene.goAnnotation.ontologyTerm.identifier
## 1                                GO:0046872
## 2                                GO:0046872
## 3                                GO:0046872
## 4                                GO:0046872
## 5                                GO:0046872
## 6                                GO:0046872
##   Gene.goAnnotation.ontologyTerm.name
## 1                   metal ion binding
## 2                   metal ion binding
## 3                   metal ion binding
## 4                   metal ion binding
## 5                   metal ion binding
## 6                   metal ion binding

4.3 Find and plot the genes within 50000 base pairs of gene ABCA6

  • Start with the Gene_Location template, update to search for ABCA6 gene.
queryGeneLoc = getTemplateQuery(im, "Gene_Location")
queryGeneLoc$where[[2]][["value"]] = "ABCA6"
resGeneLoc= runQuery(im, queryGeneLoc)

resGeneLoc
##   Gene.primaryIdentifier Gene.secondaryIdentifier Gene.symbol
## 1                  23460          ENSG00000154262       ABCA6
##                                   Gene.name
## 1 ATP binding cassette subfamily A member 6
##   Gene.chromosome.primaryIdentifier Gene.locations.start Gene.locations.end
## 1                                17             69062044           69142188
##   Gene.locations.strand
## 1                    -1

We’re going to use the output (gene location) as input for the next query.

  • Define a new query
# set constraints
constraints = setConstraints(
  paths = c(
    "Gene.chromosome.primaryIdentifier",
    "Gene.locations.start",
    "Gene.locations.end",
    "Gene.organism.name"
  ),
  operators = c(
    "=",
    ">=",
    "<=",
    "="
  ),
  values = list(
    resGeneLoc[1, "Gene.chromosome.primaryIdentifier"],
    as.character(as.numeric(resGeneLoc[1, "Gene.locations.start"])-50000),
    as.character(as.numeric(resGeneLoc[1, "Gene.locations.end"])+50000),
    "Homo sapiens"
  )
)

# set InterMineR-class query
queryNeighborGene = setQuery(
  select = c("Gene.primaryIdentifier", 
             "Gene.symbol",
             "Gene.chromosome.primaryIdentifier",
             "Gene.locations.start", 
             "Gene.locations.end", 
             "Gene.locations.strand"),
  where = constraints
)

summary(queryNeighborGene)
##                                path op        value code
## 1 Gene.chromosome.primaryIdentifier  =           17    A
## 2              Gene.locations.start >=     69012044    B
## 3                Gene.locations.end <=     69192188    C
## 4                Gene.organism.name  = Homo sapiens    D
  • Run the query
resNeighborGene <- runQuery(im, queryNeighborGene)
resNeighborGene
##   Gene.primaryIdentifier  Gene.symbol Gene.chromosome.primaryIdentifier
## 1              100421166 LOC100421166                                17
## 2              100616316     MIR4524A                                17
## 3              100847008     MIR4524B                                17
## 4                  23460        ABCA6                                17
##   Gene.locations.start Gene.locations.end Gene.locations.strand
## 1             69094100           69095506                    -1
## 2             69099564           69099632                    -1
## 3             69099542           69099656                     1
## 4             69062044           69142188                    -1
  • Plot the genes
resNeighborGene$Gene.locations.strand[which(resNeighborGene$Gene.locations.strand==1)]="+"

resNeighborGene$Gene.locations.strand[which(resNeighborGene$Gene.locations.strand==-1)]="-"

gene.idx = which(nchar(resNeighborGene$Gene.symbol)==0)

resNeighborGene$Gene.symbol[gene.idx]=resNeighborGene$Gene.primaryIdentifier[gene.idx]
require(Gviz)
annTrack = AnnotationTrack(
  start=resNeighborGene$Gene.locations.start,
  end=resNeighborGene$Gene.locations.end,
  strand=resNeighborGene$Gene.locations.strand,
  chromosome=resNeighborGene$Gene.chromosome.primaryIdentifier[1],
  genome="GRCh38", 
  name="around ABCA6",
  id=resNeighborGene$Gene.symbol)

gtr <- GenomeAxisTrack()
itr <- IdeogramTrack(genome="hg38", chromosome="chr17")

plotTracks(list(gtr, itr, annTrack), shape="box", showFeatureId=TRUE, fontcolor="black")

5 System info

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=C                  LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Gviz_1.22.0          GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 
##  [4] org.Hs.eg.db_3.4.2   GO.db_3.4.2          GeneAnswers_2.20.0  
##  [7] RColorBrewer_1.1-2   Heatplus_2.24.0      MASS_7.3-47         
## [10] annotate_1.56.0      XML_3.98-1.9         AnnotationDbi_1.40.0
## [13] IRanges_2.12.0       S4Vectors_0.16.0     Biobase_2.38.0      
## [16] BiocGenerics_0.24.0  RCurl_1.95-4.8       bitops_1.0-6        
## [19] igraph_1.1.2         RSQLite_2.0          InterMineR_1.0.0    
## [22] BiocStyle_2.6.0     
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.10.0           matrixStats_0.52.2           
##  [3] bit64_0.9-7                   progress_1.1.2               
##  [5] httr_1.3.1                    rprojroot_1.2                
##  [7] tools_3.4.2                   backports_1.1.1              
##  [9] R6_2.2.2                      rpart_4.1-11                 
## [11] Hmisc_4.0-3                   DBI_0.7                      
## [13] lazyeval_0.2.1                colorspace_1.3-2             
## [15] nnet_7.3-12                   RMySQL_0.10.13               
## [17] gridExtra_2.3                 prettyunits_1.0.2            
## [19] bit_1.1-12                    curl_3.0                     
## [21] compiler_3.4.2                chron_2.3-51                 
## [23] graph_1.56.0                  htmlTable_1.9                
## [25] xml2_1.1.1                    DelayedArray_0.4.0           
## [27] rtracklayer_1.38.0            bookdown_0.5                 
## [29] checkmate_1.8.5               scales_0.5.0                 
## [31] RBGL_1.54.0                   stringr_1.2.0                
## [33] digest_0.6.12                 Rsamtools_1.30.0             
## [35] foreign_0.8-69                rmarkdown_1.6                
## [37] XVector_0.18.0                dichromat_2.0-0              
## [39] base64enc_0.1-3               pkgconfig_2.0.1              
## [41] htmltools_0.3.6               ensembldb_2.2.0              
## [43] BSgenome_1.46.0               htmlwidgets_0.9              
## [45] rlang_0.1.2                   BiocInstaller_1.28.0         
## [47] shiny_1.0.5                   jsonlite_1.5                 
## [49] BiocParallel_1.12.0           acepack_1.4.1                
## [51] VariantAnnotation_1.24.0      magrittr_1.5                 
## [53] GenomeInfoDbData_0.99.1       Formula_1.2-2                
## [55] Matrix_1.2-11                 Rcpp_0.12.13                 
## [57] munsell_0.4.3                 proto_1.0.0                  
## [59] sqldf_0.4-11                  stringi_1.1.5                
## [61] yaml_2.1.14                   RJSONIO_1.3-0                
## [63] SummarizedExperiment_1.8.0    zlibbioc_1.24.0              
## [65] AnnotationHub_2.10.0          plyr_1.8.4                   
## [67] blob_1.1.0                    lattice_0.20-35              
## [69] Biostrings_2.46.0             splines_3.4.2                
## [71] GenomicFeatures_1.30.0        knitr_1.17                   
## [73] tcltk_3.4.2                   biomaRt_2.34.0               
## [75] evaluate_0.10.1               downloader_0.4               
## [77] biovizBase_1.26.0             latticeExtra_0.6-28          
## [79] data.table_1.10.4-3           httpuv_1.3.5                 
## [81] gtable_0.2.0                  assertthat_0.2.0             
## [83] gsubfn_0.6-6                  ggplot2_2.2.1                
## [85] mime_0.5                      xtable_1.8-2                 
## [87] AnnotationFilter_1.2.0        survival_2.41-3              
## [89] tibble_1.3.4                  GenomicAlignments_1.14.0     
## [91] memoise_1.1.0                 cluster_2.0.6                
## [93] interactiveDisplayBase_1.16.0

6 Appendix

6.1 Visual way to derive the column name of a query view or the path name in a query constraint from the database webpage


The InterMine model could be accessed from the mine homepage by clicking the tab “QueryBuilder” and selecting the appropriate data type under “Select a Data Type to Begin a Query”:


Here we select Gene as the data type:


We could select Symbol and Chromosome->Primary Identifier by clicking Show on the right of them. Then click “Export XML” at the bottom right corner of the webpage:


The column names Gene.symbol and Gene.chromosome.primaryIdentifier are contained in the XML output: