1 Introduction

BiocSet is a package that represents element sets in a tibble format with the BiocSet class. Element sets are read in and converted into a tibble format. From here, typical dplyr operations can be performed on the element set. BiocSet also provides functionality for mapping different ID types and providing reference urls for elements and sets.

2 Installation

Install the most recent version from Bioconductor:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BiocSet")

The development version is also available for install from GitHub:

BiocManager::install("Kayla-Morrell/BiocSet")

Then load BiocSet:

library(BiocSet)

3 BiocSet

3.1 Input and Output

BiocSet can create a BiocSet object using two different input methods. The first is to input named character vectors of element sets. BiocSet returns three tibbles, es_element which contains the elements, es_set which contains the sets and es_elementset which contains elements and sets together.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 x 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 49 more rows

The second method of creating a BiocSet object would be to read in a .gmt file. Using import(), a path to a downloaded .gmt file is read in and a BiocSet object is returned. The example below uses a hallmark element set downloaded from GSEA, which is also included with this package. This BiocSet includes a source column within the es_elementset tibble for reference as to where the element set came from.

gmtFile <- system.file(package = "BiocSet",
                        "extdata",
                        "hallmark.gene.symbol.gmt")
tbl2 <- import(gmtFile)
tbl2
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 4,386 x 1
#>   element
#>   <chr>  
#> 1 JUNB   
#> 2 CXCL2  
#> 3 ATF3   
#> # … with 4,383 more rows
#> 
#> es_set():
#> # A tibble: 50 x 2
#>   set                     source                                             
#>   <chr>                   <chr>                                              
#> 1 HALLMARK_TNFA_SIGNALIN… http://www.broadinstitute.org/gsea/msigdb/cards/HA…
#> 2 HALLMARK_HYPOXIA        http://www.broadinstitute.org/gsea/msigdb/cards/HA…
#> 3 HALLMARK_CHOLESTEROL_H… http://www.broadinstitute.org/gsea/msigdb/cards/HA…
#> # … with 47 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 7,324 x 2
#>   element set                             
#>   <chr>   <chr>                           
#> 1 JUNB    HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 2 CXCL2   HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> 3 ATF3    HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> # … with 7,321 more rows

export() allows for a BiocSet object to be exported into a temporary file with the extention .gmt.

fl <- tempfile(fileext = ".gmt")
gmt <- export(tbl2, fl)
gmt
#> GMTFile object
#> resource: /tmp/Rtmp4yTTLP/file43f04ed59372.gmt

3.2 Implemented functions

A feature available to BiocSet is the ability to activate different tibbles to perform certain functions on. When a BiocSet is created, the tibble es_elementset is automatically activated and all functions will be performed on this tibble. BiocSet adopts the use of many common dplyr functions such as filter(), select(), mutate(), summarise(), and arrange(). With each of the functions the user is able to pick a different tibble to activate and work on by using ‘verb_tibble’. After the function is executed than the ‘active’ tibble is returned back to the tibble that was active before the function call. Some examples are shown below of how these functions work.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 x 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 49 more rows
tbl %>% filter_element(element == "a" | element == "A")
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 2 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 A      
#> 
#> es_set():
#> # A tibble: 2 x 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 2 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 A       set2
tbl %>% mutate_set(pval = rnorm(1:2))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 x 2
#>   set     pval
#>   <chr>  <dbl>
#> 1 set1   0.201
#> 2 set2  -1.01 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 49 more rows
tbl %>% arrange_elementset(desc(element))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 52 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 49 more rows
#> 
#> es_set():
#> # A tibble: 2 x 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 52 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 z       set1 
#> 2 y       set1 
#> 3 x       set1 
#> # … with 49 more rows

3.3 Set operations

BiocSet also allows for common set operations to be performed on the BiocSet object. union() and intersection() are the two set operations available in BiocSet. We demonstate how a user can find the union between two BiocSet objects or within a single BiocSet object. Intersection is used in the same way.

# union of two BiocSet objects
es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)])
es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)])
union(es1, es2)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 8 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 5 more rows
#> 
#> es_set():
#> # A tibble: 2 x 1
#>   set  
#>   <chr>
#> 1 set1 
#> 2 set2 
#> 
#> es_elementset() <active>:
#> # A tibble: 8 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       set1 
#> 2 b       set1 
#> 3 c       set1 
#> # … with 5 more rows

# union within a single BiocSet object
es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)])
union_single(es3)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 20 x 1
#>   element
#>   <chr>  
#> 1 a      
#> 2 b      
#> 3 c      
#> # … with 17 more rows
#> 
#> es_set():
#> # A tibble: 1 x 1
#>   set  
#>   <chr>
#> 1 union
#> 
#> es_elementset() <active>:
#> # A tibble: 20 x 2
#>   element set  
#>   <chr>   <chr>
#> 1 a       union
#> 2 b       union
#> 3 c       union
#> # … with 17 more rows

4 Case study I

Next, we demonstrate the a couple uses of BiocSet with an experiment dataset airway from the package airway. This data is from an RNA-Seq experiment on airway smooth muscle (ASM) cell lines.

The first step is to load the library and the necessary data.

library(airway)
data("airway")
se <- airway

The function go_sets() discovers the keys from the org object and uses AnnotationDbi::select to create a mapping to GO ids. go_sets() also allows the user to indicate which evidence type or ontology type they would like when selecting the GO ids. The default is using all evidence types and all ontology types. We represent these identifieres as a BiocSet object. Using the go_sets function we are able to map the Ensembl ids and GO ids from the genome wide annotation for Human data in the org.Hs.eg.db package. The Ensembl ids are treated as elements while the GO ids are treated as sets.

library(org.Hs.eg.db)
go <- go_sets(org.Hs.eg.db, "ENSEMBL")
go
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 22,512 x 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 2.251e+04 more rows
#> 
#> es_set():
#> # A tibble: 18,175 x 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000010
#> # … with 1.817e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 318,556 x 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 3.186e+05 more rows

# an example of subsetting by evidence type
go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 16,011 x 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000168685
#> 3 ENSG00000114030
#> # … with 1.601e+04 more rows
#> 
#> es_set():
#> # A tibble: 4,318 x 2
#>   set        evidence    
#>   <chr>      <named list>
#> 1 GO:0000002 <chr [1]>   
#> 2 GO:0000018 <chr [1]>   
#> 3 GO:0000019 <chr [1]>   
#> # … with 4,315 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 68,112 x 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000168685 GO:0000018
#> 3 ENSG00000114030 GO:0000018
#> # … with 6.811e+04 more rows

Some users may not be interested in reporting the non-descriptive elements. We demonstrate subsetting the airway data to include non-zero assays and then filter out the non-descriptive elements.

se1 = se[rowSums(assay(se)) != 0,]
go %>% filter_element(element %in% rownames(se1))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 16,479 x 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 1.648e+04 more rows
#> 
#> es_set():
#> # A tibble: 18,175 x 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000010
#> # … with 1.817e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 250,234 x 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 2.502e+05 more rows

It may also be of interest to users to know how many elements are in each set. Using the count function we are able to calculate the elements per set.

go %>% group_by(set) %>% dplyr::count()
#> # A tibble: 18,175 x 2
#> # Groups:   set [18,175]
#>    set            n
#>    <chr>      <int>
#>  1 GO:0000002    13
#>  2 GO:0000003     4
#>  3 GO:0000010     2
#>  4 GO:0000012     7
#>  5 GO:0000014     9
#>  6 GO:0000015     4
#>  7 GO:0000016     1
#>  8 GO:0000018     6
#>  9 GO:0000019     3
#> 10 GO:0000022     2
#> # … with 18,165 more rows

It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality.

drop <- es_activate(go, elementset) %>% group_by(set) %>%
    dplyr::count() %>% filter(n == 0) %>% pull(set)
go %>% filter_set(!(set %in% drop))
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 22,512 x 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 2.251e+04 more rows
#> 
#> es_set():
#> # A tibble: 18,175 x 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000010
#> # … with 1.817e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 318,556 x 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 3.186e+05 more rows

To simplify mapping we created a couple map functions. map_unique() is used when there is known 1:1 mapping, This takes four arguements, a BiocSet object, an AnnotationDbi object, the id to map from, and the id to map to. map_multiple() needs a fifth argument to indicate how the function should treat an element when there is multiple mapping. Both functions utilize mapIds from AnnotationDbi and return a BiocSet object. In the example below we show how to use map_unique to map go’s ids from Ensembl to gene symbols.

go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")
#> 'select()' returned 1:many mapping between keys and columns
#> Joining, by = "element"
#> `mutate_if()` ignored the following grouping variables:
#> Column `element`
#> Joining, by = "element"
#> Joining, by = "element"
#> Joining, by = "set"
#> Joining, by = c("element", "set")
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 19,821 x 1
#>   element
#>   <chr>  
#> 1 AKT3   
#> 2 LONP1  
#> 3 MEF2A  
#> # … with 1.982e+04 more rows
#> 
#> es_set():
#> # A tibble: 18,175 x 1
#>   set       
#>   <chr>     
#> 1 GO:0000002
#> 2 GO:0000003
#> 3 GO:0000010
#> # … with 1.817e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 281,598 x 2
#>   element set       
#>   <chr>   <chr>     
#> 1 AKT3    GO:0000002
#> 2 LONP1   GO:0000002
#> 3 MEF2A   GO:0000002
#> # … with 2.816e+05 more rows

Another functionality of BiocSet is the ability to add information to the tibbles. Using the GO.db library we are able to map definitions to the GO ids. From there we can add the mapping to the tibble using map_add() and the mutate function.

library(GO.db)
map <- map_add_set(go, GO.db, "GOID", "DEFINITION")
go %>% mutate_set(definition = map)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 22,512 x 1
#>   element        
#>   <chr>          
#> 1 ENSG00000151729
#> 2 ENSG00000025708
#> 3 ENSG00000068305
#> # … with 2.251e+04 more rows
#> 
#> es_set():
#> # A tibble: 18,175 x 2
#>   set        definition                                                      
#>   <chr>      <chr>                                                           
#> 1 GO:0000002 The maintenance of the structure and integrity of the mitochond…
#> 2 GO:0000003 The production of new individuals that contain some portion of …
#> 3 GO:0000010 Catalysis of the reaction: all-trans-hexaprenyl diphosphate + i…
#> # … with 1.817e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 318,556 x 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 3.186e+05 more rows

The library KEGGREST is a client interface to the KEGG REST server. KEGG contains pathway maps that represent interaction, reaction and relation networks for various biological processes and diseases. BiocSet has a function that utilizes KEGGREST to develop a BiocSet object that contains the elements for every pathway map in KEGG.

Due to limiations of the KEGGREST package, kegg_sets can take some time to run depending on the amount of pathways for the species of interest. Therefore we demonstrate using BiocFileCache to make the data available to the user.

library(BiocFileCache)
rname <- "kegg_hsa"
exists <- NROW(bfcquery(query=rname, field="rname")) != 0L
if (!exists)
{
    kegg <- kegg_sets("hsa")
    fl <- bfcnew(rname = rname, ext = ".gmt")
    export(kegg_sets("hsa"), fl)
}
kegg <- import(bfcrpath(rname=rname))

Within the kegg_sets() function we remove pathways that do not contain any elements. We then mutate the element tibble using the map_add function to contain both Ensembl and Entrez ids.

map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL")
#> 'select()' returned 1:many mapping between keys and columns
kegg <- kegg %>% mutate_element(ensembl = map)

Since we are working with ASM data we thought we would subset the airway data to contain only the elements in the asthma pathway. This filter is performed on the KEGG id, which for asthma is “hsa05310”.

asthma <- kegg %>% filter_set(set == "hsa05310")

se <- se[rownames(se) %in% es_element(asthma)$ensembl,]

se
#> class: RangedSummarizedExperiment 
#> dim: 7683 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(7683): ENSG00000000419 ENSG00000000938 ... ENSG00000273079
#>   ENSG00000273085
#> rowData names(0):
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample

The filtering can also be done for multiple pathways.

pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970")
multipaths <- kegg %>% filter_set(set %in% pathways)

multipaths
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 7,911 x 2
#>   element ensembl        
#>   <chr>   <chr>          
#> 1 3101    ENSG00000160883
#> 2 3098    ENSG00000156515
#> 3 3099    ENSG00000159399
#> # … with 7,908 more rows
#> 
#> es_set():
#> # A tibble: 4 x 2
#>   set      source
#>   <chr>    <chr> 
#> 1 hsa04110 <NA>  
#> 2 hsa04970 <NA>  
#> 3 hsa05224 <NA>  
#> # … with 1 more row
#> 
#> es_elementset() <active>:
#> # A tibble: 392 x 2
#>   element set     
#>   <chr>   <chr>   
#> 1 595     hsa04110
#> 2 894     hsa04110
#> 3 896     hsa04110
#> # … with 389 more rows

5 Case study II

This example will start out the same way that Case Study I started, by loading in the airway dataset, but we will also do some reformating of the data. The end goal is to be able to perform a Gene Set Enrichment Analysis on the data and return a BiocSet object of the gene sets.

data("airway")
airway$dex <- relevel(airway$dex, "untrt")

Similar to other analyses we perform a differential expression analysis on the airway data using the library DESeq2 and then store the results into a tibble.

library(DESeq2)
library(tibble)
des <- DESeqDataSet(airway, design = ~ cell + dex)
des <- DESeq(des)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
res <- results(des)

tbl <- res %>% 
    as.data.frame() %>%
    as_tibble(rownames = "ENSEMBL")

Since we want to use limma::goana() to perform the GSEA we will need to have ENTREZ identifiers in the data, as well as filter out some NA information. Later on this will be considered our ‘element’ tibble.

tbl <- tbl %>% 
    mutate(
        ENTREZID = mapIds(
            org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL"
        ) %>% unname()
    )
#> 'select()' returned 1:many mapping between keys and columns

tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID))
tbl
#> # A tibble: 14,589 x 8
#>    ENSEMBL    baseMean log2FoldChange  lfcSE   stat   pvalue    padj ENTREZID
#>    <chr>         <dbl>          <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <chr>   
#>  1 ENSG00000…    709.         -0.381  0.101  -3.79   1.52e-4 1.28e-3 7105    
#>  2 ENSG00000…    520.          0.207  0.112   1.84   6.53e-2 1.97e-1 8813    
#>  3 ENSG00000…    237.          0.0379 0.143   0.264  7.92e-1 9.11e-1 57147   
#>  4 ENSG00000…     57.9        -0.0882 0.287  -0.307  7.59e-1 8.95e-1 55732   
#>  5 ENSG00000…   5817.          0.426  0.0883  4.83   1.38e-6 1.82e-5 3075    
#>  6 ENSG00000…   1282.         -0.241  0.0887 -2.72   6.58e-3 3.28e-2 2519    
#>  7 ENSG00000…    610.         -0.0476 0.167  -0.286  7.75e-1 9.03e-1 2729    
#>  8 ENSG00000…    369.         -0.500  0.121  -4.14   3.48e-5 3.42e-4 4800    
#>  9 ENSG00000…    183.         -0.124  0.180  -0.689  4.91e-1 7.24e-1 90529   
#> 10 ENSG00000…   2814.         -0.0411 0.103  -0.400  6.89e-1 8.57e-1 57185   
#> # … with 14,579 more rows

Now that the data is ready for GSEA we can go ahead and use goana() and make the results into a tibble. This tibble will be considered our ‘set’ tibble.

library(limma)
#> 
#> Attaching package: 'limma'
#> The following object is masked from 'package:DESeq2':
#> 
#>     plotMA
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     plotMA
go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>%
    as.data.frame() %>%
    as_tibble(rownames = "GOALL")
go_ids
#> # A tibble: 21,307 x 6
#>    GOALL     Term                                  Ont       N    DE     P.DE
#>    <chr>     <chr>                                 <chr> <dbl> <dbl>    <dbl>
#>  1 GO:00017… cell activation                       BP      944   334 8.24e-12
#>  2 GO:00022… immune effector process               BP      819   256 1.64e- 4
#>  3 GO:00022… cell activation involved in immune r… BP      500   148 2.73e- 2
#>  4 GO:00022… myeloid leukocyte activation          BP      465   149 1.22e- 3
#>  5 GO:00022… myeloid cell activation involved in … BP      399   118 4.55e- 2
#>  6 GO:00022… neutrophil activation involved in im… BP      361   108 4.03e- 2
#>  7 GO:00023… leukocyte activation involved in imm… BP      496   148 2.10e- 2
#>  8 GO:00023… immune system process                 BP     1993   662 7.62e-16
#>  9 GO:00024… leukocyte mediated immunity           BP      544   167 4.79e- 3
#> 10 GO:00024… myeloid leukocyte mediated immunity   BP      409   124 2.01e- 2
#> # … with 21,297 more rows

The last thing we need to do is create a tibble that we will consider our ‘elementset’ tibble. This tibble will be a mapping of all the elements and sets.

foo <- AnnotationDbi::select(
    org.Hs.eg.db,
    tbl$ENTREZID,
    "GOALL",
    "ENTREZID") %>% as_tibble()
#> 'select()' returned many:many mapping between keys and columns
foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct()
foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL)
foo
#> # A tibble: 1,159,245 x 2
#>    ENTREZID GOALL     
#>    <chr>    <chr>     
#>  1 7105     GO:0001816
#>  2 7105     GO:0001817
#>  3 7105     GO:0001819
#>  4 7105     GO:0002218
#>  5 7105     GO:0002221
#>  6 7105     GO:0002252
#>  7 7105     GO:0002253
#>  8 7105     GO:0002376
#>  9 7105     GO:0002682
#> 10 7105     GO:0002683
#> # … with 1,159,235 more rows

The function BiocSet_from_elementset() allows for users to create a BiocSet object from tibbles. This function is helpful when there is metadata contained in the tibble that should be in the BiocSet object. For this function to work properly, the columns that are being joined on need to be named correctly. For instance, in order to use this function on the tibbles we created we need to change the column in the ‘element’ tibble to ‘element’, the column in the ‘set’ tibble to ‘set’ and the same will be for the ‘elementset’ tibble. We demonstrate this below and then create the BiocSet object with the simple function.

foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL)
tbl <- tbl %>% dplyr::rename(element = ENTREZID)
go_ids <- go_ids %>% dplyr::rename(set = GOALL)
es <- BiocSet_from_elementset(foo, tbl, go_ids)
#> Joining, by = "element"
#> Joining, by = "set"
#> Joining, by = c("element", "set")
#> more elements in 'element' than in 'elementset'
#> more elements in 'set' than in 'elementset'
es
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 12,270 x 8
#>   element ENSEMBL     baseMean log2FoldChange  lfcSE   stat   pvalue     padj
#>   <chr>   <chr>          <dbl>          <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
#> 1 3980    ENSG000000…     206.        -0.393  0.172  -2.29  2.23e- 2  8.66e-2
#> 2 1890    ENSG000000…     145.         1.23   0.199   6.18  6.58e-10  1.49e-8
#> 3 50484   ENSG000000…    1407.        -0.0701 0.0902 -0.778 4.37e- 1  6.80e-1
#> # … with 1.227e+04 more rows
#> 
#> es_set():
#> # A tibble: 15,079 x 6
#>   set        Term                             Ont       N    DE       P.DE
#>   <chr>      <chr>                            <chr> <dbl> <dbl>      <dbl>
#> 1 GO:0000002 mitochondrial genome maintenance BP       29     6 0.796     
#> 2 GO:0000003 reproduction                     BP      948   308 0.00000102
#> 3 GO:0000012 single strand break repair       BP        8     1 0.908     
#> # … with 1.508e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 1,159,245 x 2
#>   element set       
#>   <chr>   <chr>     
#> 1 3980    GO:0000002
#> 2 1890    GO:0000002
#> 3 50484   GO:0000002
#> # … with 1.159e+06 more rows

For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame.

tibble_from_element(es)
#> Joining, by = "set"
#> Joining, by = "element"
#> # A tibble: 12,262 x 14
#>    element set   Term  Ont   N     DE    P.DE  ENSEMBL baseMean
#>    <chr>   <lis> <lis> <lis> <lis> <lis> <lis> <list>  <list>  
#>  1 1       <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [2…
#>  2 100     <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [4…
#>  3 1000    <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [2…
#>  4 10000   <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1…
#>  5 10001   <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [7…
#>  6 10003   <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [3…
#>  7 10004   <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1…
#>  8 100048… <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [9…
#>  9 10005   <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1…
#> 10 10006   <chr… <chr… <chr… <dbl… <dbl… <dbl… <chr [… <dbl [1…
#> # … with 12,252 more rows, and 5 more variables: log2FoldChange <list>,
#> #   lfcSE <list>, stat <list>, pvalue <list>, padj <list>

head(data.frame_from_elementset(es))
#> Joining, by = "set"
#> Joining, by = "element"
#>   element        set                             Term Ont  N DE      P.DE
#> 1    3980 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7959496
#> 2    1890 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7959496
#> 3   50484 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7959496
#> 4    4205 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7959496
#> 5    9093 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7959496
#> 6    6742 GO:0000002 mitochondrial genome maintenance  BP 29  6 0.7959496
#>           ENSEMBL  baseMean log2FoldChange      lfcSE       stat
#> 1 ENSG00000005156  205.6627    -0.39272067 0.17180717 -2.2858223
#> 2 ENSG00000025708  144.6965     1.23104228 0.19933257  6.1758212
#> 3 ENSG00000048392 1406.6344    -0.07014829 0.09020226 -0.7776778
#> 4 ENSG00000068305 1565.5675     0.52905471 0.09186326  5.7591547
#> 5 ENSG00000103423  670.6312    -0.04006479 0.09900566 -0.4046718
#> 6 ENSG00000106028  680.9599     0.02649354 0.10141139  0.2612482
#>         pvalue         padj
#> 1 2.226466e-02 8.661787e-02
#> 2 6.582046e-10 1.494473e-08
#> 3 4.367590e-01 6.803673e-01
#> 4 8.453618e-09 1.636969e-07
#> 5 6.857188e-01 8.544470e-01
#> 6 7.939011e-01 9.121070e-01

6 Information look up

A final feature to BiocSet is the ability to add reference information about all of the elements/sets. A user could utilize the function url_ref() to add information to the BiocSet object. If a user has a question about a specific id then they can follow the reference url to get more informtion. Below is an example of using url_ref() to add reference urls to the go data set we worked with above.

url_ref(go)
#> class: BiocSet
#> 
#> es_element():
#> # A tibble: 22,512 x 2
#>   element        url                                                         
#>   <chr>          <chr>                                                       
#> 1 ENSG000001517… https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g…
#> 2 ENSG000000257… https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g…
#> 3 ENSG000000683… https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g…
#> # … with 2.251e+04 more rows
#> 
#> es_set():
#> # A tibble: 18,175 x 2
#>   set        url                                                           
#>   <chr>      <chr>                                                         
#> 1 GO:0000002 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000002
#> 2 GO:0000003 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000003
#> 3 GO:0000010 http://amigo.geneontology.org/amigo/medial_search?q=GO:0000010
#> # … with 1.817e+04 more rows
#> 
#> es_elementset() <active>:
#> # A tibble: 318,556 x 2
#>   element         set       
#>   <chr>           <chr>     
#> 1 ENSG00000151729 GO:0000002
#> 2 ENSG00000025708 GO:0000002
#> 3 ENSG00000068305 GO:0000002
#> # … with 3.186e+05 more rows

7 Session info

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] limma_3.42.0                tibble_2.1.3               
#>  [3] DESeq2_1.26.0               BiocFileCache_1.10.0       
#>  [5] dbplyr_1.4.2                GO.db_3.10.0               
#>  [7] org.Hs.eg.db_3.10.0         AnnotationDbi_1.48.0       
#>  [9] airway_1.6.0                SummarizedExperiment_1.16.0
#> [11] DelayedArray_0.12.0         BiocParallel_1.20.0        
#> [13] matrixStats_0.55.0          Biobase_2.46.0             
#> [15] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
#> [17] IRanges_2.20.0              S4Vectors_0.24.0           
#> [19] BiocGenerics_0.32.0         BiocSet_1.0.1              
#> [21] dplyr_0.8.3                 BiocStyle_2.14.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-6             bit64_0.9-7             
#>  [3] RColorBrewer_1.1-2       httr_1.4.1              
#>  [5] tools_3.6.1              backports_1.1.5         
#>  [7] utf8_1.1.4               R6_2.4.0                
#>  [9] rpart_4.1-15             Hmisc_4.2-0             
#> [11] DBI_1.0.0                lazyeval_0.2.2          
#> [13] colorspace_1.4-1         nnet_7.3-12             
#> [15] gridExtra_2.3            tidyselect_0.2.5        
#> [17] bit_1.1-14               curl_4.2                
#> [19] compiler_3.6.1           cli_1.1.0               
#> [21] htmlTable_1.13.2         formatR_1.7             
#> [23] rtracklayer_1.46.0       bookdown_0.14           
#> [25] checkmate_1.9.4          scales_1.0.0            
#> [27] genefilter_1.68.0        rappdirs_0.3.1          
#> [29] stringr_1.4.0            digest_0.6.22           
#> [31] Rsamtools_2.2.1          foreign_0.8-72          
#> [33] rmarkdown_1.16           XVector_0.26.0          
#> [35] base64enc_0.1-3          pkgconfig_2.0.3         
#> [37] htmltools_0.4.0          htmlwidgets_1.5.1       
#> [39] rlang_0.4.1              rstudioapi_0.10         
#> [41] RSQLite_2.1.2            acepack_1.4.1           
#> [43] RCurl_1.95-4.12          magrittr_1.5            
#> [45] GenomeInfoDbData_1.2.2   Formula_1.2-3           
#> [47] Matrix_1.2-17            Rcpp_1.0.2              
#> [49] munsell_0.5.0            fansi_0.4.0             
#> [51] stringi_1.4.3            yaml_2.2.0              
#> [53] zlibbioc_1.32.0          plyr_1.8.4              
#> [55] grid_3.6.1               blob_1.2.0              
#> [57] crayon_1.3.4             lattice_0.20-38         
#> [59] Biostrings_2.54.0        splines_3.6.1           
#> [61] annotate_1.64.0          KEGGREST_1.26.1         
#> [63] locfit_1.5-9.1           zeallot_0.1.0           
#> [65] knitr_1.25               pillar_1.4.2            
#> [67] geneplotter_1.64.0       XML_3.98-1.20           
#> [69] glue_1.3.1               evaluate_0.14           
#> [71] latticeExtra_0.6-28      data.table_1.12.6       
#> [73] BiocManager_1.30.9       png_0.1-7               
#> [75] vctrs_0.2.0              gtable_0.3.0            
#> [77] purrr_0.3.3              assertthat_0.2.1        
#> [79] ggplot2_3.2.1            xfun_0.10               
#> [81] xtable_1.8-4             survival_3.1-6          
#> [83] GenomicAlignments_1.22.0 memoise_1.1.0           
#> [85] cluster_2.1.0

BiocSet: Representing Element Sets in the Tidyverse

2019-11-07

Package