1 Introduction

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.

Functionality includes access to :

2 Build reports

The Bioconductor build reports are available online as HTML pages. However, they are not very computable. The biocBuildReport function does some heroic parsing of the HTML to produce a tidy data.frame for further processing in R.

library(BiocPkgTools)
head(biocBuildReport())
## # A tibble: 6 x 9
##   pkg   version author commit last_changed_date   node  stage result
##   <chr> <chr>   <chr>  <chr>  <dttm>              <chr> <chr> <chr> 
## 1 a4     1.30.0 Tobia…  771e… 2018-10-30 00:00:00 malb… inst… OK    
## 2 a4     1.30.0 Tobia…  771e… 2018-10-30 00:00:00 malb… buil… OK    
## 3 a4     1.30.0 Tobia…  771e… 2018-10-30 00:00:00 malb… chec… OK    
## 4 a4     1.30.0 Tobia…  771e… 2018-10-30 00:00:00 toka… inst… OK    
## 5 a4     1.30.0 Tobia…  771e… 2018-10-30 00:00:00 toka… buil… OK    
## 6 a4     1.30.0 Tobia…  771e… 2018-10-30 00:00:00 toka… chec… OK    
## # … with 1 more variable: bioc_version <chr>

2.1 Personal build report

Because developers may be interested in a quick view of their own packages, there is a simple function, problemPage, to produce an HTML report of the build status of packages matching a given author regex. The default is to report only “problem” build statuses (ERROR, WARNING).

problemPage()

When run in an interactive environment, the problemPage function will open a browser window for user interaction. Note that if you want to include all your package results, not just the broken ones, simply specify includeOK = TRUE.

3 Download statistics

Bioconductor supplies download stats for all packages. The biocDownloadStats function grabs all available download stats for all packages in all Experiment Data, Annotation Data, and Software packages. The results are returned as a tidy data.frame for further analysis.

head(biocDownloadStats())
## # A tibble: 6 x 6
##   Package  Year Month Nb_of_distinct_IPs Nb_of_downloads repo    
##   <fct>   <int> <fct>              <int>           <int> <chr>   
## 1 ABarray  2019 Jan                   53             102 Software
## 2 ABarray  2019 Feb                    0               0 Software
## 3 ABarray  2019 Mar                    0               0 Software
## 4 ABarray  2019 Apr                    0               0 Software
## 5 ABarray  2019 May                    0               0 Software
## 6 ABarray  2019 Jun                    0               0 Software

The download statistics reported are for all available versions of a package. There are no separate, publicly available statistics broken down by version.

4 Package details

The R DESCRIPTION file contains a plethora of information regarding package authors, dependencies, versions, etc. In a repository such as Bioconductor, these details are available in bulk for all inclucded packages. The biocPkgList returns a data.frame with a row for each package. Tons of information are avaiable, as evidenced by the column names of the results.

bpi = biocPkgList()
colnames(bpi)
##  [1] "Package"                   "Version"                  
##  [3] "Depends"                   "Suggests"                 
##  [5] "License"                   "MD5sum"                   
##  [7] "NeedsCompilation"          "Title"                    
##  [9] "Description"               "biocViews"                
## [11] "Author"                    "Maintainer"               
## [13] "git_url"                   "git_branch"               
## [15] "git_last_commit"           "git_last_commit_date"     
## [17] "Date/Publication"          "source.ver"               
## [19] "win.binary.ver"            "mac.binary.el-capitan.ver"
## [21] "vignettes"                 "vignetteTitles"           
## [23] "hasREADME"                 "hasNEWS"                  
## [25] "hasINSTALL"                "hasLICENSE"               
## [27] "Rfiles"                    "Enhances"                 
## [29] "dependsOnMe"               "Imports"                  
## [31] "importsMe"                 "suggestsMe"               
## [33] "LinkingTo"                 "Archs"                    
## [35] "VignetteBuilder"           "URL"                      
## [37] "SystemRequirements"        "BugReports"               
## [39] "PackageStatus"             "Video"                    
## [41] "linksToMe"                 "OS_type"                  
## [43] "License_restricts_use"     "License_is_FOSS"          
## [45] "organism"

Some of the variables are parsed to produce list columns.

head(bpi)
## # A tibble: 6 x 45
##   Package Version Depends Suggests License MD5sum NeedsCompilation Title
##   <chr>   <chr>   <list>  <list>   <chr>   <chr>  <chr>            <chr>
## 1 a4      1.30.0  <chr [… <chr [4… GPL-3   4e44a… no               Auto…
## 2 a4Base  1.30.0  <chr [… <chr [2… GPL-3   a8238… no               Auto…
## 3 a4Clas… 1.30.0  <chr [… <chr [1… GPL-3   f3c57… no               Auto…
## 4 a4Core  1.30.0  <chr [… <chr [1… GPL-3   1c18e… no               Auto…
## 5 a4Prep… 1.30.0  <chr [… <chr [2… GPL-3   79f08… no               Auto…
## 6 a4Repo… 1.30.0  <chr [… <chr [1… GPL-3   4a7eb… no               Auto…
## # … with 37 more variables: Description <chr>, biocViews <list>,
## #   Author <list>, Maintainer <list>, git_url <chr>, git_branch <chr>,
## #   git_last_commit <chr>, git_last_commit_date <chr>,
## #   `Date/Publication` <chr>, source.ver <chr>, win.binary.ver <chr>,
## #   `mac.binary.el-capitan.ver` <chr>, vignettes <list>,
## #   vignetteTitles <list>, hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>,
## #   hasLICENSE <chr>, Rfiles <chr>, Enhances <list>, dependsOnMe <list>,
## #   Imports <list>, importsMe <list>, suggestsMe <list>, LinkingTo <chr>,
## #   Archs <chr>, VignetteBuilder <chr>, URL <chr>, SystemRequirements <chr>,
## #   BugReports <chr>, PackageStatus <chr>, Video <chr>, linksToMe <chr>,
## #   OS_type <chr>, License_restricts_use <chr>, License_is_FOSS <chr>,
## #   organism <chr>

As a simple example of how these columns can be used, extracting the importsMe column to find the packages that import the GEOquery package.

require(dplyr)
bpi = biocPkgList()
bpi %>% 
    filter(Package=="GEOquery") %>%
    pull(importsMe) %>%
    unlist()
##  [1] "bigmelon"        "ChIPXpress"      "coexnet"         "crossmeta"      
##  [5] "EGAD"            "GSEABenchmarkeR" "MACPET"          "minfi"          
##  [9] "MoonlightR"      "phantasus"       "recount"         "SRAdb"

5 Package Explorer

For the end user of Bioconductor, an analysis often starts with finding a package or set of packages that perform required tasks or are tailored to a specific operation or data type. The bioc_explorer() function implements an interactive bubble visualization with filtering based on biocViews terms. Bubbles are sized based on download statistics. Tooltip and detail-on-click capabilities are included. To start a local session:

biocExplore()

6 Dependency graphs

The Bioconductor ecosystem is built around the concept of interoperability and dependencies. These interdependencies are available as part of the biocPkgList() output. The BiocPkgTools provides some convenience functions to convert package dependencies to R graphs. A modular approach leads to the following workflow.

  1. Create a data.frame of dependencies using buildPkgDependencyDataFrame.
  2. Create an igraph object from the dependency data frame using buildPkgDependencyIgraph
  3. Use native igraph functionality to perform arbitrary network operations. Convenience functions, inducedSubgraphByPkgs and subgraphByDegree are available.
  4. Visualize with packages such as visNetwork.

6.1 Working with dependency graphs

A dependency graph for all of Bioconductor is a starting place.

library(BiocPkgTools)
dep_df = buildPkgDependencyDataFrame()
g = buildPkgDependencyIgraph(dep_df)
g
## IGRAPH 112396e DN-- 2950 23775 -- 
## + attr: name (v/c), edgetype (e/c)
## + edges from 112396e (vertex names):
##  [1] a4       ->a4Base        a4       ->a4Preproc    
##  [3] a4       ->a4Classif     a4       ->a4Core       
##  [5] a4       ->a4Reporting   a4Base   ->methods      
##  [7] a4Base   ->graphics      a4Base   ->grid         
##  [9] a4Base   ->Biobase       a4Base   ->AnnotationDbi
## [11] a4Base   ->annaffy       a4Base   ->mpm          
## [13] a4Base   ->genefilter    a4Base   ->limma        
## [15] a4Base   ->multtest      a4Base   ->glmnet       
## + ... omitted several edges
library(igraph)
head(V(g))
## + 6/2950 vertices, named, from 112396e:
## [1] a4          a4Base      a4Classif   a4Core      a4Preproc   a4Reporting
head(E(g))
## + 6/23775 edges from 112396e (vertex names):
## [1] a4    ->a4Base      a4    ->a4Preproc   a4    ->a4Classif  
## [4] a4    ->a4Core      a4    ->a4Reporting a4Base->methods

See inducedSubgraphByPkgs and subgraphByDegree to produce subgraphs based on a subset of packages.

See the igraph documentation for more detail on graph analytics, setting vertex and edge attributes, and advanced subsetting.

6.2 Graph visualization

The visNetwork package is a nice interactive visualization tool that implements graph plotting in a browser. It can be integrated into shiny applications. Interactive graphs can also be included in Rmarkdown documents (see vignette)

igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())

The full dependency graph is really not that informative to look at, though doing so is possible. A common use case is to visualize the graph of dependencies “centered” on a package of interest. In this case, I will focus on the GEOquery package.

igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")

The subgraphByDegree() function returns all nodes and connections within degree of the named package; the default degree is 1.

The visNework package can plot igraph objects directly, but more flexibility is offered by first converting the graph to visNetwork form.

library(visNetwork)
data <- toVisNetworkData(igraph_geoquery_network)

The next few code chunks highlight just a few examples of the visNetwork capabilities, starting with a basic plot.

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")

For fun, we can watch the graph stabilize during drawing, best viewed interactively.

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visPhysics(stabilization=FALSE)

Add arrows and colors to better capture dependencies.

data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visEdges(arrows='from') 

Add a legend.

ledges <- data.frame(color = c("green", "lightblue", "red"),
  label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
  visEdges(arrows='from') %>%
  visLegend(addEdges=ledges)

6.3 Integration with BiocViews

[Work in progress]

The biocViews package is a small ontology of terms describing Bioconductor packages. This is a work-in-progress section, but here is a small example of plotting the biocViews graph.

library(biocViews)
data(biocViewsVocab)
biocViewsVocab
## A graphNEL graph with directed edges
## Number of Nodes = 476 
## Number of Edges = 475
library(igraph)
g = igraph.from.graphNEL(biocViewsVocab)
library(visNetwork)
gv = toVisNetworkData(g)
visNetwork(gv$nodes, gv$edges, width="100%") %>%
    visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
    visNodes(size=20) %>%
    visPhysics(stabilization=FALSE)

7 Provenance

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biocViews_1.50.10  visNetwork_2.0.5   igraph_1.2.2      
## [4] bindrcpp_0.2.2     dplyr_0.7.8        BiocPkgTools_1.0.3
## [7] htmlwidgets_1.3    knitr_1.21         BiocStyle_2.10.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0          compiler_3.5.2      pillar_1.3.1       
##  [4] BiocManager_1.30.4  bindr_0.1.1         bitops_1.0-6       
##  [7] tools_3.5.2         digest_0.6.18       jsonlite_1.6       
## [10] evaluate_0.12       tibble_2.0.1        pkgconfig_2.0.2    
## [13] rlang_0.3.1         graph_1.60.0        rex_1.1.2          
## [16] cli_1.0.1           parallel_3.5.2      curl_3.3           
## [19] yaml_2.2.0          xfun_0.4            stringr_1.3.1      
## [22] httr_1.4.0          xml2_1.2.0          hms_0.4.2          
## [25] stats4_3.5.2        DT_0.5              tidyselect_0.2.5   
## [28] Biobase_2.42.0      glue_1.3.0          R6_2.3.0           
## [31] fansi_0.4.0         XML_3.98-1.16       RBGL_1.58.1        
## [34] rmarkdown_1.11      bookdown_0.9        purrr_0.2.5        
## [37] readr_1.3.1         tidyr_0.8.2         magrittr_1.5       
## [40] BiocGenerics_0.28.0 htmltools_0.3.6     RUnit_0.4.32       
## [43] assertthat_0.2.0    rvest_0.3.2         utf8_1.1.4         
## [46] stringi_1.2.4       RCurl_1.95-4.11     lazyeval_0.2.1     
## [49] crayon_1.3.4