1 Introduction

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.

Functionality includes access to :

2 Build reports

The Bioconductor build reports are available online as HTML pages. However, they are not very computable. The biocBuildReport function does some heroic parsing of the HTML to produce a tidy data.frame for further processing in R.

library(BiocPkgTools)
head(biocBuildReport())
## # A tibble: 6 × 11
##   pkg     author version git_l…¹ git_last_commit_d…² Depre…³ Packa…⁴ node  stage
##   <chr>   <chr>  <chr>   <chr>   <dttm>              <lgl>   <chr>   <chr> <chr>
## 1 ABAEnr… Steff… 1.25.1  aa43e0a 2021-11-15 13:30:05 TRUE    Deprec… nebb… inst…
## 2 ABAEnr… Steff… 1.25.1  aa43e0a 2021-11-15 13:30:05 TRUE    Deprec… nebb… buil…
## 3 ABAEnr… Steff… 1.25.1  aa43e0a 2021-11-15 13:30:05 TRUE    Deprec… nebb… chec…
## 4 ABAEnr… Steff… 1.25.1  aa43e0a 2021-11-15 13:30:05 TRUE    Deprec… palo… inst…
## 5 ABAEnr… Steff… 1.25.1  aa43e0a 2021-11-15 13:30:05 TRUE    Deprec… palo… buil…
## 6 ABAEnr… Steff… 1.25.1  aa43e0a 2021-11-15 13:30:05 TRUE    Deprec… palo… chec…
## # … with 2 more variables: result <chr>, bioc_version <chr>, and abbreviated
## #   variable names ¹​git_last_commit, ²​git_last_commit_date, ³​Deprecated,
## #   ⁴​PackageStatus
## # ℹ Use `colnames()` to see all variable names

2.1 Personal build report

Because developers may be interested in a quick view of their own packages, there is a simple function, problemPage, to produce an HTML report of the build status of packages matching a given author regex supplied to the authorPattern argument. The default is to report only “problem” build statuses (ERROR, WARNING).

problemPage(authorPattern = "V.*Carey")

In similar fashion, maintainers of packages that have many downstream packages that depend on them may wish to check that a change they introduced hasn’t suddenly broken a large number of these. You can use the dependsOn argument to produce the summary report of those packages that “depend on” the given package.

problemPage(dependsOn = "limma")

When run in an interactive environment, the problemPage function will open a browser window for user interaction. Note that if you want to include all your package results, not just the broken ones, simply specify includeOK = TRUE.

3 Download statistics

Bioconductor supplies download stats for all packages. The biocDownloadStats function grabs all available download stats for all packages in all Experiment Data, Annotation Data, and Software packages. The results are returned as a tidy data.frame for further analysis.

head(biocDownloadStats())
## # A tibble: 6 × 7
##   pkgType  Package  Year Month Nb_of_distinct_IPs Nb_of_downloads Date      
##   <chr>    <chr>   <int> <chr>              <int>           <int> <date>    
## 1 software ABarray  2022 Jan                   51              78 2022-01-01
## 2 software ABarray  2022 Feb                   33              68 2022-02-01
## 3 software ABarray  2022 Mar                   62              99 2022-03-01
## 4 software ABarray  2022 Apr                   16              24 2022-04-01
## 5 software ABarray  2022 May                    0               0 2022-05-01
## 6 software ABarray  2022 Jun                    0               0 2022-06-01

The download statistics reported are for all available versions of a package. There are no separate, publicly available statistics broken down by version.

The majority of Bioconductor Software packages are also available through other channels such as Anaconda, who also provided download statistics for packages installed from their repositories. Access to these counts is provided by the anacondaDownloadStats function:

head(anacondaDownloadStats())
## # A tibble: 6 × 7
##   Package Year  Month Nb_of_distinct_IPs Nb_of_downloads repo     Date      
##   <chr>   <chr> <chr>              <int>           <dbl> <chr>    <date>    
## 1 ABAData 2018  Apr                   NA               8 Anaconda 2018-04-01
## 2 ABAData 2018  Aug                   NA               5 Anaconda 2018-08-01
## 3 ABAData 2018  Dec                   NA             133 Anaconda 2018-12-01
## 4 ABAData 2018  Jul                   NA               6 Anaconda 2018-07-01
## 5 ABAData 2018  Jun                   NA              18 Anaconda 2018-06-01
## 6 ABAData 2018  Mar                   NA              13 Anaconda 2018-03-01

Note that Anaconda do not provide counts for distinct IP addresses, but this column is included for compatibility with the Bioconductor count tables.

4 Package details

The R DESCRIPTION file contains a plethora of information regarding package authors, dependencies, versions, etc. In a repository such as Bioconductor, these details are available in bulk for all included packages. The biocPkgList returns a data.frame with a row for each package. Tons of information are available, as evidenced by the column names of the results.

bpi = biocPkgList()
colnames(bpi)
##  [1] "Package"               "Version"               "Depends"              
##  [4] "Suggests"              "License"               "MD5sum"               
##  [7] "NeedsCompilation"      "Title"                 "Description"          
## [10] "biocViews"             "Author"                "Maintainer"           
## [13] "git_url"               "git_branch"            "git_last_commit"      
## [16] "git_last_commit_date"  "Date/Publication"      "source.ver"           
## [19] "win.binary.ver"        "mac.binary.ver"        "vignettes"            
## [22] "vignetteTitles"        "hasREADME"             "hasNEWS"              
## [25] "hasINSTALL"            "hasLICENSE"            "Rfiles"               
## [28] "dependencyCount"       "Imports"               "Enhances"             
## [31] "dependsOnMe"           "suggestsMe"            "VignetteBuilder"      
## [34] "URL"                   "SystemRequirements"    "BugReports"           
## [37] "importsMe"             "Archs"                 "LinkingTo"            
## [40] "Video"                 "linksToMe"             "PackageStatus"        
## [43] "License_restricts_use" "OS_type"               "organism"             
## [46] "License_is_FOSS"

Some of the variables are parsed to produce list columns.

head(bpi)
## # A tibble: 6 × 46
##   Package   Version Depends Sugge…¹ License MD5sum Needs…² Title Descr…³ biocV…⁴
##   <chr>     <chr>   <list>  <list>  <chr>   <chr>  <chr>   <chr> <chr>   <list> 
## 1 a4        1.44.0  <chr>   <chr>   GPL-3   cc696… no      Auto… "Umbre… <chr>  
## 2 a4Base    1.44.0  <chr>   <chr>   GPL-3   094c0… no      Auto… "Base … <chr>  
## 3 a4Classif 1.44.0  <chr>   <chr>   GPL-3   1f6e6… no      Auto… "Funct… <chr>  
## 4 a4Core    1.44.0  <chr>   <chr>   GPL-3   9413a… no      Auto… "Utili… <chr>  
## 5 a4Preproc 1.44.0  <chr>   <chr>   GPL-3   b5b29… no      Auto… "Utili… <chr>  
## 6 a4Report… 1.44.0  <chr>   <chr>   GPL-3   6bd64… no      Auto… "Utili… <chr>  
## # … with 36 more variables: Author <list>, Maintainer <list>, git_url <chr>,
## #   git_branch <chr>, git_last_commit <chr>, git_last_commit_date <chr>,
## #   `Date/Publication` <chr>, source.ver <chr>, win.binary.ver <chr>,
## #   mac.binary.ver <chr>, vignettes <list>, vignetteTitles <list>,
## #   hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>, hasLICENSE <chr>,
## #   Rfiles <list>, dependencyCount <chr>, Imports <list>, Enhances <list>,
## #   dependsOnMe <list>, suggestsMe <list>, VignetteBuilder <chr>, URL <chr>, …
## # ℹ Use `colnames()` to see all variable names

As a simple example of how these columns can be used, extracting the importsMe column to find the packages that import the GEOquery package.

require(dplyr)
bpi = biocPkgList()
bpi %>% 
    filter(Package=="GEOquery") %>%
    pull(importsMe) %>%
    unlist()
##  [1] "bigmelon"                         "BioPlex"                         
##  [3] "ChIPXpress"                       "conclus"                         
##  [5] "crossmeta"                        "DExMA"                           
##  [7] "EGAD"                             "GAPGOM"                          
##  [9] "GEOexplorer"                      "MACPET"                          
## [11] "minfi"                            "MoonlightR"                      
## [13] "phantasus"                        "recount"                         
## [15] "SRAdb"                            "BeadArrayUseCases"               
## [17] "GSE13015"                         "healthyControlsPresenceChecker"  
## [19] "easyDifferentialGeneCoexpression" "geneExpressionFromGEO"           
## [21] "MetaIntegrator"                   "seeker"

5 Package Explorer

For the end user of Bioconductor, an analysis often starts with finding a package or set of packages that perform required tasks or are tailored to a specific operation or data type. The biocExplore() function implements an interactive bubble visualization with filtering based on biocViews terms. Bubbles are sized based on download statistics. Tooltip and detail-on-click capabilities are included. To start a local session:

biocExplore()

6 Dependency graphs

The Bioconductor ecosystem is built around the concept of interoperability and dependencies. These interdependencies are available as part of the biocPkgList() output. The BiocPkgTools provides some convenience functions to convert package dependencies to R graphs. A modular approach leads to the following workflow.

  1. Create a data.frame of dependencies using buildPkgDependencyDataFrame.
  2. Create an igraph object from the dependency data frame using buildPkgDependencyIgraph
  3. Use native igraph functionality to perform arbitrary network operations. Convenience functions, inducedSubgraphByPkgs and subgraphByDegree are available.
  4. Visualize with packages such as visNetwork.

6.1 Working with dependency graphs

A dependency graph for all of Bioconductor is a starting place.

library(BiocPkgTools)
dep_df = buildPkgDependencyDataFrame()
g = buildPkgDependencyIgraph(dep_df)
g
## IGRAPH 4d7c691 DN-- 3846 38307 -- 
## + attr: name (v/c), edgetype (e/c)
## + edges from 4d7c691 (vertex names):
##  [1] a4       ->a4Base       a4       ->a4Preproc    a4       ->a4Classif   
##  [4] a4       ->a4Core       a4       ->a4Reporting  a4Base   ->a4Preproc   
##  [7] a4Base   ->a4Core       a4Classif->a4Core       a4Classif->a4Preproc   
## [10] abseqR   ->R            ABSSeq   ->R            ABSSeq   ->methods     
## [13] acde     ->R            acde     ->boot         ACE      ->R           
## [16] aCGH     ->R            aCGH     ->cluster      aCGH     ->survival    
## [19] aCGH     ->multtest     ACME     ->R            ACME     ->Biobase     
## [22] ACME     ->methods      ACME     ->BiocGenerics ADaCGH2  ->R           
## + ... omitted several edges
library(igraph)
head(V(g))
## + 6/3846 vertices, named, from 4d7c691:
## [1] a4        a4Base    a4Classif abseqR    ABSSeq    acde
head(E(g))
## + 6/38307 edges from 4d7c691 (vertex names):
## [1] a4    ->a4Base      a4    ->a4Preproc   a4    ->a4Classif  
## [4] a4    ->a4Core      a4    ->a4Reporting a4Base->a4Preproc

See inducedSubgraphByPkgs and subgraphByDegree to produce subgraphs based on a subset of packages.

See the igraph documentation for more detail on graph analytics, setting vertex and edge attributes, and advanced subsetting.

6.2 Graph visualization

The visNetwork package is a nice interactive visualization tool that implements graph plotting in a browser. It can be integrated into shiny applications. Interactive graphs can also be included in Rmarkdown documents (see vignette)

igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())

The full dependency graph is really not that informative to look at, though doing so is possible. A common use case is to visualize the graph of dependencies “centered” on a package of interest. In this case, I will focus on the GEOquery package.

igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")

The subgraphByDegree() function returns all nodes and connections within degree of the named package; the default degree is 1.

The visNework package can plot igraph objects directly, but more flexibility is offered by first converting the graph to visNetwork form.

library(visNetwork)
data <- toVisNetworkData(igraph_geoquery_network)

The next few code chunks highlight just a few examples of the visNetwork capabilities, starting with a basic plot.

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")

For fun, we can watch the graph stabilize during drawing, best viewed interactively.

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visPhysics(stabilization=FALSE)

Add arrows and colors to better capture dependencies.

data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'

visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
    visEdges(arrows='from') 

Add a legend.

ledges <- data.frame(color = c("green", "lightblue", "red"),
  label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
  visEdges(arrows='from') %>%
  visLegend(addEdges=ledges)

6.3 Integration with BiocViews

[Work in progress]

The biocViews package is a small ontology of terms describing Bioconductor packages. This is a work-in-progress section, but here is a small example of plotting the biocViews graph.

library(biocViews)
data(biocViewsVocab)
biocViewsVocab
## A graphNEL graph with directed edges
## Number of Nodes = 494 
## Number of Edges = 493
library(igraph)
g = igraph.from.graphNEL(biocViewsVocab)
library(visNetwork)
gv = toVisNetworkData(g)
visNetwork(gv$nodes, gv$edges, width="100%") %>%
    visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
    visNodes(size=20) %>%
    visPhysics(stabilization=FALSE)

7 Dependency burden

The dependency burden of a package, namely the amount of functionality that a given package is importing, is an important parameter to take into account during package development. A package may break because one or more of its dependencies have changed the part of the API our package is importing or this part has even broken. For this reason, it may be useful for package developers to quantify the dependency burden of a given package. To do that we should first gather all dependency information using the function buildPkgDependencyDataFrame() but setting the arguments to work with packages in Bioconductor and CRAN and dependencies categorised as Depends or Imports, which are the ones installed by default for a given package.

library(BiocPkgTools)

depdf <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN"),
                                     dependencies=c("Depends", "Imports"))
depdf
## # A tibble: 134,394 × 3
##    Package   dependency  edgetype
##    <chr>     <chr>       <chr>   
##  1 a4        a4Base      Depends 
##  2 a4        a4Preproc   Depends 
##  3 a4        a4Classif   Depends 
##  4 a4        a4Core      Depends 
##  5 a4        a4Reporting Depends 
##  6 a4Base    a4Preproc   Depends 
##  7 a4Base    a4Core      Depends 
##  8 a4Classif a4Core      Depends 
##  9 a4Classif a4Preproc   Depends 
## 10 abseqR    R           Depends 
## # … with 134,384 more rows
## # ℹ Use `print(n = ...)` to see more rows

Finally, we call the function pkgDepMetrics() to obtain different metrics on the dependency burden of a package we want to analyze, in the case below, the package BiocPkgTools itself:

pkgDepMetrics("BiocPkgTools", depdf)
##               ImportedAndUsed Exported Usage DepOverlap DepGainIfExcluded
## utils                       1      227  0.44       0.01                 0
## graph                       1      116  0.86       0.07                 0
## rlang                       4      436  0.92       0.02                 0
## igraph                      9      809  1.11       0.13                 4
## RBGL                        1       77  1.30       0.08                 0
## htmltools                   1       75  1.33       0.08                 0
## xml2                        1       66  1.52       0.01                 0
## tidyr                       1       65  1.54       0.23                 1
## tools                       2      118  1.69       0.01                 0
## stringr                     1       49  2.04       0.08                 0
## magrittr                    1       42  2.38       0.01                 0
## DT                          1       42  2.38       0.23                 6
## dplyr                       9      287  3.14       0.22                 0
## tidyselect                  1       25  4.00       0.10                 0
## tibble                      2       46  4.35       0.15                 0
## httr                        5       91  5.49       0.10                 0
## htmlwidgets                 1       14  7.14       0.12                 0
## rvest                       3       40  7.50       0.33                 2
## gh                          1       10 10.00       0.17                 3
## jsonlite                    2       17 11.76       0.01                 0
## BiocFileCache               4       29 13.79       0.51                10
## BiocManager                 1        6 16.67       0.02                 0
## biocViews                  NA       31    NA       0.17                 6
## readr                      NA      115    NA       0.33                 6

In this resulting table, rows correspond to dependencies and columns provide the following information:

The reported information is ordered by the Usage column to facilitate the identification of dependencies for which the analyzed package is using a small fraction of their functionality and therefore, it could be easier remove them. To aid in that decision, the column DepOverlap reports the overlap of the dependency graph of each dependency with the one of the analyzed package. Here a value above, e.g., 0.5, could, albeit not necessarily, imply that removing that dependency could substantially lighten the dependency burden of the analyzed package.

An NA value in the ImportedAndUsed column indicates that the function pkgDepMetrics() could not identify what functionality calls in the analyzed package are made to the dependency. This may happen because pkgDepMetrics() has failed to identify the corresponding calls, as it happens with imported built-in constants such as DNA_BASES from Biostrings, or that although the given package is importing that dependency, none of its functionality is actually being used. In such a case, this dependency could be safely removed without any further change in the analyzed package.

We can find out what actually functionality calls are we importing as follows:

imp <- pkgDepImports("BiocPkgTools")
imp %>% filter(pkg == "DT")
## # A tibble: 1 × 2
##   pkg   fun      
##   <chr> <chr>    
## 1 DT    datatable

8 Provenance

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] biocViews_1.64.1    visNetwork_2.1.0    igraph_1.3.4       
## [4] dplyr_1.0.9         BiocPkgTools_1.14.1 htmlwidgets_1.5.4  
## [7] knitr_1.39          BiocStyle_2.24.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2    xfun_0.32           bslib_0.4.0        
##  [4] purrr_0.3.4         vctrs_0.4.1         generics_0.1.3     
##  [7] htmltools_0.5.3     stats4_4.2.1        BiocFileCache_2.4.0
## [10] yaml_2.3.5          utf8_1.2.2          blob_1.2.3         
## [13] RBGL_1.72.0         XML_3.99-0.10       rlang_1.0.4        
## [16] jquerylib_0.1.4     pillar_1.8.0        glue_1.6.2         
## [19] DBI_1.1.3           rappdirs_0.3.3      BiocGenerics_0.42.0
## [22] bit64_4.0.5         dbplyr_2.2.1        lifecycle_1.0.1    
## [25] stringr_1.4.0       rvest_1.0.2         evaluate_0.16      
## [28] memoise_2.0.1       Biobase_2.56.0      tzdb_0.3.0         
## [31] fastmap_1.1.0       curl_4.3.2          RUnit_0.4.32       
## [34] fansi_1.0.3         Rcpp_1.0.9          readr_2.1.2        
## [37] DT_0.24             filelock_1.0.2      BiocManager_1.30.18
## [40] cachem_1.0.6        graph_1.74.0        jsonlite_1.8.0     
## [43] bit_4.0.4           hms_1.1.1           digest_0.6.29      
## [46] stringi_1.7.8       gh_1.3.0            bookdown_0.28      
## [49] cli_3.3.0           tools_4.2.1         bitops_1.0-7       
## [52] magrittr_2.0.3      sass_0.4.2          RCurl_1.98-1.8     
## [55] RSQLite_2.2.15      tibble_3.1.8        crayon_1.5.1       
## [58] tidyr_1.2.0         pkgconfig_2.0.3     ellipsis_0.3.2     
## [61] xml2_1.3.3          httr_1.4.3          assertthat_0.2.1   
## [64] rmarkdown_2.15      R6_2.5.1            compiler_4.2.1