Introduction to the AnVIL package

Nitesh Turaga1, Vincent Carey, BJ Stubbs, Marcel Ramos and Martin Morgan1*

1Roswell Park Comprehensive Cancer Center

7 May 2020

Abstract

The AnVIL is cloud computing resource developed in part by the National Human Genome Research Institute. The AnVIL package provides end-user and devloper functionality. For the end-user, AnVIL provides faast binary package installation, utitlities for working with Terra / AnVIL table and data resources, and convenient functions for file movement to and from Google cloud storage. For developers, AnVIL provides programatic access to the Terra, Leonardo, Dockstore, and Gen3 RESTful programming interface, including helper functions to transform JSON responses to more formats more amenable to manipulation in R.

Package

AnVIL 1.0.3

1 Installation

Install the AnVIL package with

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager", repos = "https://cran.r-project.org")
BiocManager::install("AnVIL")

Once installed, load the package with

library(AnVIL)

2 Quick start

2.1 Up to speed with AnVIL

The AnVIL project is an analysis, visualization, and informatics cloud-based space for data access, sharing and computing across large genomic-related data sets.

The AnVIL project supports use of R through Jupyter notebooks and RStudio. Support for RStudio is preliminary as of April 2020.

This package provides access to AnVIL resources from within the AnVIL cloud, and also from stand-alone computing resources such as a user’s laptop.

Use of this package requires AnVIL and Google cloud computing billing accounts. Consult AnVIL training guides for details on establishing these accounts.

The remainder of this vignette assumes that an AnVIL account has been established and successfully linked to a Google cloud computing billing account.

2.2 Use in the AnVIL cloud

In the AnVIL cloud environment, click on the RUNTIME button illustrated below and choose the ‘Bioconductor’ runtime. When creating a Jupyter notebook, choose R as the engine. For RStudio use, choose Custom environment and enter the current anvil-rstudio-bioconductor image and version tag. A time of writing, the image is

us.gcr.io/anvil-gcr-public/anvil-rstudio-bioconductor:0.0.3

Images are available at Google Cloud Platform container registry.

2.3 Local use

Local use requires that the gcloud SDK is installed, and that the billing account used by AnVIL can be authenticated with the user. These requirements are satisfied when using the AnVIL compute cloud. For local use, one must

Install the gcloud sdk

Define an environment variable or option() named GCLOUD_SDK_PATH pointing to the root of the SDK installation, e.g,

dir(file.path(Sys.getenv("GCLOUD_SDK_PATH"), "bin"), "^(gcloud|gsutil)$")
## [1] "gcloud" "gsutil"

Test the installation with gcloud_exists()

## the code chunks in this vignette are fully evaluated when
## gcloud_exists() returns TRUE
gcloud_exists()
## [1] FALSE

3 For end users

3.1 Fast binary package installation

The AnVIL cloud compute environment makes use of docker containers with defined installations of binary system software. It is thus possible to archive pre-built ‘binary’ R packages, and to install these without requiring compilation. The AnVIL function install() arranges to install binary packages (when these are available) and current; it defaults to installing packages from source using standard BiocManager::install() facilities.

AnVIL::install("GenomicFeatures")

Thus AnVIL::install() can be used as an improved method for installing CRAN and Bioconductor packages.

Because package installation is fast, it can be convenient to install packages into libraries on a project-specific basis, e.g., to create a ‘snapshot’ of packages for reproducible analysis. Use

add_libpaths("~/my/project")

as a convenient way to prepend a project-specific library path to .libPaths(). New packages will be installed into this library.

3.2 Working with Google cloud-based resources

The AnVIL package implements functions to facilitate access to Google cloud resources.

Using `gcloud_*()` for account management

The gcloud_*() family of functions provide access to Google cloud functions implemented by the gcloud binary. gcloud_project() returns the current billing account.

gcloud_account() # authentication account
gcloud_project() # billing project information

A convenient way to access any gcloud SDK command is to use gcloud_cmd(), e.g.,

gcloud_cmd("projects", "list") %>%
    readr::read_table() %>%
    filter(startsWith(PROJECT_ID, "anvil"))

This translates into the command line gcloud projects list. Help is also available within R, e.g.,

gcloud_help("projects")

Use gcloud_help() (with no arguments) for an overview of available commands.

Using `gsutil_*()` for file and bucket management

The gsutil_*() family of functions provides an interface to google bucket manipulation. The following refers to publicly available 1000 genomes data available in Google Cloud Storage.

src <- "gs://genomics-public-data/1000-genomes/"

gsutil_ls() lists bucket content; gsutil_stat() additional detail about fully-specified buckets.

gsutil_ls(src)

other <- paste0(src, "other")
gsutil_ls(other, recursive = TRUE)

sample_info <- paste0(src, "other/sample_info/sample_info.csv")
gsutil_stat(sample_info)

gsutil_cp() copies buckets from or to Google cloud storage; copying to cloud storage requires write permission, of course. One or both of the arguments can be cloud endpoints.

fl <- tempfile()
gsutil_cp(sample_info, fl)

csv <- readr::read_csv(fl, guess_max = 5000L)
csv

gsutil_pipe() provides a streaming interface that does not require intermediate disk storage.

pipe <- gsutil_pipe(fl)
readr::read_csv(pipe, guess_max = 5000L) %>%
    dplyr::select("Sample", "Family_ID", "Population", "Gender")

gsutil_rsync() synchronizes a local file hierarchy with a remote bucket. This can be a powerful operation when delete = TRUE (removing local or remote files), and has default option dry = TRUE to indicate the consequences of the sync.

destination <- tempfile()
stopifnot(dir.create(destination))
source <- paste0(src, "other/sample_info")

## dry run
gsutil_rsync(source, destination)

gsutil_rsync(source, destination, dry = FALSE)
dir(destination, recursive = TRUE)

## nothing to synchronize
gsutil_rsync(source, destination, dry = FALSE)

## one file requires synchronization
unlink(file.path(destination, "README"))
gsutil_rsync(source, destination, dry = FALSE)

localize() and delocalize() provide ‘one-way’ synchronization. localize() moves the content of the gs:// source to the local file system. localize() could be used at the start of an analysis to retrieve data stored in the google cloud to the local compute instance. delocalize() performs the complementary operation, copying local files to a gs:// destination. The unlink = TRUE option to delocalize() unlinks local source files recursively. It could be used at the end of an analysis to move results to the cloud for long-term persistent storage.

3.3 Using `av*()` to work with AnVIL tables and data

Tables, reference data, and persistent files

AnVIL organizes data and analysis environments into ‘workspaces’. AnVIL-provided Data resources in a workspace are managed in as ‘TABLES’, ‘REFERENCE DATA’, and ‘Workspace Data’, as illustrated in the figure below.

Data tables in a workspace are available by specifying the namespace (billing account) and name (workspace name) of the workspace. When on the AnVIL in a Jupyter notebook or RStudio, this information can be discovered with

avworkspace_namespace()
avworkspace_name()

It is also possible to specify, when not in the AnVIL compute environment, the data resource to work with.

## N.B.: IT IS NOT NECESSARY TO SET THESE WHEN ON ANVIL
avworkspace_namespace("pathogen-genomic-surveillance")
avworkspace_name("COVID-19")

Using `avtable*()` for accessing tables

Accessing data tables use the av*() functions. Use avtables() to discover available tables, and avtable() to retrieve a particular table

avtables()
sample <- avtable("sample")
sample

The data in the table can then be manipulated using standard R commands, e.g., to identify SRA samples for which a final assembly fasta file is available.

sample %>%
    select(name, contains("fasta")) %>%
    filter(!is.na(final_assembly_fasta))

Users can easily add tables to their own workspace using avtable_import(), perhaps as the final stage of a pipe

mtcars %>%
    mutate(cyl = factor(cyl)) %>%
    avtable_import()

The TABLES data in a workspace are usually provided as curated results from AnVIL. Nonetheless, it can sometimes be useful to delete individual rows from a table. Use avtable_delete_values().

Using `avdata()` for accessing Workspace Data

The ‘Workspace Data’ is accessible through avdata() (the example below shows that some additional parsing may be necessary).

avdata()

Using `avbucket()` and workspace files

Each workspace is associated with a google bucket, with the content summarized in the ‘Files’ portion of the workspace. The location of the files is

files <- avbucket()
files

and the content can be discovered and retrieved using gsutil_*() functionality; if the workspace is owned by the user, then persistent data can be written to the bucket.

gsutil_ls(files)

## requires workspace ownership
uri <- avbucket()                             # discover bucket
bucket <- file.path(uri, "mtcars.tab")
write.table(mtcars, gsutil_pipe(bucket, "w")) # write to bucket

3.4 Example work flows

Example work flows will be developed as experience with the AnVIL cloud increases.

4 For developers

4.1 Set-up

4.2 Service APIs

AnVIL applications are exposed to the developer through RESTful API services. Each service is represented in R as an object. The object is created by invoking a constructor, sometimes with arguments. We illustrate basic functionality with the Terra() service.

Currently, APIs using the OpenAPI Specification (OAS) Version 2 (formerly known as Swagger) are supported. AnVIL makes use of the rapiclient codebase to provide a unified representation of the API protocol.

Construction

Create an instance of the service. This consults a Swagger / OpenAPI schema corresponding to the service to create an object that knows about available endpoints. Terra / AnVIL project services usually have Swagger / OpenApi-generated documentation, e.g., for the Terra service.

terra <- Terra()

Printing the return object displays a brief summary of endpoints

terra

The schema for the service groups endpoints based on tag values, providing some level of organization when exploring the service. Tags display consists of endpoints (available as a tibble with tags(terra)).

terra %>% tags("Status")

Invoke endpoints

Access an endpoint with $; without parentheses () this generates a brief documentation string (derived from the schema specification. Including parentheses (and necessary arguments) invokes the endpoint.

terra$status
terra$status()

operations() and schemas() return a named list of endpoints, and of argument and return value schemas. operations(terra)$XXX() can be used an alternative to direct invocation terra$XXX(). schemas() can be used to construct function arguments with complex structure.

empty_object() is a convenience function to construct an ‘empty’ object (named list without content) required by some endpoints.

Process responses

Endpoints return objects of class response, defined in the httr package

status <- terra$status()
class(status)

Several convenience functions are available to help developers transform return values into representations that are more directly useful.

str() is invoked for the side-effect of displaying the list-like structure of the response. Note that this is not the literal structure of the response object (use utils::str(status) for that), but rather the structure of the JSON response received from the service.

str(status)

as.list() returns the JSON response as a list, and flatten() attempts to transform the list into a tibble. flatten() is effective when the response is in fact a JSON row-wise representation of tibble-like data.

lst <- status %>% as.list()
lengths(lst)
lengths(lst$systems)
str(lst$systems)

4.3 Service implementations

The AnVIL package implements and has made extensive use of the following services:

Terra (https://api.firecloud.org/; Terra()) provides access to terra account and workspace management.

Leonardo (https://leonardo.dev.anvilproject.org/; Leonardo()) implements an interface to the AnVIL container deployment service, useful for management Jupyter notebook and RStudio sessions running in the AnVIL compute cloud.

The Dockstore service (https://dockstore.org/swagger.json, Dockstore()) is available but has recieved limited testing. Dockstore is used to run CWL- or WDL-based work flows, including workflows using R / Bioconductor. See the separate vignette ‘Dockstore and Bioconductor for AnVIL’ for initial documentation.

Gen3 services (https://raw.githubusercontent.com/uc-cdis) can be created, but functionality is untested. The services are Gen3Fence() (authentication), Gen3Indexd() (indexing service), Gen3Peregrine() (graphQL queries), and Gen3Sheepdog() (submission services).

4.4 Extending the `Service` class to implement your own RESTful interface

The AnVIL package provides useful functionality for exposing other RESTful services represented in Swagger. To use this in other packages,

Add to the package DESCRIPTION file
```
Imports: AnVIL
```

Arrange (e.g., via roxygen2 @importFrom, etc.) for the NAMESPACE file to contain

importFrom AnVIL, Service
importMethodsFrom AnVIL, "$"   # pehaps also `tags()`, etc
importClassesFrom AnVIL, Service

Implement your own class definition and constructor. Use ?Service to provide guidance on argument specification. For instance, to re-implement the terra service.

.MyService <- setClass("MyService", contains = "Service")

MyService <-
    function()
{
    .MyService(Service(
        "myservice",
        host = "api.firecloud.org",
        api_url = "https://api.firecloud.org/api-docs.yaml",
        authenticate = FALSE
    ))
}

5 Support, bug reports, and source code availability

For user support, please ask for help on the Bioconductor support site. Remember to tag your question with ‘AnVIL’, so that the maintainer is notified. Ask for developer support on the bioc-devel mailing list.

Please report bugs as ‘issues’ on GitHub.

Retrieve the source code for this package from it’s cannonical location.

git clone https://git.bioconductor.org/packages/AnVIL

The package source code is also available on GitHub

Appendix

Acknowledgments

Research reported in this software package was supported by the US National Human Genomics Research Institute of the National Institutes of Health under award number U24HG010263. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Introduction to the AnVIL package

7 May 2020

Abstract

Package

Contents

1 Installation

2 Quick start

2.1 Up to speed with AnVIL

2.2 Use in the AnVIL cloud

2.3 Local use

3 For end users

3.1 Fast binary package installation

3.2 Working with Google cloud-based resources

Using `gcloud_*()` for account management

Using `gsutil_*()` for file and bucket management

3.3 Using `av*()` to work with AnVIL tables and data

Tables, reference data, and persistent files

Using `avtable*()` for accessing tables

Using `avdata()` for accessing Workspace Data

Using `avbucket()` and workspace files

3.4 Example work flows

4 For developers

4.1 Set-up

4.2 Service APIs

Construction

Invoke endpoints

Process responses

4.3 Service implementations

4.4 Extending the `Service` class to implement your own RESTful interface

5 Support, bug reports, and source code availability

Appendix

Acknowledgments

Session info

Introduction to the AnVIL package

7 May 2020

Abstract

Package

Contents

1 Installation

2 Quick start

2.1 Up to speed with AnVIL

2.2 Use in the AnVIL cloud

2.3 Local use

3 For end users

3.1 Fast binary package installation

3.2 Working with Google cloud-based resources

Using gcloud_*() for account management

Using gsutil_*() for file and bucket management

3.3 Using av*() to work with AnVIL tables and data

Tables, reference data, and persistent files

Using avtable*() for accessing tables

Using avdata() for accessing Workspace Data

Using avbucket() and workspace files

3.4 Example work flows

4 For developers

4.1 Set-up

4.2 Service APIs

Construction

Invoke endpoints

Process responses

4.3 Service implementations

4.4 Extending the Service class to implement your own RESTful interface

5 Support, bug reports, and source code availability

Appendix

Acknowledgments

Session info

Using `gcloud_*()` for account management

Using `gsutil_*()` for file and bucket management

3.3 Using `av*()` to work with AnVIL tables and data

Using `avtable*()` for accessing tables

Using `avdata()` for accessing Workspace Data

Using `avbucket()` and workspace files

4.4 Extending the `Service` class to implement your own RESTful interface