ExperimentHubData
provides tools to add or modify resources in
Bioconductor’s ExperimentHub
. This ‘hub’ houses curated data from courses,
publications or experiments. The resources are generally not files of raw data
(as can be the case in AnnotationHub
) but instead are R
/ Bioconductor
objects such as GRanges, SummarizedExperiment, data.frame etc. Each resource
has associated metadata that can be searched through the ExperimentHub
client interface.
Resources are contributed to ExperimentHub
in the form of a package. The
package contains the resource metadata, man pages, vignette and any supporting
R
functions the author wants to provide. This is a similar design to the
existing Bioconductor
experimental data packages except the data are
stored in AWS S3 buckets instead of the data/ directory of the package.
Below are the steps required for adding new resources.
Bioconductor
team memberThe man page and vignette examples in the data experiment package will not work until
the data are available in ExperimentHub
. Adding the data to AWS S3 and the
metadata to the production database involves assistance from a Bioconductor
team member. Please read the section “Uploading Data to S3”.
When a resource is downloaded from ExperimentHub
the associated data experiment
package is loaded in the workspace making the man pages and vignettes readily
available. Because documentation plays an important role in understanding these
curated resources please take the time to develop clear man pages and a
detailed vignette. These documents provide essential background to the user and
guide appropriate use the of resources.
Below is an outline of package organization. The files listed are required unless otherwise stated.
inst/extdata/
metadata.csv
:
This file contains the metadata in the format of one row per resource
to be added to the ExperimentHub
database. The file should be generated
from the code in inst/scripts/make-metadata.R where the final data are
written out with write.csv(..., row.names=FALSE)
. The required column
names and data types are specified in
ExperimentHubData::makeExperimentHubMetadata
See
?ExperimentHubData::makeExperimentHubMetadata
for details.
An example data experiment package metadata.csv file can be found here
inst/scripts/
make-data.R
:
A script describing the steps involved in making the data object(s). This
includes where the original data were downloaded from, pre-processing,
and how the final R object was made. Include a description of any
steps performed outside of R
with third party software. It is encouraged to
serialize Data objects with save()
with the .rda extension on the filename
but not strictly necessary. If the data is provided in another format an
appropriate loading method may need to be implemented. Please advise when
reaching out for “Uploading Data to S3”.
make-metadata.R
:
A script to make the metadata.csv file located in inst/extdata of the
package. See ?ExperimentHubData::makeExperimentHubMetadata
for a description
of expected fields and data types.
ExperimentHubData::makeExperimentHubMetadata()
can be used to validate the
metadata.csv file before submitting the package.
vignettes/
R/
zzz.R
: Optional. You can include a .onLoad()
function in a zzz.R file that
exports each resource name (i.e., title) into a function. This allows the data
to be loaded by name, e.g., resouce123()
.
.onLoad <- function(libname, pkgname) {
fl <- system.file("extdata", "metadata.csv", package=pkgname)
titles <- read.csv(fl, stringsAsFactors=FALSE)$Title
createHubAccessors(pkgname, titles)
}
ExperimentHub::createHubAccessors()
and
ExperimentHub:::.hubAccessorFactory()
provide internal
detail. The resource-named function has a single ‘metadata’
argument. When metadata=TRUE, the metadata are loaded (equivalent
to single-bracket method on an ExperimentHub object) and when
FALSE the full resource is loaded (equivalent to double-bracket
method).
R/*.R
: Optional. Functions to enhance data exploration.
man/
package man page: The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an entry for each resource title either on the package man page or individual man pages.
resource man pages: Resources can be documented on the same page, grouped by common type or have their own dedicated man pages.
document how data are loaded: Data can be accessed via the standard ExperimentHub interface with single and double-bracket methods, e.g.,
library(ExperimentHub)
eh <- ExperimentHub()
myfiles <- query(eh, "PACKAGENAME")
myfiles[[1]] ## load the first resource in the list
myfiles[["EH123"]] ## load by EH id
If a .onLoad()
function is used to export each resource as a function
also document that method of loading, e.g.,
resourceA(metadata = FALSE) ## data are loaded
resourceA(metadata = TRUE) ## metadata are displayed
DESCRIPTION
/ NAMESPACE
The package should depend on and fully import ExperimentHub. If using the
suggested .onLoad()
function, import the utils package in the DESCRIPTION
file and selectively importFrom(utils, read.csv) in the NAMESPACE.
Package authors are encouraged to use the ExperimentHub::listResources()
and
ExperimentHub::loadResource()
functions in their man pages and vignette.
These helpers are designed to facilitate data discovery within a specific
package vs within all of ExperimentHub.
Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should follow instructions in the section “Uploading Data to S3”.
When you are satisfied with the representation of your resources in
make-metadata.R (which produces metadata.csv) the Bioconductor
team
member will add the metadata to the production database.
Once the data are in AWS S3 and the metadata have been added to the production database the man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the package tracker for review. The package should be submitted without any of the data that is now located on S3; This keeps the package light weight and minimual size while still providing access to key large data files now stored on S3. If the data files were added to the github repository please see removing large data files and clean git tree to remove the large files and reduce package size.
Many times these data package are created as a suppliment to a software package. There is a process for submitting mulitple package under the same issue.
Metadata for new versions of the data can be added to the same package as they become available.
The titles for the new versions must be unique and not match the title of any resource currently in ExperimentHub. Good practice would be to include the version and / or genome build in the title.
Make data available: see section on “Uploading Data to S3”
Update make-metadata.R with the new metadata information
Generate a new metadata.csv file. The package should contain metadata for all versions of the data in ExperimentHub so the old file should remain. When adding a new version it might be helpful to write a new csv file named by version, e.g., metadata_v84.csv, metadata_85.csv etc.
Bump package version and commit to git
Notify Lori.Shepherd@Roswellpark.org that an update is ready and a team member will add the new metadata to the production database; new resources will not be visible in ExperimentHub until the metadata are added to the database.
Contact Lori.Shepherd@roswellpark.org or maintainer@bioconductor.org with any questions.
A bug fix may involve a change to the metadata, data resource or both.
The replacement resource must have the same name as the original and be at the same location (path)
Notify Lori.Shepherd@roswellpark.org that you want to replace the data and make the files available: see section “Uploading Data to S3”.
New metadata records can be added for new resources but modifying existing records is discouraged. Record modification will only be done in the case of bug fixes.
Notify Lori.Shepherd@roswellpark.org that you want to change the metadata
Update make-metadata.R with modified information and regenerate the metadata.csv file if necessary
Bump the package version and commit to git
Removing resources should be done with caution. The intent is that ExperimentHub be a ‘reproducible’ resource by providing a stable snapshot of the data. Data made available in Bioconductor version x.y.z should be available for all versions greater than x.y.z. Unfortunately this is not always possible. If you find it necessary to remove data from ExperimentHub please contact Lori.shepherd@roswellpark.org or maintainer@bioconductor.org for assistance.
When a resource is removed from ExperimentHub the ‘status’ field in the
metadata is modified to explain why they are no longer available. Once
this status is changed the ExperimentHub()
constructor will not list the
resource among the available ids. An attempt to extract the resource with
‘[[’ and the EH id will return an error along with the status message.
Instead of providing the data files via dropbox, ftp, etc. we will grant temporary access to an S3 bucket where you can upload your data. Please email Lori.Shepherd@roswellpark.org for access.
You will be given access to the ‘AnnotationContributor’ user. Ensure that the
AWS CLI
is installed on your machine. See instructions for installing AWS CLI
here. Once you have requested access you
will be emailed a set of keys. There are two options to set the profile up for
AnnotationContributor
.aws/config
file to include the following updating the keys
accordingly:[profile AnnotationContributor]
output = text
region = us-east-1
aws_access_key_id = ****
aws_secret_access_key = ****
.aws/config
file, Run the following command entering
appropriate information from aboveaws configure --profile AnnotationContributor
After the configuration is set you should be able to upload resources using
aws --profile AnnotationContributor s3 cp test_file.txt s3://annotation-contributor/test_file.txt --acl public-read
# to upload directory
aws --profile AnnotationContributor s3 cp test_dir s3://annotation-contributor/teset_dir --recursive --acl public-read
Please upload the data with the appropriate directory structure, including subdirectories as necessary (i.e. top directory must be software package name, then if applicable, subdirectories of versions, …)
Once the upload is complete, email Lori.Shepherd@roswellpark.org to continue the process. To add the data officially the data will need to be uploaded and the metadata.csv file will need to be created in the github repository.
The best way to validate record metadata is to read inst/extdata/metadata.csv
with ExperimentHubData::makeExperimentHubMetadata()
. If that is successful the
metadata are ready to go.
As described above the metadata.csv file (or multiple metadata.csv files) will
need to be created before the data can be added to the database. To ensure
proper formatting one should run AnnotationHubData::makeAnnotationHubMetadata
on the package with any/all metadata files, and address any ERRORs that
occur. Each object uploaded to S3 should have an entry in the metadata
file. Briefly, a description of the metadata columns required:
Any additional columns in the metadata.csv file will be ignored but could be included for internal reference.
This is a dummy example but hopefully it will give you an idea of the format. Let’s say I have a package myExperimentPackage and I upload two files one a SummarizedExperiments of expression data saved as a .rda and the other a sqlite database both considered simulated data. You would want the following saved as a csv (comma seperated output) but for easier view we show in a table:
Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Simulated Expression Data | Simulated Expression values for 12 samples and 12000 probles | 3.9 | NA | Simulated | http://mylabshomepage | v1 | NA | NA | NA | http://bioconductor.org/packages/myExperimentPackage | Bioconductor Maintainer maintainer@bioconductor.org | SummarizedExperiment | Rda | myExperimentPackage/SEobject.rda |
Simulated Database | Simulated Database containing gene mappings | 3.9 | hg19 | Simulated | http://bioconductor.org/packages/myExperimentPackage | v2 | Home sapiens | 9606 | NA | http://bioconductor.org/packages/myExperimentPackage | Bioconductor Maintainer maintainer@bioconductor.org | SQLiteConnection | SQLiteFile | myExperimentPackage/mydatabase.sqlite |