scpdata 1.12.0
Welcome to the scpdata
package, and thank you for your interest in
contributing!
The scpdata
data package is a repository of curated mass
spectrometry-based single-cell proteomics (SCP) datasets. The purpose
of scpdata
is to provide users with streamlined access to
high-quality SCP data, alleviating the need for time-consuming data
wrangling. We currently provide data at the peptide-to-spectrum match
(PSM) level, the peptide level and/or the protein level. The package
also encompasses a large diversity of technologies, including DDA and
DIA, label-free and multiplexed experiments from various laboratories
such as the Slavov Lab, the Kelly Lab, and the Schoof Lab.
Contributions are very much welcome. We happily accept major contributions such as adding a new dataset, as well as minor contributions as fixing typos or improving current documentation.
To facilitate our collaboration, this vignette will guide you through the process of adding a new dataset to the package. We will first get you started with some basic guidelines on how to contribute using GitHub. We’ll proceed with a description of the data structure and the data pieces we expect. Next, we will provide an overview of the package’s folder structure to help you navigate through the project. Finally, we’ll explain the workflow you should follow to add your dataset to the repository.
scpdata
GitHub repository (click
here).git
:git clone git@github.com:YOUR_USER_NAME/scpdata
If you have any questions or face any hurdles, do not hesitate to open a new issue and we’ll be happy to provide additional guidance.
QFeatures
objectAll datasets in scpdata
are stored in a QFeatures
object (see
intro
vignette).
The object is created following the
scp
data framework, as
described in this short
demo.
We refer to feature data as the data generated by MS data
identification and quantification tools. Depending on the tool,
features may represent PSMs, peptides and/or proteins. For instance,
MaxQuant provides an evidence.txt
file with PSM-level information,
a peptides.txt
file with peptide-level information and
proteinGroups.txt
with protein-level information. We encourage
adding as many of the three feature layers when contributing a dataset
to scpdata
.
For each feature, the tools provide quantification data as well as
feature annotations. These two pieces of information should be
separated in a SingleCellExperiment
object. Feature annotations are
stored in the rowData
and the quantitative values are stored in the
assay
.
Sample annotations contain information about each sample (single cell)
in the dataset. This information is generated by the experimenter
and should contain biological descriptors, such as the cell line or
the treatment applied, and technical descriptors, such as the day of
acquisition, the acquisition batch, the LC batch, etc. The sample
annotations are stored in the colData
of the QFeatures
object.
If you want to contribute to scpdata
with a dataset you generated
yourself, we suggest you read the last section of initial
recommendations for SCP experiments that provides a comprehensive
discussion about descriptors of interest you should collect:
Gatto, Laurent, Ruedi Aebersold, Juergen Cox, Vadim Demichev, Jason Derks, Edward Emmott, Alexander M. Franks, et al. 2023. “Initial Recommendations for Performing, Benchmarking and Reporting Single-Cell Proteomics Experiments.” Nature Methods 20 (3): 375–86.
We also require the collection of experimental data that describes the dataset. This information is commonly retrieved from the publication associated with the dataset and provides a scientific context to the dataset. This information is used for building the dataset documentation.
Finally, the ExperimentHub
project, on which scpdata
relies,
requires every dataset to thoroughly provide a description of the data
sources.
We here provide an overview of the key folders and files relevant when contributing a new dataset. The current files may provide a source of inspiration when preparing a new dataset.
The folder contains all R scripts used to generate the QFeatures
objects from the source files, one script for each dataset. Each
script is named as follows: make-data_
+ DATASET_NAME
+ .R
.
Note the file called make-metadata.R
. It generates a CSV table
required by ExperimentHub
where each line corresponds to a dataset
and the columns contains the data source information. The table is
stored in inst/extdata/metadata.csv
, which should never be changed
manually.
The folder contains 3 R scripts, but new contributions should only
consider the data.R
and can safely ignore the other two. The
data.R
script contains the documentation for each dataset, formatted
using roxygen2
markup.
The folder contains the compiled documentation manuals, one for each
dataset. These were automatically generated by roxygen2
and
should never be changed manually.
In practice, contributing a new dataset involves 6 steps.
If you want to contribute an already published dataset, identify the data sources for all feature data and the sample annotations. This is generally provided in the article, but you may need to request additional information from the authors.
If you want to contribute with your own dataset, make sure that all feature data and the sample annotation table are available from a public repository (eg PRIDE, MASSive or Zenodo).
QFeatures
objectCreate a new R script, inst/scripts/make-data_DATASET_NAME.R
, which
contains all the code to convert the data source data into the
QFeatures
object. Here are some tips and tricks for generating a
high-quality dataset:
QFeatures
or SingleCellExperiment
objects can be streamlined
using
scp::readSCP()
and
scp::readSingleCellExperiment()
,
respectively.QFeatures::addAssay()
.
You should then add links between the assays. This is streamlined
using
QFeatures::addAssayLink()
.Add the data documentation and the data collection procedure in
scpdata/R/data.R
. Use roxygen2
markup language. The documentation
is structured as follows, but you can best use the documentation of an
existing dataset as a template:
QFeatures
object. Describe each assay,
namely what level features it contains, the number of features and
the number of cells/samplesQFeatures
object, and where to find the script you created.##' \donttest{
##' dataset_name()
##' }
##' @keywords datasets
"dataset_name"
: end the documentation with the name of your
dataset, ensuring your data set is correctly exported.Add the data source information in the inst/script/make-metadata.R
script and run the complete script that will update the
inst/extdata/metadata.csv
. You can use a previous dataset as
template. All fields are mandatory: Title, Description, BiocVersion,
Genome, SourceType, SourceUrl, SourceVersion, Species, TaxonomyId,
Coordinate_1_based, DataProvider, Maintainer, RDataClass,
DispatchClass, PublicationDate, NumberAssays, PreprocessingSoftware,
LabelingProtocol, PsmsAvailable, PeptidesAvailable, ProteinsAvailable,
ContainsSingleCells, Notes. See
?ExperimentHubData::makeExperimentHubMetadata
for a comprehensive
description of the fields.
Next, ensure that your updated metadata.csv
file is valid by
running ExperimentHubData::makeExperimentHubMetadata("scpdata")
.
Push any change you made to GitHub and open a pull request to notify
us of your contribution. The pull request should include all the
commits related to the dataset you want to contribute. Provide in the
description where we can retrieve your QFeatures
object, e.g.
through Zenodo.
Once your pull request is submitted, we will take over and will proceed to the following steps:
metadata.csv
on their server. See the help
page
for more information.