CuratedAtlasQuery
is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.
Harmonised data is stored in the ARDC Nectar Research Cloud, and most CuratedAtlasQuery
functions interact with Nectar via web requests, so a network connection is required for most functionality.
# Note: in real applications you should use the default value of remote_url
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
#> # Source: SQL [?? x 56]
#> # Database: DuckDB v0.10.1 [biocbuild@Linux 5.15.0-105-generic:R 4.4.0/:memory:]
#> cell_ sample_ cell_type cell_type_harmonised confidence_class
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 TTATGCTAGGGTGTTG_12 039c558c… mature N… immune_unclassified 5
#> 2 GCTTGAACATGGTCTA_12 039c558c… mature N… cd8 tem 3
#> 3 GTCATTTGTGCACCAC_12 039c558c… mature N… immune_unclassified 5
#> 4 AAGGAGCGTATTCTCT_12 039c558c… mature N… immune_unclassified 5
#> 5 ATTGGACTCCGTACAA_12 039c558c… mature N… immune_unclassified 5
#> 6 CTCGTCAAGTACGTTC_12 039c558c… mature N… immune_unclassified 5
#> 7 CGTCTACTCTCTAAGG_11 07e64957… mature N… immune_unclassified 5
#> 8 ACGATCAAGGTGGTTG_15 17640030… mature N… immune_unclassified 5
#> 9 TCATACTTCCGCACGA_15 17640030… mature N… immune_unclassified 5
#> 10 GAGACCCTCCCTTCCC_15 17640030… mature N… immune_unclassified 5
#> # ℹ more rows
#> # ℹ 51 more variables: cell_annotation_azimuth_l2 <chr>,
#> # cell_annotation_blueprint_singler <chr>,
#> # cell_annotation_monaco_singler <chr>, sample_id_db <chr>,
#> # `_sample_name` <chr>, assay <chr>, assay_ontology_term_id <chr>,
#> # file_id_db <chr>, cell_type_ontology_term_id <chr>,
#> # development_stage <chr>, development_stage_ontology_term_id <chr>, …
The metadata
variable can then be re-used for all subsequent queries.
metadata |>
dplyr::distinct(tissue, file_id)
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v0.10.1 [biocbuild@Linux 5.15.0-105-generic:R 4.4.0/:memory:]
#> tissue file_id
#> <chr> <chr>
#> 1 heart left ventricle 5775c8d8-e37e-40cd-94f4-8e78b05ca331
#> 2 kidney 5ef8b993-4a02-42ee-9202-a595f6e9a758
#> 3 thymus 5c1cc788-2645-45fb-b1d9-2f43d368bba8
#> 4 respiratory airway fe1bbb3e-8c3b-4dfd-ae20-9d288b8a7699
#> 5 blood 79d07078-90fd-43c3-b705-46c9b4d9d8d3
#> 6 mesenteric lymph node 59dfc135-19c1-4380-a9e8-958908273756
#> 7 kidney blood vessel f7e94dbb-8638-4616-aaf9-16e2212c369f
#> 8 kidney blood vessel 8fee7b82-178b-4c04-bf23-04689415690d
#> 9 respiratory airway 6661ab3a-792a-4682-b58c-4afb98b2c016
#> 10 retina 94039710-0387-40e1-9667-dbbac4c469c1
#> # ℹ more rows
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> class: SingleCellExperiment
#> dim: 36229 1571
#> metadata(0):
#> assays(1): counts
#> rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
#> rowData names(0):
#> colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
#> CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
#> colData names(56): sample_ cell_type ... updated_at_y original_cell_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
This is helpful if just few genes are of interest, as they can be compared across samples.
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> class: SingleCellExperiment
#> dim: 36229 1571
#> metadata(0):
#> assays(1): cpm
#> rownames(36229): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
#> rowData names(0):
#> colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
#> CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
#> colData names(56): sample_ cell_type ... updated_at_y original_cell_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts
#> class: SingleCellExperiment
#> dim: 1 1571
#> metadata(0):
#> assays(1): cpm
#> rownames(1): PUM1
#> rowData names(0):
#> colnames(1571): ACACCAAAGCCACCTG_SC18_1 TCAGCTCCAGACAAGC_SC18_1 ...
#> CAGCATAAGCTAACAA_F02607_1 AAGGAGCGTATAATGG_F02607_1
#> colData names(56): sample_ cell_type ... updated_at_y original_cell_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting.
single_cell_counts_seurat =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_seurat()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts_seurat
#> An object of class Seurat
#> 36229 features across 1571 samples within 1 assay
#> Active assay: originalexp (36229 features, 0 variable features)
#> 2 layers present: counts, data
SingleCellExperiment
The returned SingleCellExperiment
can be saved with two modalities, as .rds
or as HDF5
.
Saving as .rds
has the advantage of being fast, andd the .rds
file occupies very little disk space as it only stores the links to the files in your cache.
However it has the disadvantage that for big SingleCellExperiment
objects, which merge a lot of HDF5 from your get_single_cell_experiment
, the display and manipulation is going to be slow. In addition, an .rds
saved in this way is not portable: you will not be able to share it with other users.
Saving as .hdf5
executes any computation on the SingleCellExperiment
and writes it to disk as a monolithic HDF5
. Once this is done, operations on the SingleCellExperiment
will be comparatively very fast. The resulting .hdf5
file will also be totally portable and sharable.
However this .hdf5
has the disadvantage of being larger than the corresponding .rds
as it includes a copy of the count information, and the saving process is going to be slow for large objects.
We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues
suppressPackageStartupMessages({
library(ggplot2)
})
# Plots with styling
counts <- metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised == "cd14 mono") |>
dplyr::filter(file_id_db != "c5a05f23f9784a3be3bfa651198a48eb") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Add feature to table
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
# Rank x axis
tibble::as_tibble()
# Plot by disease
counts |>
dplyr::with_groups(disease, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(disease, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Disease") +
ggtitle("HLA-A in CD14 monocytes by disease")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
# Plot by tissue
counts |>
dplyr::with_groups(tissue_harmonised, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(tissue_harmonised, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Tissue") +
ggtitle("HLA-A in CD14 monocytes by tissue") +
theme(legend.position = "none")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised=="nk") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Plot
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
ggplot(aes(tissue_harmonised, `HLA.A`, color = file_id)) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
legend.position = "none"
) +
geom_jitter(shape=".") +
xlab("Tissue") +
ggtitle("HLA-A in nk cells by tissue")
Various metadata fields are not common between datasets, so it does not make sense for these to live in the main metadata table. However, we can obtain it using the get_unharmonised_metadata()
function. This function returns a data frame with one row per dataset, including the unharmonised
column which contains unharmnised metadata as a nested data frame.
harmonised <- metadata |> dplyr::filter(tissue == "kidney blood vessel")
unharmonised <- get_unharmonised_metadata(harmonised)
unharmonised
#> # A tibble: 4 × 2
#> file_id unharmonised
#> <chr> <list>
#> 1 63523aa3-0d04-4fc6-ac59-5cadd3e73a14 <tbl_dck_[,17]>
#> 2 8fee7b82-178b-4c04-bf23-04689415690d <tbl_dck_[,12]>
#> 3 dc9d8cdd-29ee-4c44-830c-6559cb3d0af6 <tbl_dck_[,14]>
#> 4 f7e94dbb-8638-4616-aaf9-16e2212c369f <tbl_dck_[,14]>
Notice that the columns differ between each dataset’s data frame:
dplyr::pull(unharmonised) |> head(2)
#> [[1]]
#> # Source: SQL [?? x 17]
#> # Database: DuckDB v0.10.1 [biocbuild@Linux 5.15.0-105-generic:R 4.4.0/:memory:]
#> cell_ file_id donor_age donor_uuid library_uuid mapped_reference_ann…¹
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 2 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 3 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 4 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 5 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 6 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 7 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 8 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> 9 4602STDY709… 63523a… 19 months 46318131-… 67178571-39… GENCODE 24
#> 10 4602STDY701… 63523a… 27 months a8536b6a-… 5ddaeac7-0f… GENCODE 24
#> # ℹ more rows
#> # ℹ abbreviated name: ¹mapped_reference_annotation
#> # ℹ 11 more variables: sample_uuid <chr>, suspension_type <chr>,
#> # suspension_uuid <chr>, author_cell_type <chr>, cell_state <chr>,
#> # reported_diseases <chr>, Short_Sample <chr>, Project <chr>,
#> # Experiment <chr>, compartment <chr>, broad_celltype <chr>
#>
#> [[2]]
#> # Source: SQL [?? x 12]
#> # Database: DuckDB v0.10.1 [biocbuild@Linux 5.15.0-105-generic:R 4.4.0/:memory:]
#> cell_ file_id orig.ident nCount_RNA nFeature_RNA seurat_clusters Project
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 1069 8fee7b82-17… 4602STDY7… 16082 3997 25 Experi…
#> 2 1214 8fee7b82-17… 4602STDY7… 1037 606 25 Experi…
#> 3 2583 8fee7b82-17… 4602STDY7… 3028 1361 25 Experi…
#> 4 2655 8fee7b82-17… 4602STDY7… 1605 859 25 Experi…
#> 5 3609 8fee7b82-17… 4602STDY7… 1144 682 25 Experi…
#> 6 3624 8fee7b82-17… 4602STDY7… 1874 963 25 Experi…
#> 7 3946 8fee7b82-17… 4602STDY7… 1296 755 25 Experi…
#> 8 5163 8fee7b82-17… 4602STDY7… 11417 3255 25 Experi…
#> 9 5446 8fee7b82-17… 4602STDY7… 1769 946 19 Experi…
#> 10 6275 8fee7b82-17… 4602STDY7… 3750 1559 25 Experi…
#> # ℹ more rows
#> # ℹ 5 more variables: donor_id <chr>, compartment <chr>, broad_celltype <chr>,
#> # author_cell_type <chr>, Sample <chr>
Dataset-specific columns (definitions available at cellxgene.cziscience.com)
cell_count
, collection_id
, created_at.x
, created_at.y
, dataset_deployments
, dataset_id
, file_id
, filename
, filetype
, is_primary_data.y
, is_valid
, linked_genesets
, mean_genes_per_cell
, name
, published
, published_at
, revised_at
, revision
, s3_uri
, schema_version
, tombstone
, updated_at.x
, updated_at.y
, user_submitted
, x_normalization
Sample-specific columns (definitions available at cellxgene.cziscience.com)
sample_
, sample_name
, age_days
, assay
, assay_ontology_term_id
, development_stage
, development_stage_ontology_term_id
, ethnicity
, ethnicity_ontology_term_id
, experiment___
, organism
, organism_ontology_term_id
, sample_placeholder
, sex
, sex_ontology_term_id
, tissue
, tissue_harmonised
, tissue_ontology_term_id
, disease
, disease_ontology_term_id
, is_primary_data.x
Cell-specific columns (definitions available at cellxgene.cziscience.com)
cell_
, cell_type
, cell_type_ontology_term_idm
, cell_type_harmonised
, confidence_class
, cell_annotation_azimuth_l2
, cell_annotation_blueprint_singler
Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
tissue_harmonised
: a coarser tissue name for better filteringage_days
: the number of days corresponding to the agecell_type_harmonised
: the consensus call identity (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleRconfidence_class
: an ordinal class of how confident cell_type_harmonised
is. 1 is complete consensus, 2 is 3 out of four and so on.cell_annotation_azimuth_l2
: Azimuth cell annotationcell_annotation_blueprint_singler
: SingleR cell annotation using Blueprint referencecell_annotation_blueprint_monaco
: SingleR cell annotation using Monaco referencesample_id_db
: Sample subdivision for internal usefile_id_db
: File subdivision for internal usesample_
: Sample ID.sample_name
: How samples were definedThe raw
assay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and transformations specified in the x_normalization
column.
The cpm
assay includes counts per million.
sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 CuratedAtlasQueryR_1.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.8
#> [3] magrittr_2.0.3 spatstat.utils_3.0-4
#> [5] farver_2.1.1 rmarkdown_2.26
#> [7] zlibbioc_1.50.0 vctrs_0.6.5
#> [9] ROCR_1.0-11 spatstat.explore_3.2-7
#> [11] forcats_1.0.0 htmltools_0.5.8.1
#> [13] S4Arrays_1.4.0 ttservice_0.4.0
#> [15] Rhdf5lib_1.26.0 SparseArray_1.4.0
#> [17] rhdf5_2.48.0 sass_0.4.9
#> [19] sctransform_0.4.1 parallelly_1.37.1
#> [21] KernSmooth_2.23-22 bslib_0.7.0
#> [23] htmlwidgets_1.6.4 ica_1.0-3
#> [25] plyr_1.8.9 plotly_4.10.4
#> [27] zoo_1.8-12 cachem_1.0.8
#> [29] igraph_2.0.3 mime_0.12
#> [31] lifecycle_1.0.4 pkgconfig_2.0.3
#> [33] Matrix_1.7-0 R6_2.5.1
#> [35] fastmap_1.1.1 GenomeInfoDbData_1.2.12
#> [37] MatrixGenerics_1.16.0 fitdistrplus_1.1-11
#> [39] future_1.33.2 shiny_1.8.1.1
#> [41] digest_0.6.35 colorspace_2.1-0
#> [43] patchwork_1.2.0 S4Vectors_0.42.0
#> [45] rprojroot_2.0.4 Seurat_5.0.3
#> [47] tensor_1.5 RSpectra_0.16-1
#> [49] irlba_2.3.5.1 GenomicRanges_1.56.0
#> [51] labeling_0.4.3 progressr_0.14.0
#> [53] fansi_1.0.6 spatstat.sparse_3.0-3
#> [55] httr_1.4.7 polyclip_1.10-6
#> [57] abind_1.4-5 compiler_4.4.0
#> [59] withr_3.0.0 DBI_1.2.2
#> [61] fastDummies_1.7.3 highr_0.10
#> [63] HDF5Array_1.32.0 duckdb_0.10.1
#> [65] MASS_7.3-60.2 DelayedArray_0.30.0
#> [67] tools_4.4.0 lmtest_0.9-40
#> [69] httpuv_1.6.15 future.apply_1.11.2
#> [71] goftest_1.2-3 glue_1.7.0
#> [73] nlme_3.1-164 rhdf5filters_1.16.0
#> [75] promises_1.3.0 grid_4.4.0
#> [77] Rtsne_0.17 cluster_2.1.6
#> [79] reshape2_1.4.4 generics_0.1.3
#> [81] gtable_0.3.5 spatstat.data_3.0-4
#> [83] tidyr_1.3.1 data.table_1.15.4
#> [85] sp_2.1-4 utf8_1.2.4
#> [87] XVector_0.44.0 BiocGenerics_0.50.0
#> [89] spatstat.geom_3.2-9 RcppAnnoy_0.0.22
#> [91] ggrepel_0.9.5 RANN_2.6.1
#> [93] pillar_1.9.0 stringr_1.5.1
#> [95] spam_2.10-0 RcppHNSW_0.6.0
#> [97] later_1.3.2 splines_4.4.0
#> [99] dplyr_1.1.4 lattice_0.22-6
#> [101] survival_3.6-4 deldir_2.0-4
#> [103] tidyselect_1.2.1 SingleCellExperiment_1.26.0
#> [105] miniUI_0.1.1.1 pbapply_1.7-2
#> [107] knitr_1.46 gridExtra_2.3
#> [109] IRanges_2.38.0 SummarizedExperiment_1.34.0
#> [111] scattermore_1.2 stats4_4.4.0
#> [113] xfun_0.43 Biobase_2.64.0
#> [115] matrixStats_1.3.0 UCSC.utils_1.0.0
#> [117] stringi_1.8.3 lazyeval_0.2.2
#> [119] yaml_2.3.8 evaluate_0.23
#> [121] codetools_0.2-20 tibble_3.2.1
#> [123] cli_3.6.2 uwot_0.2.2
#> [125] xtable_1.8-4 reticulate_1.36.1
#> [127] munsell_0.5.1 jquerylib_0.1.4
#> [129] GenomeInfoDb_1.40.0 Rcpp_1.0.12
#> [131] spatstat.random_3.2-3 globals_0.16.3
#> [133] dbplyr_2.5.0 png_0.1-8
#> [135] parallel_4.4.0 ellipsis_0.3.2
#> [137] blob_1.2.4 assertthat_0.2.1
#> [139] dotCall64_1.1-1 tidySingleCellExperiment_1.14.0
#> [141] listenv_0.9.1 viridisLite_0.4.2
#> [143] scales_1.3.0 ggridges_0.5.6
#> [145] SeuratObject_5.0.1 leiden_0.4.3.1
#> [147] purrr_1.0.2 crayon_1.5.2
#> [149] rlang_1.1.3 cowplot_1.1.3