h5ad
filesThe purpose of this package is to make it easy to query the Human Cell Atlas Data Portal via their data browser API. Visit the Human Cell Atlas for more information on the project.
Evaluate the following code chunk to install packages required for this vignette.
## install from Bioconductor if you haven't already
pkgs <- c("httr", "dplyr", "LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)
Load the packages into your R session.
library(httr)
library(dplyr)
library(LoomExperiment)
library(hca)
To illustrate use of this package, consider the task of downloading a ‘loom’ file summarizing single-cell gene expression observed in an HCA research project. This could be accomplished by visiting the HCA data portal (at https://data.humancellatlas.org/explore) in a web browser and selecting projects interactively, but it is valuable to accomplish the same goal in a reproducible, flexible, programmatic way. We will (1) discover projects available in the HCA Data Coordinating Center that have loom files; and (2) retrieve the file from the HCA and import the data into R as a ‘LoomExperiment’ object. For illustration, we focus on the ‘Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns’ project.
Use projects()
to retrieve the first 200 projects in the HCA’s
default catalog.
projects(size = 200)
## # A tibble: 200 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <list> <list> <list> <list> <list> <list> <list>
## 1 74b6d569-3b1… 1.3 Mi… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 2 53c53cd4-812… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 3 7027adc6-c9c… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 4 94e4ee09-9b4… A Huma… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 5 60ea42e1-af4… A Prot… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 6 ef1e3497-515… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 7 9ac53858-606… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 8 923d3231-729… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 9 f86f1ab4-1fb… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 10 602628d7-c03… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 190 more rows, 5 more variables: pairedEnd <list>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
Use filters()
to restrict the projects to just those that contain at
least one ‘loom’ file.
project_filter <- filters(fileFormat = list(is = "loom"))
project_tibble <- projects(project_filter)
project_tibble
## # A tibble: 79 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <list> <list> <list> <list> <list> <list> <list>
## 1 53c53cd4-812… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 2 7027adc6-c9c… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 3 c1810dbc-16d… A cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 4 a9301beb-e9f… A huma… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 5 996120f9-e84… A huma… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 6 842605c7-375… A sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 7 a004b150-1c3… A sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 8 4a95101c-9ff… A sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 9 1cd1f41f-f81… A sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 10 8185730f-411… A sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 69 more rows, 5 more variables: pairedEnd <list>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
Use standard R commands to further filter projects to the one we are
interested in, with title starting with “Single…”. Extract the
unique projectId
for the first project with this title.
project_tibble |>
filter(startsWith(projectTitle, "Single")) |>
head(1) |>
t()
## [,1]
## projectId "0c3b7785-f74d-4091-8616-a68757e4c2a8"
## projectTitle "Single cell RNA sequencing of multiple myeloma II"
## genusSpecies "Homo sapiens"
## sampleEntityType "specimens"
## specimenOrgan "hematopoietic system"
## specimenOrganPart character,2
## selectedCellType "plasma cell"
## libraryConstructionApproach character,2
## nucleicAcidSource "single cell"
## pairedEnd logical,2
## workflow character,4
## specimenDisease "plasma cell myeloma"
## donorDisease "plasma cell myeloma"
## developmentStage "human adult stage"
projectIds <-
project_tibble |>
filter(startsWith(projectTitle, "Single")) |>
dplyr::pull(projectId)
projectId <- projectIds[1]
files()
retrieves (the first 1000) files from the Human Cell Atlas
data portal. Construct a filter to restrict the files to loom files
from the project we are interested in.
file_filter <- filters(
projectId = list(is = projectId),
fileFormat = list(is = "loom")
)
# only the two smallest files
file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc")
file_tibble
## # A tibble: 2 × 8
## fileId name fileF…¹ size version proje…² proje…³ url
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
## 1 fe214fea-cc68-56d0-a6d2-fe… bone… loom 4.04e7 2021-0… Single… 0c3b77… http…
## 2 3014ec47-1399-57ca-ab74-c8… Bone… loom 6.35e7 2021-1… Single… 0c3b77… http…
## # … with abbreviated variable names ¹fileFormat, ²projectTitle, ³projectId
files_download()
will download one or more files (one for each row)
in file_tibble
. The download is more complicated than simply
following the url
column of file_tibble
, so it is not possible to
simply copy the url into a browser. We’ll download the file and then
immediately import it into R.
file_locations <- file_tibble |> files_download()
LoomExperiment::import(unname(file_locations[1]),
type ="SingleCellLoomExperiment")
## class: SingleCellLoomExperiment
## dim: 58347 3762
## metadata(15): last_modified CreationDate ...
## project.provenance.document_id specimen_from_organism.organ
## assays(1): matrix
## rownames: NULL
## rowData names(29): Gene antisense_reads ... reads_per_molecule
## spliced_reads
## colnames: NULL
## colData names(43): CellID antisense_reads ... reads_unmapped
## spliced_reads
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(0): NULL
Note that files_download()
uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache],
so individual files are only downloaded once.
h5ad
filesThis example walks through the process of file discovery and retrieval
in a little more detail, using h5ad
files created by the Python
AnnData analysis software and available for some experiments in the
default catalog.
The first challenge is to understand what file formats are available from the HCA. Obtain a tibble describing the ‘facets’ of the data, the number of terms used in each facet, and the number of distinct values used to describe projects.
projects_facets()
## # A tibble: 34 × 3
## facet n_terms n_values
## <chr> <int> <int>
## 1 assayType 1 288
## 2 biologicalSex 5 474
## 3 cellLineType 6 302
## 4 contactName 3009 3750
## 5 contentDescription 50 938
## 6 developmentStage 147 643
## 7 donorDisease 251 610
## 8 effectiveOrgan 135 518
## 9 fileFormat 58 910
## 10 fileSource 10 624
## # … with 24 more rows
Note the fileFormat
facet, and repeat projects_facets()
to
discover detail about available file formats
projects_facets("fileFormat")
## # A tibble: 58 × 3
## facet term count
## <chr> <chr> <int>
## 1 fileFormat fastq.gz 239
## 2 fileFormat loom 79
## 3 fileFormat bam 77
## 4 fileFormat tar 60
## 5 fileFormat txt.gz 49
## 6 fileFormat tsv.gz 47
## 7 fileFormat csv.gz 41
## 8 fileFormat mtx.gz 40
## 9 fileFormat csv 38
## 10 fileFormat bai 26
## # … with 48 more rows
Note that there are 8 uses of the h5ad
file format. Use this as a
filter to discover relevant projects.
filters <- filters(fileFormat = list(is = "h5ad"))
projects(filters)
## # A tibble: 21 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <list> <list> <list> <list> <list> <list> <list>
## 1 c1810dbc-16d… A cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 2 c0518445-3b3… A cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 3 2fe3c60b-ac1… A huma… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 4 73769e0a-5fc… A prox… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 5 957261f7-2bd… A spat… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 6 1dddae6e-375… Cell T… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 7 ad98d3cd-26f… Cells … <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 8 fde199d2-a84… Cells … <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 9 83f5188e-3bf… Distin… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 10 2f676143-80c… Integr… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 11 more rows, 5 more variables: pairedEnd <list>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
The default tibble produced by projects()
contains only some of the
information available; the information is much richer.
To obtain a tibble with an expanded set of columns, you can specify that using
the as
parameter set to "tibble_expanded"
.
# an expanded set of columns for all or the first 4 projects
projects(as = 'tibble_expanded', size = 4)
## # A tibble: 4 × 123
## projectId cellS…¹ cellS…² cellS…³ cellS…⁴ dates…⁵ dates…⁶ dates…⁷ dates…⁸
## <chr> <list> <chr> <list> <int> <chr> <chr> <chr> <chr>
## 1 74b6d569-3b11… <chr> brain <chr> 1330000 2021-1… 2019-0… 2021-1… 2021-1…
## 2 53c53cd4-8127… <chr> prosta… <chr> 108701 2021-0… 2021-0… 2021-0… 2021-0…
## 3 7027adc6-c9c9… <chr> heart <chr> NA 2021-0… 2020-0… 2021-0… 2021-0…
## 4 94e4ee09-9b4b… <chr> liver <chr> NA 2022-0… 2022-0… 2022-0… 2022-0…
## # … with 114 more variables: dates.submissionDate <chr>,
## # dates.updateDate <chr>, donorOrganisms.biologicalSex <chr>,
## # donorOrganisms.developmentStage <list>, donorOrganisms.disease <list>,
## # donorOrganisms.donorCount <int>, donorOrganisms.genusSpecies <chr>,
## # donorOrganisms.id <list>, `donorOrganisms[*].organismAgeRange` <list>,
## # `donorOrganisms[*].organismAgeRange[*][*]` <list>,
## # donorOrganisms.organismAge <list>, …
In the next sections, we’ll cover other options for the as
parameter, and the data formats
they return.
projects()
as an R list
Instead of retrieving the result of projects()
as a tibble, retrieve
it as a ‘list-of-lists’
projects_list <- projects(size = 200, as = "list")
This is a complicated structure. We will use lengths()
, names()
,
and standard R list selection operations to navigate this a bit. At
the top level there are three elements.
lengths(projects_list)
## pagination termFacets hits
## 8 35 200
hits
represents each project as a list, e.g,.
lengths(projects_list$hits[[1]])
## protocols entryId sources projects
## 2 1 1 1
## samples specimens cellLines donorOrganisms
## 1 1 0 1
## organoids cellSuspensions dates fileTypeSummaries
## 0 1 1 2
shows that there are 10 different ways in which the first project is described. Each component is itself a list-of-lists, e.g.,
lengths(projects_list$hits[[1]]$projects[[1]])
## projectId projectTitle projectShortname laboratory
## 1 1 1 1
## estimatedCellCount projectDescription contributors publications
## 1 1 6 1
## supplementaryLinks matrices contributedAnalyses accessions
## 1 0 1 3
## accessible
## 1
projects_list$hits[[1]]$projects[[1]]$projectTitle
## [1] "1.3 Million Brain Cells from E18 Mice"
One can use standard R commands to navigate this data structure, and
to, e.g., extract the projectTitle
of each project.
projects()
as an lol
Use as = "lol"
to create a more convenient way to select, filter and
extract elements from the list-of-lists by projects()
.
lol <- projects(size = 200, as = "lol")
lol
## # class: lol_hca lol
## # number of distinct paths: 13715
## # total number of elements: 125410
## # number of leaf paths: 10361
## # number of leaf elements: 95504
## # lol_path():
## # A tibble: 13,715 × 3
## path n is_leaf
## <chr> <int> <lgl>
## 1 hits 1 FALSE
## 2 hits[*] 200 FALSE
## 3 hits[*].cellLines 200 FALSE
## 4 hits[*].cellLines[*] 42 FALSE
## 5 hits[*].cellLines[*].cellLineType 42 FALSE
## 6 hits[*].cellLines[*].cellLineType[*] 54 TRUE
## 7 hits[*].cellLines[*].id 42 FALSE
## 8 hits[*].cellLines[*].id[*] 485 TRUE
## 9 hits[*].cellLines[*].modelOrgan 42 FALSE
## 10 hits[*].cellLines[*].modelOrgan[*] 56 TRUE
## # … with 13,705 more rows
Use lol_select()
to restrict the lol
to particular paths, and
lol_filter()
to filter results to paths that are leafs, or with
specific numbers of entries.
lol_select(lol, "hits[*].projects[*]")
## # class: lol_hca lol
## # number of distinct paths: 13592
## # total number of elements: 78211
## # number of leaf paths: 10304
## # number of leaf elements: 63302
## # lol_path():
## # A tibble: 13,592 × 3
## path n is_leaf
## <chr> <int> <lgl>
## 1 hits[*].projects[*] 200 FALSE
## 2 hits[*].projects[*].accessible 200 TRUE
## 3 hits[*].projects[*].accessions 200 FALSE
## 4 hits[*].projects[*].accessions[*] 580 FALSE
## 5 hits[*].projects[*].accessions[*].accession 580 TRUE
## 6 hits[*].projects[*].accessions[*].namespace 580 TRUE
## 7 hits[*].projects[*].contributedAnalyses 200 FALSE
## 8 hits[*].projects[*].contributedAnalyses.developmentStage 2 FALSE
## 9 hits[*].projects[*].contributedAnalyses.developmentStage.adult 2 FALSE
## 10 hits[*].projects[*].contributedAnalyses.developmentStage.adult… 2 FALSE
## # … with 13,582 more rows
lol_select(lol, "hits[*].projects[*]") |>
lol_filter(n == 44, is_leaf)
## # class: lol_hca lol
## # number of distinct paths: 0
## # total number of elements: 0
## # number of leaf paths: 0
## # number of leaf elements: 0
## # lol_path():
## # A tibble: 0 × 3
## # … with 3 variables: path <chr>, n <int>, is_leaf <lgl>
lol_pull()
extracts a path from the lol
as a vector; lol_lpull()
extracts paths as lists.
titles <- lol_pull(lol, "hits[*].projects[*].projectTitle")
length(titles)
## [1] 200
head(titles, 2)
## [1] "1.3 Million Brain Cells from E18 Mice"
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"
projects()
tibbles with specific columnsThe path or its abbreviation can be used to specify the columns of
the tibble to be returned by the projects()
query.
Here we retrieve additional details of donor count and total cells by adding appropriate path abbreviations to a named character vector. Names on the character vector can be used to rename the path more concisely, but the paths must uniquely identify elements in the list-of-lists.
columns <- c(
projectId = "hits[*].entryId",
projectTitle = "hits[*].projects[*].projectTitle",
genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]",
donorCount = "hits[*].donorOrganisms[*].donorCount",
cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]",
totalCells = "hits[*].cellSuspensions[*].totalCells"
)
projects <- projects(filters, columns = columns)
projects
## # A tibble: 21 × 6
## projectId projec…¹ genus…² donor…³ cellS…⁴ total…⁵
## <chr> <chr> <list> <int> <list> <list>
## 1 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell … <chr> 24 <chr> <int>
## 2 c0518445-3b3b-49c6-b8fc-c41daa4eacba A cellu… <chr> 17 <chr> <int>
## 3 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 A human… <chr> 38 <chr> <int>
## 4 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 A proxi… <chr> 3 <chr> <int>
## 5 957261f7-2bd6-4358-a6ed-24ee080d5cfc A spati… <chr> 13 <chr> <int>
## 6 1dddae6e-3753-48af-b20e-fa22abad125d Cell Ty… <chr> 6 <chr> <int>
## 7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 Cells o… <chr> 14 <chr> <int>
## 8 fde199d2-a841-4ed1-aa65-b9e0af8969b1 Cells o… <chr> 12 <chr> <int>
## 9 83f5188e-3bf7-4956-9544-cea4f8997756 Distinc… <chr> 6 <chr> <int>
## 10 2f676143-80c2-4bc6-b7b4-2613fe0fadf0 Integra… <chr> 15 <chr> <int>
## # … with 11 more rows, and abbreviated variable names ¹projectTitle,
## # ²genusSpecies, ³donorCount, ⁴cellSuspensions.organ, ⁵totalCells
Note that the cellSuspensions.organ
and totalCells
columns have more than
one entry per project.
projects |>
select(projectId, cellSuspensions.organ, totalCells)
## # A tibble: 21 × 3
## projectId cellSuspensions.organ totalCells
## <chr> <list> <list>
## 1 c1810dbc-16d2-45c3-b45e-3e675f88d87b <chr [2]> <int [2]>
## 2 c0518445-3b3b-49c6-b8fc-c41daa4eacba <chr [2]> <int [2]>
## 3 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 <chr [2]> <int [1]>
## 4 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 <chr [2]> <int [0]>
## 5 957261f7-2bd6-4358-a6ed-24ee080d5cfc <chr [1]> <int [0]>
## 6 1dddae6e-3753-48af-b20e-fa22abad125d <chr [1]> <int [0]>
## 7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 <chr [1]> <int [1]>
## 8 fde199d2-a841-4ed1-aa65-b9e0af8969b1 <chr [4]> <int [0]>
## 9 83f5188e-3bf7-4956-9544-cea4f8997756 <chr [2]> <int [2]>
## 10 2f676143-80c2-4bc6-b7b4-2613fe0fadf0 <chr [1]> <int [0]>
## # … with 11 more rows
In this case, the mapping between cellSuspensions.organ
and totalCells
is clear, but in general more refined navigation of the lol
structure may be
necessary.
projects |>
select(projectId, cellSuspensions.organ, totalCells) |>
filter(lengths(totalCells) > 0) |>
tidyr::unnest(c("cellSuspensions.organ", "totalCells"))
## # A tibble: 20 × 3
## projectId cellSuspensions.organ totalCells
## <chr> <chr> <int>
## 1 c1810dbc-16d2-45c3-b45e-3e675f88d87b thymus 456000
## 2 c1810dbc-16d2-45c3-b45e-3e675f88d87b colon 16000
## 3 c0518445-3b3b-49c6-b8fc-c41daa4eacba lung 40200
## 4 c0518445-3b3b-49c6-b8fc-c41daa4eacba nose 7087
## 5 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 lung 78901
## 6 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 heart 78901
## 7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 heart 791000
## 8 83f5188e-3bf7-4956-9544-cea4f8997756 immune organ 381
## 9 83f5188e-3bf7-4956-9544-cea4f8997756 large intestine 1141
## 10 c4077b3c-5c98-4d26-a614-246d12c2e5d7 esophagus 93267
## 11 c4077b3c-5c98-4d26-a614-246d12c2e5d7 spleen 66553
## 12 c4077b3c-5c98-4d26-a614-246d12c2e5d7 lung 40025
## 13 091cf39b-01bc-42e5-9437-f419a66c8a45 bone marrow 1480000
## 14 b963bd4b-4bc1-4404-8425-69d74bc636b8 blood 800000
## 15 4e6f083b-5b9a-4393-9890-2a83da8188f1 embryo 27083
## 16 4e6f083b-5b9a-4393-9890-2a83da8188f1 endoderm 54166
## 17 4e6f083b-5b9a-4393-9890-2a83da8188f1 presumptive gut 30952
## 18 455b46e6-d8ea-4611-861e-de720a562ada bone marrow 87532
## 19 455b46e6-d8ea-4611-861e-de720a562ada spleen 60043
## 20 455b46e6-d8ea-4611-861e-de720a562ada blood 25016
Select the following entry, augment the filter, and query available files
projects |>
filter(startsWith(projectTitle, "Reconstruct")) |>
t()
## [,1]
## projectId "f83165c5-e2ea-4d15-a5cf-33f3550bffde"
## projectTitle "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics"
## genusSpecies "Homo sapiens"
## donorCount 16
## cellSuspensions.organ character,3
## totalCells integer,0
This approach can be used to customize the tibbles returned by the
other main functions in the package, files()
, samples()
, and
bundles()
.
The relevant file can be selected and downloaded using the technique in the first example.
filters <- filters(
projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
fileFormat = list(is = "h5ad")
)
files <-
files(filters) |>
head(1) # only first file, for demonstration
files |> t()
## [,1]
## fileId "6d4fedcf-857d-5fbb-9928-8b9605500a69"
## name "vento18_ss2.processed.h5ad"
## fileFormat "h5ad"
## size "82121633"
## version "2021-02-10T16:56:40.419579Z"
## projectTitle "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics"
## projectId "f83165c5-e2ea-4d15-a5cf-33f3550bffde"
## url "https://service.azul.data.humancellatlas.org/repository/files/6d4fedcf-857d-5fbb-9928-8b9605500a69?catalog=dcp20&version=2021-02-10T16%3A56%3A40.419579Z"
file_path <- files_download(files)
"h5ad"
files can be read as SingleCellExperiment objects using the
zellkonverter package.
## don't want large amount of data read from disk
sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
sce
project_filter <- filters(fileFormat = list(is = "csv"))
project_tibble <- projects(project_filter)
project_tibble |>
filter(
startsWith(
projectTitle,
"Reconstructing the human first trimester"
)
)
## # A tibble: 1 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <list> <list> <list> <list> <list> <list> <list>
## 1 f83165c5-e2ea… Recons… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 5 more variables: pairedEnd <list>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
projectId <-
project_tibble |>
filter(
startsWith(
projectTitle,
"Reconstructing the human first trimester"
)
) |>
pull(projectId)
file_filter <- filters(
projectId = list(is = projectId),
fileFormat = list(is = "csv")
)
## first 4 files will be returned
file_tibble <- files(file_filter, size = 4)
file_tibble |>
files_download()
## 7f9a181e-24c5-5462-b308-7fef5b1bda2a-2021-02-10T16:56:40.419579Z
## "/home/biocbuild/.cache/R/hca/372fcc60a648de_372fcc60a648de.csv"
## d04c6e3c-b740-5586-8420-4480a1b5706c-2021-02-10T16:56:40.419579Z
## "/home/biocbuild/.cache/R/hca/372fcc22f276fe_372fcc22f276fe.csv"
## d30ffc0b-7d6e-5b85-aff9-21ec69663a81-2021-02-10T16:56:40.419579Z
## "/home/biocbuild/.cache/R/hca/372fcc2643c2b7_372fcc2643c2b7.csv"
## e1517725-01b0-5346-9788-afca63e9993a-2021-02-10T16:56:40.419579Z
## "/home/biocbuild/.cache/R/hca/372fcc758ae5_372fcc758ae5.csv"
The files()
, bundles()
, and samples()
can all return many 1000’s
of results. It is necessary to ‘page’ through these to see all of
them. We illustrate pagination with projects()
, retrieving only 30 projects.
Pagination works for the default tibble
output
page_1_tbl <- projects(size = 30)
page_1_tbl
## # A tibble: 30 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <list> <list> <list> <list> <list> <list> <list>
## 1 74b6d569-3b1… 1.3 Mi… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 2 53c53cd4-812… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 3 7027adc6-c9c… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 4 94e4ee09-9b4… A Huma… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 5 60ea42e1-af4… A Prot… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 6 ef1e3497-515… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 7 9ac53858-606… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 8 923d3231-729… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 9 f86f1ab4-1fb… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 10 602628d7-c03… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 20 more rows, 5 more variables: pairedEnd <lgl>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
page_2_tbl <- page_1_tbl |> hca_next()
page_2_tbl
## # A tibble: 30 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <chr> <list> <list> <list> <list> <list> <list>
## 1 1cd1f41f-f81… A sing… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 2 e255b1c6-114… A sing… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 3 8185730f-411… A sing… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 4 957261f7-2bd… A spat… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 5 52b29aa4-c8d… A surv… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 6 f0f89c14-746… AIDA p… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 7 38449aea-70b… Altere… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 8 0fd8f918-62d… An Atl… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 9 8ab8726d-81b… An org… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## 10 005d611a-14d… Assess… Homo s… <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 20 more rows, 5 more variables: pairedEnd <list>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
## should be identical to page_1_tbl
page_2_tbl |> hca_prev()
## # A tibble: 30 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <list> <list> <list> <list> <list> <list> <list>
## 1 74b6d569-3b1… 1.3 Mi… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 2 53c53cd4-812… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 3 7027adc6-c9c… A Cell… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 4 94e4ee09-9b4… A Huma… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 5 60ea42e1-af4… A Prot… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 6 ef1e3497-515… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 7 9ac53858-606… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 8 923d3231-729… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 9 f86f1ab4-1fb… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 10 602628d7-c03… A Sing… <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## # … with 20 more rows, 5 more variables: pairedEnd <lgl>, workflow <list>,
## # specimenDisease <list>, donorDisease <list>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
Pagination also works for the lol
objects
page_1_lol <- projects(size = 5, as = "lol")
page_1_lol |>
lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"
## [5] "A Protocol for Revealing Oral Neutrophil Heterogeneity by Single-Cell Immune Profiling in Human Saliva"
page_2_lol <-
page_1_lol |>
hca_next()
page_2_lol |>
lol_pull("hits[*].projects[*].projectTitle")
## [1] "A Single-Cell Atlas of the Human Healthy Airways."
## [2] "A Single-Cell Characterization of Human Post-implantation Embryos Cultured In Vitro Delineates Morphogenesis in Primary Syncytialization"
## [3] "A Single-Cell Transcriptomic Atlas of Human Skin Aging."
## [4] "A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure"
## [5] "A Single-cell Transcriptomic Atlas of Human Intervertebral Disc"
## should be identical to page_1_lol
page_2_lol |>
hca_prev() |>
lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"
## [5] "A Protocol for Revealing Oral Neutrophil Heterogeneity by Single-Cell Immune Profiling in Human Saliva"
Much like projects()
and files()
, samples()
and bundles()
allow you to
provide a filter
object and additional criteria to retrieve data in the
form of samples and bundles respectively
heart_filters <- filters(organ = list(is = "heart"))
heart_samples <- samples(filters = heart_filters, size = 4)
heart_samples
## # A tibble: 4 × 6
## entryId projectTitle genus…¹ disease format count
## <chr> <chr> <chr> <chr> <list> <lis>
## 1 012c52ff-4770-4c0c-8c2e-c348da72cfc3 A Cellular … Mus mu… normal <chr> <int>
## 2 035db5b9-a219-4df8-bfc9-117cd07c45ba A Cellular … Mus mu… normal <chr> <int>
## 3 09e425f7-22d7-487e-b78b-78b449578181 A Cellular … Mus mu… normal <chr> <int>
## 4 2273e44d-9fbc-4c13-8cb3-3caf8a9ac1f7 A Cellular … Mus mu… normal <chr> <int>
## # … with abbreviated variable name ¹genusSpecies
heart_bundles <- bundles(filters = heart_filters, size = 4)
heart_bundles
## # A tibble: 4 × 6
## projectTitle genus…¹ samples files bundl…² bundl…³
## <chr> <chr> <list> <lis> <chr> <chr>
## 1 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr> <chr> 0d391b… 2021-0…
## 2 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr> <chr> 165a2d… 2021-0…
## 3 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr> <chr> 166c1b… 2020-0…
## 4 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr> <chr> 18bad6… 2021-0…
## # … with abbreviated variable names ¹genusSpecies, ²bundleUuid, ³bundleVersion
HCA experiments are organized into catalogs, each of which can be summarized
with the summary()
function
heart_filters <- filters(organ = list(is = "heart"))
summary(filters = heart_filters, type = "fileTypeSummaries")
## # A tibble: 17 × 3
## format count totalSize
## <chr> <int> <dbl>
## 1 fastq.gz 28515 1.36e13
## 2 h5 180 1.75e10
## 3 loom 169 3.56e11
## 4 bam 164 3.28e12
## 5 zip 148 9.46e 9
## 6 csv 88 6.66e 7
## 7 tsv.gz 84 6.07e10
## 8 rds 28 1.23e10
## 9 mtx.gz 24 4.80e 8
## 10 bed.gz 6 3.33e 9
## 11 h5ad 5 7.17e 9
## 12 .txt 4 8.02e 7
## 13 RDS 3 3.01e 8
## 14 Rds 3 2.64e 9
## 15 Robj.gz 3 2.36e10
## 16 csv.gz 1 4.87e 8
## 17 h5ad.zip 1 1.55e10
first_catalog <- catalogs()[1]
summary(type = "overview", catalog = first_catalog)
## # A tibble: 7 × 2
## name value
## <chr> <dbl>
## 1 projectCount 2.88e 2
## 2 specimenCount 1.33e 4
## 3 speciesCount 2 e 0
## 4 fileCount 4.81e 5
## 5 totalFileSize 2.01e14
## 6 donorCount 4.50e 3
## 7 labCount 4.68e 2
Each project, file, sample, and bundles has its own unique ID by which, in conjunction with its catalog, can be to uniquely identify them.
heart_filters <- filters(organ = list(is = "heart"))
heart_projects <- projects(filters = heart_filters, size = 4)
heart_projects
## # A tibble: 4 × 14
## projectId proje…¹ genus…² sampl…³ speci…⁴ speci…⁵ selec…⁶ libra…⁷ nucle…⁸
## <chr> <chr> <chr> <list> <list> <list> <lgl> <list> <list>
## 1 7027adc6-c9c9… A Cell… Mus mu… <chr> <chr> <chr> NA <chr> <chr>
## 2 a9301beb-e9fa… A huma… Homo s… <chr> <chr> <chr> NA <chr> <chr>
## 3 2fe3c60b-ac1a… A huma… Homo s… <chr> <chr> <chr> NA <chr> <chr>
## 4 c31fa434-c9ed… A sing… Homo s… <chr> <chr> <chr> NA <chr> <chr>
## # … with 5 more variables: pairedEnd <lgl>, workflow <list>,
## # specimenDisease <chr>, donorDisease <chr>, developmentStage <list>, and
## # abbreviated variable names ¹projectTitle, ²genusSpecies, ³sampleEntityType,
## # ⁴specimenOrgan, ⁵specimenOrganPart, ⁶selectedCellType,
## # ⁷libraryConstructionApproach, ⁸nucleicAcidSource
projectId <-
heart_projects |>
filter(
startsWith(
projectTitle,
"Cells of the adult human"
)
) |>
dplyr::pull(projectId)
result <- projects_detail(uuid = projectId)
The result is a list containing three elements representing
information for navigating next or previous (alphabetical, by default)
(pagination
) project, the filters (termFacets
) available, and
details of the project (hits
).
names(result)
## [1] "pagination" "termFacets" "hits"
As mentioned above, the hits
are a complicated list-of-lists
structure. A very convenient way to explore this structure visually
is with listview::jsonedit(result)
. Selecting individual elements is
possible using the lol
interface; an alternative is
cellxgenedp::jmespath()
.
lol(result)
## # class: lol
## # number of distinct paths: 710
## # total number of elements: 28719
## # number of leaf paths: 427
## # number of leaf elements: 19068
## # lol_path():
## # A tibble: 710 × 3
## path n is_leaf
## <chr> <int> <lgl>
## 1 hits 1 FALSE
## 2 hits[*] 10 FALSE
## 3 hits[*].cellLines 10 FALSE
## 4 hits[*].cellSuspensions 10 FALSE
## 5 hits[*].cellSuspensions[*] 14 FALSE
## 6 hits[*].cellSuspensions[*].organ 14 FALSE
## 7 hits[*].cellSuspensions[*].organPart 14 FALSE
## 8 hits[*].cellSuspensions[*].organPart[*] 16 TRUE
## 9 hits[*].cellSuspensions[*].organ[*] 14 TRUE
## 10 hits[*].cellSuspensions[*].selectedCellType 14 FALSE
## # … with 700 more rows
See the accompanying “Human Cell Atlas Manifests” vignette on details
pertaining to the use of the manifest
endpoint and further
annotation of .loom
files.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] httr_1.4.4 hca_1.6.0
## [3] LoomExperiment_1.16.0 BiocIO_1.8.0
## [5] rhdf5_2.42.0 SingleCellExperiment_1.20.0
## [7] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [9] GenomicRanges_1.50.0 GenomeInfoDb_1.34.0
## [11] IRanges_2.32.0 S4Vectors_0.36.0
## [13] BiocGenerics_0.44.0 MatrixGenerics_1.10.0
## [15] matrixStats_0.62.0 dplyr_1.0.10
## [17] BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.2.1 sass_0.4.2 vroom_1.6.0
## [4] bit64_4.0.5 jsonlite_1.8.3 bslib_0.4.0
## [7] shiny_1.7.3 assertthat_0.2.1 BiocManager_1.30.19
## [10] BiocFileCache_2.6.0 blob_1.2.3 GenomeInfoDbData_1.2.9
## [13] yaml_2.3.6 pillar_1.8.1 RSQLite_2.2.18
## [16] lattice_0.20-45 glue_1.6.2 digest_0.6.30
## [19] promises_1.2.0.1 XVector_0.38.0 htmltools_0.5.3
## [22] httpuv_1.6.6 Matrix_1.5-1 pkgconfig_2.0.3
## [25] bookdown_0.29 zlibbioc_1.44.0 purrr_0.3.5
## [28] xtable_1.8-4 archive_1.1.5 HDF5Array_1.26.0
## [31] later_1.3.0 tzdb_0.3.0 tibble_3.1.8
## [34] generics_0.1.3 ellipsis_0.3.2 DT_0.26
## [37] withr_2.5.0 cachem_1.0.6 cli_3.4.1
## [40] crayon_1.5.2 mime_0.12 magrittr_2.0.3
## [43] memoise_2.0.1 evaluate_0.17 fansi_1.0.3
## [46] tools_4.2.1 hms_1.1.2 lifecycle_1.0.3
## [49] stringr_1.4.1 Rhdf5lib_1.20.0 DelayedArray_0.24.0
## [52] compiler_4.2.1 jquerylib_0.1.4 rlang_1.0.6
## [55] grid_4.2.1 RCurl_1.98-1.9 rhdf5filters_1.10.0
## [58] rappdirs_0.3.3 htmlwidgets_1.5.4 miniUI_0.1.1.1
## [61] bitops_1.0-7 rmarkdown_2.17 DBI_1.1.3
## [64] curl_4.3.3 R6_2.5.1 knitr_1.40
## [67] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2
## [70] filelock_1.0.2 readr_2.1.3 stringi_1.7.8
## [73] parallel_4.2.1 Rcpp_1.0.9 vctrs_0.5.0
## [76] dbplyr_2.2.1 tidyselect_1.2.0 xfun_0.34