Contents

1 Motivation & Introduction

The purpose of this package is to make it easy to query the Human Cell Atlas Data Portal via their data browser API. Visit the Human Cell Atlas for more information on the project.

1.1 Installation and getting started

Evaluate the following code chunk to install packages required for this vignette.

## install from Bioconductor if you haven't already
pkgs <- c("httr", "dplyr", "LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)

Load the packages into your R session.

library(httr)
library(dplyr)
library(LoomExperiment)
library(hca)

2 Example: Discover and download a ‘loom’ file

To illustrate use of this package, consider the task of downloading a ‘loom’ file summarizing single-cell gene expression observed in an HCA research project. This could be accomplished by visiting the HCA data portal (at https://data.humancellatlas.org/explore) in a web browser and selecting projects interactively, but it is valuable to accomplish the same goal in a reproducible, flexible, programmatic way. We will (1) discover projects available in the HCA Data Coordinating Center that have loom files; and (2) retrieve the file from the HCA and import the data into R as a ‘LoomExperiment’ object. For illustration, we focus on the ‘Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns’ project.

2.1 Discover projects with loom files

Use projects() to retrieve all projects in the HCA’s default catalog.

projects()
## # A tibble: 261 × 5
##    projectId                            projectTitle     genus…¹ sampl…² speci…³
##    <chr>                                <chr>            <list>  <list>  <list> 
##  1 74b6d569-3b11-42ef-b6b1-a0454522b4a0 1.3 Million Bra… <chr>   <chr>   <chr>  
##  2 53c53cd4-8127-4e12-bc7f-8fe1610a715c A Cellular Anat… <chr>   <chr>   <chr>  
##  3 7027adc6-c9c9-46f3-84ee-9badc3a4f53b A Cellular Atla… <chr>   <chr>   <chr>  
##  4 94e4ee09-9b4b-410a-84dc-a751ad36d0df A Human Liver C… <chr>   <chr>   <chr>  
##  5 60ea42e1-af49-42f5-8164-d641fdb696bc A Protocol for … <chr>   <chr>   <chr>  
##  6 ef1e3497-515e-4bbe-8d4c-10161854b699 A Single-Cell A… <chr>   <chr>   <chr>  
##  7 f86f1ab4-1fbb-4510-ae35-3ffd752d4dfc A Single-Cell T… <chr>   <chr>   <chr>  
##  8 602628d7-c038-48a8-aa97-ffbb2cb44c9d A Single-cell T… <chr>   <chr>   <chr>  
##  9 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell atlas of… <chr>   <chr>   <chr>  
## 10 c0518445-3b3b-49c6-b8fc-c41daa4eacba A cellular cens… <chr>   <chr>   <chr>  
## # … with 251 more rows, and abbreviated variable names ¹​genusSpecies,
## #   ²​samples.organ, ³​specimens.organ
## # ℹ Use `print(n = ...)` to see more rows

Use filters() to restrict the projects to just those that contain at least one ‘loom’ file.

project_filter <- filters(fileFormat = list(is = "loom"))
project_tibble <- projects(project_filter)
project_tibble
## # A tibble: 79 × 5
##    projectId                            projectTitle     genus…¹ sampl…² speci…³
##    <chr>                                <chr>            <list>  <list>  <list> 
##  1 53c53cd4-8127-4e12-bc7f-8fe1610a715c A Cellular Anat… <chr>   <chr>   <chr>  
##  2 7027adc6-c9c9-46f3-84ee-9badc3a4f53b A Cellular Atla… <chr>   <chr>   <chr>  
##  3 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell atlas of… <chr>   <chr>   <chr>  
##  4 a9301beb-e9fa-42fe-b75c-84e8a460c733 A human cell at… <chr>   <chr>   <chr>  
##  5 996120f9-e84f-409f-a01e-732ab58ca8b9 A human single … <chr>   <chr>   <chr>  
##  6 842605c7-375a-47c5-9e2c-a71c2c00fcad A single cell a… <chr>   <chr>   <chr>  
##  7 a004b150-1c36-4af6-9bbd-070c06dbc17d A single-cell a… <chr>   <chr>   <chr>  
##  8 4a95101c-9ffc-4f30-a809-f04518a23803 A single-cell r… <chr>   <chr>   <chr>  
##  9 1cd1f41f-f81a-486b-a05b-66ec60f81dcf A single-cell s… <chr>   <chr>   <chr>  
## 10 8185730f-4113-40d3-9cc3-929271784c2b A single-cell t… <chr>   <chr>   <chr>  
## # … with 69 more rows, and abbreviated variable names ¹​genusSpecies,
## #   ²​samples.organ, ³​specimens.organ
## # ℹ Use `print(n = ...)` to see more rows

Use standard R commands to further filter projects to the one we are interested in, with title starting with “Single…”. Extract the unique projectId for the first project with this title.

project_tibble %>%
    filter(startsWith(projectTitle, "Single")) %>%
    head(1) %>%
    t()
##                 [,1]                                               
## projectId       "0c3b7785-f74d-4091-8616-a68757e4c2a8"             
## projectTitle    "Single cell RNA sequencing of multiple myeloma II"
## genusSpecies    "Homo sapiens"                                     
## samples.organ   "hematopoietic system"                             
## specimens.organ "hematopoietic system"

projectIds <-
    project_tibble %>%
    filter(startsWith(projectTitle, "Single")) %>%
    dplyr::pull(projectId)

projectId <- projectIds[1]

2.2 Discover and download the loom file of interest

files() retrieves (the first 1000) files from the Human Cell Atlas data portal. Construct a filter to restrict the files to loom files from the project we are interested in.

file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "loom")
)

# only the two smallest files
file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc")

file_tibble
## # A tibble: 2 × 8
##   fileId                      name  fileF…¹   size version proje…² proje…³ url  
##   <chr>                       <chr> <chr>    <int> <chr>   <chr>   <chr>   <chr>
## 1 fe214fea-cc68-56d0-a6d2-fe… bone… loom    4.04e7 2021-0… Single… 0c3b77… http…
## 2 3014ec47-1399-57ca-ab74-c8… Bone… loom    6.35e7 2021-1… Single… 0c3b77… http…
## # … with abbreviated variable names ¹​fileFormat, ²​projectTitle, ³​projectId

files_download() will download one or more files (one for each row) in file_tibble. The download is more complicated than simply following the url column of file_tibble, so it is not possible to simply copy the url into a browser. We’ll download the file and then immediately import it into R.

file_locations <- file_tibble %>% files_download()

LoomExperiment::import(unname(file_locations[1]),
                       type ="SingleCellLoomExperiment")
## class: SingleCellLoomExperiment 
## dim: 58347 3762 
## metadata(15): last_modified CreationDate ...
##   project.provenance.document_id specimen_from_organism.organ
## assays(1): matrix
## rownames: NULL
## rowData names(29): Gene antisense_reads ... reads_per_molecule
##   spliced_reads
## colnames: NULL
## colData names(43): CellID antisense_reads ... reads_unmapped
##   spliced_reads
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(0): NULL

Note that files_download() uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache], so individual files are only downloaded once.

3 Example: Illustrating access to h5ad files

This example walks through the process of file discovery and retrieval in a little more detail, using h5ad files created by the Python AnnData analysis software and available for some experiments in the default catalog.

3.1 Projects facets and terms

The first challenge is to understand what file formats are available from the HCA. Obtain a tibble describing the ‘facets’ of the data, the number of terms used in each facet, and the number of distinct values used to describe projects.

projects_facets()
## # A tibble: 35 × 3
##    facet              n_terms n_values
##    <chr>                <int>    <int>
##  1 assayType                1      261
##  2 biologicalSex            5      430
##  3 cellLineType             6      273
##  4 contactName           2555     3187
##  5 contentDescription      46      855
##  6 developmentStage       139      547
##  7 donorDisease           229      548
##  8 effectiveOrgan         118      459
##  9 fileFormat              56      844
## 10 fileSource              10      575
## # … with 25 more rows
## # ℹ Use `print(n = ...)` to see more rows

Note the fileFormat facet, and repeat projects_facets() to discover detail about available file formats

projects_facets("fileFormat")
## # A tibble: 56 × 3
##    facet      term     count
##    <chr>      <chr>    <int>
##  1 fileFormat fastq.gz   218
##  2 fileFormat loom        79
##  3 fileFormat bam         77
##  4 fileFormat tar         55
##  5 fileFormat txt.gz      44
##  6 fileFormat tsv.gz      43
##  7 fileFormat mtx.gz      37
##  8 fileFormat csv         34
##  9 fileFormat csv.gz      32
## 10 fileFormat bai         26
## # … with 46 more rows
## # ℹ Use `print(n = ...)` to see more rows

Note that there are 8 uses of the h5ad file format. Use this as a filter to discover relevant projects.

filters <- filters(fileFormat = list(is = "h5ad"))
projects(filters)
## # A tibble: 18 × 5
##    projectId                            projectTitle     genus…¹ sampl…² speci…³
##    <chr>                                <chr>            <list>  <list>  <list> 
##  1 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell atlas of… <chr>   <chr>   <chr>  
##  2 c0518445-3b3b-49c6-b8fc-c41daa4eacba A cellular cens… <chr>   <chr>   <chr>  
##  3 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 A human fetal l… <chr>   <chr>   <chr>  
##  4 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 A proximal-to-d… <chr>   <chr>   <chr>  
##  5 957261f7-2bd6-4358-a6ed-24ee080d5cfc A spatial multi… <chr>   <chr>   <chr>  
##  6 1dddae6e-3753-48af-b20e-fa22abad125d Cell Types of t… <chr>   <chr>   <chr>  
##  7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 Cells of the ad… <chr>   <chr>   <chr>  
##  8 fde199d2-a841-4ed1-aa65-b9e0af8969b1 Cells of the hu… <chr>   <chr>   <chr>  
##  9 83f5188e-3bf7-4956-9544-cea4f8997756 Distinct microb… <chr>   <chr>   <chr>  
## 10 2f676143-80c2-4bc6-b7b4-2613fe0fadf0 Integrative ana… <chr>   <chr>   <chr>  
## 11 c4077b3c-5c98-4d26-a614-246d12c2e5d7 Ischaemic sensi… <chr>   <chr>   <chr>  
## 12 091cf39b-01bc-42e5-9437-f419a66c8a45 Profiling of CD… <chr>   <chr>   <chr>  
## 13 f83165c5-e2ea-4d15-a5cf-33f3550bffde Reconstructing … <chr>   <chr>   <chr>  
## 14 31887183-a72c-4308-9eac-c6140313f39c Single-nucleus … <chr>   <chr>   <chr>  
## 15 abe1a013-af7a-45ed-8c26-f3793c24a1f4 Spatio-temporal… <chr>   <chr>   <chr>  
## 16 b963bd4b-4bc1-4404-8425-69d74bc636b8 The cellular im… <chr>   <chr>   <chr>  
## 17 4e6f083b-5b9a-4393-9890-2a83da8188f1 The emergent la… <chr>   <chr>   <chr>  
## 18 455b46e6-d8ea-4611-861e-de720a562ada Transcriptomic … <chr>   <chr>   <chr>  
## # … with abbreviated variable names ¹​genusSpecies, ²​samples.organ,
## #   ³​specimens.organ

3.2 Projects columns

The default tibble produced by projects() contains only some of the information available; the information is much richer.

To obtain a tibble with an expanded set of columns, you can specify that using the as parameter set to "tibble_expanded".

# an expanded set of columns for all or the first 4 projects
projects(as = 'tibble_expanded', size = 4)
## # A tibble: 4 × 124
##   projectId      cellS…¹ cellS…² cellS…³ cellS…⁴ dates…⁵ dates…⁶ dates…⁷ dates…⁸
##   <chr>          <list>  <chr>   <list>    <int> <chr>   <chr>   <chr>   <chr>  
## 1 74b6d569-3b11… <chr>   brain   <chr>   1330000 2021-1… 2019-0… 2021-1… 2021-1…
## 2 53c53cd4-8127… <chr>   prosta… <chr>    108701 2021-0… 2021-0… 2021-0… 2021-0…
## 3 7027adc6-c9c9… <chr>   heart   <chr>        NA 2021-0… 2020-0… 2021-0… 2021-0…
## 4 94e4ee09-9b4b… <chr>   liver   <chr>        NA 2022-0… 2022-0… 2022-0… 2022-0…
## # … with 115 more variables: dates.submissionDate <chr>,
## #   dates.updateDate <chr>, donorOrganisms.biologicalSex <chr>,
## #   donorOrganisms.developmentStage <list>, donorOrganisms.disease <list>,
## #   donorOrganisms.donorCount <int>, donorOrganisms.genusSpecies <chr>,
## #   donorOrganisms.id <list>, donorOrganisms.organismAgeRange <list>,
## #   donorOrganisms.organismAgeRange.gte <list>,
## #   donorOrganisms.organismAgeRange.lte <list>, …
## # ℹ Use `colnames()` to see all variable names

In the next sections, we’ll cover other options for the as parameter, and the data formats they return.

3.2.1 projects() as an R list

Instead of retrieving the result of projects() as a tibble, retrieve it as a ‘list-of-lists’

projects_list <- projects(as = "list")

This is a complicated structure. We will use lengths(), names(), and standard R list selection operations to navigate this a bit. At the top level there are three elements.

lengths(projects_list)
## pagination termFacets       hits 
##          8         36        261

hits represents each project as a list, e.g,.

lengths(projects_list$hits[[1]])
##         protocols           entryId           sources          projects 
##                 2                 1                 1                 1 
##           samples         specimens         cellLines    donorOrganisms 
##                 1                 1                 0                 1 
##         organoids   cellSuspensions             dates fileTypeSummaries 
##                 0                 1                 1                 2

shows that there are 10 different ways in which the first project is described. Each component is itself a list-of-lists, e.g.,

lengths(projects_list$hits[[1]]$projects[[1]])
##           projectId        projectTitle    projectShortname          laboratory 
##                   1                   1                   1                   1 
##  estimatedCellCount  projectDescription        contributors        publications 
##                   1                   1                   6                   1 
##  supplementaryLinks            matrices contributedAnalyses          accessions 
##                   1                   0                   1                   3 
##          accessible 
##                   1
projects_list$hits[[1]]$projects[[1]]$projectTitle
## [1] "1.3 Million Brain Cells from E18 Mice"

One can use standard R commands to navigate this data structure, and to, e.g., extract the projectTitle of each project.

3.2.2 projects() as an lol

Use as = "lol" to create a more convenient way to select, filter and extract elements from the list-of-lists by projects().

lol <- projects(as = "lol")
lol
## # class: lol_hca lol
## # number of distinct paths: 17471
## # total number of elements: 175618
## # number of leaf paths: 13279
## # number of leaf elements: 136224
## # lol_path():
## # A tibble: 17,471 × 3
##    path                                     n is_leaf
##    <chr>                                <int> <lgl>  
##  1 hits                                     1 FALSE  
##  2 hits[*]                                261 FALSE  
##  3 hits[*].cellLines                      261 FALSE  
##  4 hits[*].cellLines[*]                    43 FALSE  
##  5 hits[*].cellLines[*].cellLineType       43 FALSE  
##  6 hits[*].cellLines[*].cellLineType[*]    55 TRUE   
##  7 hits[*].cellLines[*].id                 43 FALSE  
##  8 hits[*].cellLines[*].id[*]             459 TRUE   
##  9 hits[*].cellLines[*].modelOrgan         43 FALSE  
## 10 hits[*].cellLines[*].modelOrgan[*]      57 TRUE   
## # … with 17,461 more rows
## # ℹ Use `print(n = ...)` to see more rows

Use lol_select() to restrict the lol to particular paths, and lol_filter() to filter results to paths that are leafs, or with specific numbers of entries.

lol_select(lol, "hits[*].projects[*]")
## # class: lol_hca lol
## # number of distinct paths: 17347
## # total number of elements: 113014
## # number of leaf paths: 13220
## # number of leaf elements: 91944
## # lol_path():
## # A tibble: 17,347 × 3
##    path                                                                n is_leaf
##    <chr>                                                           <int> <lgl>  
##  1 hits[*].projects[*]                                               261 FALSE  
##  2 hits[*].projects[*].accessible                                    261 TRUE   
##  3 hits[*].projects[*].accessions                                    261 FALSE  
##  4 hits[*].projects[*].accessions[*]                                 722 FALSE  
##  5 hits[*].projects[*].accessions[*].accession                       722 TRUE   
##  6 hits[*].projects[*].accessions[*].namespace                       722 TRUE   
##  7 hits[*].projects[*].contributedAnalyses                           261 FALSE  
##  8 hits[*].projects[*].contributedAnalyses.developmentStage            4 FALSE  
##  9 hits[*].projects[*].contributedAnalyses.developmentStage.adult      4 FALSE  
## 10 hits[*].projects[*].contributedAnalyses.developmentStage.adult…     2 FALSE  
## # … with 17,337 more rows
## # ℹ Use `print(n = ...)` to see more rows
lol_select(lol, "hits[*].projects[*]") |>
    lol_filter(n == 44, is_leaf)
## # class: lol_hca lol
## # number of distinct paths: 12
## # total number of elements: 528
## # number of leaf paths: 12
## # number of leaf elements: 528
## # lol_path():
## # A tibble: 12 × 3
##    path                                                                n is_leaf
##    <chr>                                                           <int> <lgl>  
##  1 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  2 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  3 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  4 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  5 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  6 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  7 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  8 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
##  9 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
## 10 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
## 11 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE   
## 12 hits[*].projects[*].contributedAnalyses.organ.liver.genusSpeci…    44 TRUE

lol_pull() extracts a path from the lol as a vector; lol_lpull() extracts paths as lists.

titles <- lol_pull(lol, "hits[*].projects[*].projectTitle")
length(titles)
## [1] 261
head(titles, 2)
## [1] "1.3 Million Brain Cells from E18 Mice"                                      
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"

3.2.3 Creating projects() tibbles with specific columns

The path or its abbreviation can be used to specify the columns of the tibble to be returned by the projects() query.

Here we retrieve additional details of donor count and total cells by adding appropriate path abbreviations to a named character vector. Names on the character vector can be used to rename the path more concisely, but the paths must uniquely identify elements in the list-of-lists.

columns <- c(
    projectId = "hits[*].entryId",
    projectTitle = "hits[*].projects[*].projectTitle",
    genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]",
    donorCount = "hits[*].donorOrganisms[*].donorCount",
    cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]",
    totalCells = "hits[*].cellSuspensions[*].totalCells"
)
projects <- projects(filters, columns = columns)
projects
## # A tibble: 18 × 6
##    projectId                            projec…¹ genus…² donor…³ cellS…⁴ total…⁵
##    <chr>                                <chr>    <list>    <int> <list>  <list> 
##  1 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell … <chr>        24 <chr>   <int>  
##  2 c0518445-3b3b-49c6-b8fc-c41daa4eacba A cellu… <chr>        17 <chr>   <int>  
##  3 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 A human… <chr>        38 <chr>   <int>  
##  4 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 A proxi… <chr>         3 <chr>   <int>  
##  5 957261f7-2bd6-4358-a6ed-24ee080d5cfc A spati… <chr>        13 <chr>   <int>  
##  6 1dddae6e-3753-48af-b20e-fa22abad125d Cell Ty… <chr>         6 <chr>   <int>  
##  7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 Cells o… <chr>        14 <chr>   <int>  
##  8 fde199d2-a841-4ed1-aa65-b9e0af8969b1 Cells o… <chr>        12 <chr>   <int>  
##  9 83f5188e-3bf7-4956-9544-cea4f8997756 Distinc… <chr>         6 <chr>   <int>  
## 10 2f676143-80c2-4bc6-b7b4-2613fe0fadf0 Integra… <chr>        15 <chr>   <int>  
## 11 c4077b3c-5c98-4d26-a614-246d12c2e5d7 Ischaem… <chr>        14 <chr>   <int>  
## 12 091cf39b-01bc-42e5-9437-f419a66c8a45 Profili… <chr>         3 <chr>   <int>  
## 13 f83165c5-e2ea-4d15-a5cf-33f3550bffde Reconst… <chr>        16 <chr>   <int>  
## 14 31887183-a72c-4308-9eac-c6140313f39c Single-… <chr>        16 <chr>   <int>  
## 15 abe1a013-af7a-45ed-8c26-f3793c24a1f4 Spatio-… <chr>         8 <chr>   <int>  
## 16 b963bd4b-4bc1-4404-8425-69d74bc636b8 The cel… <chr>       120 <chr>   <int>  
## 17 4e6f083b-5b9a-4393-9890-2a83da8188f1 The eme… <chr>         6 <chr>   <int>  
## 18 455b46e6-d8ea-4611-861e-de720a562ada Transcr… <chr>         4 <chr>   <int>  
## # … with abbreviated variable names ¹​projectTitle, ²​genusSpecies, ³​donorCount,
## #   ⁴​cellSuspensions.organ, ⁵​totalCells

Note that the cellSuspensions.organ and totalCells columns have more than one entry per project.

projects |>
   select(projectId, cellSuspensions.organ, totalCells)
## # A tibble: 18 × 3
##    projectId                            cellSuspensions.organ totalCells
##    <chr>                                <list>                <list>    
##  1 c1810dbc-16d2-45c3-b45e-3e675f88d87b <chr [2]>             <int [2]> 
##  2 c0518445-3b3b-49c6-b8fc-c41daa4eacba <chr [2]>             <int [2]> 
##  3 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 <chr [2]>             <int [1]> 
##  4 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 <chr [2]>             <int [0]> 
##  5 957261f7-2bd6-4358-a6ed-24ee080d5cfc <chr [1]>             <int [0]> 
##  6 1dddae6e-3753-48af-b20e-fa22abad125d <chr [1]>             <int [0]> 
##  7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 <chr [1]>             <int [1]> 
##  8 fde199d2-a841-4ed1-aa65-b9e0af8969b1 <chr [4]>             <int [0]> 
##  9 83f5188e-3bf7-4956-9544-cea4f8997756 <chr [2]>             <int [2]> 
## 10 2f676143-80c2-4bc6-b7b4-2613fe0fadf0 <chr [1]>             <int [0]> 
## 11 c4077b3c-5c98-4d26-a614-246d12c2e5d7 <chr [3]>             <int [3]> 
## 12 091cf39b-01bc-42e5-9437-f419a66c8a45 <chr [1]>             <int [1]> 
## 13 f83165c5-e2ea-4d15-a5cf-33f3550bffde <chr [3]>             <int [0]> 
## 14 31887183-a72c-4308-9eac-c6140313f39c <chr [7]>             <int [0]> 
## 15 abe1a013-af7a-45ed-8c26-f3793c24a1f4 <chr [1]>             <int [0]> 
## 16 b963bd4b-4bc1-4404-8425-69d74bc636b8 <chr [1]>             <int [1]> 
## 17 4e6f083b-5b9a-4393-9890-2a83da8188f1 <chr [3]>             <int [3]> 
## 18 455b46e6-d8ea-4611-861e-de720a562ada <chr [3]>             <int [3]>

In this case, the mapping between cellSuspensions.organ and totalCells is clear, but in general more refined navigation of the lol structure may be necessary.

projects |>
    select(projectId, cellSuspensions.organ, totalCells) |>
    filter(lengths(totalCells) > 0) |>
    tidyr::unnest(c("cellSuspensions.organ", "totalCells"))
## # A tibble: 20 × 3
##    projectId                            cellSuspensions.organ totalCells
##    <chr>                                <chr>                      <int>
##  1 c1810dbc-16d2-45c3-b45e-3e675f88d87b thymus                    456000
##  2 c1810dbc-16d2-45c3-b45e-3e675f88d87b colon                      16000
##  3 c0518445-3b3b-49c6-b8fc-c41daa4eacba lung                       40200
##  4 c0518445-3b3b-49c6-b8fc-c41daa4eacba nose                        7087
##  5 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 lung                       78901
##  6 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 heart                      78901
##  7 ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 heart                     791000
##  8 83f5188e-3bf7-4956-9544-cea4f8997756 immune organ                 381
##  9 83f5188e-3bf7-4956-9544-cea4f8997756 large intestine             1141
## 10 c4077b3c-5c98-4d26-a614-246d12c2e5d7 esophagus                  93267
## 11 c4077b3c-5c98-4d26-a614-246d12c2e5d7 spleen                     66553
## 12 c4077b3c-5c98-4d26-a614-246d12c2e5d7 lung                       40025
## 13 091cf39b-01bc-42e5-9437-f419a66c8a45 bone marrow              1480000
## 14 b963bd4b-4bc1-4404-8425-69d74bc636b8 blood                     800000
## 15 4e6f083b-5b9a-4393-9890-2a83da8188f1 embryo                     27083
## 16 4e6f083b-5b9a-4393-9890-2a83da8188f1 endoderm                   54166
## 17 4e6f083b-5b9a-4393-9890-2a83da8188f1 presumptive gut            30952
## 18 455b46e6-d8ea-4611-861e-de720a562ada bone marrow                87532
## 19 455b46e6-d8ea-4611-861e-de720a562ada spleen                     60043
## 20 455b46e6-d8ea-4611-861e-de720a562ada blood                      25016

Select the following entry, augment the filter, and query available files

projects %>%
    filter(startsWith(projectTitle, "Reconstruct")) %>%
    t()
##                       [,1]                                                                                                 
## projectId             "f83165c5-e2ea-4d15-a5cf-33f3550bffde"                                                               
## projectTitle          "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics"
## genusSpecies          "Homo sapiens"                                                                                       
## donorCount            16                                                                                                   
## cellSuspensions.organ character,3                                                                                          
## totalCells            integer,0

This approach can be used to customize the tibbles returned by the other main functions in the package, files(), samples(), and bundles().

3.3 File download

The relevant file can be selected and downloaded using the technique in the first example.

filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) %>%
    head(1)            # only first file, for demonstration
files %>% t()
##              [,1]                                                                                                                                                      
## fileId       "6d4fedcf-857d-5fbb-9928-8b9605500a69"                                                                                                                    
## name         "vento18_ss2.processed.h5ad"                                                                                                                              
## fileFormat   "h5ad"                                                                                                                                                    
## size         "82121633"                                                                                                                                                
## version      "2021-02-10T16:56:40.419579Z"                                                                                                                             
## projectTitle "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics"                                                     
## projectId    "f83165c5-e2ea-4d15-a5cf-33f3550bffde"                                                                                                                    
## url          "https://service.azul.data.humancellatlas.org/repository/files/6d4fedcf-857d-5fbb-9928-8b9605500a69?catalog=dcp17&version=2021-02-10T16%3A56%3A40.419579Z"
file_path <- files_download(files)

"h5ad" files can be read as SingleCellExperiment objects using the zellkonverter package.

## don't want large amount of data read from disk
sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
sce

4 Example: A multiple file download

project_filter <- filters(fileFormat = list(is = "csv"))
project_tibble <- projects(project_filter)

project_tibble %>%
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    )
## # A tibble: 1 × 5
##   projectId                            projectTitle      genus…¹ sampl…² speci…³
##   <chr>                                <chr>             <list>  <list>  <list> 
## 1 f83165c5-e2ea-4d15-a5cf-33f3550bffde Reconstructing t… <chr>   <chr>   <chr>  
## # … with abbreviated variable names ¹​genusSpecies, ²​samples.organ,
## #   ³​specimens.organ

projectId <-
    project_tibble %>%
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    ) %>%
    pull(projectId)

file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "csv")
)

## first 4 files will be returned
file_tibble <- files(file_filter, size = 4)

file_tibble %>%
    files_download()
## 7f9a181e-24c5-5462-b308-7fef5b1bda2a-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/3e264966a8b56a_3e264966a8b56a.csv" 
## d04c6e3c-b740-5586-8420-4480a1b5706c-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/3e264969584155_3e264969584155.csv" 
## d30ffc0b-7d6e-5b85-aff9-21ec69663a81-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/3e264972689134_3e264972689134.csv" 
## e1517725-01b0-5346-9788-afca63e9993a-2021-02-10T16:56:40.419579Z 
## "/home/biocbuild/.cache/R/hca/3e26497b0bf95a_3e26497b0bf95a.csv"

5 Example: Exploring the pagination feature

The files(), bundles(), and samples() can all return many 1000’s of results. It is necessary to ‘page’ through these to see all of them. We illustrate pagination with projects(), retrieving only 30 projects.

Pagination works for the default tibble output

page_1_tbl <- projects(size = 30)
page_1_tbl
## # A tibble: 30 × 5
##    projectId                            projectTitle     genus…¹ sampl…² speci…³
##    <chr>                                <chr>            <list>  <list>  <list> 
##  1 74b6d569-3b11-42ef-b6b1-a0454522b4a0 1.3 Million Bra… <chr>   <chr>   <chr>  
##  2 53c53cd4-8127-4e12-bc7f-8fe1610a715c A Cellular Anat… <chr>   <chr>   <chr>  
##  3 7027adc6-c9c9-46f3-84ee-9badc3a4f53b A Cellular Atla… <chr>   <chr>   <chr>  
##  4 94e4ee09-9b4b-410a-84dc-a751ad36d0df A Human Liver C… <chr>   <chr>   <chr>  
##  5 60ea42e1-af49-42f5-8164-d641fdb696bc A Protocol for … <chr>   <chr>   <chr>  
##  6 ef1e3497-515e-4bbe-8d4c-10161854b699 A Single-Cell A… <chr>   <chr>   <chr>  
##  7 f86f1ab4-1fbb-4510-ae35-3ffd752d4dfc A Single-Cell T… <chr>   <chr>   <chr>  
##  8 602628d7-c038-48a8-aa97-ffbb2cb44c9d A Single-cell T… <chr>   <chr>   <chr>  
##  9 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell atlas of… <chr>   <chr>   <chr>  
## 10 c0518445-3b3b-49c6-b8fc-c41daa4eacba A cellular cens… <chr>   <chr>   <chr>  
## # … with 20 more rows, and abbreviated variable names ¹​genusSpecies,
## #   ²​samples.organ, ³​specimens.organ
## # ℹ Use `print(n = ...)` to see more rows

page_2_tbl <- page_1_tbl %>% hca_next()
page_2_tbl
## # A tibble: 30 × 5
##    projectId                            projectTitle     genus…¹ sampl…² speci…³
##    <chr>                                <chr>            <chr>   <list>  <list> 
##  1 957261f7-2bd6-4358-a6ed-24ee080d5cfc A spatial multi… Homo s… <chr>   <chr>  
##  2 52b29aa4-c8d6-42b4-807a-b35be94469ca A survey of hum… Homo s… <chr>   <chr>  
##  3 f0f89c14-7460-4bab-9d42-22228a91f185 AIDA pilot data  Homo s… <chr>   <chr>  
##  4 38449aea-70b5-40db-84b3-1e08f32efe34 Altered human o… Homo s… <chr>   <chr>  
##  5 0fd8f918-62d6-4b8b-ac35-4c53dd601f71 An Atlas of Imm… Homo s… <chr>   <chr>  
##  6 8ab8726d-81b9-4bd2-acc2-4d50bee786b4 An organoid and… Homo s… <chr>   <chr>  
##  7 005d611a-14d5-4fbf-846e-571a1f874f70 Assessing the r… Homo s… <chr>   <chr>  
##  8 04ad400c-58cb-40a5-bc2b-2279e13a910b Blood and immun… Homo s… <chr>   <chr>  
##  9 a29952d9-925e-40f4-8a1c-274f118f1f51 Bone marrow pla… Homo s… <chr>   <chr>  
## 10 74493e98-44fc-48b0-a58f-cc7e77268b59 CD27hiCD38hi pl… Homo s… <chr>   <chr>  
## # … with 20 more rows, and abbreviated variable names ¹​genusSpecies,
## #   ²​samples.organ, ³​specimens.organ
## # ℹ Use `print(n = ...)` to see more rows

## should be identical to page_1_tbl
page_2_tbl %>% hca_prev()
## # A tibble: 30 × 5
##    projectId                            projectTitle     genus…¹ sampl…² speci…³
##    <chr>                                <chr>            <list>  <list>  <list> 
##  1 74b6d569-3b11-42ef-b6b1-a0454522b4a0 1.3 Million Bra… <chr>   <chr>   <chr>  
##  2 53c53cd4-8127-4e12-bc7f-8fe1610a715c A Cellular Anat… <chr>   <chr>   <chr>  
##  3 7027adc6-c9c9-46f3-84ee-9badc3a4f53b A Cellular Atla… <chr>   <chr>   <chr>  
##  4 94e4ee09-9b4b-410a-84dc-a751ad36d0df A Human Liver C… <chr>   <chr>   <chr>  
##  5 60ea42e1-af49-42f5-8164-d641fdb696bc A Protocol for … <chr>   <chr>   <chr>  
##  6 ef1e3497-515e-4bbe-8d4c-10161854b699 A Single-Cell A… <chr>   <chr>   <chr>  
##  7 f86f1ab4-1fbb-4510-ae35-3ffd752d4dfc A Single-Cell T… <chr>   <chr>   <chr>  
##  8 602628d7-c038-48a8-aa97-ffbb2cb44c9d A Single-cell T… <chr>   <chr>   <chr>  
##  9 c1810dbc-16d2-45c3-b45e-3e675f88d87b A cell atlas of… <chr>   <chr>   <chr>  
## 10 c0518445-3b3b-49c6-b8fc-c41daa4eacba A cellular cens… <chr>   <chr>   <chr>  
## # … with 20 more rows, and abbreviated variable names ¹​genusSpecies,
## #   ²​samples.organ, ³​specimens.organ
## # ℹ Use `print(n = ...)` to see more rows

Pagination also works for the lol objects

page_1_lol <- projects(size = 5, as = "lol")
page_1_lol %>%
    lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"                                                                 
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"                           
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."                                              
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"                             
## [5] "A Protocol for Revealing Oral Neutrophil Heterogeneity by Single-Cell Immune Profiling in Human Saliva"

page_2_lol <-
    page_1_lol %>%
    hca_next()
page_2_lol %>%
    lol_pull("hits[*].projects[*].projectTitle")
## [1] "A Single-Cell Atlas of the Human Healthy Airways."                                                                  
## [2] "A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure"
## [3] "A Single-cell Transcriptomic Atlas of Human Intervertebral Disc"                                                    
## [4] "A cell atlas of human thymic development defines T cell repertoire formation"                                       
## [5] "A cellular census of human lungs identifies novel cell states in health and in asthma"

## should be identical to page_1_lol
page_2_lol %>%
    hca_prev() %>%
    lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"                                                                 
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"                           
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."                                              
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"                             
## [5] "A Protocol for Revealing Oral Neutrophil Heterogeneity by Single-Cell Immune Profiling in Human Saliva"

6 Example: Obtaining other data entities

Much like projects() and files(), samples() and bundles() allow you to provide a filter object and additional criteria to retrieve data in the form of samples and bundles respectively

heart_filters <- filters(organ = list(is = "heart"))
heart_samples <- samples(filters = heart_filters, size = 4)
heart_samples
## # A tibble: 4 × 6
##   entryId                              projectTitle genus…¹ disease format count
##   <chr>                                <chr>        <chr>   <chr>   <list> <lis>
## 1 012c52ff-4770-4c0c-8c2e-c348da72cfc3 A Cellular … Mus mu… normal  <chr>  <int>
## 2 035db5b9-a219-4df8-bfc9-117cd07c45ba A Cellular … Mus mu… normal  <chr>  <int>
## 3 09e425f7-22d7-487e-b78b-78b449578181 A Cellular … Mus mu… normal  <chr>  <int>
## 4 2273e44d-9fbc-4c13-8cb3-3caf8a9ac1f7 A Cellular … Mus mu… normal  <chr>  <int>
## # … with abbreviated variable name ¹​genusSpecies

heart_bundles <- bundles(filters = heart_filters, size = 4)
heart_bundles
## # A tibble: 4 × 6
##   projectTitle                             genus…¹ samples files bundl…² bundl…³
##   <chr>                                    <chr>   <list>  <lis> <chr>   <chr>  
## 1 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr>   <chr> 0d391b… 2021-0…
## 2 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr>   <chr> 165a2d… 2021-0…
## 3 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr>   <chr> 166c1b… 2020-0…
## 4 A Cellular Atlas of Pitx2-Dependent Car… Mus mu… <chr>   <chr> 18bad6… 2021-0…
## # … with abbreviated variable names ¹​genusSpecies, ²​bundleUuid, ³​bundleVersion

7 Example: Obtaining summaries of project catalogs

HCA experiments are organized into catalogs, each of which can be summarized with the summary() function

heart_filters <- filters(organ = list(is = "heart"))
summary(filters = heart_filters, type = "fileTypeSummaries")
## # A tibble: 15 × 3
##    format   count     totalSize
##    <chr>    <int>         <dbl>
##  1 fastq.gz 28293 6779884300072
##  2 h5         180   17469586148
##  3 loom       169  355865926256
##  4 bam        164 3284598129743
##  5 zip        148    9458395398
##  6 csv         88      66561822
##  7 tsv.gz      84   60703654221
##  8 rds         28   12319112882
##  9 mtx.gz      24     479611217
## 10 bed.gz       6    3332206567
## 11 h5ad         5    7174452054
## 12 .txt         4      80206910
## 13 RDS          3     301293176
## 14 Rds          3    2637887113
## 15 h5ad.zip     1   15510167649
first_catalog <- catalogs()[1]
summary(type = "overview", catalog = first_catalog)
## # A tibble: 7 × 2
##   name            value
##   <chr>           <dbl>
## 1 projectCount  2.61e 2
## 2 specimenCount 1.24e 4
## 3 speciesCount  2   e 0
## 4 fileCount     4.76e 5
## 5 totalFileSize 1.74e14
## 6 donorCount    3.79e 3
## 7 labCount      4.48e 2

8 Example: Obtaining details on individual projects, files, samples, and bundles

Each project, file, sample, and bundles has its own unique ID by which, in conjunction with its catalog, can be to uniquely identify them.

heart_filters <- filters(organ = list(is = "heart"))
heart_projects <- projects(filters = heart_filters, size = 4)
heart_projects
## # A tibble: 4 × 5
##   projectId                            projectTitle      genus…¹ sampl…² speci…³
##   <chr>                                <chr>             <chr>   <list>  <list> 
## 1 7027adc6-c9c9-46f3-84ee-9badc3a4f53b A Cellular Atlas… Mus mu… <chr>   <chr>  
## 2 a9301beb-e9fa-42fe-b75c-84e8a460c733 A human cell atl… Homo s… <chr>   <chr>  
## 3 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 A human fetal lu… Homo s… <chr>   <chr>  
## 4 c31fa434-c9ed-4263-a9b6-d9ffb9d44005 A single-cell at… Homo s… <chr>   <chr>  
## # … with abbreviated variable names ¹​genusSpecies, ²​samples.organ,
## #   ³​specimens.organ

projectId <-
    heart_projects %>%
    filter(
        startsWith(
            projectTitle,
            "Cells of the adult human"
        )
    ) %>%
    dplyr::pull(projectId)

result <- projects_detail(uuid = projectId)

The result is a list containing three elements representing information for navigating next or previous (alphabetical, by default) (pagination) project, the filters (termFacets) available, and details of the project (hits).

names(result)
## [1] "pagination" "termFacets" "hits"

As mentioned above, the hits are a complicated list-of-lists structure. A very convenient way to explore this structure visually is with listview::jsonedit(result). Selecting individual elements is possible using the lol interface; an alternative is cellxgenedp::jmespath().

lol(result)
## # class: lol
## # number of distinct paths: 890
## # total number of elements: 26743
## # number of leaf paths: 565
## # number of leaf elements: 17952
## # lol_path():
## # A tibble: 890 × 3
##    path                                            n is_leaf
##    <chr>                                       <int> <lgl>  
##  1 hits                                            1 FALSE  
##  2 hits[*]                                        10 FALSE  
##  3 hits[*].cellLines                              10 FALSE  
##  4 hits[*].cellSuspensions                        10 FALSE  
##  5 hits[*].cellSuspensions[*]                     16 FALSE  
##  6 hits[*].cellSuspensions[*].organ               16 FALSE  
##  7 hits[*].cellSuspensions[*].organPart           16 FALSE  
##  8 hits[*].cellSuspensions[*].organPart[*]        22 TRUE   
##  9 hits[*].cellSuspensions[*].organ[*]            16 TRUE   
## 10 hits[*].cellSuspensions[*].selectedCellType    16 FALSE  
## # … with 880 more rows
## # ℹ Use `print(n = ...)` to see more rows

9 Exploring manifest files

See the accompanying “Human Cell Atlas Manifests” vignette on details pertaining to the use of the manifest endpoint and further annotation of .loom files.

10 Session info

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] httr_1.4.3                  hca_1.4.3                  
##  [3] LoomExperiment_1.14.0       BiocIO_1.6.0               
##  [5] rhdf5_2.40.0                SingleCellExperiment_1.18.0
##  [7] SummarizedExperiment_1.26.1 Biobase_2.56.0             
##  [9] GenomicRanges_1.48.0        GenomeInfoDb_1.32.2        
## [11] IRanges_2.30.0              S4Vectors_0.34.0           
## [13] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
## [15] matrixStats_0.62.0          dplyr_1.0.9                
## [17] BiocStyle_2.24.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyr_1.2.0            sass_0.4.2             vroom_1.5.7           
##  [4] bit64_4.0.5            jsonlite_1.8.0         bslib_0.4.0           
##  [7] assertthat_0.2.1       BiocManager_1.30.18    BiocFileCache_2.4.0   
## [10] blob_1.2.3             GenomeInfoDbData_1.2.8 yaml_2.3.5            
## [13] pillar_1.8.0           RSQLite_2.2.15         lattice_0.20-45       
## [16] glue_1.6.2             digest_0.6.29          XVector_0.36.0        
## [19] htmltools_0.5.3        Matrix_1.4-1           pkgconfig_2.0.3       
## [22] bookdown_0.27          zlibbioc_1.42.0        purrr_0.3.4           
## [25] HDF5Array_1.24.1       tzdb_0.3.0             tibble_3.1.7          
## [28] generics_0.1.3         ellipsis_0.3.2         cachem_1.0.6          
## [31] cli_3.3.0              crayon_1.5.1           magrittr_2.0.3        
## [34] memoise_2.0.1          evaluate_0.15          fansi_1.0.3           
## [37] tools_4.2.1            hms_1.1.1              formatR_1.12          
## [40] lifecycle_1.0.1        stringr_1.4.0          Rhdf5lib_1.18.2       
## [43] DelayedArray_0.22.0    lambda.r_1.2.4         compiler_4.2.1        
## [46] jquerylib_0.1.4        rlang_1.0.4            futile.logger_1.4.3   
## [49] grid_4.2.1             RCurl_1.98-1.7         rhdf5filters_1.8.0    
## [52] rappdirs_0.3.3         bitops_1.0-7           rmarkdown_2.14        
## [55] DBI_1.1.3              curl_4.3.2             R6_2.5.1              
## [58] knitr_1.39             fastmap_1.1.0          bit_4.0.4             
## [61] utf8_1.2.2             filelock_1.0.2         futile.options_1.0.1  
## [64] readr_2.1.2            stringi_1.7.8          parallel_4.2.1        
## [67] Rcpp_1.0.9             vctrs_0.4.1            dbplyr_2.2.1          
## [70] tidyselect_1.1.2       xfun_0.31