Contents

1 Setup

For each case study, ensure that cellxgenedp (see the Bioconductor package landing page, or GitHub.io site) is installed (additional installation options are at https://mtmorgan.github.io/cellxgenedp/).

if (!"BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager", repos = "https://CRAN.R-project.org")
BiocManager::install("cellxgenedp")

Load the package.

library(cellxgenedp)

2 Case study: authors & datasets

2.1 Challenge and solution

This case study arose from a question on the CZI Science Community Slack. A user asked

Hi! Is it possible to search CELLxGENE and identify all datasets by a specific author or set of authors?

Unfortunately, this is not possible from the CELLxGENE web site – authors are only associated with collections, and collections can only be sorted or filtered by title (or publication / tissue / disease / organism).

A cellxgenedp solution uses authors() to discover authors and their collections, and joins this information to datasets().

author_datasets <- left_join(
    authors(),
    datasets(),
    by = "collection_id",
    relationship = "many-to-many"
)
author_datasets
#> # A tibble: 42,760 × 34
#>    collection_id  family given consortium dataset_id dataset_version_id donor_id
#>    <chr>          <chr>  <chr> <chr>      <chr>      <chr>              <list>  
#>  1 ceb895f4-ff9f… Zhu    Kaiyi <NA>       53ce2631-… 2f17c183-388a-4c0… <list>  
#>  2 ceb895f4-ff9f… Zhu    Kaiyi <NA>       1d4128f6-… 94762ee1-9f9f-49e… <list>  
#>  3 ceb895f4-ff9f… Bendl  Jaro… <NA>       53ce2631-… 2f17c183-388a-4c0… <list>  
#>  4 ceb895f4-ff9f… Bendl  Jaro… <NA>       1d4128f6-… 94762ee1-9f9f-49e… <list>  
#>  5 ceb895f4-ff9f… Rahman Samir <NA>       53ce2631-… 2f17c183-388a-4c0… <list>  
#>  6 ceb895f4-ff9f… Rahman Samir <NA>       1d4128f6-… 94762ee1-9f9f-49e… <list>  
#>  7 ceb895f4-ff9f… Vicari Jame… <NA>       53ce2631-… 2f17c183-388a-4c0… <list>  
#>  8 ceb895f4-ff9f… Vicari Jame… <NA>       1d4128f6-… 94762ee1-9f9f-49e… <list>  
#>  9 ceb895f4-ff9f… Colem… Clai… <NA>       53ce2631-… 2f17c183-388a-4c0… <list>  
#> 10 ceb895f4-ff9f… Colem… Clai… <NA>       1d4128f6-… 94762ee1-9f9f-49e… <list>  
#> # ℹ 42,750 more rows
#> # ℹ 27 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> #   cell_type <list>, citation <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

author_datasets provides a convenient point from which to make basic queries, e.g., finding the authors contributing the most datasets.

author_datasets |>
    count(family, given, sort = TRUE)
#> # A tibble: 3,946 × 3
#>    family      given        n
#>    <chr>       <chr>    <int>
#>  1 Casper      Tamara     232
#>  2 Dee         Nick       232
#>  3 Macosko     Evan Z.    230
#>  4 Chen        Fei        226
#>  5 Ding        Song-Lin   226
#>  6 Murray      Evan       226
#>  7 Hirschstein Daniel     217
#>  8 Travaglini  Kyle J.    202
#>  9 Nyhus       Julie      201
#> 10 Lein        Ed S.      198
#> # ℹ 3,936 more rows

Perhaps one is interested in the most prolific authors based on ‘collections’, rather than ‘datasets’. The five most prolific authors by collection are

prolific_authors <-
    authors() |>
    count(family, given, sort = TRUE) |>
    slice(1:5)
prolific_authors
#> # A tibble: 5 × 3
#>   family    given          n
#>   <chr>     <chr>      <int>
#> 1 Teichmann Sarah A.      23
#> 2 Regev     Aviv          13
#> 3 Haniffa   Muzlifah      12
#> 4 Meyer     Kerstin B.    12
#> 5 Polanski  Krzysztof     12

The datasets associated with authors are

right_join(
    author_datasets,
    prolific_authors,
    by = c("family", "given")
)
#> # A tibble: 364 × 35
#>    collection_id  family given consortium dataset_id dataset_version_id donor_id
#>    <chr>          <chr>  <chr> <chr>      <chr>      <chr>              <list>  
#>  1 1d1c7275-476a… Regev  Aviv  <NA>       d319af7f-… 3c80a5bb-8c89-433… <list>  
#>  2 1d1c7275-476a… Regev  Aviv  <NA>       bc7260e0-… cab44f51-3ebf-4c9… <list>  
#>  3 1d1c7275-476a… Regev  Aviv  <NA>       8623d55f-… f7e3732a-eea9-49a… <list>  
#>  4 1d1c7275-476a… Regev  Aviv  <NA>       7b75b2c4-… 4b53f264-a62e-41a… <list>  
#>  5 1d1c7275-476a… Regev  Aviv  <NA>       5cdbb2ea-… 9944528e-f23b-4d8… <list>  
#>  6 1d1c7275-476a… Regev  Aviv  <NA>       2f6a20f1-… 129c17aa-3d9d-435… <list>  
#>  7 f7cecffa-00b4… Meyer  Kers… <NA>       eaf0c852-… ec370e4c-4ce8-435… <list>  
#>  8 f7cecffa-00b4… Meyer  Kers… <NA>       5af90777-… 4fd0f389-89ac-4a6… <list>  
#>  9 f7cecffa-00b4… Meyer  Kers… <NA>       318cb2a6-… 99d31e0c-4ccf-42c… <list>  
#> 10 f7cecffa-00b4… Teich… Sara… <NA>       eaf0c852-… ec370e4c-4ce8-435… <list>  
#> # ℹ 354 more rows
#> # ℹ 28 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> #   cell_type <list>, citation <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

Alternatively, one might be interested in specific authors. This is most easily accomplished with a simple filter on author_datasets, e.g.,

author_datasets |>
    filter(
        family %in% c("Teichmann", "Regev", "Haniffa")
    )
#> # A tibble: 238 × 34
#>    collection_id  family given consortium dataset_id dataset_version_id donor_id
#>    <chr>          <chr>  <chr> <chr>      <chr>      <chr>              <list>  
#>  1 1d1c7275-476a… Regev  Aviv  <NA>       d319af7f-… 3c80a5bb-8c89-433… <list>  
#>  2 1d1c7275-476a… Regev  Aviv  <NA>       bc7260e0-… cab44f51-3ebf-4c9… <list>  
#>  3 1d1c7275-476a… Regev  Aviv  <NA>       8623d55f-… f7e3732a-eea9-49a… <list>  
#>  4 1d1c7275-476a… Regev  Aviv  <NA>       7b75b2c4-… 4b53f264-a62e-41a… <list>  
#>  5 1d1c7275-476a… Regev  Aviv  <NA>       5cdbb2ea-… 9944528e-f23b-4d8… <list>  
#>  6 1d1c7275-476a… Regev  Aviv  <NA>       2f6a20f1-… 129c17aa-3d9d-435… <list>  
#>  7 f7cecffa-00b4… Teich… Sara… <NA>       eaf0c852-… ec370e4c-4ce8-435… <list>  
#>  8 f7cecffa-00b4… Teich… Sara… <NA>       5af90777-… 4fd0f389-89ac-4a6… <list>  
#>  9 f7cecffa-00b4… Teich… Sara… <NA>       318cb2a6-… 99d31e0c-4ccf-42c… <list>  
#> 10 2d2e2acd-dade… Teich… Sara… <NA>       f9846bb4-… f0b11a56-236f-4e7… <list>  
#> # ℹ 228 more rows
#> # ℹ 27 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> #   cell_type <list>, citation <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

or more carefully by constructing at data.frame of family and given names, and performing a join with author_datasets

authors_of_interest <-
    tibble(
        family = c("Teichmann", "Regev", "Haniffa"),
        given = c("Sarah A.", "Aviv", "Muzlifah")
    )
right_join(
    author_datasets,
    authors_of_interest,
    by = c("family", "given")
)
#> # A tibble: 228 × 34
#>    collection_id  family given consortium dataset_id dataset_version_id donor_id
#>    <chr>          <chr>  <chr> <chr>      <chr>      <chr>              <list>  
#>  1 1d1c7275-476a… Regev  Aviv  <NA>       d319af7f-… 3c80a5bb-8c89-433… <list>  
#>  2 1d1c7275-476a… Regev  Aviv  <NA>       bc7260e0-… cab44f51-3ebf-4c9… <list>  
#>  3 1d1c7275-476a… Regev  Aviv  <NA>       8623d55f-… f7e3732a-eea9-49a… <list>  
#>  4 1d1c7275-476a… Regev  Aviv  <NA>       7b75b2c4-… 4b53f264-a62e-41a… <list>  
#>  5 1d1c7275-476a… Regev  Aviv  <NA>       5cdbb2ea-… 9944528e-f23b-4d8… <list>  
#>  6 1d1c7275-476a… Regev  Aviv  <NA>       2f6a20f1-… 129c17aa-3d9d-435… <list>  
#>  7 f7cecffa-00b4… Teich… Sara… <NA>       eaf0c852-… ec370e4c-4ce8-435… <list>  
#>  8 f7cecffa-00b4… Teich… Sara… <NA>       5af90777-… 4fd0f389-89ac-4a6… <list>  
#>  9 f7cecffa-00b4… Teich… Sara… <NA>       318cb2a6-… 99d31e0c-4ccf-42c… <list>  
#> 10 2d2e2acd-dade… Teich… Sara… <NA>       f9846bb4-… f0b11a56-236f-4e7… <list>  
#> # ℹ 218 more rows
#> # ℹ 27 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> #   cell_type <list>, citation <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

2.2 Areas of interest

There are several interesting questions that suggest themselves, and several areas where some additional work is required.

It might be interesting to identify authors working on similar disease, or other areas of interest. The disease column in the author_datasets table is a list.

author_datasets |>
    select(family, given, dataset_id, disease)
#> # A tibble: 42,760 × 4
#>    family  given    dataset_id                           disease   
#>    <chr>   <chr>    <chr>                                <list>    
#>  1 Zhu     Kaiyi    53ce2631-3646-4172-bbd9-38b0a44d8214 <list [1]>
#>  2 Zhu     Kaiyi    1d4128f6-c27b-40c4-af77-b1c7e2b694e7 <list [1]>
#>  3 Bendl   Jaroslav 53ce2631-3646-4172-bbd9-38b0a44d8214 <list [1]>
#>  4 Bendl   Jaroslav 1d4128f6-c27b-40c4-af77-b1c7e2b694e7 <list [1]>
#>  5 Rahman  Samir    53ce2631-3646-4172-bbd9-38b0a44d8214 <list [1]>
#>  6 Rahman  Samir    1d4128f6-c27b-40c4-af77-b1c7e2b694e7 <list [1]>
#>  7 Vicari  James M. 53ce2631-3646-4172-bbd9-38b0a44d8214 <list [1]>
#>  8 Vicari  James M. 1d4128f6-c27b-40c4-af77-b1c7e2b694e7 <list [1]>
#>  9 Coleman Claire   53ce2631-3646-4172-bbd9-38b0a44d8214 <list [1]>
#> 10 Coleman Claire   1d4128f6-c27b-40c4-af77-b1c7e2b694e7 <list [1]>
#> # ℹ 42,750 more rows

This is because a single dataset may involve more than one disease. Furthermore, each entry in the list contains two elements, the label and ontology_term_id of the disease. There are two approaches to working with this data.

One approach to working with this data uses facilities in cellxgenedp as outlined in an accompanying article. Discover possible diseases.

facets(db(), "disease")
#> # A tibble: 99 × 4
#>    facet   label                                          ontology_term_id     n
#>    <chr>   <chr>                                          <chr>            <int>
#>  1 disease normal                                         PATO:0000461       985
#>  2 disease COVID-19                                       MONDO:0100096       60
#>  3 disease dementia                                       MONDO:0001627       50
#>  4 disease diabetic kidney disease                        MONDO:0005016       26
#>  5 disease myocardial infarction                          MONDO:0005068       26
#>  6 disease autosomal dominant polycystic kidney disease   MONDO:0004691       24
#>  7 disease Alzheimer disease                              MONDO:0004975       15
#>  8 disease lung adenocarcinoma                            MONDO:0005061       11
#>  9 disease small cell lung carcinoma                      MONDO:0008433       11
#> 10 disease arrhythmogenic right ventricular cardiomyopat… MONDO:0016587        9
#> # ℹ 89 more rows

Focus on COVID-19, and use facets_filter() to select relevant author-dataset combinations.

author_datasets |>
    filter(facets_filter(disease, "label", "COVID-19"))
#> # A tibble: 1,766 × 34
#>    collection_id  family given consortium dataset_id dataset_version_id donor_id
#>    <chr>          <chr>  <chr> <chr>      <chr>      <chr>              <list>  
#>  1 a72afd53-ab92… Wilk   Aaro… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  2 a72afd53-ab92… Rusta… Arjun <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  3 a72afd53-ab92… Zhao   Nanc… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  4 a72afd53-ab92… Roque  Jona… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  5 a72afd53-ab92… Martí… Giov… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  6 a72afd53-ab92… McKec… Juli… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  7 a72afd53-ab92… Ivison Geof… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  8 a72afd53-ab92… Ranga… Than… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#>  9 a72afd53-ab92… Verga… Rose… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#> 10 a72afd53-ab92… Hollis Tayl… <NA>       456e8b9b-… a5ac4378-58f6-4ff… <list>  
#> # ℹ 1,756 more rows
#> # ℹ 27 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> #   cell_type <list>, citation <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

Authors contributing to these datasets are

author_datasets |>
    filter(facets_filter(disease, "label", "COVID-19")) |>
    count(family, given, sort = TRUE)
#> # A tibble: 794 × 3
#>    family       given           n
#>    <chr>        <chr>       <int>
#>  1 Farber       Donna L.       29
#>  2 Guo          Xinzheng V.    28
#>  3 Saqi         Anjali         28
#>  4 Baldwin      Matthew R.     27
#>  5 Chait        Michael        27
#>  6 Connors      Thomas J.      27
#>  7 Davis-Porada Julia          27
#>  8 Dogra        Pranay         27
#>  9 Gray         Joshua I.      27
#> 10 Idzikowski   Emma           27
#> # ℹ 784 more rows

A second approach is to follow the practices in R for Data Science, the disease column can be ‘unnested’ twice, the first time to expand the author_datasets table for each disease, and the second time to separate the two columns of each disease.

author_dataset_diseases <-
    author_datasets |>
    select(family, given, dataset_id, disease) |>
    tidyr::unnest_longer(disease) |>
    tidyr::unnest_wider(disease)
author_dataset_diseases
#> # A tibble: 56,419 × 5
#>    family  given    dataset_id                           label  ontology_term_id
#>    <chr>   <chr>    <chr>                                <chr>  <chr>           
#>  1 Zhu     Kaiyi    53ce2631-3646-4172-bbd9-38b0a44d8214 normal PATO:0000461    
#>  2 Zhu     Kaiyi    1d4128f6-c27b-40c4-af77-b1c7e2b694e7 normal PATO:0000461    
#>  3 Bendl   Jaroslav 53ce2631-3646-4172-bbd9-38b0a44d8214 normal PATO:0000461    
#>  4 Bendl   Jaroslav 1d4128f6-c27b-40c4-af77-b1c7e2b694e7 normal PATO:0000461    
#>  5 Rahman  Samir    53ce2631-3646-4172-bbd9-38b0a44d8214 normal PATO:0000461    
#>  6 Rahman  Samir    1d4128f6-c27b-40c4-af77-b1c7e2b694e7 normal PATO:0000461    
#>  7 Vicari  James M. 53ce2631-3646-4172-bbd9-38b0a44d8214 normal PATO:0000461    
#>  8 Vicari  James M. 1d4128f6-c27b-40c4-af77-b1c7e2b694e7 normal PATO:0000461    
#>  9 Coleman Claire   53ce2631-3646-4172-bbd9-38b0a44d8214 normal PATO:0000461    
#> 10 Coleman Claire   1d4128f6-c27b-40c4-af77-b1c7e2b694e7 normal PATO:0000461    
#> # ℹ 56,409 more rows

Author-dataset combinations associated with COVID-19, and contributors to these datasets, are

author_dataset_diseases |>
    filter(label == "COVID-19")

author_dataset_diseases |>
    filter(label == "COVID-19") |>
    count(family, given, sort = TRUE)

These computations are the same as the earlier iteration using functionality in cellxgenedp.

A further resource that might be of interest is the [OSLr][] package article illustrating how the ontologies used by CELLxGENE can be manipulated to, e.g., identify studies with terms that derive from a common term (e.g., all disease terms related to ‘carcinoma’).

2.3 Collaboration

TODO.

It might be interesting to know which authors have collaborated with one another. This can be computed from the author_datasets table, following approaches developed in the grantpubcite package to identify collaborations between projects in the NIH-funded ITCR program. See the graph visualization in the ITCR collaboration section for inspiration.

2.4 Duplicate collection-author combinations

Here are the authors

authors <- authors()
authors
#> # A tibble: 5,109 × 4
#>    collection_id                        family   given      consortium
#>    <chr>                                <chr>    <chr>      <chr>     
#>  1 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Zhu      Kaiyi      <NA>      
#>  2 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Bendl    Jaroslav   <NA>      
#>  3 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Rahman   Samir      <NA>      
#>  4 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Vicari   James M.   <NA>      
#>  5 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Coleman  Claire     <NA>      
#>  6 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Clarence Tereza     <NA>      
#>  7 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Latouche Ovaun      <NA>      
#>  8 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Tsankova Nadejda M. <NA>      
#>  9 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Li       Aiqun      <NA>      
#> 10 ceb895f4-ff9f-403a-b7c3-187a9657ac2c Brennand Kristen J. <NA>      
#> # ℹ 5,099 more rows

There are 5109 collection-author combinations. We expect these to be distinct (each row identifying a unique collection-author combination). But this is not true

nrow(authors) == nrow(distinct(authors))
#> [1] FALSE

Duplicated data are

authors |> 
    count(collection_id, family, given, consortium, sort = TRUE) |>
    filter(n > 1)
#> # A tibble: 73 × 5
#>    collection_id                        family     given        consortium     n
#>    <chr>                                <chr>      <chr>        <chr>      <int>
#>  1 e5f58829-1a66-40b5-a624-9046778e74f5 Pisco      Angela Oliv… <NA>           4
#>  2 e5f58829-1a66-40b5-a624-9046778e74f5 Crasta     Sheela       <NA>           3
#>  3 e5f58829-1a66-40b5-a624-9046778e74f5 Swift      Michael      <NA>           3
#>  4 e5f58829-1a66-40b5-a624-9046778e74f5 Travaglini Kyle J.      <NA>           3
#>  5 e5f58829-1a66-40b5-a624-9046778e74f5 de Morree  Antoine      <NA>           3
#>  6 51544e44-293b-4c2b-8c26-560678423380 Betts      Michael R.   <NA>           2
#>  7 51544e44-293b-4c2b-8c26-560678423380 Faryabi    Robert B.    <NA>           2
#>  8 51544e44-293b-4c2b-8c26-560678423380 Fasolino   Maria        <NA>           2
#>  9 51544e44-293b-4c2b-8c26-560678423380 Feldman    Michael      <NA>           2
#> 10 51544e44-293b-4c2b-8c26-560678423380 Goldman    Naomi        <NA>           2
#> # ℹ 63 more rows

Discover details of the first duplicated collection, e5f58829-1a66-40b5-a624-9046778e74f5

duplicate_authors <-
    collections() |>
    filter(collection_id == "e5f58829-1a66-40b5-a624-9046778e74f5")
duplicate_authors
#> # A tibble: 1 × 18
#>   collection_id     collection_version_id collection_url consortia contact_email
#>   <chr>             <chr>                 <chr>          <list>    <chr>        
#> 1 e5f58829-1a66-40… 0a47227f-aab6-4c3d-b… https://cellx… <list>    angela.pisco…
#> # ℹ 13 more variables: contact_name <chr>, curator_name <chr>,
#> #   description <chr>, doi <chr>, links <list>, name <chr>,
#> #   publisher_metadata <list>, revising_in <lgl>, revision_of <lgl>,
#> #   visibility <chr>, created_at <date>, published_at <date>, revised_at <date>

The author information comes from the publisher_metadata column

publisher_metadata <-
    duplicate_authors |>
    pull(publisher_metadata)

This is a ‘list-of-lists’, with relevant information as elements in the first list

names(publisher_metadata[[1]])
#> [1] "authors"         "is_preprint"     "journal"         "published_at"   
#> [5] "published_day"   "published_month" "published_year"

and relevant information in the authors field, of which there are 221

length(publisher_metadata[[1]][["authors"]])
#> [1] 221

Inspection shows that there are four authors with family name Pisco and given name Angela Oliveira: it appears that the data provided by CZI indeed includes duplicate author names.

From a pragmatic perspective, it might make sense to remove duplicate entries from authors before down-stream analysis.

deduplicated_authors <- distinct(authors)

Tools that I have found useful when working with list-of-lists style data rare listviewer::jsonedit() for visualization, and rjsoncons for filtering and querying these data using JSONpointer, JSONpath, or JMESpath expression (a more R-centric tool is the purrr package).

2.4.1 What is an ‘author’?

The combination of family and given name may refer to two (or more) different individuals (e.g., two individuals named ‘Martin Morgan’), or a single individual may be recorded under two different names (e.g., given name sometimes ‘Martin’ and sometimes ‘Martin T.’). It is not clear how this could be resolved; recording ORCID identifiers migth help with disambiguation.

3 Case study: using ontology to identify datasets

This case study was developed in response to the following Slack question:

CELLxGENE’s webpage is using different ontologies and displaying them in an easy to interogate manner (choosing amongst 3 possible coarseness for cell types, tissues and age) I was wondering if this simplified tree of the 3 subgroups for cell type, tissue and age categories was available somewhere?

As indicated in the question, CELLxGENE provides some access to ontologies through a hand-curated three-tiered classification of specific facets; the tiers can be retrieved from publicly available code, but one might want to develop a more flexible or principled approach.

CELLxGENE dataset facets like ‘disease’ and ‘cell type’ use terms from ontologies. Ontologies arrange terms in directed acyclic graphs, and use of ontologies can be useful to identify related datasets. For instance, one might be interesed in cancer-related datasets (derived from the ‘carcinoma’ term in the corresponding ontology) in general, rather than, e.g., ‘B-cell non-Hodgkins lymphoma’.

In exploring this question in R, I found myself developing the OLSr package to query and process ontologies from the EMBL-EBI Ontology Lookup Service. See the ‘Case Study: CELLxGENE Ontologies’ article in the OLSr package for full details.

Session information

#> R Under development (unstable) (2024-01-16 r85808)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] cellxgenedp_1.7.2           dplyr_1.1.4                
#>  [3] SingleCellExperiment_1.25.0 SummarizedExperiment_1.33.2
#>  [5] Biobase_2.63.0              GenomicRanges_1.55.1       
#>  [7] GenomeInfoDb_1.39.5         IRanges_2.37.1             
#>  [9] S4Vectors_0.41.3            BiocGenerics_0.49.1        
#> [11] MatrixGenerics_1.15.0       matrixStats_1.2.0          
#> [13] zellkonverter_1.13.2        BiocStyle_2.31.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] dir.expiry_1.11.0       xfun_0.41               bslib_0.6.1            
#>  [4] htmlwidgets_1.6.4       rhdf5_2.47.2            lattice_0.22-5         
#>  [7] rhdf5filters_1.15.1     rjsoncons_1.1.0         vctrs_0.6.5            
#> [10] tools_4.4.0             bitops_1.0-7            generics_0.1.3         
#> [13] curl_5.2.0              parallel_4.4.0          tibble_3.2.1           
#> [16] fansi_1.0.6             pkgconfig_2.0.3         Matrix_1.6-5           
#> [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.11 compiler_4.4.0         
#> [22] httpuv_1.6.13           htmltools_0.5.7         sass_0.4.8             
#> [25] RCurl_1.98-1.14         yaml_2.3.8              tidyr_1.3.0            
#> [28] later_1.3.2             pillar_1.9.0            crayon_1.5.2           
#> [31] jquerylib_0.1.4         ellipsis_0.3.2          DT_0.31                
#> [34] DelayedArray_0.29.0     cachem_1.0.8            abind_1.4-5            
#> [37] mime_0.12               basilisk_1.15.2         tidyselect_1.2.0       
#> [40] digest_0.6.34           purrr_1.0.2             bookdown_0.37          
#> [43] fastmap_1.1.1           grid_4.4.0              cli_3.6.2              
#> [46] SparseArray_1.3.3       magrittr_2.0.3          S4Arrays_1.3.2         
#> [49] utf8_1.2.4              withr_3.0.0             promises_1.2.1         
#> [52] filelock_1.0.3          httr_1.4.7              rmarkdown_2.25         
#> [55] XVector_0.43.1          reticulate_1.34.0       png_0.1-8              
#> [58] HDF5Array_1.31.1        shiny_1.8.0             evaluate_0.23          
#> [61] knitr_1.45              basilisk.utils_1.15.1   rlang_1.1.3            
#> [64] Rcpp_1.0.12             xtable_1.8-4            glue_1.7.0             
#> [67] BiocManager_1.30.22     jsonlite_1.8.8          Rhdf5lib_1.25.1        
#> [70] R6_2.5.1                zlibbioc_1.49.0