Case study: authors & datasets
Challenge and solution
This case study arose from a question on the CZI Science Community
Slack. A user asked
Hi! Is it possible to search CELLxGENE and identify all datasets by
a specific author or set of authors?
Unfortunately, this is not possible from the CELLxGENE web site –
authors are only associated with collections, and collections can only
be sorted or filtered by title (or publication / tissue / disease /
organism).
A cellxgenedp solution uses authors()
to discover authors and
their collections, and joins this information to datasets()
.
author_datasets <- left_join(
authors(),
datasets(),
by = "collection_id",
relationship = "many-to-many"
)
author_datasets
#> # A tibble: 46,398 × 35
#> collection_id family given consortium dataset_id dataset_version_id donor_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 59c9ecfe-c47d… Yang Andr… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 2 59c9ecfe-c47d… Yang Andr… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 3 59c9ecfe-c47d… Kern Fabi… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 4 59c9ecfe-c47d… Kern Fabi… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 5 59c9ecfe-c47d… Losada Patr… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 6 59c9ecfe-c47d… Losada Patr… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 7 59c9ecfe-c47d… Agam Maay… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 8 59c9ecfe-c47d… Agam Maay… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 9 59c9ecfe-c47d… Maat Chri… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 10 59c9ecfe-c47d… Maat Chri… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> # ℹ 46,388 more rows
#> # ℹ 28 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> # cell_type <list>, citation <chr>, default_embedding <chr>,
#> # development_stage <list>, disease <list>, embeddings <list>,
#> # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
#> # feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>, …
author_datasets
provides a convenient point from which to make basic
queries, e.g., finding the authors contributing the most datasets.
author_datasets |>
count(family, given, sort = TRUE)
#> # A tibble: 4,204 × 3
#> family given n
#> <chr> <chr> <int>
#> 1 Casper Tamara 232
#> 2 Dee Nick 232
#> 3 Macosko Evan Z. 230
#> 4 Chen Fei 226
#> 5 Ding Song-Lin 226
#> 6 Murray Evan 226
#> 7 Hirschstein Daniel 217
#> 8 Travaglini Kyle J. 202
#> 9 Nyhus Julie 201
#> 10 Teichmann Sarah A. 199
#> # ℹ 4,194 more rows
Perhaps one is interested in the most prolific authors based on
‘collections’, rather than ‘datasets’. The five most prolific authors
by collection are
prolific_authors <-
authors() |>
count(family, given, sort = TRUE) |>
slice(1:5)
prolific_authors
#> # A tibble: 5 × 3
#> family given n
#> <chr> <chr> <int>
#> 1 Teichmann Sarah A. 25
#> 2 Regev Aviv 14
#> 3 Haniffa Muzlifah 13
#> 4 Meyer Kerstin B. 13
#> 5 Polanski Krzysztof 13
The datasets associated with authors are
right_join(
author_datasets,
prolific_authors,
by = c("family", "given")
)
#> # A tibble: 509 × 36
#> collection_id family given consortium dataset_id dataset_version_id donor_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 b52eb423-5d0d… Polan… Krzy… <NA> f75f2ff4-… 76399757-d131-4a4… <chr>
#> 2 b52eb423-5d0d… Polan… Krzy… <NA> ed852810-… e394387d-fdb3-4a1… <chr>
#> 3 b52eb423-5d0d… Polan… Krzy… <NA> d4e69e01-… fcc03458-cf50-425… <chr>
#> 4 b52eb423-5d0d… Polan… Krzy… <NA> 9d584fcb-… 0f5dba64-8621-420… <chr>
#> 5 b52eb423-5d0d… Polan… Krzy… <NA> 84f1a631-… 47d7cdd8-0895-483… <chr>
#> 6 b52eb423-5d0d… Polan… Krzy… <NA> 78fd69d2-… 98850cc8-8c09-466… <chr>
#> 7 b52eb423-5d0d… Polan… Krzy… <NA> 572f3f3e-… 54ec48d6-d115-40c… <chr>
#> 8 b52eb423-5d0d… Polan… Krzy… <NA> 1009f384-… 324c7c08-5399-493… <chr>
#> 9 b52eb423-5d0d… Teich… Sara… <NA> f75f2ff4-… 76399757-d131-4a4… <chr>
#> 10 b52eb423-5d0d… Teich… Sara… <NA> ed852810-… e394387d-fdb3-4a1… <chr>
#> # ℹ 499 more rows
#> # ℹ 29 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> # cell_type <list>, citation <chr>, default_embedding <chr>,
#> # development_stage <list>, disease <list>, embeddings <list>,
#> # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
#> # feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>, …
Alternatively, one might be interested in specific authors. This is
most easily accomplished with a simple filter on author_datasets
, e.g.,
author_datasets |>
filter(
family %in% c("Teichmann", "Regev", "Haniffa")
)
#> # A tibble: 337 × 35
#> collection_id family given consortium dataset_id dataset_version_id donor_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 b52eb423-5d0d… Teich… Sara… <NA> f75f2ff4-… 76399757-d131-4a4… <chr>
#> 2 b52eb423-5d0d… Teich… Sara… <NA> ed852810-… e394387d-fdb3-4a1… <chr>
#> 3 b52eb423-5d0d… Teich… Sara… <NA> d4e69e01-… fcc03458-cf50-425… <chr>
#> 4 b52eb423-5d0d… Teich… Sara… <NA> 9d584fcb-… 0f5dba64-8621-420… <chr>
#> 5 b52eb423-5d0d… Teich… Sara… <NA> 84f1a631-… 47d7cdd8-0895-483… <chr>
#> 6 b52eb423-5d0d… Teich… Sara… <NA> 78fd69d2-… 98850cc8-8c09-466… <chr>
#> 7 b52eb423-5d0d… Teich… Sara… <NA> 572f3f3e-… 54ec48d6-d115-40c… <chr>
#> 8 b52eb423-5d0d… Teich… Sara… <NA> 1009f384-… 324c7c08-5399-493… <chr>
#> 9 793fdaec-5067… Regev Aviv <NA> 86282760-… f4915942-787b-405… <chr>
#> 10 793fdaec-5067… Regev Aviv <NA> 471647b3-… 37188227-b8a7-4a7… <chr>
#> # ℹ 327 more rows
#> # ℹ 28 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> # cell_type <list>, citation <chr>, default_embedding <chr>,
#> # development_stage <list>, disease <list>, embeddings <list>,
#> # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
#> # feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>, …
or more carefully by constructing at data.frame
of family and given
names, and performing a join with author_datasets
authors_of_interest <-
tibble(
family = c("Teichmann", "Regev", "Haniffa"),
given = c("Sarah A.", "Aviv", "Muzlifah")
)
right_join(
author_datasets,
authors_of_interest,
by = c("family", "given")
)
#> # A tibble: 327 × 35
#> collection_id family given consortium dataset_id dataset_version_id donor_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 b52eb423-5d0d… Teich… Sara… <NA> f75f2ff4-… 76399757-d131-4a4… <chr>
#> 2 b52eb423-5d0d… Teich… Sara… <NA> ed852810-… e394387d-fdb3-4a1… <chr>
#> 3 b52eb423-5d0d… Teich… Sara… <NA> d4e69e01-… fcc03458-cf50-425… <chr>
#> 4 b52eb423-5d0d… Teich… Sara… <NA> 9d584fcb-… 0f5dba64-8621-420… <chr>
#> 5 b52eb423-5d0d… Teich… Sara… <NA> 84f1a631-… 47d7cdd8-0895-483… <chr>
#> 6 b52eb423-5d0d… Teich… Sara… <NA> 78fd69d2-… 98850cc8-8c09-466… <chr>
#> 7 b52eb423-5d0d… Teich… Sara… <NA> 572f3f3e-… 54ec48d6-d115-40c… <chr>
#> 8 b52eb423-5d0d… Teich… Sara… <NA> 1009f384-… 324c7c08-5399-493… <chr>
#> 9 793fdaec-5067… Regev Aviv <NA> 86282760-… f4915942-787b-405… <chr>
#> 10 793fdaec-5067… Regev Aviv <NA> 471647b3-… 37188227-b8a7-4a7… <chr>
#> # ℹ 317 more rows
#> # ℹ 28 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> # cell_type <list>, citation <chr>, default_embedding <chr>,
#> # development_stage <list>, disease <list>, embeddings <list>,
#> # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
#> # feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>, …
Areas of interest
There are several interesting questions that suggest themselves, and
several areas where some additional work is required.
It might be interesting to identify authors working on similar
disease, or other areas of interest. The disease
column in the
author_datasets
table is a list.
author_datasets |>
select(family, given, dataset_id, disease)
#> # A tibble: 46,398 × 4
#> family given dataset_id disease
#> <chr> <chr> <chr> <list>
#> 1 Yang Andrew C. 595c9010-99ec-462d-b6a1-2b2fe5407871 <list [4]>
#> 2 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee <list [9]>
#> 3 Kern Fabian 595c9010-99ec-462d-b6a1-2b2fe5407871 <list [4]>
#> 4 Kern Fabian 2f05ab20-a092-4bab-9276-3e0eb24e3fee <list [9]>
#> 5 Losada Patricia M. 595c9010-99ec-462d-b6a1-2b2fe5407871 <list [4]>
#> 6 Losada Patricia M. 2f05ab20-a092-4bab-9276-3e0eb24e3fee <list [9]>
#> 7 Agam Maayan R. 595c9010-99ec-462d-b6a1-2b2fe5407871 <list [4]>
#> 8 Agam Maayan R. 2f05ab20-a092-4bab-9276-3e0eb24e3fee <list [9]>
#> 9 Maat Christina A. 595c9010-99ec-462d-b6a1-2b2fe5407871 <list [4]>
#> 10 Maat Christina A. 2f05ab20-a092-4bab-9276-3e0eb24e3fee <list [9]>
#> # ℹ 46,388 more rows
This is because a single dataset may involve more than one
disease. Furthermore, each entry in the list contains two elements,
the label
and ontology_term_id
of the disease. There are two
approaches to working with this data.
One approach to working with this data uses facilities in
cellxgenedp as outlined in an accompanying article. Discover
possible diseases.
facets(db(), "disease")
#> # A tibble: 119 × 4
#> facet label ontology_term_id n
#> <chr> <chr> <chr> <int>
#> 1 disease normal PATO:0000461 1139
#> 2 disease COVID-19 MONDO:0100096 62
#> 3 disease dementia MONDO:0001627 50
#> 4 disease myocardial infarction MONDO:0005068 27
#> 5 disease diabetic kidney disease MONDO:0005016 26
#> 6 disease autosomal dominant polycystic kidney disease MONDO:0004691 24
#> 7 disease Alzheimer disease MONDO:0004975 15
#> 8 disease small cell lung carcinoma MONDO:0008433 12
#> 9 disease lung adenocarcinoma MONDO:0005061 11
#> 10 disease basal cell carcinoma MONDO:0020804 10
#> # ℹ 109 more rows
Focus on COVID-19
, and use facets_filter()
to select relevant
author-dataset combinations.
author_datasets |>
filter(facets_filter(disease, "label", "COVID-19"))
#> # A tibble: 1,812 × 35
#> collection_id family given consortium dataset_id dataset_version_id donor_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 59c9ecfe-c47d… Yang Andr… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 2 59c9ecfe-c47d… Yang Andr… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 3 59c9ecfe-c47d… Kern Fabi… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 4 59c9ecfe-c47d… Kern Fabi… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 5 59c9ecfe-c47d… Losada Patr… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 6 59c9ecfe-c47d… Losada Patr… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 7 59c9ecfe-c47d… Agam Maay… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 8 59c9ecfe-c47d… Agam Maay… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> 9 59c9ecfe-c47d… Maat Chri… <NA> 595c9010-… b4645848-e3d8-492… <chr>
#> 10 59c9ecfe-c47d… Maat Chri… <NA> 2f05ab20-… 3b715360-b0ae-4e5… <chr>
#> # ℹ 1,802 more rows
#> # ℹ 28 more variables: assay <list>, batch_condition <list>, cell_count <int>,
#> # cell_type <list>, citation <chr>, default_embedding <chr>,
#> # development_stage <list>, disease <list>, embeddings <list>,
#> # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
#> # feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>, …
Authors contributing to these datasets are
author_datasets |>
filter(facets_filter(disease, "label", "COVID-19")) |>
count(family, given, sort = TRUE)
#> # A tibble: 817 × 3
#> family given n
#> <chr> <chr> <int>
#> 1 Farber Donna L. 29
#> 2 Guo Xinzheng V. 28
#> 3 Saqi Anjali 28
#> 4 Baldwin Matthew R. 27
#> 5 Chait Michael 27
#> 6 Connors Thomas J. 27
#> 7 Davis-Porada Julia 27
#> 8 Dogra Pranay 27
#> 9 Gray Joshua I. 27
#> 10 Idzikowski Emma 27
#> # ℹ 807 more rows
A second approach is to follow the practices in R for Data
Science, the disease
column can be ‘unnested’ twice, the
first time to expand the author_datasets
table for each disease, and
the second time to separate the two columns of each disease.
author_dataset_diseases <-
author_datasets |>
select(family, given, dataset_id, disease) |>
tidyr::unnest_longer(disease) |>
tidyr::unnest_wider(disease)
author_dataset_diseases
#> # A tibble: 60,968 × 5
#> family given dataset_id label ontology_term_id
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Yang Andrew C. 595c9010-99ec-462d-b6a1-2b2fe5407871 COVID… MONDO:0100096
#> 2 Yang Andrew C. 595c9010-99ec-462d-b6a1-2b2fe5407871 aspir… MONDO:0000265
#> 3 Yang Andrew C. 595c9010-99ec-462d-b6a1-2b2fe5407871 influ… MONDO:0005812
#> 4 Yang Andrew C. 595c9010-99ec-462d-b6a1-2b2fe5407871 malig… MONDO:0009831
#> 5 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee COVID… MONDO:0100096
#> 6 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee breas… MONDO:0007254
#> 7 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee cardi… MONDO:0004994
#> 8 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee chron… MONDO:0005002
#> 9 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee heart… MONDO:0005267
#> 10 Yang Andrew C. 2f05ab20-a092-4bab-9276-3e0eb24e3fee influ… MONDO:0005812
#> # ℹ 60,958 more rows
Author-dataset combinations associated with COVID-19, and contributors
to these datasets, are
author_dataset_diseases |>
filter(label == "COVID-19")
author_dataset_diseases |>
filter(label == "COVID-19") |>
count(family, given, sort = TRUE)
These computations are the same as the earlier iteration using
functionality in cellxgenedp.
A further resource that might be of interest is the [OSLr][] package
article illustrating how the ontologies used by CELLxGENE can be
manipulated to, e.g., identify studies with terms that derive from a
common term (e.g., all disease terms related to ‘carcinoma’).
Collaboration
TODO.
It might be interesting to know which authors have collaborated with
one another. This can be computed from the author_datasets
table,
following approaches developed in the grantpubcite package to
identify collaborations between projects in the NIH-funded ITCR
program. See the graph visualization in the ITCR collaboration
section for inspiration.
Duplicate collection-author combinations
Here are the authors
authors <- authors()
authors
#> # A tibble: 5,465 × 4
#> collection_id family given consortium
#> <chr> <chr> <chr> <chr>
#> 1 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Yang Andrew C. <NA>
#> 2 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Kern Fabian <NA>
#> 3 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Losada Patricia M. <NA>
#> 4 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Agam Maayan R. <NA>
#> 5 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Maat Christina A. <NA>
#> 6 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Schmartz Georges P. <NA>
#> 7 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Fehlmann Tobias <NA>
#> 8 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Stein Julian A. <NA>
#> 9 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Schaum Nicholas <NA>
#> 10 59c9ecfe-c47d-4a6a-bab0-895cc0c1942b Lee Davis P. <NA>
#> # ℹ 5,455 more rows
There are 5465 collection-author combinations. We expect
these to be distinct (each row identifying a unique collection-author
combination). But this is not true
nrow(authors) == nrow(distinct(authors))
#> [1] FALSE
Duplicated data are
authors |>
count(collection_id, family, given, consortium, sort = TRUE) |>
filter(n > 1)
#> # A tibble: 73 × 5
#> collection_id family given consortium n
#> <chr> <chr> <chr> <chr> <int>
#> 1 e5f58829-1a66-40b5-a624-9046778e74f5 Pisco Angela Oliv… <NA> 4
#> 2 e5f58829-1a66-40b5-a624-9046778e74f5 Crasta Sheela <NA> 3
#> 3 e5f58829-1a66-40b5-a624-9046778e74f5 Swift Michael <NA> 3
#> 4 e5f58829-1a66-40b5-a624-9046778e74f5 Travaglini Kyle J. <NA> 3
#> 5 e5f58829-1a66-40b5-a624-9046778e74f5 de Morree Antoine <NA> 3
#> 6 51544e44-293b-4c2b-8c26-560678423380 Betts Michael R. <NA> 2
#> 7 51544e44-293b-4c2b-8c26-560678423380 Faryabi Robert B. <NA> 2
#> 8 51544e44-293b-4c2b-8c26-560678423380 Fasolino Maria <NA> 2
#> 9 51544e44-293b-4c2b-8c26-560678423380 Feldman Michael <NA> 2
#> 10 51544e44-293b-4c2b-8c26-560678423380 Goldman Naomi <NA> 2
#> # ℹ 63 more rows
Discover details of the first duplicated collection,
e5f58829-1a66-40b5-a624-9046778e74f5
duplicate_authors <-
collections() |>
filter(collection_id == "e5f58829-1a66-40b5-a624-9046778e74f5")
duplicate_authors
#> # A tibble: 1 × 18
#> collection_id collection_version_id collection_url consortia contact_email
#> <chr> <chr> <chr> <list> <chr>
#> 1 e5f58829-1a66-40… 519f5ac5-1f84-4b48-9… https://cellx… <chr [2]> angela.pisco…
#> # ℹ 13 more variables: contact_name <chr>, curator_name <chr>,
#> # description <chr>, doi <chr>, links <list>, name <chr>,
#> # publisher_metadata <list>, revising_in <lgl>, revision_of <lgl>,
#> # visibility <chr>, created_at <date>, published_at <date>, revised_at <date>
The author information comes from the publisher_metadata
column
publisher_metadata <-
duplicate_authors |>
pull(publisher_metadata)
This is a ‘list-of-lists’, with relevant information as elements in
the first list
names(publisher_metadata[[1]])
#> [1] "authors" "is_preprint" "journal" "published_at"
#> [5] "published_day" "published_month" "published_year"
and relevant information in the authors
field, of which there are 221
length(publisher_metadata[[1]][["authors"]])
#> [1] 221
Inspection shows that there are four authors with family name Pisco
and given name Angela Oliveira
: it appears that the data provided by
CZI indeed includes duplicate author names.
From a pragmatic perspective, it might make sense to remove duplicate
entries from authors
before down-stream analysis.
deduplicated_authors <- distinct(authors)
Tools that I have found useful when working with list-of-lists style
data rare listviewer::jsonedit() for visualization, and
rjsoncons for filtering and querying these data using JSONpointer,
JSONpath, or JMESpath expression (a more R-centric tool is the
purrr package).
What is an ‘author’?
The combination of family and given name may refer to two (or more)
different individuals (e.g., two individuals named ‘Martin Morgan’),
or a single individual may be recorded under two different names
(e.g., given name sometimes ‘Martin’ and sometimes ‘Martin T.’). It is
not clear how this could be resolved; recording ORCID identifiers
migth help with disambiguation.