A comprehensive guide to using the rsemmed package for exploring the Semantic MEDLINE database.
The rsemmed package provides a way for users to explore connections between the biological concepts present in the Semantic MEDLINE database (Kilicoglu et al. 2011) in a programmatic way.
The Semantic MEDLINE database (SemMedDB) is a collection of annotations of sentences from the abstracts of articles indexed in PubMed. These annotations take the form of subject-predicate-object triples of information. These triples are also called predications.
An example predication is “Interleukin-12 INTERACTS_WITH IFNA1”. Here, the subject is “Interleukin-12”, the object is “IFNA1” (interferon alpha-1), and the predicate linking the subject and object is “INTERACTS_WITH”. The Semantic MEDLINE database consists of tens of millions of these predications.
Semantic MEDLINE also provides information on the broad categories into which biological concepts (predication subjects and objects) fall. This information is called the semantic type of a concept. The databases assigns 4-letter codes to semantic types. For example, “gngm” represents “Gene or Genome”. Every concept in the database has one or more semantic types (abbreviated as “semtypes”).
Note: The information in Semantic MEDLINE is primarily computationally-derived. Thus, some information will seem nonsensical. For example, the reported semantic types of concepts might not quite match. The Semantic MEDLINE resource and this package are meant to facilitate an initial window of exploration into the literature. The hope is that this package helps guide more streamlined manual investigations of the literature.
The predications in SemMedDB can be represented in graph form. Nodes represent concepts, and directed edges represent predicates (concept linkers). In particular, the Semantic MEDLINE graph is a directed multigraph because multiple predicates are often present between pairs of nodes (e.g., “A ASSOCIATED_WITH B” and “A INTERACTS_WITH B”). rsemmed relies on the igraph package for efficient graph operations.
The full data underlying the complete Semantic MEDLINE database is available from from this National Library of Medicine site as SQL dump files. In particular, the PREDICATION table is the primary file that is needed to construct the database. More information about the Semantic MEDLINE database is available here.
See the inst/script
folder for scripts to perform the following processing of these raw files:
The next section describes details about the processing that occurs in these scripts to generate the graph representation.
In this vignette, we will explore a much smaller subset of the full graph that suffices to show the full functionality of rsemmed.
The graph representation of SemMedDB contains a processed and summarized form of the raw database. The toy example below illustrates the summarization performed.
Subject | Subject semtype | Predicate | Object | Object semtype |
---|---|---|---|---|
A | aapp | INHIBITS | B | gngm |
A | gngm | INHIBITS | B | aapp |
The two rows show two predications that are treated as different predications because the semantic types (“semtypes”) of the subject and object vary. In the processed data, such instances have been collapsed as shown below.
Subject | Subject semtype | Predicate | Object | Object semtype | # instances |
---|---|---|---|---|---|
A | aapp,gngm | INHIBITS | B | aapp,gngm | 2 |
The different semantic types for a particular concept are collapsed into a single comma-separated string that is available via igraph::vertex_attr(g, "semtype")
.
The “# instances” column indicates that the “A INHIBITS B” predication was observed twice in the database. This piece of information is available as an edge attribute via igraph::edge_attr(g, "num_instances")
. Similarly, predicate information is also an edge attribute accessible via igraph::edge_attr(g, "predicate")
.
A note of caution: Be careful when working with edge attributes in the Semantic MEDLINE graph manually. These operations can be very slow because there are over 18 million edges. Working with node/vertex attributes is much faster, but there are still a very large number of nodes (roughly 290,000).
The rest of this vignette will showcase how to use rsemmed functions to explore this graph.
To install rsemmed, start R and enter the following:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rsemmed")
Load the rsemmed package and the g_small
object which contains a smaller version of the Semantic MEDLINE database.
library(rsemmed)
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
data(g_small)
This loads an object of class igraph
named g_small
into the workspace. The SemMedDB graph object is a necessary input for most of rsemmed’s functions.
(The full processed graph representation linked above contains an object of class igraph
named g
.)
The starting point for an rsemmed exploration is to find nodes related to the initial ideas of interest. For example, we may wish to find connections between the ideas “sickle cell trait” and “malaria”.
The rsemmed::find_nodes()
function allows you to search for nodes by name. We supply the graph and a regular expression to use in searching through the name
attribute of the nodes. Finding the most relevant nodes will generally involve iteration.
To find nodes related to the sickle cell trait, we can start by searching for nodes containing the word “sickle”. (Note: searches ignore capitalization.)
nodes_sickle <- find_nodes(g_small, pattern = "sickle")
## This graph was created by an old(er) igraph version.
## Call upgrade_graph() on it to use with the current igraph version
## For now we convert it on the fly...
nodes_sickle
## + 5/1038 vertices, named, from f0c97f6:
## [1] Sickle Cell Anemia Sickle Hemoglobin Sickle Cell Trait
## [4] Sickle cell retinopathy sickle trait
We may decide that only sickle cell anemia and the sickle trait are important. Conventional R subsetting allows us to keep the 3 related nodes:
nodes_sickle <- nodes_sickle[c(1,3,5)]
nodes_sickle
## + 3/1038 vertices, named, from f0c97f6:
## [1] Sickle Cell Anemia Sickle Cell Trait sickle trait
We can also search for nodes related to “malaria”:
nodes_malaria <- find_nodes(g_small, pattern = "malaria")
nodes_malaria
## + 32/1038 vertices, named, from f0c97f6:
## [1] Malaria
## [2] Malaria, Falciparum
## [3] Malaria, Cerebral
## [4] Malaria Vaccines
## [5] Antimalarials
## [6] Malaria, Vivax
## [7] Simian malaria
## [8] Malaria, Avian
## [9] Malarial parasites
## [10] Mixed malaria
## + ... omitted several vertices
There are 32 results, not all of which are printed, so we can display all results by accessing the name
attribute of the returned nodes:
nodes_malaria$name
## [1] "Malaria"
## [2] "Malaria, Falciparum"
## [3] "Malaria, Cerebral"
## [4] "Malaria Vaccines"
## [5] "Antimalarials"
## [6] "Malaria, Vivax"
## [7] "Simian malaria"
## [8] "Malaria, Avian"
## [9] "Malarial parasites"
## [10] "Mixed malaria"
## [11] "Plasmodium malariae infection"
## [12] "Malaria antigen"
## [13] "Prescription of prophylactic anti-malarial"
## [14] "Induced malaria"
## [15] "Malaria serology"
## [16] "Algid malaria"
## [17] "Aminoquinoline antimalarial"
## [18] "Plasmodium malariae"
## [19] "Malaria antibody"
## [20] "Malaria screening"
## [21] "Congenital malaria"
## [22] "Ovale malaria"
## [23] "Quartan malaria"
## [24] "Malaria Antibodies Test"
## [25] "Biguanide antimalarial"
## [26] "MALARIA RELAPSE"
## [27] "KIT, TEST, MALARIA"
## [28] "Malarial hepatitis"
## [29] "Malarial pigmentation"
## [30] "Malaria antigen test"
## [31] "Malarial nephrosis"
## [32] "Malaria smear"
Perhaps we only want to keep the nodes that relate to disease. We could use direct subsetting, but another option is to use find_nodes()
again with nodes_malaria
as the input. Using the match
argument set to FALSE
allows us to prune unwanted matches from our results.
Below we iteratively prune matches to only keep disease-related results. Though this is not as condense as direct subsetting, it is more transparent about what was removed.
nodes_malaria <- nodes_malaria %>%
find_nodes(pattern = "anti", match = FALSE) %>%
find_nodes(pattern = "test", match = FALSE) %>%
find_nodes(pattern = "screening", match = FALSE) %>%
find_nodes(pattern = "pigment", match = FALSE) %>%
find_nodes(pattern = "smear", match = FALSE) %>%
find_nodes(pattern = "parasite", match = FALSE) %>%
find_nodes(pattern = "serology", match = FALSE) %>%
find_nodes(pattern = "vaccine", match = FALSE)
nodes_malaria
## + 17/1038 vertices, named, from f0c97f6:
## [1] Malaria Malaria, Falciparum
## [3] Malaria, Cerebral Malaria, Vivax
## [5] Simian malaria Malaria, Avian
## [7] Mixed malaria Plasmodium malariae infection
## [9] Induced malaria Algid malaria
## [11] Plasmodium malariae Congenital malaria
## [13] Ovale malaria Quartan malaria
## [15] MALARIA RELAPSE Malarial hepatitis
## [17] Malarial nephrosis
The find_nodes()
function can also be used with the semtypes
argument which allows you to specify a character vector of semantic types to search for. If both pattern
and semtypes
are provided, they are combined with an OR
operation. If you would like them to be combined with an AND
operation, nest the calls in sequence.
## malaria OR disease (dsyn)
find_nodes(g_small, pattern = "malaria", semtypes = "dsyn")
## + 317/1038 vertices, named, from f0c97f6:
## [1] Obstruction
## [2] Depressed mood
## [3] Carcinoma
## [4] HIV-1
## [5] Infection
## [6] leukemia
## [7] Neoplasm
## [8] Renal tubular disorder
## [9] Toxic effect
## [10] Vesicle
## + ... omitted several vertices
## malaria AND disease (dsyn)
find_nodes(g_small, pattern = "malaria") %>%
find_nodes(semtypes = "dsyn")
## + 16/1038 vertices, named, from f0c97f6:
## [1] Malaria Malaria, Falciparum
## [3] Malaria, Cerebral Malaria, Vivax
## [5] Simian malaria Malaria, Avian
## [7] Mixed malaria Plasmodium malariae infection
## [9] Induced malaria Algid malaria
## [11] Congenital malaria Ovale malaria
## [13] Quartan malaria MALARIA RELAPSE
## [15] Malarial hepatitis Malarial nephrosis
Finally, you can also select nodes by exact name with the names
argument. (Capitalization is ignored.)
find_nodes(g_small, names = "sickle trait")
## + 1/1038 vertex, named, from f0c97f6:
## [1] sickle trait
find_nodes(g_small, names = "SICKLE trait")
## + 1/1038 vertex, named, from f0c97f6:
## [1] sickle trait
Now that we have nodes related to the ideas of interest, we can develop further understanding by asking the following questions:
To further Aim 1, we can use the rsemmed::find_paths()
function. This function takes two sets of nodes from
and to
(corresponding to the two different ideas of interest) and returns all shortest paths between nodes in from
(“source” nodes) and nodes in to
(“target” nodes). That is, for every possible combination of a single node in from
and a single node in to
, all shortest undirected paths between those nodes are found.
paths <- find_paths(graph = g_small, from = nodes_sickle, to = nodes_malaria)
find_paths()
The result of find_paths()
is a list with one element for each of the nodes in from
. Each element is itself a list of paths between from
and to
. In igraph, paths are represented as vertex sequences (class igraph.vs
).
Recall that nodes_sickle
contains the nodes below:
nodes_sickle
## + 3/1038 vertices, named, from f0c97f6:
## [1] Sickle Cell Anemia Sickle Cell Trait sickle trait
Thus, paths
is structured as follows:
paths[[1]]
is a list of paths originating from Sickle Cell Anemia.paths[[2]]
is a list of paths originating from Sickle Cell Trait.paths[[3]]
is a list of paths originating from sickle trait.With lengths()
we can show the number of shortest paths starting at each of the three source (“from”) nodes:
lengths(paths)
## [1] 956 268 1601
There are two ways to display the information contained in these paths: rsemmed::text_path()
and rsemmed::plot_path()
.
text_path()
displays a text version of a pathplot_path()
displays a graphical version of the pathFor example, to show the 100th of the shortest paths originating from the first of the sickle trait nodes (paths[[1]][[100]]
), we can use text_path()
and plot_path()
as below:
this_path <- paths[[1]][[100]]
tp <- text_path(g_small, this_path)
## Sickle Cell Anemia --- pulmonary complications :
## # A tibble: 4 × 5
## from_semtype from via to to_semtype
## <chr> <chr> <chr> <chr> <chr>
## 1 dsyn Sickle Cell Anemia CAUSES pulmonary co… patf
## 2 patf pulmonary complications CAUSES Sickle Cell … dsyn
## 3 patf pulmonary complications COEXISTS_WITH Sickle Cell … dsyn
## 4 patf pulmonary complications MANIFESTATION_OF Sickle Cell … dsyn
##
## pulmonary complications --- Malaria, Falciparum :
## # A tibble: 1 × 5
## from_semtype from via to to_semtype
## <chr> <chr> <chr> <chr> <chr>
## 1 patf pulmonary complications COEXISTS_WITH Malaria, Falcip… dsyn
tp
## [[1]]
## # A tibble: 4 × 5
## from_semtype from via to to_semtype
## <chr> <chr> <chr> <chr> <chr>
## 1 dsyn Sickle Cell Anemia CAUSES pulmonary co… patf
## 2 patf pulmonary complications CAUSES Sickle Cell … dsyn
## 3 patf pulmonary complications COEXISTS_WITH Sickle Cell … dsyn
## 4 patf pulmonary complications MANIFESTATION_OF Sickle Cell … dsyn
##
## [[2]]
## # A tibble: 1 × 5
## from_semtype from via to to_semtype
## <chr> <chr> <chr> <chr> <chr>
## 1 patf pulmonary complications COEXISTS_WITH Malaria, Falcip… dsyn
plot_path(g_small, this_path)
plot_path()
plots the subgraph defined by the nodes on the path.
text_path()
sequentially shows detailed information about semantic types and predicates for the pairs of nodes on the path. It also invisibly returns a list of tibble
’s containing the displayed information, where each list element corresponds to a pair of nodes on the path.
Finding paths between node sets necessarily uses shortest path algorithms for computational tractability. However, when these algorithms are run without modification, the shortest paths tend to be less useful than desired.
For example, one of the shortest paths from “sickle trait” to “Malaria, Cerebral” goes through the node “Infant”:
this_path <- paths[[3]][[32]]
plot_path(g_small, this_path)
This likely isn’t the type of path we were hoping for. Why does such a path arise? For some insight, we can use the degree()
function within the igraph package to look at the degree distribution for all nodes in the Semantic MEDLINE graph. We also show the degree of the “Infant” node in red.
plot(density(degree(g_small), from = 0),
xlab = "Degree", main = "Degree distribution")
## The second node in the path is "Infant" --> this_path[2]
abline(v = degree(g_small, v = this_path[2]), col = "red", lwd = 2)