Contents

1 CTDquerier R package

CTDquerier is an R package that allows R users to download basic data from CTDbase about genes, chemicals and diseases. First, the input provided by user is validated in CTDbase vocabulary files. Second, the validated results are used to query CTDbase and their data are downloaded into the R session.

The package can be installed using:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("CTDquerier")

2 Case-Study aim

The case-study is based on the results obtained in the article entitled “Case-control admixture mapping in Latino populations enriches for known asthma-associated genes”, from Torgerson et. al. [1]. The genes from Table E1 of that paper have been collected and used as starting point for this document.

The aim of this case study is to illustrate how CTDquerier can be used to provide further evidences about a given hypothesis or how to provide biological insights from genetic, toxicological or environmental epidemiological studies. In our case, we are using a real example where the authors are interested in providing a set of genes associated with asthma.

By using CTDquerier we will provide information about, among others, what is the level of curated evidence, from CTDbase, of the association between those genes and asthma or which chemicals are relaqted to this list of genes and asthma according to CTDbase.

3 Case-Study description

Asthma is a common long term inflammatory disease of the airways of the lungs. It is characterized by variable and recurring symptoms, reversible airflow obstruction, and bronchospasm. Symptoms include episodes of wheezing, coughing, chest tightness, and shortness of breath. Asthma is thought to be caused by a combination of genetic and environmental factors. Environmental factors include exposure to air pollution and allergens. Diagnosis is usually based on the pattern of symptoms, response to therapy over time, and spirometry. Asthma is classified according to the frequency of symptoms, forced expiratory volume in one second, and peak expiratory flow rate. It may also be classified as atopic or non-atopic where atopy refers to a predisposition toward developing a type 1 hypersensitivity reaction.

The Genetics of Asthma in Latino Americans (GALA) study [2] includes subjects with asthma and their biological parents recruited from schools, clinics, and hospitals at 4 sites: San Francisco Bay Area, New York City, Puerto Rico, and Mexico City. Patients were contacted to participate in the study if approved by their primary physician. On the basis of interviews and questionnaire data, subjects were included in the study if they were between the ages of 8 and 40 years with physician-diagnosed mild to moderate-to-severe asthma and had experienced 2 or more symptoms in the previous 2 years at the time of recruitment (including wheezing, coughing, and/or shortness of breath) and if both parents and 4 sets of grandparents self-identified as being of either Puerto Rican or Mexican ethnicity.

After quality control [1], the total number of SNPs included in the study was 729,685, and the total number of subjects was 529 children with asthma (253 Mexican and 276 Puerto Rican subjects) and 347 control subjects (158 Mexican and 189 Puerto Rican subjects).

The authors aim to identify novel asthma-related genes in Latino populations by using case-control admixture mapping. By comparing local ancestry between cases and control subjects the authors identified 62 admixture mapping peaks (29 in Puerto Ricans and 33 in Mexicans). They also observed a significant enrichment of previously identified asthma-associated genes within the 62 admixture mapping peaks (P = .0051). The information about all the genes in those candidate regions are provided in Table E1 of the paper [1].

4 Getting data (GALA genes)

The Table E1 from the GALA Study [1] was downloaded and converted into a .csv file. This table is included in the package and can be loaded into R by:

table_e1_csv <- system.file(
  paste0( "extdata", .Platform$file.sep, "gala_table_e1.csv" ), 
  package="CTDquerier"
)
table_e1 <- read.csv( table_e1_csv, stringsAsFactors = FALSE )

The genes of interest can be retrieve from the colum Genes of this table. Let us start by removing regions without annotated genes.

dim( table_e1 )
## [1] 62 15
table_e1 <- table_e1[ table_e1$Genes != "NA ", ]
dim( table_e1 )
## [1] 45 15

We have filtered 17 regions that had no genes. Now we will get the gene list by:

gala_genes <- trimws( unlist( strsplit( table_e1$Genes, "," ) ) )
length( gala_genes )
## [1] 305
gala_genes[1:15]
##  [1] "JMJD4"        "LOC100130093" "PRSS38"       "SNAP47"      
##  [5] "CMTM6"        "CMTM7"        "CMTM8"        "CNOT10"      
##  [9] "DYNC1LI1"     "GPD1L"        "HS3ST1"       "MIR572"      
## [13] "FAM13A"       "TIGD2"        "C5orf46"

NOTE that each region can have more than one gene that is separated by ,.

We end up with a total of 305 candidate genes related to asthma as proposed by Torgerson et. al. [1]. Let us call to this list GALA genes.

5 Querying GALA genes in CTDbase

Any type of enrichment analysis of a given list of genes should start by querying CTDabse. This can be done by using the function query_ctd_gene. The function validates the list of genes into CTDbase gene-vocabulary. Then the function performs a different query per gene and download the associated information into your computer. In our example this is performed by:

library( CTDquerier )
gala <- query_ctd_gene( terms = gala_genes, verbose = TRUE )

Since this process can be high time-consuming, CTDquerier package already contains the result of the query encapsulated in an object that can be loaded into your R session with:

data( gala, package = "CTDquerier" )
gala
## Object of class 'CTDdata'
## -------------------------
##  . Type: GENE 
##  . Length: 258 
##  . Items: JMJD4, ..., MAFB 
##  . Diseases: 2884 ( 113408 / 223462 )
##  . Gene-gene interactions: 8812 ( 11464 )
##  . Gene-chemical interactions: 10519 ( 17355 )
##  . KEGG pathways: 1340 ( 1340 )
##  . GO terms: 4740 ( 4752 )

6 Questions that can be answered from a gene list

Now, let us illustrate the type of questions that can be addressed by using CTDquerier when the user wants to provide biological insights from a gene list. In our case, we are interested in providing existing literature evidences (annotated in CTDbase) that our GALA genes are enriched in chemical or genetic processes realted to asthma.

6.1 How many GALA genes are in CTDbase?

Recall that our genes of interest are stored in the vector gala_genes (305 genes) and that the result obtained from querying CTDbase with those genes is stored in the object gala (258 terms). This means that 47 genes from the GALA study were not present in CTDbase. This information can be visually inspect by using the method plot of a CTDdata object.

library( ggplot2 )
plot( gala ) + ggtitle( "Lost & Found Genes from GALA Study" )

The names of the lost genes from GALA study can be obtained using the method get_terms:

get_terms( gala )[[ "lost" ]]
##  [1] "LOC100130093" "LOC728724"    "C10ORF119"    "FAM45B"      
##  [5] "FLJ46361"     "LOC399815"    "LOC400643"    "GRIK1-AS"    
##  [9] "NCRNA00258"   "C2ORF14"      "LOC150527"    "LOC150776"   
## [13] "LOC150786"    "LOC389043"    "LOC401010"    "LOC440910"   
## [17] "LOC646743"    "C4ORF35"      "C4ORF7"       "GTF2H2D"     
## [21] "LOC100049076" "LOC100170939" "LOC100272216" "LOC647859"   
## [25] "LOC728613"    "LOC389333"    "MGC29506"     "LOC285627"   
## [29] "C7ORF60"      "LOC401397"    "C8ORF71"      "C13ORF15"    
## [33] "LOC646982"    "CSDAP1"       "KIAA0664L3"   "LOC390705"   
## [37] "MYST1"        "C17ORF54"     "BASE"         "BPIL1"       
## [41] "C20ORF114"    "C20ORF185"    "C20ORF186"    "C20ORF70"    
## [45] "C20ORF71"     "PLUNC"        "FLJ46257"

6.2 How many diseases are associated with GALA genes according to CTDbase?

The call to query_ctd_gene done in previous section has downloaded “all the information available” in CTDbase of GALA genes. We can inspect the information available just printing our object of class CTDdata:

gala
## Object of class 'CTDdata'
## -------------------------
##  . Type: GENE 
##  . Length: 258 
##  . Items: JMJD4, ..., MAFB 
##  . Diseases: 2884 ( 113408 / 223462 )
##  . Gene-gene interactions: 8812 ( 11464 )
##  . Gene-chemical interactions: 10519 ( 17355 )
##  . KEGG pathways: 1340 ( 1340 )
##  . GO terms: 4740 ( 4752 )

The line starting by #Diseases indicates the number of relationships between a gene (or chemical) and diseases. For the GALA genes, there are a total of 223,462 relations between our genes of interest and diseases. The table with the relations between genes and diseases can be obtained using the method get_table.

gala_all_diseases <- get_table( gala, index_name = "diseases" )
colnames( gala_all_diseases )
## [1] "Disease.Name"      "Disease.ID"        "Direct.Evidence"  
## [4] "Inference.Network" "Inference.Score"   "Reference.Count"  
## [7] "GeneSymbol"        "GeneID"
dim( gala_all_diseases )
## [1] 223462      8

We observe that the table with the relations between gene and diseases has close to ~220K rows. This table provides with all the relations that link GALA genes with diseases, including the curated and the inferred ones.

We can also check that the number of genes used to create this table are the correct ones:

length( unique( gala_all_diseases$GeneSymbol ) )
## [1] 246
sum( get_terms( gala )[[ "found" ]] %in% 
    unique( gala_all_diseases$GeneSymbol ) )
## [1] 247
sum( !get_terms( gala )[[ "found" ]] %in% 
    unique( gala_all_diseases$GeneSymbol ) )
## [1] 11

We can also check the name of the genes that do not contributed to create the table of associations between genes and diseases:

get_terms( gala )[[ "found" ]][ 
    !get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol )
]
##  [1] "SBF1P1"   "TPTE2P3"  "MIR4317"  "MIR4319"  "FAM123C"  "MIR4277" 
##  [7] "MIR548D2" "KIAA0564" "MIR320D1" "MIR621"   "BPIL3"

Finally, the answer to our question of interest can be obtained by:

length( unique( gala_all_diseases$Disease.Name ) )
## [1] 2884

This is the number of unique diseases associated to the GALA genes using all the relations. If we are interested in getting only the associations between the GALA genes and the diseases, understood as the curated relations:

gala_all_diseases_cu <- gala_all_diseases[ !is.na( gala_all_diseases$Direct.Evidence ), ]
gala_all_diseases_cu <- gala_all_diseases_cu[ gala_all_diseases_cu$Direct.Evidence != "", ]
dim( gala_all_diseases_cu )
## [1] 436   8
length( unique( gala_all_diseases_cu$Disease.Name ) )
## [1] 301

So, in summary: There are 301 diseases linked to GALA genes, supported by 436 curated associations.

6.3 What is the level of evidence of the association between GALA genes and asthma according to CTDbase?

In order to answer this question we first need to get "Asthma" in the column containing disease information in the table of associations between genes and diseases that we have previously created. This can be performed by executing:

gala_asthma <- gala_all_diseases[ 
    gala_all_diseases$Disease.Name == "Asthma" , 
]
dim( gala_asthma )
## [1] 214   8

The subsetting resulted in a table with 209 entries. The firts column to inspect is the Direct.Evidene that provides a clue of the relations that are curated and the ones that are inferred.

sum( gala_asthma$Direct.Evidence != "" & !is.na( gala_asthma$Direct.Evidence ) )
## [1] 2

So, only two of the 214 relations are curated associatins. The other 212 (92 + 120) are inferred.

for the ones that have no direct evidence, CTDbase informs about the level of evidence of the relation between a gene and a diseasewith the columns Inference.Score and Reference.Count. The first contains the CTDbase inference-score of gene-disease association in base of a series of criterion defined by CTDbase’s developers while the second are just the number of bibliographical references enforcing the relation.

Then, the mean of the Inference Score can be interpreted as the level of evidence of the association between GALA genes and asthma according to CTDbase:

mean( gala_asthma$Inference.Score, na.rm = TRUE )
## [1] 13.13887

This index provides evidences about the degree of similarity between CTD chemical–gene–disease networks and a similar scale-free random network. The higher the score, the more likely the inference network has atypical connectivity. In other words, this large score indicates that the network of GALA genes with asthma is real and not build by chance (http://ctdbase.org/help/chemDiseaseDetailHelp.jsp).

We can also provide the total number of papers relating GALA genes with asthma:

sum( gala_asthma$Reference.Count, na.rm = TRUE )
## [1] 3208

We can provide the Inference Score of each gene of interest with asthma by:

plot( gala, index_name = "disease", subset.disease = "Asthma", filter.score = 20 ) +
    ggtitle( "Evidence of the association between GALA genes and Asthma" )

Note that here only genes having an Inference Score higher that 20 are plotted (argument filter.score)

6.4 Which chemicals are associated with GALA genes according to CTDbase?

The object gala has a total of 17355 gene-chemical interaction registers. We can obtain this table using the get_table method:

gala_chem <- get_table( gala, index_name = "chemical interactions" )
colnames( gala_chem )
## [1] "Chemical.Name"       "Chemical.ID"         "CAS.RN"             
## [4] "Interaction"         "Interaction.Actions" "Reference.Count"    
## [7] "Organism.Count"      "GeneSymbol"          "GeneID"
length( unique( gala_chem$Chemical.Name ) )
## [1] 1892

This output is indicating that GALA genes are associated with 1892 unique chemicals according to CTDabse. The distribution of the number of associations (Reference.Count column) can be obtained by:

knitr::kable( t( table( gala_chem$Reference.Count ) ) )
1 2 3 4 5 6 7 8 9 10 11 13 18 19 20 21 22 24 28 53
16196 806 174 82 35 27 9 4 6 2 3 1 1 1 2 1 2 1 1 1

This information can also be obtained in a heat-map plot. The arguments subset.genes and subset.chemicals are used to filter the elements in X-axis (genes) and Y-axis (chemicals). The argument filter.score allows to filter by minimum number of existing papers (variable Reference.Count) providing evidences about the association between chemicals and genes.

plot( gala, index_name = "chemical interactions", filter.score = 6 )

7 Questions that can be answered from a given disease

In that case we are interested in knowing genes or chemicals that have been associated with our disease (or group of diseases) of interest

7.1 Which genes are associated with asthma according to CTDbase?

The way we can obtain all the information related to a disease from CTDbase is similar to the one we got the information of genes. In that case the function to be used is query_ctd_dise:

asthma <- query_ctd_dise( terms = "Asthma", verbose = TRUE )
## Downloading disease vocabilary from CTDbase ( 'CTD_diseases.tsv.gz' ).
## Loading disease vocabulary.
## Staring query for disease 'ASTHMA' ( MESH:D001249 )
##  . Downloading 'disease-gene interaction' table.
##  . Downloading 'disease-chemical interation' table.
##  . Downloading 'KEGG pathways' table.

As we can see from the log obtained when verbose is set to TRUE, the information downloaded from CTDabse by CTDquerier for diseases is composed by disease-gene interactions, disease-chemical interactions and pathway interactions (KEGG).

asthma
## Object of class 'CTDdata'
## -------------------------
##  . Type: DISEASE 
##  . Length: 1 
##  . Items: ASTHMA 
##  . Disease-gene interactions: 33547 ( 109 / 33547 )
##  . Gene-chemical interactions: 10126 ( 236 / 10126 )
##  . KEGG pathways: 1088 ( 1088 )
##  . GO terms: 0 (-)

To know the exact number of genes related to Asthma according to CTDbase we extract the proper table and count the unique genes that appear on it.

ctd_asthma <- get_table( asthma, index_name = "gene interactions" )
length( unique( ctd_asthma$Gene.Symbol ) )
## [1] 24605

We can also look for teh curated associations between genes and Asthma:

sum( !is.na( ctd_asthma$Direct.Evidence ) & ctd_asthma$Direct.Evidence != "" )
## [1] 109

According to CTDbase, there are 24605 genes related to asthma and 109 of them are curated associations. However the term "Asthma" may appear in different ways. Here we illustrate how to get the real number of genes related with Asthma.

library( knitr )
tt <- as.data.frame( table( ctd_asthma$Disease.Name ) )
colnames( tt ) <- c( "Disease", "Frequency" )
kable( tt[ order( tt$Frequency, decreasing = TRUE ), ] )
Disease Frequency
5 Asthma 24300
9 Asthma, Occupational 5704
11 Status Asthmaticus 2659
6 Asthma, Aspirin-Induced 765
7 Asthma, Exercise-Induced 112
8 Asthma, Nasal Polyps, And Aspirin Intolerance 2
1 ASTHMA-RELATED TRAITS, SUSCEPTIBILITY TO, 1 1
2 ASTHMA-RELATED TRAITS, SUSCEPTIBILITY TO, 2 1
3 ASTHMA-RELATED TRAITS, SUSCEPTIBILITY TO, 5 1
4 ASTHMA-RELATED TRAITS, SUSCEPTIBILITY TO, 7 1
10 Platelet-Activating Factor Acetylhydrolase Deficiency 1

Hence, the true answer is 24300 gene-diseases relations and not r, length( unique( ctd_asthma$Gene.Symbol ) ).

7.2 Which chemicals are associated with asthma according to CTDbase?

First, let us retrive the chemicals related to asthma from asthma object

ctd_asthma_chem <- get_table( asthma, index_name = "chemical interactions" )
colnames( ctd_asthma_chem )
## [1] "Chemical.Name"     "Chemical.ID"       "CAS.RN"           
## [4] "Disease.Name"      "Disease.ID"        "Direct.Evidence"  
## [7] "Inference.Network" "Inference.Score"   "Reference.Count"
length( unique( ctd_asthma_chem$Chemical.Name ) )
## [1] 5582

Hence, the answer is 5582. Now lets see how many of these relations are curated associations:

sum( !is.na( ctd_asthma_chem$Direct.Evidence ) & ctd_asthma_chem$Direct.Evidence != "" )
## [1] 236

Visualization can be perform trough Inference.Score or Reference.Count plot (for both curated and inferred relations). Let us assume that we are interested in providing information about the chemicals that are associated with GALA genes having an score larger than 30. The plot can be created by:

plot( asthma, index_name = "chemical interactions", subset.disease = "Asthma", filter.score = 30 ) +
    ggtitle( "Evidence of the association between GALA genes and chemicals" )

8 Questions relating the triad: chemicals, genes and disease of interest

In some occassions the user may be interested in determining how chemicals, genes and a given disease have been associated in the literature.

8.1 Which chemicals are associated to both GALA genes and asthma according to CTDbase?

The table gala_chem contains the relation between GALA genes and the chemicals associated to these genes according to CTDbase. The table ctd_asthma_chem contains the chemicals related to Asthma according to CTDbase. Therefore, to find the chemicals that are related to both GALA genes and asthma, according to CTDbase we get the intersection of both tables. This can be performed by

intr_chem <- intersect( gala_chem$Chemical.Name, ctd_asthma_chem$Chemical.Name )
length( intr_chem )
## [1] 1520

So, there are 1520 chemicals sharing related with GALA genes and asthma at the same time. Next code can be used to quantify the overlap.

length( intr_chem ) / nrow( gala_chem ) * 100
## [1] 8.758283
length( intr_chem ) / nrow( ctd_asthma_chem ) * 100
## [1] 15.01086

That is, the 8.76% of the total number of associations of GALA genes with chemicals are shared with the observed associations of asthma with chemicals. On the other hand, the 15.01% of the chemicals associated with asthma are shared with the chemicals associated with GALA genes.

On the other and the curated chemical associations on both sides of GALA genes and Asthma can be found as:

a <- ctd_asthma_chem$Chemical.Name[
    !is.na( ctd_asthma_chem$Direct.Evidence ) & ctd_asthma_chem$Direct.Evidence != ""
]
intr_chem <- intersect(  gala_chem$Chemical.Name, a )
length( intr_chem )
## [1] 87

It can be really useful to have a data.frame with three columns: chemical names, and the number of references in the literature associated with GALA genes and asthma. This table can be retrieve by:

gala_chem_r <- gala_chem[ gala_chem$Chemical.Name %in% intr_chem, ]
gala_chem_r <- gala_chem_r[ !duplicated( gala_chem_r$Chemical.Name ), ]
ctd_asthma_chem_r <- ctd_asthma_chem[ ctd_asthma_chem$Chemical.Name %in% intr_chem, ]
ctd_asthma_chem_r <- ctd_asthma_chem_r[ !duplicated( ctd_asthma_chem_r$Chemical.Name ), ]

dta <- merge(
    gala_chem_r[ , c( "Chemical.Name", "Reference.Count" ) ],
    ctd_asthma_chem_r[ , c( "Chemical.Name", "Reference.Count" ) ],
    by = "Chemical.Name"
)
colnames( dta ) <- c( "Chemical.Name", "Reference.Gala", "Reference.Asthma" )
dta <- dta[ 
    order( dta$Reference.Gala, dta$Reference.Asthma, decreasing = TRUE ), 
]
dta[1:5, ]
##                     Chemical.Name Reference.Gala Reference.Asthma
## 57                  Phenobarbital              2               43
## 1  2,4-Dichlorophenoxyacetic Acid              2               20
## 62                    Propranolol              2               15
## 69                    Terfenadine              2               11
## 11                     Carbamates              2                9

Having this data.frame, we can use the function leaf_plot to compare the number of references in the gene list and in asthma for the top 25 chemicals by:

leaf_plot( dta[1:25, ], label = "Chemical.Name", 
    valueLeft = "Reference.Gala", valueRight = "Reference.Asthma",
    titleLeft = "GALA", titleRight = "Asthma"
)

10 Session Info.

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.20       ggplot2_3.1.0    bindrcpp_0.2.2   CTDquerier_1.2.0
## [5] BiocStyle_2.10.0
## 
## loaded via a namespace (and not attached):
##  [1] stringdist_0.9.5.1  tidyselect_0.2.5    xfun_0.4           
##  [4] purrr_0.2.5         colorspace_1.3-2    htmltools_0.3.6    
##  [7] stats4_3.5.1        BiocFileCache_1.6.0 yaml_2.2.0         
## [10] blob_1.1.1          rlang_0.3.0.1       pillar_1.3.0       
## [13] glue_1.3.0          withr_2.1.2         DBI_1.0.0          
## [16] rappdirs_0.3.1      BiocGenerics_0.28.0 bit64_0.9-7        
## [19] dbplyr_1.2.2        bindr_0.1.1         plyr_1.8.4         
## [22] stringr_1.3.1       munsell_0.5.0       gtable_0.2.0       
## [25] memoise_1.1.0       evaluate_0.12       labeling_0.3       
## [28] parallel_3.5.1      curl_3.2            highr_0.7          
## [31] Rcpp_0.12.19        backports_1.1.2     scales_1.0.0       
## [34] BiocManager_1.30.3  S4Vectors_0.20.0    bit_1.1-14         
## [37] gridExtra_2.3       digest_0.6.18       stringi_1.2.4      
## [40] bookdown_0.7        dplyr_0.7.7         rprojroot_1.3-2    
## [43] grid_3.5.1          tools_3.5.1         bitops_1.0-6       
## [46] magrittr_1.5        RCurl_1.95-4.11     lazyeval_0.2.1     
## [49] RSQLite_2.1.1       tibble_1.4.2        crayon_1.3.4       
## [52] pkgconfig_2.0.2     assertthat_0.2.0    rmarkdown_1.10     
## [55] httr_1.3.1          R6_2.3.0            compiler_3.5.1

Bibliography

1. Torgerson D.G. GJ Gignoux C.R. Case-control admixture mapping in latino populations enriches for known asthma-associated genes. 2012.

2. Price A.L. PN Tandon A. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. 2009.