1 Introduction

The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading institutions working in healthcare, research, disease advocacy, life science, and information technology. The Data Working Group of the GA4GH developed data model schemas and application program interfaces (APIs) for genomic data. These APIs are specifically designed to allow sharing of genomics data in a standardized manner and without having to exchange complete experiments. They developed a reference implementation for these APIs providing a web server for hosting genomic data.

The Bioconductor project provides tools for the analysis and comprehension of high-throughput genomic data. It uses the R statistical programming language, and is open source and open development. The Bioconductor provides stable representations for genomics data such as biological sequences and genomic variants.

We developed GA4GHclient, a Bioconductor package that provides an easy way to access public data servers through GA4GH APIs. The requested output data are converted into tidy data and presented as data frames. Our package provides methods for converting genomic variant data into VCF data allowing the creation of complex data analysis using public genomic data and well validated Bioconductor packages. This package also provides an interactive web application for interacting with genomics data through GA4GH APIs.

Public data servers that use GA4GH Genomics API:

2 Available request methods

The table below shows all available methods in GA4GHclient package. These methods are based on the official GA4GH schemas. Let us know if some method is missing by opening an issue at https://github.com/labbcb/GA4GHclient/issues.

Method Description
searchDatasets Search for datasets
searchReferenceSets Search for reference sets (reference genomes)
searchReferences Search for references (genome sequences, e.g. chromosomes)
listReferenceBases Get the sequence bases of a reference genome by genomic range
searchVariantSets Search for for variant sets (VCF files)
searchVariants Search for variants by genomic ranges (lines of VCF files)
searchCallSets Search for call sets (sample columns of VCF files)
searchVariantAnnotationSets Search for variant annotation sets (annotated VCF files)
searchVariantAnnotations Search for annotated variants by genomic range
searchFeatureSets Search for feature sets (genomic features, e.g. GFF files)
searchFeatures Search for features (lines of genomic feature files)
searchReadGroupSets Search for read group sets (sequence alignement, e.g BAM files)
searchReads Search for reads by genomic range (bases of aligned sequences)
searchBiosamples Search for biosamples
searchIndividuals Search for individuals
searchRnaQuantificationSets Search for RNA quantification sets
searchRnaQuantifications Search for RNA quantifications
searchExpressionLevels Search for expression levels
searchPhenotypeAssociationSets Search for phenotype association sets
searchPhenotypeAssociations Search for phenotype associations
getDataset Get a dataset by its ID
getReferenceSet Get a reference set by its ID
getReference Get a reference by its ID
getVariantSet Get a variant set by its ID
getVariant Get a variant by its ID with all call sets for this variant
getCallSet Get a call set by its ID
getVariantAnnotationSet Get a variant annotation set by its ID
getVariantAnnotation Get an annotated variant by its ID
getFeatureSet Get a feature set by its ID
getFeature Get a feature by its ID
getReadGroupSet Get a read group set by its ID
getBiosample Get a biosample by its ID
getIndividual Get an individual by its ID
getRnaQuantificationSet Get an RNA quantification set by its ID
getRnaQuantification Get an RNA quantification by its ID
getExpressionLevel Get an expression level by its ID

3 Retrieving Thousand Genomes Project data through GA4GHclient package

First, load required packages for this vignette.

library(GA4GHclient)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(VariantAnnotation)

For this example we will use Hosting Thousand Genomes Project server. It contains data from the 1000 Genomes project. Set the URL of GA4GH API data server. This variable will be used by all request functions.

host <- "http://1kgenomes.ga4gh.org/"

Get basic information about data server. The dataset may contain many variant sets. A variant set represents the header of an VCF file. A collection of variants represent lines of an VCF file. The nrow attribute define the amount of entries we want to get from the server. For the dataset and variant set we want only the first entry.

The searchVariants function requests genomic variant data from the server. It uses genomic intervals to retrieve these varants. As R programming language, the genomic indice is 1-based. If you want to see what reference names (e.g. chromosomes) are available run the searchReferences function.

datasets <- searchDatasets(host, nrows = 1)
datasetId <- datasets$id
variantSets <- searchVariantSets(host, datasetId, nrows = 1)
variantSetId <- variantSets$id
variants <- searchVariants(host, variantSetId, referenceName = "1",
    start = 15000, end = 16000, nrows = 10)
variants
## class: CollapsedVCF 
## dim: 10 0 
## rowRanges(vcf):
##   GRanges with 3 metadata columns: ID, REF, ALT
## info(vcf):
##   DataFrame with 14 columns: EUR_AF, SAS_AF, AC, AA, AF, AFR_AF, AMR_AF, AN,...
## info(header(vcf)):
##                  Number Type    Description                                    
##    EUR_AF        A      Float   Allele frequency in the EUR populations calc...
##    SAS_AF        A      Float   Allele frequency in the SAS populations calc...
##    AC            A      Integer Total number of alternate alleles in called ...
##    AA            1      String  Ancestral Allele. Format: AA|REF|ALT|IndelTy...
##    AF            A      Float   Estimated allele frequency in the range (0,1)  
##    AFR_AF        A      Float   Allele frequency in the AFR populations calc...
##    AMR_AF        A      Float   Allele frequency in the AMR populations calc...
##    AN            1      Integer Total number of alleles in called genotypes    
##    VT            .      String  indicates what type of variant the line repr...
##    EAS_AF        A      Float   Allele frequency in the EAS populations calc...
##    NS            1      Integer Number of samples with data                    
##    EX_TARGET     0      Flag    indicates whether a variant is within the ex...
##    DP            1      Integer Total read depth; only low coverage data wer...
##    MULTI_ALLELIC 0      Flag    indicates whether a site is multi-allelic      
## geno(vcf):
##   List of length 0:

Almost search functions will return an DataFrame object from S4Vector package. The searchVariants and getVariant functions will return an VCF object with header from VariantAnnotation package. The getVariantSet function will return an VCFHeader object. These three functions will return DataFrame object when asVCF or asVCFHeader parameters be FALSE.

variants.df <- searchVariants(host, variantSetId, referenceName = "1",
    start = 15000, end = 16000, nrows = 10, asVCF = FALSE)
DT::datatable(as.data.frame(variants.df), options = list(scrollX = TRUE))

Due to internet connection, there is a limit of amount for response data imposed by the server (or the responseSize parameter). By default the function will make requests until get all response data. Increasing the value of the responseSize parameter will reduce the number os requests to server.

4 Search Variants by genomic location of genes

Bioconductor has annotation packages for many genomes. For example, the TxDb.Hsapiens.UCSC.hg19.knownGene pakage exposes an annotation databases generated from UCSC by exposing these as TxDb objects. Before using annotation packages, it is very important to check what version of reference genome was used by the GA4GH API-based data server. In other words, what reference genome was used to align sequencing reads and call for variants. This information can be accessed via searchReferenceSets function.

referenceSets <- searchReferenceSets(host, nrows = 1)
referenceSets$name

In this case, the version of reference genome is NCBI37, which is compatible with TxDb.Hsapiens.UCSC.hg19.knownGene.

Get name of all available genes.

head(keys(org.Hs.eg.db, keytype = "SYMBOL"))
## [1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP"

Get genomic location of all genes.

head(genes(TxDb.Hsapiens.UCSC.hg19.knownGene))
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## GRanges object with 6 ranges and 1 metadata column:
##             seqnames              ranges strand |     gene_id
##                <Rle>           <IRanges>  <Rle> | <character>
##           1    chr19   58858172-58874214      - |           1
##          10     chr8   18248755-18258723      + |          10
##         100    chr20   43248163-43280376      - |         100
##        1000    chr18   25530930-25757445      - |        1000
##       10000     chr1 243651535-244006886      - |       10000
##   100008586     chrX   49217763-49233491      + |   100008586
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Use case: search for variants located in SCN1A gene.

df <- select(org.Hs.eg.db, keys = "SCN1A", columns = c("SYMBOL", "ENTREZID"), keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
gr <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene, filter=list(gene_id=df$ENTREZID))
seqlevelsStyle(gr) <- "NCBI"
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
variants.scn1a <- searchVariantsByGRanges(host, variantSetId, gr, asVCF = FALSE)
DT::datatable(as.data.frame(variants.scn1a[[1]]), options = list(scrollX = TRUE))
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Use case: locate annotation data for a group of variants located at SCN1A gene.

variants.scn1a.gr <- makeGRangesFromDataFrame(variants.scn1a[[1]],
    seqnames.field = "referenceName")
seqlevelsStyle(variants.scn1a.gr) <- "UCSC"
locateVariants(variants.scn1a.gr, TxDb.Hsapiens.UCSC.hg19.knownGene,
    CodingVariants())
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 184 out-of-bound ranges located on sequences
##   12221, 12222, and 12223. Note that ranges located on a sequence whose
##   length is unknown (NA) or on a circular sequence are not considered
##   out-of-bound (use seqlengths() and isCircular() to get the lengths and
##   circularity flags of the underlying sequences). You can use trim() to
##   trim these ranges. See ?`trim,GenomicRanges-method` for more
##   information.
## 'select()' returned many:1 mapping between keys and columns
## GRanges object with 308 ranges and 9 metadata columns:
##         seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
##            <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
##     [1]     chr2 166847834      - |   coding      5867      5867        51
##     [2]     chr2 166847834      - |   coding      5918      5918        51
##     [3]     chr2 166847834      - |   coding      5951      5951        51
##     [4]     chr2 166847921      - |   coding      5780      5780        52
##     [5]     chr2 166847921      - |   coding      5831      5831        52
##     ...      ...       ...    ... .      ...       ...       ...       ...
##   [304]     chr2 166930064      - |   coding        68        68      2159
##   [305]     chr2 166930064      - |   coding        68        68      2159
##   [306]     chr2 166930078      - |   coding        54        54      2160
##   [307]     chr2 166930078      - |   coding        54        54      2160
##   [308]     chr2 166930078      - |   coding        54        54      2160
##                TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
##         <character> <IntegerList> <character> <CharacterList> <CharacterList>
##     [1]       12221         37217        6323                                
##     [2]       12222         37217        6323                                
##     [3]       12223         37217        6323                                
##     [4]       12221         37217        6323                                
##     [5]       12222         37217        6323                                
##     ...         ...           ...         ...             ...             ...
##   [304]       12222         37244        6323                                
##   [305]       12223         37244        6323                                
##   [306]       12221         37244        6323                                
##   [307]       12222         37244        6323                                
##   [308]       12223         37244        6323                                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

5 VariantAnnotation classes

The searchVariants and getVariant functions return VCF objects. These VCF objects contains the VCF header. This is an example of how to get variants with call data. We can get calls running the searchCallSets function. After convertion, it can be written into disk as VCF file using writeVcf.

callSetIds <- searchCallSets(host, variantSetId, nrows = 5)$id
vcf <- searchVariants(host, variantSetId, referenceName = "1",
    start = 15000, end = 16000, callSetIds = callSetIds, nrows = 10)
vcf
## class: CollapsedVCF 
## dim: 10 5 
## rowRanges(vcf):
##   GRanges with 3 metadata columns: ID, REF, ALT
## info(vcf):
##   DataFrame with 14 columns: EUR_AF, SAS_AF, AC, AA, AF, AFR_AF, AMR_AF, AN,...
## info(header(vcf)):
##                  Number Type    Description                                    
##    EUR_AF        A      Float   Allele frequency in the EUR populations calc...
##    SAS_AF        A      Float   Allele frequency in the SAS populations calc...
##    AC            A      Integer Total number of alternate alleles in called ...
##    AA            1      String  Ancestral Allele. Format: AA|REF|ALT|IndelTy...
##    AF            A      Float   Estimated allele frequency in the range (0,1)  
##    AFR_AF        A      Float   Allele frequency in the AFR populations calc...
##    AMR_AF        A      Float   Allele frequency in the AMR populations calc...
##    AN            1      Integer Total number of alleles in called genotypes    
##    VT            .      String  indicates what type of variant the line repr...
##    EAS_AF        A      Float   Allele frequency in the EAS populations calc...
##    NS            1      Integer Number of samples with data                    
##    EX_TARGET     0      Flag    indicates whether a variant is within the ex...
##    DP            1      Integer Total read depth; only low coverage data wer...
##    MULTI_ALLELIC 0      Flag    indicates whether a site is multi-allelic      
## geno(vcf):
##   List of length 1: GT
## geno(header(vcf)):
##       Number Type   Description
##    GT 1      String Genotype
header(vcf)
## class: VCFHeader 
## samples(0):
## meta(0):
## fixed(0):
## info(27): CIEND CIPOS ... EX_TARGET MULTI_ALLELIC
## geno(1): GT

The vcf object works as expected. We can get the genotype data for example. More information about working with VCF data see vignette("VariantAnnotation").

geno(vcf)$GT
##       HG00096 HG00097 HG00099 HG00100 HG00101
##  [1,] "0|0"   "0|0"   "0|1"   "0|0"   "0|0"  
##  [2,] "0|0"   "0|0"   "0|1"   "0|0"   "0|0"  
##  [3,] "0|0"   "0|0"   "0|1"   "0|0"   "0|0"  
##  [4,] "0|0"   "0|0"   "0|1"   "0|0"   "0|0"  
##  [5,] "0|0"   "0|0"   "1|0"   "0|0"   "0|0"  
##  [6,] "0|0"   "0|0"   "0|0"   "0|0"   "1|2"  
##  [7,] "0|0"   "0|0"   "0|0"   "0|0"   "2|2"  
##  [8,] "0|0"   "0|0"   "0|0"   "0|0"   "2|2"  
##  [9,] "0|0"   "0|0"   "0|0"   "0|0"   "1|2"  
## [10,] "0|0"   "0|0"   "0|0"   "0|0"   "1|2"

6 Session Information

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] VariantAnnotation_1.40.0               
##  [2] Rsamtools_2.10.0                       
##  [3] Biostrings_2.62.0                      
##  [4] XVector_0.34.0                         
##  [5] SummarizedExperiment_1.24.0            
##  [6] MatrixGenerics_1.6.0                   
##  [7] matrixStats_0.61.0                     
##  [8] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [9] GenomicFeatures_1.46.0                 
## [10] GenomicRanges_1.46.0                   
## [11] GenomeInfoDb_1.30.0                    
## [12] org.Hs.eg.db_3.14.0                    
## [13] AnnotationDbi_1.56.0                   
## [14] IRanges_2.28.0                         
## [15] Biobase_2.54.0                         
## [16] GA4GHclient_1.18.0                     
## [17] S4Vectors_0.32.0                       
## [18] BiocGenerics_0.40.0                    
## [19] BiocStyle_2.22.0                       
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7             bit64_4.0.5              filelock_1.0.2          
##  [4] progress_1.2.2           httr_1.4.2               tools_4.1.1             
##  [7] bslib_0.3.1              utf8_1.2.2               R6_2.5.1                
## [10] DT_0.19                  DBI_1.1.1                tidyselect_1.1.1        
## [13] prettyunits_1.1.1        bit_4.0.4                curl_4.3.2              
## [16] compiler_4.1.1           xml2_1.3.2               DelayedArray_0.20.0     
## [19] rtracklayer_1.54.0       bookdown_0.24            sass_0.4.0              
## [22] rappdirs_0.3.3           stringr_1.4.0            digest_0.6.28           
## [25] rmarkdown_2.11           pkgconfig_2.0.3          htmltools_0.5.2         
## [28] dbplyr_2.1.1             fastmap_1.1.0            BSgenome_1.62.0         
## [31] highr_0.9                htmlwidgets_1.5.4        rlang_0.4.12            
## [34] RSQLite_2.2.8            jquerylib_0.1.4          BiocIO_1.4.0            
## [37] generics_0.1.1           jsonlite_1.7.2           crosstalk_1.1.1         
## [40] BiocParallel_1.28.0      dplyr_1.0.7              RCurl_1.98-1.5          
## [43] magrittr_2.0.1           GenomeInfoDbData_1.2.7   Matrix_1.3-4            
## [46] Rcpp_1.0.7               fansi_0.5.0              lifecycle_1.0.1         
## [49] stringi_1.7.5            yaml_2.2.1               zlibbioc_1.40.0         
## [52] BiocFileCache_2.2.0      grid_4.1.1               blob_1.2.2              
## [55] parallel_4.1.1           crayon_1.4.1             lattice_0.20-45         
## [58] hms_1.1.1                KEGGREST_1.34.0          knitr_1.36              
## [61] pillar_1.6.4             rjson_0.2.20             biomaRt_2.50.0          
## [64] XML_3.99-0.8             glue_1.4.2               evaluate_0.14           
## [67] BiocManager_1.30.16      png_0.1-7                vctrs_0.3.8             
## [70] purrr_0.3.4              assertthat_0.2.1         cachem_1.0.6            
## [73] xfun_0.27                restfulr_0.0.13          tibble_3.1.5            
## [76] GenomicAlignments_1.30.0 memoise_2.0.0            ellipsis_0.3.2