Google Genomics implements the GA4GH reads API and this R package can retrieve data from that implementation. For more detail, see https://cloud.google.com/genomics/v1beta2/reference/reads
library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate
By default, this function retrieves reads for a small genomic region for one sample in 1,000 Genomes.
reads <- getReads()
## Fetching reads page.
## Reads are now available.
length(reads)
## [1] 18
We can see that 18 individual reads were returned and that the JSON response was deserialized into an R list object:
class(reads)
## [1] "list"
mode(reads)
## [1] "list"
The top level field names are:
names(reads[[1]])
## [1] "id" "readGroupId"
## [3] "readGroupSetId" "fragmentName"
## [5] "properPlacement" "duplicateFragment"
## [7] "fragmentLength" "readNumber"
## [9] "numberReads" "failedVendorQualityChecks"
## [11] "alignment" "secondaryAlignment"
## [13] "supplementaryAlignment" "alignedSequence"
## [15] "alignedQuality" "nextMatePosition"
## [17] "info"
And examining the alignment we see:
reads[[1]]$alignedSequence
## [1] "AACAAAAAACTGTCTAACAAGATTTTATGGTTTATAGACCATGATTCCCCGGACACATTAGATAGAAATCTGGGCAAGAGAAGAAAAAAAGGTCAGAGTT"
reads[[1]]$alignment$position$referenceName
## [1] "22"
reads[[1]]$alignment$position$position
## [1] "16051308"
This is good, but this data becomes much more useful when it is converted to Bioconductor data types. For example, we can convert reads in this list representation to GAlignments:
readsToGAlignments(reads)
## GAlignments object with 18 alignments and 1 metadata column:
## seqnames strand cigar qwidth start
## <Rle> <Rle> <character> <integer> <integer>
## ERR251039.44923356 chr22 - 100M 100 16051309
## ERR251039.28556355 chr22 + 100M 100 16051323
## ERR016214.27010769 chr22 + 75M1D6M 81 16051330
## ERR016213.15718767 chr22 + 68M 68 16051400
## ERR016213.29749886 chr22 - 68M 68 16051403
## ... ... ... ... ... ...
## ERR251040.1475552 chr22 + 100M 100 16051454
## ERR251040.3762117 chr22 + 100M 100 16051469
## ERR251040.11853979 chr22 + 100M 100 16051478
## ERR251040.34469189 chr22 + 100M 100 16051486
## ERR251040.38772006 chr22 - 100M 100 16051498
## end width njunc | flag
## <integer> <integer> <integer> | <numeric>
## ERR251039.44923356 16051408 100 0 | 147
## ERR251039.28556355 16051422 100 0 | 163
## ERR016214.27010769 16051411 82 0 | 99
## ERR016213.15718767 16051467 68 0 | 163
## ERR016213.29749886 16051470 68 0 | 153
## ... ... ... ... . ...
## ERR251040.1475552 16051553 100 0 | 163
## ERR251040.3762117 16051568 100 0 | 163
## ERR251040.11853979 16051577 100 0 | 163
## ERR251040.34469189 16051585 100 0 | 99
## ERR251040.38772006 16051597 100 0 | 83
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Let’s take a look at the reads that overlap rs9536314 for sample NA12893 within the Illumina Platinum Genomes dataset. This SNP resides on chromosome 13 at position 33628137 in 0-based coordinates.
# Change the values of 'chromosome', 'start', or 'end' here if you wish to plot
# alignments from a different portion of the genome.
alignments <- getReads(readGroupSetId="CMvnhpKTFhD3he72j4KZuyc",
chromosome="chr13",
start=33628130,
end=33628145,
converter=readsToGAlignments)
## Fetching reads page.
## Reads are now available.
alignments
## GAlignments object with 67 alignments and 1 metadata column:
## seqnames strand cigar
## <Rle> <Rle> <character>
## HSQ1008:141:D0CC8ACXX:4:2201:3978:155671 chr13 + 101M
## HSQ1008:141:D0CC8ACXX:3:1206:7108:129302 chr13 - 101M
## HSQ1008:141:D0CC8ACXX:2:1307:18541:80804 chr13 + 101M
## HSQ1008:141:D0CC8ACXX:4:2108:12961:30710 chr13 - 101M
## HSQ1008:141:D0CC8ACXX:1:2308:18484:6721 chr13 - 101M
## ... ... ... ...
## HSQ1008:141:D0CC8ACXX:2:1308:19101:126736 chr13 + 101M
## HSQ1008:141:D0CC8ACXX:2:1302:13881:79089 chr13 - 101M
## HSQ1008:141:D0CC8ACXX:2:2108:1700:58641 chr13 - 101M
## HSQ1008:141:D0CC8ACXX:1:2104:4131:31083 chr13 + 101M
## HSQ1008:141:D0CC8ACXX:4:2302:3851:18256 chr13 + 101M
## qwidth start end
## <integer> <integer> <integer>
## HSQ1008:141:D0CC8ACXX:4:2201:3978:155671 101 33628031 33628131
## HSQ1008:141:D0CC8ACXX:3:1206:7108:129302 101 33628033 33628133
## HSQ1008:141:D0CC8ACXX:2:1307:18541:80804 101 33628039 33628139
## HSQ1008:141:D0CC8ACXX:4:2108:12961:30710 101 33628042 33628142
## HSQ1008:141:D0CC8ACXX:1:2308:18484:6721 101 33628043 33628143
## ... ... ... ...
## HSQ1008:141:D0CC8ACXX:2:1308:19101:126736 101 33628133 33628233
## HSQ1008:141:D0CC8ACXX:2:1302:13881:79089 101 33628136 33628236
## HSQ1008:141:D0CC8ACXX:2:2108:1700:58641 101 33628136 33628236
## HSQ1008:141:D0CC8ACXX:1:2104:4131:31083 101 33628139 33628239
## HSQ1008:141:D0CC8ACXX:4:2302:3851:18256 101 33628141 33628241
## width njunc | flag
## <integer> <integer> | <numeric>
## HSQ1008:141:D0CC8ACXX:4:2201:3978:155671 101 0 | 163
## HSQ1008:141:D0CC8ACXX:3:1206:7108:129302 101 0 | 83
## HSQ1008:141:D0CC8ACXX:2:1307:18541:80804 101 0 | 99
## HSQ1008:141:D0CC8ACXX:4:2108:12961:30710 101 0 | 83
## HSQ1008:141:D0CC8ACXX:1:2308:18484:6721 101 0 | 147
## ... ... ... . ...
## HSQ1008:141:D0CC8ACXX:2:1308:19101:126736 101 0 | 99
## HSQ1008:141:D0CC8ACXX:2:1302:13881:79089 101 0 | 83
## HSQ1008:141:D0CC8ACXX:2:2108:1700:58641 101 0 | 147
## HSQ1008:141:D0CC8ACXX:1:2104:4131:31083 101 0 | 99
## HSQ1008:141:D0CC8ACXX:4:2302:3851:18256 101 0 | 99
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Notice that we passed the converter to the getReads method so that we’re immediately working with GAlignments which means that we can start taking advantage of other Bioconductor functionality. Also keep in mind that the parameters start
and end
are expressed in 0-based coordinates per the GA4GH specification but the Bioconductor data type converters in GoogleGenomics, by default, transform the returned data to 1-based coordinates.
library(ggbio)
We can display the basic alignments and coverage data:
alignmentPlot <- autoplot(alignments, aes(color=strand, fill=strand))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
coveragePlot <- ggplot(as(alignments, "GRanges")) +
stat_coverage(color="gray40", fill="skyblue")
tracks(alignmentPlot, coveragePlot,
xlab="Reads overlapping rs9536314 for NA12893")
And also display the ideogram for the corresponding location on the chromosome:
ideogramPlot <- plotIdeogram(genome="hg19", subchr="chr13")
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
ideogramPlot + xlim(as(alignments, "GRanges"))
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggbio_1.28.0
## [2] ggplot2_2.2.1
## [3] org.Hs.eg.db_3.6.0
## [4] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [5] BSgenome_1.48.0
## [6] rtracklayer_1.40.0
## [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [8] GenomicFeatures_1.32.0
## [9] AnnotationDbi_1.42.0
## [10] GoogleGenomics_2.2.0
## [11] VariantAnnotation_1.26.0
## [12] GenomicAlignments_1.16.0
## [13] Rsamtools_1.32.0
## [14] Biostrings_2.48.0
## [15] XVector_0.20.0
## [16] SummarizedExperiment_1.10.0
## [17] DelayedArray_0.6.0
## [18] BiocParallel_1.14.0
## [19] matrixStats_0.53.1
## [20] Biobase_2.40.0
## [21] GenomicRanges_1.32.0
## [22] GenomeInfoDb_1.16.0
## [23] IRanges_2.14.0
## [24] S4Vectors_0.18.0
## [25] BiocGenerics_0.26.0
## [26] knitr_1.20
## [27] BiocStyle_2.8.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.12.0 bitops_1.0-6 bit64_0.9-7
## [4] RColorBrewer_1.1-2 progress_1.1.2 httr_1.3.1
## [7] rprojroot_1.3-2 tools_3.5.0 backports_1.1.2
## [10] R6_2.2.2 rpart_4.1-13 Hmisc_4.1-1
## [13] DBI_0.8 lazyeval_0.2.1 colorspace_1.3-2
## [16] nnet_7.3-12 gridExtra_2.3 prettyunits_1.0.2
## [19] GGally_1.3.2 bit_1.1-12 curl_3.2
## [22] compiler_3.5.0 graph_1.58.0 htmlTable_1.11.2
## [25] labeling_0.3 bookdown_0.7 scales_0.5.0
## [28] checkmate_1.8.5 RBGL_1.56.0 stringr_1.3.0
## [31] digest_0.6.15 foreign_0.8-70 rmarkdown_1.9
## [34] dichromat_2.0-0 base64enc_0.1-3 pkgconfig_2.0.1
## [37] htmltools_0.3.6 ensembldb_2.4.0 htmlwidgets_1.2
## [40] rlang_0.2.0 rstudioapi_0.7 RSQLite_2.1.0
## [43] BiocInstaller_1.30.0 jsonlite_1.5 acepack_1.4.1
## [46] RCurl_1.95-4.10 magrittr_1.5 GenomeInfoDbData_1.1.0
## [49] Formula_1.2-2 Matrix_1.2-14 Rcpp_0.12.16
## [52] munsell_0.4.3 stringi_1.1.7 yaml_2.1.18
## [55] zlibbioc_1.26.0 plyr_1.8.4 grid_3.5.0
## [58] blob_1.1.1 lattice_0.20-35 splines_3.5.0
## [61] pillar_1.2.2 rjson_0.2.15 reshape2_1.4.3
## [64] biomaRt_2.36.0 XML_3.98-1.11 evaluate_0.10.1
## [67] biovizBase_1.28.0 latticeExtra_0.6-28 data.table_1.10.4-3
## [70] gtable_0.2.0 reshape_0.8.7 assertthat_0.2.0
## [73] xfun_0.1 AnnotationFilter_1.4.0 survival_2.42-3
## [76] OrganismDbi_1.22.0 tibble_1.4.2 memoise_1.1.0
## [79] cluster_2.0.7-1