Contents

0.1 Working with Reads

Google Genomics implements the GA4GH reads API and this R package can retrieve data from that implementation. For more detail, see https://cloud.google.com/genomics/v1beta2/reference/reads

library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate

By default, this function retrieves reads for a small genomic region for one sample in 1,000 Genomes.

reads <- getReads()
## Fetching reads page.
## Reads are now available.
length(reads)
## [1] 18

We can see that 18 individual reads were returned and that the JSON response was deserialized into an R list object:

class(reads)
## [1] "list"
mode(reads)
## [1] "list"

The top level field names are:

names(reads[[1]])
##  [1] "id"                        "readGroupId"              
##  [3] "readGroupSetId"            "fragmentName"             
##  [5] "properPlacement"           "duplicateFragment"        
##  [7] "fragmentLength"            "readNumber"               
##  [9] "numberReads"               "failedVendorQualityChecks"
## [11] "alignment"                 "secondaryAlignment"       
## [13] "supplementaryAlignment"    "alignedSequence"          
## [15] "alignedQuality"            "nextMatePosition"         
## [17] "info"

And examining the alignment we see:

reads[[1]]$alignedSequence
## [1] "AACAAAAAACTGTCTAACAAGATTTTATGGTTTATAGACCATGATTCCCCGGACACATTAGATAGAAATCTGGGCAAGAGAAGAAAAAAAGGTCAGAGTT"
reads[[1]]$alignment$position$referenceName
## [1] "22"
reads[[1]]$alignment$position$position
## [1] "16051308"

This is good, but this data becomes much more useful when it is converted to Bioconductor data types. For example, we can convert reads in this list representation to GAlignments:

readsToGAlignments(reads)
## GAlignments object with 18 alignments and 1 metadata column:
##                      seqnames strand       cigar    qwidth     start
##                         <Rle>  <Rle> <character> <integer> <integer>
##   ERR251039.44923356    chr22      -        100M       100  16051309
##   ERR251039.28556355    chr22      +        100M       100  16051323
##   ERR016214.27010769    chr22      +     75M1D6M        81  16051330
##   ERR016213.15718767    chr22      +         68M        68  16051400
##   ERR016213.29749886    chr22      -         68M        68  16051403
##                  ...      ...    ...         ...       ...       ...
##    ERR251040.1475552    chr22      +        100M       100  16051454
##    ERR251040.3762117    chr22      +        100M       100  16051469
##   ERR251040.11853979    chr22      +        100M       100  16051478
##   ERR251040.34469189    chr22      +        100M       100  16051486
##   ERR251040.38772006    chr22      -        100M       100  16051498
##                            end     width     njunc |      flag
##                      <integer> <integer> <integer> | <numeric>
##   ERR251039.44923356  16051408       100         0 |       147
##   ERR251039.28556355  16051422       100         0 |       163
##   ERR016214.27010769  16051411        82         0 |        99
##   ERR016213.15718767  16051467        68         0 |       163
##   ERR016213.29749886  16051470        68         0 |       153
##                  ...       ...       ...       ... .       ...
##    ERR251040.1475552  16051553       100         0 |       163
##    ERR251040.3762117  16051568       100         0 |       163
##   ERR251040.11853979  16051577       100         0 |       163
##   ERR251040.34469189  16051585       100         0 |        99
##   ERR251040.38772006  16051597       100         0 |        83
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

0.2 Plotting Alignments

Let’s take a look at the reads that overlap rs9536314 for sample NA12893 within the Illumina Platinum Genomes dataset. This SNP resides on chromosome 13 at position 33628137 in 0-based coordinates.

# Change the values of 'chromosome', 'start', or 'end' here if you wish to plot 
# alignments from a different portion of the genome.
alignments <- getReads(readGroupSetId="CMvnhpKTFhDyy__v0qfPpkw",
                       chromosome="chr13",
                       start=33628130,
                       end=33628145,
                       converter=readsToGAlignments)
## Fetching reads page.
## Reads are now available.
alignments
## GAlignments object with 66 alignments and 1 metadata column:
##                                            seqnames strand       cigar
##                                               <Rle>  <Rle> <character>
##   HSQ1009:126:D0UUYACXX:8:2213:15440:74824    chr13      +        101M
##    HSQ1009:126:D0UUYACXX:8:2207:9137:75357    chr13      -        101M
##    HSQ1009:127:C0LVVACXX:8:1312:4462:83203    chr13      -        101M
##    HSQ1009:127:C0LVVACXX:8:2207:9999:16063    chr13      -        101M
##    HSQ1009:126:D0UUYACXX:7:2312:2992:38597    chr13      +        101M
##                                        ...      ...    ...         ...
##   HSQ1009:126:D0UUYACXX:8:1313:14128:95709    chr13      +        101M
##   HSQ1009:126:D0UUYACXX:8:1314:13643:12305    chr13      +        101M
##   HSQ1009:126:D0UUYACXX:8:1209:16537:43414    chr13      +        101M
##    HSQ1009:126:D0UUYACXX:8:1214:6640:15022    chr13      +        101M
##    HSQ1009:127:C0LVVACXX:8:2216:4748:24114    chr13      -        101M
##                                               qwidth     start       end
##                                            <integer> <integer> <integer>
##   HSQ1009:126:D0UUYACXX:8:2213:15440:74824       101  33628032  33628132
##    HSQ1009:126:D0UUYACXX:8:2207:9137:75357       101  33628036  33628136
##    HSQ1009:127:C0LVVACXX:8:1312:4462:83203       101  33628037  33628137
##    HSQ1009:127:C0LVVACXX:8:2207:9999:16063       101  33628038  33628138
##    HSQ1009:126:D0UUYACXX:7:2312:2992:38597       101  33628041  33628141
##                                        ...       ...       ...       ...
##   HSQ1009:126:D0UUYACXX:8:1313:14128:95709       101  33628137  33628237
##   HSQ1009:126:D0UUYACXX:8:1314:13643:12305       101  33628139  33628239
##   HSQ1009:126:D0UUYACXX:8:1209:16537:43414       101  33628141  33628241
##    HSQ1009:126:D0UUYACXX:8:1214:6640:15022       101  33628141  33628241
##    HSQ1009:127:C0LVVACXX:8:2216:4748:24114       101  33628143  33628243
##                                                width     njunc |      flag
##                                            <integer> <integer> | <numeric>
##   HSQ1009:126:D0UUYACXX:8:2213:15440:74824       101         0 |        99
##    HSQ1009:126:D0UUYACXX:8:2207:9137:75357       101         0 |        83
##    HSQ1009:127:C0LVVACXX:8:1312:4462:83203       101         0 |       147
##    HSQ1009:127:C0LVVACXX:8:2207:9999:16063       101         0 |       147
##    HSQ1009:126:D0UUYACXX:7:2312:2992:38597       101         0 |        99
##                                        ...       ...       ... .       ...
##   HSQ1009:126:D0UUYACXX:8:1313:14128:95709       101         0 |       163
##   HSQ1009:126:D0UUYACXX:8:1314:13643:12305       101         0 |       163
##   HSQ1009:126:D0UUYACXX:8:1209:16537:43414       101         0 |       163
##    HSQ1009:126:D0UUYACXX:8:1214:6640:15022       101         0 |        99
##    HSQ1009:127:C0LVVACXX:8:2216:4748:24114       101         0 |       147
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Notice that we passed the converter to the getReads method so that we’re immediately working with GAlignments which means that we can start taking advantage of other Bioconductor functionality. Also keep in mind that the parameters start and end are expressed in 0-based coordinates per the GA4GH specification but the Bioconductor data type converters in GoogleGenomics, by default, transform the returned data to 1-based coordinates.

library(ggbio)

We can display the basic alignments and coverage data:

alignmentPlot <- autoplot(alignments, aes(color=strand, fill=strand))
## Scale for 'colour' is already present. Adding another scale for
## 'colour', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
coveragePlot <- ggplot(as(alignments, "GRanges")) +
                stat_coverage(color="gray40", fill="skyblue")
tracks(alignmentPlot, coveragePlot,
       xlab="Reads overlapping rs9536314 for NA12893")

And also display the ideogram for the corresponding location on the chromosome:

ideogramPlot <- plotIdeogram(genome="hg19", subchr="chr13")
ideogramPlot + xlim(as(alignments, "GRanges"))

0.3 Provenance

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggbio_1.22.0                           
##  [2] ggplot2_2.1.0                          
##  [3] org.Hs.eg.db_3.4.0                     
##  [4] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
##  [5] BSgenome_1.42.0                        
##  [6] rtracklayer_1.34.0                     
##  [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [8] GenomicFeatures_1.26.0                 
##  [9] AnnotationDbi_1.36.0                   
## [10] GoogleGenomics_1.6.0                   
## [11] VariantAnnotation_1.20.0               
## [12] GenomicAlignments_1.10.0               
## [13] Rsamtools_1.26.0                       
## [14] Biostrings_2.42.0                      
## [15] XVector_0.14.0                         
## [16] SummarizedExperiment_1.4.0             
## [17] Biobase_2.34.0                         
## [18] GenomicRanges_1.26.0                   
## [19] GenomeInfoDb_1.10.0                    
## [20] IRanges_2.8.0                          
## [21] S4Vectors_0.12.0                       
## [22] BiocGenerics_0.20.0                    
## [23] knitr_1.14                             
## [24] BiocStyle_2.2.0                        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.1                    jsonlite_1.1                 
##  [3] AnnotationHub_2.6.0           splines_3.3.1                
##  [5] Formula_1.2-1                 shiny_0.14.1                 
##  [7] assertthat_0.1                interactiveDisplayBase_1.12.0
##  [9] latticeExtra_0.6-28           RBGL_1.50.0                  
## [11] yaml_2.1.13                   RSQLite_1.0.0                
## [13] lattice_0.20-34               biovizBase_1.22.0            
## [15] chron_2.3-47                  digest_0.6.10                
## [17] RColorBrewer_1.1-2            colorspace_1.2-7             
## [19] htmltools_0.3.5               httpuv_1.3.3                 
## [21] Matrix_1.2-7.1                plyr_1.8.4                   
## [23] OrganismDbi_1.16.0            XML_3.98-1.4                 
## [25] biomaRt_2.30.0                zlibbioc_1.20.0              
## [27] xtable_1.8-2                  scales_0.4.0                 
## [29] BiocParallel_1.8.0            tibble_1.2                   
## [31] nnet_7.3-12                   survival_2.39-5              
## [33] magrittr_1.5                  mime_0.5                     
## [35] evaluate_0.10                 GGally_1.2.0                 
## [37] foreign_0.8-67                graph_1.52.0                 
## [39] BiocInstaller_1.24.0          tools_3.3.1                  
## [41] data.table_1.9.6              formatR_1.4                  
## [43] stringr_1.1.0                 munsell_0.4.3                
## [45] cluster_2.0.5                 ensembldb_1.6.0              
## [47] grid_3.3.1                    RCurl_1.95-4.8               
## [49] dichromat_2.0-0               rjson_0.2.15                 
## [51] labeling_0.3                  bitops_1.0-6                 
## [53] rmarkdown_1.1                 gtable_0.2.0                 
## [55] DBI_0.5-1                     reshape_0.8.5                
## [57] curl_2.1                      reshape2_1.4.1               
## [59] R6_2.2.0                      gridExtra_2.2.1              
## [61] Hmisc_3.17-4                  stringi_1.1.2                
## [63] Rcpp_0.12.7                   rpart_4.1-10                 
## [65] acepack_1.3-3.3