1 Version Info
2 Background
3 Data Preparation
- 3.1 Human hg19
- 3.2 Mouse mm10
4 Location in and Around Genes
5 Annotate by ID
6 Annotate by Position
7 Annotating Variants
8 Exercises

1 Version Info

R version: R version 4.1.0 (2021-05-18)
Bioconductor version: 3.13
Package version: 1.16.0

2 Background

Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM, VCF, gff, bed, and wig files, among others. Packages support common and advanced sequence manipulation operations such as trimming, transformation, and alignment. Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb package).

This workflow walks through the annotation of a generic set of ranges with Bioconductor packages. The ranges can be any user-defined region of interest or can be from a public file.

3 Data Preparation

3.1 Human hg19

As a first step, data are put into a GRanges object so we can take advantage of overlap operations and store identifiers as metadata columns.

The first set of ranges are variants from a dbSNP Variant Call Format (VCF) file. This file can be downloaded from the ftp site at NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/ and imported with readVcf() from the VariantAnnotation package. Alternatively, the file is available as a pre-parsed VCF object in the AnnotationHub.

The Hub returns a VcfFile object with a reference to the file on disk.

hub <- AnnotationHub()

## snapshotDate(): 2021-05-18

Query the Hub for clinvar VCF files build GRCh37:

mcols(query(hub, "clinvar.vcf", "GRCh37"))[,"sourceurl", drop=FALSE]

## DataFrame with 8 rows and 1 column
##                      sourceurl
##                    <character>
## AH57956 ftp://ftp.ncbi.nlm.n..
## AH57957 ftp://ftp.ncbi.nlm.n..
## AH57958 ftp://ftp.ncbi.nlm.n..
## AH57959 ftp://ftp.ncbi.nlm.n..
## AH57960 ftp://ftp.ncbi.nlm.n..
## AH57961 ftp://ftp.ncbi.nlm.n..
## AH57962 ftp://ftp.ncbi.nlm.n..
## AH57963 ftp://ftp.ncbi.nlm.n..

Retrieve one of the files:

fl <- query(hub, "clinvar.vcf", "GRCh37")[[1]]

## loading from cache

Read the data into a VCF object:

vcf <- readVcf(fl, "hg19")
dim(vcf)

## [1] 109721      0

Overlap operations require that seqlevels and the genome of the objects match. Here the VCF seqlevels are modified to match the TxDb.

txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
head(seqlevels(txdb_hg19))

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

seqlevels(vcf)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "X"  "Y"  "MT"

seqlevels(vcf) <- paste0("chr", seqlevels(vcf))

For this example we’ll annotate chromosomes 3 and 18:

seqlevels(vcf, pruning.mode="coarse") <- c("chr3", "chr18")
seqlevels(txdb_hg19) <- c("chr3", "chr18")

Sanity check to confirm we have matching seqlevels.

intersect(seqlevels(txdb_hg19), seqlevels(vcf))

## [1] "chr3"  "chr18"

The genomes already match so no change is needed.

unique(genome(txdb_hg19))

## [1] "hg19"

unique(genome(vcf))

## [1] "hg19"

The GRanges in a VCF object is extracted with ‘rowRanges()’.

gr_hg19 <- rowRanges(vcf)

3.2 Mouse mm10

The second set of ranges is a user-defined region of chromosome 4 in mouse. The idea here is that any region, known or unknown, can be annotated with the following steps.

Load the TxDb package and keep only the standard chromosomes.

txdb_mm10 <- keepStandardChromosomes(TxDb.Mmusculus.UCSC.mm10.ensGene)

We are creating the GRanges from scratch and can specify the seqlevels (chromosome names) to match the TxDb.

head(seqlevels(txdb_mm10))

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

gr_mm10 <- GRanges("chr4", IRanges(c(4000000, 107889000), width=1000))

Now assign the genome.

unique(genome(txdb_mm10))

## [1] "mm10"

genome(gr_mm10) <- "mm10"

4 Location in and Around Genes

locateVariants() in the VariantAnnotation package annotates ranges with transcript, exon, cds and gene ID’s from a TxDb. Various extractions are performed on the TxDb (exonsBy(), transcripts(), cdsBy(), etc.) and the result is overlapped with the ranges. An appropriate GRangesList can also be supplied as the annotation. Different variants such as ‘coding’, ‘fiveUTR’, ‘threeUTR’, ‘spliceSite’, ‘intron’, ‘promoter’, and ‘intergenic’ can be searched for by passing the appropriate constructor as the ‘region’ argument. See ?locateVariants for details.

loc_hg19 <- locateVariants(gr_hg19, txdb_hg19, AllVariants())

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 82536 out-of-bound ranges located on sequences
##   13067, 13068, 13069, 15251, 15252, 13091, 13092, 13093, 13094, 13095,
##   13097, 13123, 13124, 13130, 13131, 13132, 13133, 13136, 13143, 15300,
##   15299, 15301, 15304, 15305, 13177, 13199, 13201, 13205, 13206, 13207,
##   13208, 15329, 15330, 15341, 15342, 15343, 15344, 13260, 13261, 13253,
##   13254, 13255, 13258, 13268, 13264, 13265, 13266, 13267, 15363, 15365,
##   15388, 13288, 13289, 15391, 15392, 15393, 15424, 15420, 15421, 15422,
##   13316, 13317, 13318, 13319, 15449, 15486, 15487, 15488, 15489, 15507,
##   15504, 15505, 15506, 15538, 13391, 13392, 13390, 13399, 15554, 15555,
##   15556, 15557, 13432, 13433, 13437, 13439, 13450, 13447, 13448, 13449,
##   13451, 13452, 13445, 13446, 13453, 13454, 13458, 13467, 13468, 15585,
##   15584, 13513, 13514, 15589, 15590, 15591, 15592, 15593, 15594, 15595,
##   15596, 15603, 15597, 15598, 15599, 15600, 15601, 15602, 15604, 15616,
##   15617, 15625, 13541, 13564, 13565, 13582, 13583, 13584, 13585, 13586,
##   13587, 15657, 15658, 15660, 15661, 13635, 15682, 15684, 15685, 15686,
##   15687, 13713, 13716, 15714, 13717, 13718, 13719, 13720, 15720, 15721,
##   13730, 13731, 13733, 15800, 15808, 15806, 15807, 15809, 15811, 15805,
##   15834, 13775, 13777, 13781, 15843, 15850, 15852, 15863, 15864, 15875,
##   15880, 15881, 15883, 15884, 15885, 15882, 15879, 13791, 13792, 13793,
##   13794, 13795, 13796, 13797, 13798, 13799, 13800, 13801, 13802, 13803,
##   15897, 15896, 13832, 13833, 13836, 13837, 13838, 15967, 15968, 15969,
##   13892, 13893, 13894, 13895, 13896, 13897, 16011, 16013, 13910, 16023,
##   16025, 16027, 16033, 16034, 16035, 16036, 16037, 16039, 16041, 16042,
##   16070, 13951, 13952, 13953, 13950, 13954, 16089, 16090, 16091, 16092,
##   16112, 16113, 16114, 16115, 13974, 13975, 13976, 13994, 13995, 16129,
##   16130, 16131, 16132, 14041, 14037, 14038, 14039, 16174, 16199, 16202,
##   16211, 16212, 14072, 14073, 14074, 14069, 16220, 16223, 16224, 16226,
##   16227, 16228, 16229, 16230, 16240, 16241, 16258, 14091, 14092, 14093,
##   14094, 14096, 14090, 16267, 16266, 14108, 14109, 16272, 16273, 16279,
##   16280, 16281, 14120, 14122, 14123, 14121, 14124, 14125, 16329, 16330,
##   14207, 14208, 16344, 16345, 16346, 16347, 16348, 14310, 14314, 14315,
##   16422, 16423, 16438, 14350, 16498, 16499, 16514, 16515, 16517, 16518,
##   16519, 16520, 14398, 14399, 14401, 14412, 14413, 16548, 16547, 16555,
##   16556, 16557, 16558, 16559, 16560, 16561, 16562, 14440, 14444, 14443,
##   16643, 16644, 16645, 14487, 14488, 14495, 14497, 14498, 14499, 14496,
##   14500, 14512, 14517, 14511, 14513, 14514, 14515, 14516, 14518, 14520,
##   14521, 14522, 14523, 14545, 14546, 14548, 14549, 14550, 14551, 14544,
##   14547, 14552, 16701, 16714, 16716, 16718, 14592, 14594, 14596, 14595,
##   16727, 16728, 14599, 14600, 14601, 14602, 14616, 16815, 16826, 16849,
##   16850, 16851, 16852, 14714, 14715, 14718, 14722, 14723, 14724, 14725,
##   16860, 16861, 16862, 16863, 16888, 14823, 14824, 14825, 16961, 16962,
##   16963, 16964, 16965, 16983, 17007, 17008, 17009, 17010, 17011, 17012,
##   17014, 17015, 17046, 17047, 17048, 17098, 17110, 17114, 17122, 17123,
##   17125, 17126, 17127, 15016, 17151, 17153, 15047, 15048, 15049, 15050,
##   15051, 15052, 15054, 15055, 15057, 15058, 15059, 17157, 17158, 17159,
##   17160, 17156, 17161, 17162, 17163, 17164, 17165, 17166, 17167, 17168,
##   17169, 17172, 17173, 17181, 17182, 17203, 17204, 15102, 17215, 17216,
##   17223, 17221, 17222, 17224, 15132, 15133, 15134, 15135, 15136, 15137,
##   15138, 15139, 15141, 15140, 15157, 15158, 15163, 15164, 15165, 15166,
##   15167, 15168, 15169, 15170, 17304, 17308, 17309, 17310, 17311, 17313,
##   17328, 17329, 17375, 17376, 17379, 17380, 15224, 15226, 15228, 65174,
##   64612, 64613, 64615, 64616, 64611, 64614, 64618, 64619, 64620, 65217,
##   65218, 64652, 65228, 65229, 64682, 65285, 64751, 64755, 64757, 64758,
##   64759, 64760, 64761, 65343, 65344, 65342, 65368, 65384, 65383, 65385,
##   64809, 64811, 64812, 64815, 64816, 64823, 64824, 64825, 64826, 64827,
##   64828, 64829, 64830, 64831, 64832, 64833, 64834, 64835, 64836, 64843,
##   64840, 64841, 64842, 64844, 64837, 64846, 64905, 64913, 64917, 65471,
##   65475, 65476, 65481, 65484, 65485, 65490, 65491, 65492, 65493, 65488,
##   65489, 65558, 65559, 65578, 65580, 64959, 64962, 64964, 65620, 65621,
##   65622, 65623, 65624, 65625, 65626, 65627, 65628, 65629, 65630, 65631,
##   65632, 65633, 65634, 65635, 65636, 65637, 65638, 65639, 65640, 65653,
##   65654, 65655, 65661, 65666, 65032, 65033, 65034, 65035, 65036, 65716,
##   65713, 65714, 65717, 65718, 65138, and 65763. Note that ranges located
##   on a sequence whose length is unknown (NA) or on a circular sequence
##   are not considered out-of-bound (use seqlengths() and isCircular() to
##   get the lengths and circularity flags of the underlying sequences). You
##   can use trim() to trim these ranges. See ?`trim,GenomicRanges-method`
##   for more information.

## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns
## 'select()' returned many:1 mapping between keys and columns

table(loc_hg19$LOCATION)

## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic   promoter 
##       1520      17805       2061       1596      28014          9       2374

loc_mm10 <- locateVariants(gr_mm10, txdb_mm10, AllVariants())

## 'select()' returned 1:1 mapping between keys and columns

## 'select()' returned 1:1 mapping between keys and columns
## 'select()' returned 1:1 mapping between keys and columns

table(loc_mm10$LOCATION)

## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic   promoter 
##          6          3          0          0          0          0         12

5 Annotate by ID

The ID’s returned from locateVariants() can be used in select() to map to ID’s in other annotation packages.

cols <- c("UNIPROT", "PFAM")
keys <- na.omit(unique(loc_hg19$GENEID))
head(select(org.Hs.eg.db, keys, cols, keytype="ENTREZID"))

## 'select()' returned 1:many mapping between keys and columns

##   ENTREZID    UNIPROT    PFAM
## 1    27255 A0A024R2C7 PF07679
## 2    27255 A0A024R2C7 PF00041
## 3    27255     B4DGV0 PF07679
## 4    27255     B4DGV0 PF00041
## 5    27255     Q9UQ52 PF07679
## 6    27255     Q9UQ52 PF00041

The ‘keytype’ argument specifies that the mouse TxDb contains Ensembl instead of Entrez gene id’s.

keys <- unique(loc_mm10$GENEID)
head(select(org.Mm.eg.db, keys, cols, keytype="ENSEMBL"))

## 'select()' returned 1:many mapping between keys and columns

##              ENSEMBL UNIPROT    PFAM
## 1 ENSMUSG00000058006  J3QMC5 PF07728
## 2 ENSMUSG00000058006  J3QMC5 PF17865
## 3 ENSMUSG00000058006  J3QMC5 PF17867
## 4 ENSMUSG00000058006  A2ANY6 PF07728
## 5 ENSMUSG00000058006  A2ANY6 PF17865
## 6 ENSMUSG00000058006  A2ANY6 PF17867

6 Annotate by Position

Files stored in the AnnotationHub have been pre-processed into ranged-based R objects such as a GRanges, GAlignments and VCF. The positions in our GRanges can be overlapped with the ranges in the AnnotationHub files. This allows for easy subsetting of multiple files, resulting in only the ranges of interest.

Create a ‘hub’ from AnnotationHub and filter the files based on organism and genome build.

hub <- AnnotationHub()

## snapshotDate(): 2021-05-18

hub_hg19 <- subset(hub, 
                  (hub$species == "Homo sapiens") & (hub$genome == "hg19"))
length(hub_hg19)

## [1] 24208

Iterate over the first 3 files and extract ranges that overlap ‘gr_hg19’.

## loading from cache
## loading from cache
## loading from cache

ov_hg19 <- lapply(1:3, function(i) subsetByOverlaps(hub_hg19[[i]], gr_hg19))

## loading from cache
## loading from cache
## loading from cache

Inspect the results.

names(ov_hg19) <- names(hub_hg19)[1:3]
lapply(ov_hg19, head, n=3)

## $AH5012
## UCSC track 'cytoBand'
## UCSCData object with 3 ranges and 1 metadata column:
##       seqnames          ranges strand |        name
##          <Rle>       <IRanges>  <Rle> | <character>
##   [1]     chr3       1-2800000      * |       p26.3
##   [2]     chr3 2800001-4000000      * |       p26.2
##   [3]     chr3 4000001-8700000      * |       p26.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $AH5013
## UCSC track 'stsMap'
## UCSCData object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand |        name     score
##          <Rle>         <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr3   8787843-8788458      * |    BV209804      1000
##   [2]     chr3   8788295-8788424      * |     RH92828      1000
##   [3]     chr3 10183290-10183558      * |  GDB:361137      1000
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $AH5014
## UCSC track 'fishClones'
## UCSCData object with 3 ranges and 2 metadata columns:
##       seqnames          ranges strand |        name     score
##          <Rle>       <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr3 4354970-4542349      * |  RP11-91K16      1000
##   [2]     chr3 4765290-4929842      * | RP11-106B10      1000
##   [3]     chr3 8711065-8882271      * |  RP11-128A5      1000
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Annotating the mouse ranges in the same fashion is left as an exercise.

7 Annotating Variants

Amino acid coding changes

For the set of dbSNP variants that fall in coding regions, amino acid changes can be computed. The output contains one line for each variant-transcript match which can result in multiple lines for each variant.

head(predictCoding(vcf, txdb_hg19, Hsapiens), 3)

## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 92870 out-of-bound ranges located on sequences
##   13067, 13068, 13069, 15251, 15252, 13091, 13092, 13093, 13094, 13095,
##   13097, 13123, 13124, 13130, 13131, 13132, 13133, 13136, 13143, 15300,
##   15299, 15301, 15304, 15305, 13177, 13199, 13201, 13205, 13206, 13207,
##   13208, 15329, 15330, 15341, 15342, 15343, 15344, 13260, 13261, 13253,
##   13254, 13255, 13258, 13268, 13264, 13265, 13266, 13267, 15363, 15365,
##   15388, 13288, 13289, 15391, 15392, 15393, 15424, 15420, 15421, 15422,
##   13316, 13317, 13318, 13319, 15449, 15486, 15487, 15488, 15489, 15507,
##   15504, 15505, 15506, 15538, 13391, 13392, 13390, 13399, 15554, 15555,
##   15556, 15557, 13432, 13433, 13437, 13439, 13450, 13447, 13448, 13449,
##   13451, 13452, 13445, 13446, 13453, 13454, 13458, 13467, 13468, 15585,
##   15584, 13513, 13514, 15589, 15590, 15591, 15592, 15593, 15594, 15595,
##   15596, 15603, 15597, 15598, 15599, 15600, 15601, 15602, 15604, 15616,
##   15617, 15625, 13541, 13564, 13565, 13582, 13583, 13584, 13585, 13586,
##   13587, 15657, 15658, 15660, 15661, 13635, 15682, 15684, 15685, 15686,
##   15687, 13713, 13716, 15714, 13717, 13718, 13719, 13720, 15720, 15721,
##   13730, 13731, 13733, 15800, 15808, 15806, 15807, 15809, 15811, 15805,
##   15834, 13775, 13777, 13781, 15843, 15850, 15852, 15863, 15864, 15875,
##   15880, 15881, 15883, 15884, 15885, 15882, 15879, 13791, 13792, 13793,
##   13794, 13795, 13796, 13797, 13798, 13799, 13800, 13801, 13802, 13803,
##   15897, 15896, 13832, 13833, 13836, 13837, 13838, 15967, 15968, 15969,
##   13892, 13893, 13894, 13895, 13896, 13897, 16011, 16013, 13910, 16023,
##   16025, 16027, 16033, 16034, 16035, 16036, 16037, 16039, 16041, 16042,
##   16070, 13951, 13952, 13953, 13950, 13954, 16089, 16090, 16091, 16092,
##   16112, 16113, 16114, 16115, 13974, 13975, 13976, 13994, 13995, 16129,
##   16130, 16131, 16132, 14041, 14037, 14038, 14039, 16174, 16199, 16202,
##   16211, 16212, 14072, 14073, 14074, 14069, 16220, 16223, 16224, 16226,
##   16227, 16228, 16229, 16230, 16240, 16241, 16258, 14091, 14092, 14093,
##   14094, 14096, 14090, 16267, 16266, 14108, 14109, 16272, 16273, 16279,
##   16280, 16281, 14120, 14122, 14123, 14121, 14124, 14125, 16329, 16330,
##   14207, 14208, 16344, 16345, 16346, 16347, 16348, 14310, 14314, 14315,
##   16422, 16423, 16438, 14350, 16498, 16499, 16514, 16515, 16517, 16518,
##   16519, 16520, 14398, 14399, 14401, 14412, 14413, 16548, 16547, 16555,
##   16556, 16557, 16558, 16559, 16560, 16561, 16562, 14440, 14444, 14443,
##   16643, 16644, 16645, 14487, 14488, 14495, 14497, 14498, 14499, 14496,
##   14500, 14512, 14517, 14511, 14513, 14514, 14515, 14516, 14518, 14520,
##   14521, 14522, 14523, 14545, 14546, 14548, 14549, 14550, 14551, 14544,
##   14547, 14552, 16701, 16714, 16716, 16718, 14592, 14594, 14596, 14595,
##   16727, 16728, 14599, 14600, 14601, 14602, 14616, 16815, 16826, 16849,
##   16850, 16851, 16852, 14714, 14715, 14718, 14722, 14723, 14724, 14725,
##   16860, 16861, 16862, 16863, 16888, 14823, 14824, 14825, 16961, 16962,
##   16963, 16964, 16965, 16983, 17007, 17008, 17009, 17010, 17011, 17012,
##   17014, 17015, 17046, 17047, 17048, 17098, 17110, 17114, 17122, 17123,
##   17125, 17126, 17127, 15016, 17151, 17153, 15047, 15048, 15049, 15050,
##   15051, 15052, 15054, 15055, 15057, 15058, 15059, 17157, 17158, 17159,
##   17160, 17156, 17161, 17162, 17163, 17164, 17165, 17166, 17167, 17168,
##   17169, 17172, 17173, 17181, 17182, 17203, 17204, 15102, 17215, 17216,
##   17223, 17221, 17222, 17224, 15132, 15133, 15134, 15135, 15136, 15137,
##   15138, 15139, 15141, 15140, 15157, 15158, 15163, 15164, 15165, 15166,
##   15167, 15168, 15169, 15170, 17304, 17308, 17309, 17310, 17311, 17313,
##   17328, 17329, 17375, 17376, 17379, 17380, 15224, 15226, 15228, 65174,
##   64612, 64613, 64615, 64616, 64611, 64614, 64618, 64619, 64620, 65217,
##   65218, 64652, 65228, 65229, 64682, 65285, 64751, 64755, 64757, 64758,
##   64759, 64760, 64761, 65343, 65344, 65342, 65368, 65384, 65383, 65385,
##   64809, 64811, 64812, 64815, 64816, 64823, 64824, 64825, 64826, 64827,
##   64828, 64829, 64830, 64831, 64832, 64833, 64834, 64835, 64836, 64843,
##   64840, 64841, 64842, 64844, 64837, 64846, 64905, 64913, 64917, 65471,
##   65475, 65476, 65481, 65484, 65485, 65490, 65491, 65492, 65493, 65488,
##   65489, 65558, 65559, 65578, 65580, 64959, 64962, 64964, 65620, 65621,
##   65622, 65623, 65624, 65625, 65626, 65627, 65628, 65629, 65630, 65631,
##   65632, 65633, 65634, 65635, 65636, 65637, 65638, 65639, 65640, 65653,
##   65654, 65655, 65661, 65666, 65032, 65033, 65034, 65035, 65036, 65716,
##   65713, 65714, 65717, 65718, 65138, and 65763. Note that ranges located
##   on a sequence whose length is unknown (NA) or on a circular sequence
##   are not considered out-of-bound (use seqlengths() and isCircular() to
##   get the lengths and circularity flags of the underlying sequences). You
##   can use trim() to trim these ranges. See ?`trim,GenomicRanges-method`
##   for more information.

## GRanges object with 3 ranges and 17 metadata columns:
##               seqnames    ranges strand | paramRangeID            REF
##                  <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   rs140337334     chr3   1427481      + |           NA              C
##   rs140337334     chr3   1427481      + |           NA              C
##   rs140337334     chr3   1427481      + |           NA              C
##                              ALT      QUAL      FILTER      varAllele    CDSLOC
##               <DNAStringSetList> <numeric> <character> <DNAStringSet> <IRanges>
##   rs140337334                G,T        NA           .              G      2704
##   rs140337334                G,T        NA           .              G      2488
##   rs140337334                G,T        NA           .              G      2704
##                  PROTEINLOC   QUERYID        TXID         CDSID      GENEID
##               <IntegerList> <integer> <character> <IntegerList> <character>
##   rs140337334           902         1       13067         40577       27255
##   rs140337334           830         1       13068         40577       27255
##   rs140337334           902         1       13069         40577       27255
##                 CONSEQUENCE       REFCODON       VARCODON         REFAA
##                    <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
##   rs140337334 nonsynonymous            CCT            GCT             P
##   rs140337334 nonsynonymous            CCT            GCT             P
##   rs140337334 nonsynonymous            CCT            GCT             P
##                       VARAA
##               <AAStringSet>
##   rs140337334             A
##   rs140337334             A
##   rs140337334             A
##   -------
##   seqinfo: 2 sequences from hg19 genome; no seqlengths

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] annotation_1.16.0                        
##  [2] TxDb.Athaliana.BioMart.plantsmart22_3.0.1
##  [3] biomaRt_2.48.0                           
##  [4] BSgenome.Hsapiens.UCSC.hg19_1.4.3        
##  [5] BSgenome_1.60.0                          
##  [6] rtracklayer_1.52.0                       
##  [7] Homo.sapiens_1.3.1                       
##  [8] GO.db_3.13.0                             
##  [9] OrganismDbi_1.34.0                       
## [10] org.Mm.eg.db_3.13.0                      
## [11] org.Hs.eg.db_3.13.0                      
## [12] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0   
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.13.0 
## [14] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
## [15] GenomicFeatures_1.44.0                   
## [16] AnnotationDbi_1.54.0                     
## [17] Organism.dplyr_1.20.0                    
## [18] AnnotationFilter_1.16.0                  
## [19] dplyr_1.0.6                              
## [20] AnnotationHub_3.0.0                      
## [21] BiocFileCache_2.0.0                      
## [22] dbplyr_2.1.1                             
## [23] VariantAnnotation_1.38.0                 
## [24] Rsamtools_2.8.0                          
## [25] Biostrings_2.60.0                        
## [26] XVector_0.32.0                           
## [27] SummarizedExperiment_1.22.0              
## [28] Biobase_2.52.0                           
## [29] GenomicRanges_1.44.0                     
## [30] GenomeInfoDb_1.28.0                      
## [31] IRanges_2.26.0                           
## [32] S4Vectors_0.30.0                         
## [33] MatrixGenerics_1.4.0                     
## [34] matrixStats_0.58.0                       
## [35] BiocGenerics_0.38.0                      
## [36] BiocStyle_2.20.0                         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                  bit64_4.0.5                  
##  [3] filelock_1.0.2                progress_1.2.2               
##  [5] httr_1.4.2                    tools_4.1.0                  
##  [7] bslib_0.2.5.1                 utf8_1.2.1                   
##  [9] R6_2.5.0                      lazyeval_0.2.2               
## [11] DBI_1.1.1                     withr_2.4.2                  
## [13] tidyselect_1.1.1              prettyunits_1.1.1            
## [15] bit_4.0.4                     curl_4.3.1                   
## [17] compiler_4.1.0                graph_1.70.0                 
## [19] DelayedArray_0.18.0           bookdown_0.22                
## [21] sass_0.4.0                    RBGL_1.68.0                  
## [23] rappdirs_0.3.3                stringr_1.4.0                
## [25] digest_0.6.27                 rmarkdown_2.8                
## [27] pkgconfig_2.0.3               htmltools_0.5.1.1            
## [29] fastmap_1.1.0                 rlang_0.4.11                 
## [31] RSQLite_2.2.7                 shiny_1.6.0                  
## [33] jquerylib_0.1.4               BiocIO_1.2.0                 
## [35] generics_0.1.0                jsonlite_1.7.2               
## [37] BiocParallel_1.26.0           RCurl_1.98-1.3               
## [39] magrittr_2.0.1                GenomeInfoDbData_1.2.6       
## [41] Matrix_1.3-3                  Rcpp_1.0.6                   
## [43] fansi_0.4.2                   lifecycle_1.0.0              
## [45] stringi_1.6.2                 yaml_2.2.1                   
## [47] zlibbioc_1.38.0               grid_4.1.0                   
## [49] blob_1.2.1                    promises_1.2.0.1             
## [51] crayon_1.4.1                  lattice_0.20-44              
## [53] hms_1.1.0                     KEGGREST_1.32.0              
## [55] knitr_1.33                    pillar_1.6.1                 
## [57] rjson_0.2.20                  XML_3.99-0.6                 
## [59] glue_1.4.2                    BiocVersion_3.13.1           
## [61] evaluate_0.14                 BiocManager_1.30.15          
## [63] png_0.1-7                     vctrs_0.3.8                  
## [65] httpuv_1.6.1                  purrr_0.3.4                  
## [67] assertthat_0.2.1              cachem_1.0.5                 
## [69] xfun_0.23                     mime_0.10                    
## [71] xtable_1.8-4                  restfulr_0.0.13              
## [73] later_1.2.0                   tibble_3.1.2                 
## [75] GenomicAlignments_1.28.0      memoise_2.0.0                
## [77] ellipsis_0.3.2                interactiveDisplayBase_1.30.0

8 Exercises

Exercise 1: VCF header and reading data subsets.

VCF files can be large and it’s often the case that only a subset of variables or genomic positions are of interest. The scanVcfHeader() function in the VariantAnnotation package retrieves header information from a VCF file. Based on the information returned from scanVcfHeader() a ScanVcfParam() object can be created to read in a subset of data from a VCF file. * Use scanVcfHeader() to inspect the header information in the ‘chr22.vcf.gz’ file in VariantAnnotation package. * Select a few ‘info’ or ‘geno’ variables and create a ScanVcfParam object. * Use the ScanVcfParam object as the ‘param’ argument to readVcf() to read in a subset of data. Note that the header() accessor operates on VCF objects in the R workspace. Try header(vcf) on the dbSNP file from AnnotationHub.

Exercise 2: Annotate the mouse ranges in ‘gr_mm10’ with AnnotationHub files. * Create a new ‘hub’ and filter on organism. * Isolate the files for the appropriate genome build and perform overlaps.

Exercise 3: Annotate a gene range from Saccharomyces Scerevisiae. * Load TxDb.Scerevisiae.UCSC.sacCer3.sgdGene and extract the gene ranges. (Hint: use transcriptsBy() and range()). * Isolate the range for gene “YBL086C”. * Create a new ‘hub’ from AnnotationHub and filter by organism. (You should see >= 39 files.) * Select the files for ‘sacCer3’ and perform overlaps.

[ Back to top ]

Annotating Genomic Ranges

24 April 2018

Contents