Contents

R version: R version 4.0.3 (2020-10-10)

Bioconductor version: 3.12

Package: 1.14.0

1 Abstract

Bioconductor enables the analysis and comprehension of high- throughput genomic data. We have a vast number of packages that allow rigorous statistical analysis of large data while keeping technological artifacts in mind. Bioconductor helps users place their analytic results into biological context, with rich opportunities for visualization. Reproducibility is an important goal in Bioconductor analyses. Different types of analysis can be carried out using Bioconductor, for example

For these analyses, one typically imports and works with diverse sequence-related file types, including fasta, fastq, BAM, gtf, bed, and wig files, among others. Bioconductor packages support import, common and advanced sequence manipulation operations such as trimming, transformation, and alignment including quality assessment.

2 Sequencing Resources

Here is a illustrative description elaborating the different file types at various stages in a typical analysis, with the package names (in pink boxes) that one will use for each stage.

The following packages illustrate the diversity of functionality available; all are in the release version of Bioconductor.

Bioconductor packages are organized by biocViews. Some of the entries under Sequencing and other terms, and representative packages, include:

3 Ranges Infrastructure

Many Bioconductor packages rely heavily on the IRanges / GenomicRanges infrastructure. Thus we will begin with a quick introduction to these and then cover different file types.

The GenomicRanges package allows us to associate a range of chromosome coordinates with a sequence name (e.g., chromosome) and a strand. Such genomic ranges are very useful for describing both data (e.g., the coordinates of aligned reads, called ChIP peaks, SNPs, or copy number variants) and annotations (e.g., gene models, Roadmap Epigenomics regulatory elements, known clinically relevant variants from dbSNP). GRanges is an object representing a vector of genomic locations and associated annotations. Each element in the vector is comprised of a sequence name, a range, a strand, and optional metadata (e.g. score, GC content, etc.).

library(GenomicRanges)
GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
      IRanges(1:10, width=5), strand='-',
      score=101:110, GC = runif(10))

Genomic ranges can be created ‘by hand’, as above, but are often the result of importing data (e.g., via GenomicAlignments::readGAlignments()) or annotation (e.g., via GenomicFeatures::select() or rtracklayer::import() of BED, WIG, GTF, and other common file formats). Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes.

help(package="GenomicRanges")
vignette(package="GenomicRanges")

Some of the common operations on GRanges include findOverlaps(query, subject) and nearest(query, subject), which identify the ranges in query that overlap ranges in subject, or the range in subject nearest to `query. These operations are useful both in data analysis (e.g., counting overlaps between aligned reads and gene models in RNAseq) and comprehension (e.g., annotating genes near ChIP binding sites).

4 DNA /amino acid sequence from FASTA files

Biostrings classes (e.g., DNAStringSet) are used to represent DNA or amino acid sequences. In the example below we will construct a DNAString and show some manipulations.

library(Biostrings)
d <- DNAString("TTGAAAA-CTC-N")
length(d)  #no of letters in the DNAString
## [1] 13

We will download all Homo sapiens cDNA sequences from the FASTA file ‘Homo_sapiens.GRCh38.cdna.all.fa’ from Ensembl using AnnotationHub.

library(AnnotationHub)
ah <- AnnotationHub()

This file is downloaded as a TwoBitFile

ah2 <- query(ah, c("fasta", "homo sapiens", "Ensembl", "cdna"))
dna <- ah2[["AH68262"]]
dna
## TwoBitFile object
## resource: /tmp/RtmpOr5pHX/BiocFileCache/6a812f32682b_75008

The sequences in the file can be read in using getSeq() from the Biostrings package. The sequences are returned as a DNAStringSet object.

getSeq(dna)
## DNAStringSet object of length 187626:
##           width seq                                                             names               
##      [1]     12 GGGACAGGGGGC                                                    ENST00000632684.1
##      [2]      9 CCTTCCTAC                                                       ENST00000434970.2
##      [3]     13 ACTGGGGGATACG                                                   ENST00000448914.1
##      [4]      8 GAAATAGT                                                        ENST00000415118.1
##      [5]     12 GGGACAGGGGGC                                                    ENST00000631435.1
##      ...    ... ...
## [187622]   1370 GGCTGAGTCTGGGCCCCAGGACCCGCATGC...GAAGCTTCCCAAGATGCAGCCGGGAGGTGA ENST00000639790.1
## [187623]    284 GGCGTCTACAAGAGACCTTCCTTCTCAGCT...TGAGTGATCAGCCCTAGATGACCACTGTTA ENST00000639660.1
## [187624]    105 TGCATCACTCTGGCATTGACTTCCTGGATT...CTGTCCTTCTGTGGACCCCAGAAAGTTAAT ENST00000643577.1
## [187625]    900 ATGGGATGTCACCAATCAATGGTCACAGAA...AGGGCACTGAGGAAGGAGAGGCTGATGTAA ENST00000646356.1
## [187626]    930 ATGGGAGTCAACCAATCATGGGTCACAGAA...CACAGAGCACTGCAGAGGACGCTGTCTATG ENST00000645792.1

BSgenome packages inside Bioconductor contain whole genome sequences as distributed by ENSEMBL, NCBI and others. In this next example we will load the whole genome sequence for Homo sapiens from UCSC’s hg19 build, and calculate the GC content across chromosome 14.

library(BSgenome.Hsapiens.UCSC.hg19)

chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
chr14_dna <- getSeq(Hsapiens, chr14_range)
letterFrequency(chr14_dna, "GC", as.prob=TRUE)
##           G|C
## [1,] 0.336276

5 Reads from FASTQ files

ShortRead package from Bioconductor can be used for working with fastq files. Here we illustrate a quick example where one can read in multiple fasta files, collect some statistics and generate a report about the same.

BiocParallel is another package from Bioconductor which parallelizes this task and speeds up the process.

## 1. attach ShortRead and BiocParallel
library(ShortRead)
library(BiocParallel)

## 2. create a vector of file paths
fls <- dir("~/fastq", pattern="*fastq", full=TRUE)

## 3. collect statistics
stats0 <- qa(fls)

## 4. generate and browse the report
if (interactive())
    browseURL(report(stats0))

Two useful functions in ShortRead are trimTails() for processing FASTQ files, and FastqStreamer() for iterating through FASTQ files in manageable chunks (e.g., 1,000,000 records at a time).

6 Aligned Reads from BAM files

The GenomicAlignments package is used to input reads aligned to a reference genome.

In this next example, we will read in a BAM file and specifically read in reads supporting an apparent exon splice junction spanning position 19653773 of chromosome 14.

The package RNAseqData.HNRNPC.bam.chr14_BAMFILES contains 8 BAM files. We will use only the first BAM file. We will load the software packages and the data package, construct a GRanges with our region of interest, and use summarizeJunctions() to find reads in our region of interest.

## 1. load software packages
library(GenomicRanges)
library(GenomicAlignments)

## 2. load sample data
library('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)

## 3. define our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1)) 

## 4. alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]

## 5. supporting reads
paln[j_overlap$revmap[[1]]]
## GAlignmentsList object of length 8:
## [[1]]
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width     njunc
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##   [1]    chr14      -   66M120N6M        72  19653707  19653898       192         1
##   [2]    chr14      +  7M1270N65M        72  19652348  19653689      1342         1
##   -------
##   seqinfo: 93 sequences from an unspecified genome
## 
## [[2]]
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width     njunc
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##   [1]    chr14      -   66M120N6M        72  19653707  19653898       192         1
##   [2]    chr14      +         72M        72  19653686  19653757        72         0
##   -------
##   seqinfo: 93 sequences from an unspecified genome
## 
## [[3]]
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width     njunc
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##   [1]    chr14      +         72M        72  19653675  19653746        72         0
##   [2]    chr14      -   65M120N7M        72  19653708  19653899       192         1
##   -------
##   seqinfo: 93 sequences from an unspecified genome
## 
## ...
## <5 more elements>

For a detailed tutorial on working with BAM files do check out this detailed Overlap Encodings vignette of GenomicAlignments.

7 Called Variants from VCF files

VCF (Variant Call Files) describe SNP and other variants. The files contain meta-information lines, a header line with column names, and then (many!) data lines, each with information about a position in the genome, and optional genotype information on samples for each position.

Data are parsed into a VCF object with readVcf() from VariantAnnoation

library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")

An excellent workflow on working with Variants can be found here. In particular it is possible to read in specific components of the VCF file (e.g., readInfo(), readGeno()) and parts of the VCF at specific genomic locations (using GRanges and the param = ScanVcfParam() argument to input functions).

8 Genome Annotations from BED, WIG, GTF etc files

rtracklayer import and export functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser.

rtracklayer contains a ‘test’ BED file which we will read in here

library(rtracklayer)
test_path <- system.file("tests", package = "rtracklayer")
test_bed <- file.path(test_path, "test.bed")
  
test <- import(test_bed, format = "bed")
test
## UCSC track 'ItemRGBDemo'
## UCSCData object with 5 ranges and 5 metadata columns:
##       seqnames              ranges strand |        name     score     itemRgb               thick
##          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>           <IRanges>
##   [1]     chr7 127471197-127472363      + |        Pos1         0     #FF0000 127471197-127472363
##   [2]     chr7 127472364-127473530      + |        Pos2         2     #FF0000 127472364-127473530
##   [3]     chr7 127473531-127474697      - |        Neg1         0     #FF0000 127473531-127474697
##   [4]     chr9 127474698-127475864      + |        Pos3         5     #FF0000 127474698-127475864
##   [5]     chr9 127475865-127477031      - |        Neg2         5     #0000FF 127475865-127477031
##                        blocks
##                 <IRangesList>
##   [1] 1-300,501-700,1068-1167
##   [2]          1-250,668-1167
##   [3]                  1-1167
##   [4]                  1-1167
##   [5]                  1-1167
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The file is returned to the user as a GRanges instance. A more detailed tutorial can be found here

AnnotationHub also contains a variety of genomic annotation files (eg BED, GTF, BigWig) which use import() from rtracklayer behind the scenes. For a detailed tutorial the user is referred to Annotation workflow and AnnotationHub HOW TO vignette

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RNAseqData.HNRNPC.bam.chr14_0.28.0 BSgenome.Hsapiens.UCSC.hg19_1.4.3 
##  [3] BSgenome_1.58.0                    AnnotationHub_2.22.0              
##  [5] BiocFileCache_1.14.0               dbplyr_1.4.4                      
##  [7] VariantAnnotation_1.36.0           rtracklayer_1.50.0                
##  [9] ShortRead_1.48.0                   BiocParallel_1.24.0               
## [11] GenomicAlignments_1.26.0           Rsamtools_2.6.0                   
## [13] Biostrings_2.58.0                  XVector_0.30.0                    
## [15] SummarizedExperiment_1.20.0        Biobase_2.50.0                    
## [17] MatrixGenerics_1.2.0               matrixStats_0.57.0                
## [19] GenomicRanges_1.42.0               GenomeInfoDb_1.26.0               
## [21] IRanges_2.24.0                     S4Vectors_0.28.0                  
## [23] BiocGenerics_0.36.0                BiocStyle_2.18.0                  
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_4.0.5                   RColorBrewer_1.1-2           
##  [4] progress_1.2.2                httr_1.4.2                    tools_4.0.3                  
##  [7] R6_2.5.0                      DBI_1.1.0                     tidyselect_1.1.0             
## [10] prettyunits_1.1.1             bit_4.0.4                     curl_4.3                     
## [13] compiler_4.0.3                xml2_1.3.2                    DelayedArray_0.16.0          
## [16] bookdown_0.21                 askpass_1.1                   rappdirs_0.3.1               
## [19] stringr_1.4.0                 digest_0.6.27                 rmarkdown_2.5                
## [22] jpeg_0.1-8.1                  pkgconfig_2.0.3               htmltools_0.5.0              
## [25] fastmap_1.0.1                 rlang_0.4.8                   RSQLite_2.2.1                
## [28] shiny_1.5.0                   generics_0.0.2                hwriter_1.3.2                
## [31] dplyr_1.0.2                   RCurl_1.98-1.2                magrittr_1.5                 
## [34] GenomeInfoDbData_1.2.4        Matrix_1.2-18                 Rcpp_1.0.5                   
## [37] lifecycle_0.2.0               stringi_1.5.3                 yaml_2.2.1                   
## [40] zlibbioc_1.36.0               grid_4.0.3                    blob_1.2.1                   
## [43] promises_1.1.1                crayon_1.3.4                  lattice_0.20-41              
## [46] GenomicFeatures_1.42.0        hms_0.5.3                     knitr_1.30                   
## [49] pillar_1.4.6                  biomaRt_2.46.0                XML_3.99-0.5                 
## [52] glue_1.4.2                    BiocVersion_3.12.0            evaluate_0.14                
## [55] latticeExtra_0.6-29           BiocManager_1.30.10           png_0.1-7                    
## [58] vctrs_0.3.4                   httpuv_1.5.4                  openssl_1.4.3                
## [61] purrr_0.3.4                   assertthat_0.2.1              xfun_0.18                    
## [64] mime_0.9                      xtable_1.8-4                  later_1.1.0.1                
## [67] tibble_3.0.4                  AnnotationDbi_1.52.0          memoise_1.1.0                
## [70] ellipsis_0.3.1                interactiveDisplayBase_1.28.0