1 Basic usage
2 Low level parsing
3 retaining the genomic sequence
4 rtracklayer style import
5 Retrieving and parsing GenBank information by Versioned Accession
6 Retrieving only origin sequence
7 Creating TxDb objects from genbank annotations
8 Details and caveats
9 Performance

1 Basic usage

Simply put, the genbankr package parses files in the NCBI’s GenBank (gb/gbk) format into R.

The primary workhorse is provided by the readGenBank function, which accepts either a GenBank file (via the file argument) or raw text in the GenBank format (via the text argument).

suppressPackageStartupMessages(library(genbankr))
smpfile = system.file("sample.gbk", package="genbankr")
gb = readGenBank(smpfile)

## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column

## No exons read from genbank file. Assuming sections of CDS are full exons

## No transcript features (mRNA) found, using spans of CDSs

## Warning in fill_stack_df(feats[!typs %in% c("gene", "exon", "CDS",
## "variation", : Got unexpected multi-value field(s) [ rpt_type ]. The
## resulting column(s) will be of class CharacterList, rather than vector(s).
## Please contact the maintainer if multi-valuedness is expected/meaningful for
## the listed field(s).

gb

## GenBank Annotations
## Human herpesvirus 5 strain VR1814, complete genome. 
## Accession: GU179289
## 1 Sequence(s) with total length length: 235233
## 174 genes
## 170 transcripts
## 191 exons/cds elements
## 61 variations
## 24 other features

readGenBank generates a GenBankFull object, which contains the annotations and the origin sequence. genbankr provides methods for the AnnotationDbi API functions for retrieving information from the object.

We can retreive the genes via the function of the same name:

genes(gb)

## GRanges object with 174 ranges and 5 metadata columns:
##         seqnames           ranges strand |        type        gene
##            <Rle>        <IRanges>  <Rle> | <character> <character>
##     [1]   VR1814     [1232, 2164]      + |        gene         RL1
##     [2]   VR1814     [2353, 4863]      - |        gene      RNA2.7
##     [3]   VR1814     [5252, 5539]      - |        gene        RL5A
##     [4]   VR1814     [5888, 6223]      - |        gene         RL6
##     [5]   VR1814     [6622, 7677]      - |        gene      RNA1.2
##     ...      ...              ...    ... .         ...         ...
##   [170]   VR1814 [229242, 229793]      + |        gene        US32
##   [171]   VR1814 [229995, 230168]      + |        gene       US33A
##   [172]   VR1814 [230337, 230828]      + |        gene        US34
##   [173]   VR1814 [230822, 231016]      + |        gene       US34A
##   [174]   VR1814 [231986, 234349]      - |        gene        TRS1
##             loctype                                     note     gene_id
##         <character>                              <character> <character>
##     [1]      normal                                     <NA>         RL1
##     [2]      normal                                     <NA>      RNA2.7
##     [3]      normal                                     <NA>        RL5A
##     [4]      normal                                     <NA>         RL6
##     [5]      normal                                     <NA>      RNA1.2
##     ...         ...                                      ...         ...
##   [170]      normal                                     <NA>        US32
##   [171]      normal                                     <NA>       US33A
##   [172]      normal                                     <NA>        US34
##   [173]      normal                                     <NA>       US34A
##   [174]      normal putative beta gene; immediate early gene        TRS1
##   -------
##   seqinfo: 1 sequence from GU179289.1 genome

We can also do the same for exons, cds elements, and transcripts (code run but output ommitted for brevity):

cds(gb)
exons(gb)
transcripts(gb)

We can also access elements not in the standard TxDb API, such as variants and features which don’t fit into any of the other categories:

variants(gb)
otherFeatures(gb)

Furthermore, we can access header-level information via accessors as well:

accession(gb)

## [1] "GU179289"

vers(gb)

## accession.version         GenInfoID 
##      "GU179289.1"       "270355759"

We can get the seqinfo for a GenBankAnnot/GenBankFull object:

seqinfo(gb)

## Seqinfo object with 1 sequence from GU179289.1 genome:
##   seqnames seqlengths isCircular     genome
##   VR1814       235233      FALSE GU179289.1

Finally, we can ge the sequence itself from a GenBankFull object:

getSeq(gb)

##   A DNAStringSet instance of length 1
##      width seq                                           names               
## [1] 235233 CCATTCCGGGCCGCGTGCTGG...CCGCCAGTGCGGGACAGGGCT VR1814

2 Low level parsing

While use and integration of the Bioconductor machinery is recommended, we also provide low-level parsing capabilities via the workhorse parseGenBank function. This function returns a list structure roughly corresponding to the top-level headings within the genbank format itself:

pg = parseGenBank(smpfile)
str(pg, max.level = 1)

## List of 10
##  $ LOCUS     : chr "LOCUS       GU179289              235233 bp    DNA     linear   VRL 09-MAY-2013"
##  $ FEATURES  :List of 429
##   ..- attr(*, "dim")= int 429
##   ..- attr(*, "dimnames")=List of 1
##  $ ORIGIN    :Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
##  $ ACCESSION : chr "GU179289"
##  $ COMMENT   : NULL
##  $ DEFINITION: chr "Human herpesvirus 5 strain VR1814, complete genome."
##  $ KEYWORDS  : chr NA
##  $ REFERENCE : NULL
##  $ SOURCE    :List of 3
##  $ VERSION   : Named chr [1:2] "GU179289.1" "270355759"
##   ..- attr(*, "names")= chr [1:2] "accession.version" "GenInfoID"

3 retaining the genomic sequence

If desired, readGenBank and parseGenBank can omit full sequence of the organism by specifying ret.seq=FALSE. In this case, readGenBank returns a GenBankAnnot object, rather than a GenBankFull

gbf = readGenBank(smpfile, ret.seq = FALSE)

## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column

## No exons read from genbank file. Assuming sections of CDS are full exons

## No transcript features (mRNA) found, using spans of CDSs

## Warning in fill_stack_df(feats[!typs %in% c("gene", "exon", "CDS",
## "variation", : Got unexpected multi-value field(s) [ rpt_type ]. The
## resulting column(s) will be of class CharacterList, rather than vector(s).
## Please contact the maintainer if multi-valuedness is expected/meaningful for
## the listed field(s).

gbf

## GenBank Annotations
## Human herpesvirus 5 strain VR1814, complete genome. 
## Accession: GU179289
## 1 Sequence(s) with total length length: 235233
## 174 genes
## 170 transcripts
## 191 exons/cds elements
## 61 variations
## 24 other features

All of the accessor methods discussed above work for GenBankFull objects as they do for GenBankAnnot objects.

4 rtracklayer style import

We also provide a convenience method for using rtracklayer import style mechanics for reading GenBank files:

gbkfile = GenBankFile(smpfile)
gb2 = import(gbkfile)

## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column

## No exons read from genbank file. Assuming sections of CDS are full exons

## No transcript features (mRNA) found, using spans of CDSs

## Warning in fill_stack_df(feats[!typs %in% c("gene", "exon", "CDS",
## "variation", : Got unexpected multi-value field(s) [ rpt_type ]. The
## resulting column(s) will be of class CharacterList, rather than vector(s).
## Please contact the maintainer if multi-valuedness is expected/meaningful for
## the listed field(s).

5 Retrieving and parsing GenBank information by Versioned Accession

genbankr provides the GBAccession class and constructor for representing versioned Nuccore accession numbers.

gba = GBAccession("U49845.1")
gba

## GenBank Accession Number(s):  U49845.1

These accession objects can be passed directly to readGenBank:

readGenBank(gba, partial=TRUE)

## Loading required namespace: rentrez

## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column

## genes not available for all CDS ranges, using internal grouping ids

## No exons read from genbank file. Assuming sections of CDS are full exons

## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column

## GenBank Annotations
## Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. 
## Accession: U49845
## 1 Sequence(s) with total length length: 5028
## 2 genes
## 3 transcripts
## 3 exons/cds elements
## 0 variations
## 0 other features

6 Retrieving only origin sequence

genbankr also provides a fastpath for extracting only the sequence of an organism. We can call getSeq on a GenBankFile or GBAccession object

getSeq(gbkfile)

##   A DNAStringSet instance of length 1
##      width seq                                           names               
## [1] 235233 CCATTCCGGGCCGCGTGCTGG...CCGCCAGTGCGGGACAGGGCT VR1814

Additionally, we can specify ret.anno = FALSE in parseGenBank

parseGenBank(smpfile, ret.anno=FALSE)

##   A DNAStringSet instance of length 1
##      width seq                                           names               
## [1] 235233 CCATTCCGGGCCGCGTGCTGG...CCGCCAGTGCGGGACAGGGCT VR1814

7 Creating TxDb objects from genbank annotations

genbankr provides the makeTxDbFromGenBank function, which accepts a GenBankRecord or GBAccession object and returns a TxDb of the annotations.

gbr = readGenBank(smpfile)

## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
## Annotations don't have 'locus_tag' label, using 'gene' as gene_id column

## No exons read from genbank file. Assuming sections of CDS are full exons

## No transcript features (mRNA) found, using spans of CDSs

## Warning in fill_stack_df(feats[!typs %in% c("gene", "exon", "CDS",
## "variation", : Got unexpected multi-value field(s) [ rpt_type ]. The
## resulting column(s) will be of class CharacterList, rather than vector(s).
## Please contact the maintainer if multi-valuedness is expected/meaningful for
## the listed field(s).

tx = makeTxDbFromGenBank(gbr)
tx

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # accession.version: GU179289.1
## # GenInfoID: 270355759
## # genome: GU179289.1
## # source: NCBI (GenBank)
## # definition: Human herpesvirus 5 strain VR1814, complete genome.
## # Db content generated by: genbankr package from Bioconductor
## # genbankr version at creation time: 1.6.1
## # transcript_nrow: 170
## # exon_nrow: 189
## # cds_nrow: 191
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2018-03-05 20:26:41 -0500 (Mon, 05 Mar 2018)
## # GenomicFeatures version at creation time: 1.30.3
## # RSQLite version at creation time: 2.0
## # DBSCHEMAVERSION: 1.2

8 Details and caveats

Often times, GenBank files don’t contain exhaustive annotations. For example, files including CDS annotations often do not have separate transcript features. Furthermore, chromosomes are not always named, particularly in organisms that have only one. The details of how genbankr handles such cases are as follows:

In files where CDSs are annotated but individual exons are not, ‘approximate exons’ are defined as the individual contiguous elements within each CDS. Currently, no mixing of approximate and explicitly annotated exons is performed, even in cases where, e.g., exons are not annotated for some genes with CDS annotations.

In files where transcripts are not present, ‘approximate transcripts’ defined by the ranges spanned by groups of exons are used. Currently, we do not support generating approximate transcripts from CDSs in files that contain actual transcript annotations, even if those annotations do not cover all genes with CDS/exon annotations.

Features (gene, cds, variant, etc) are assumed to be contained within the most recent previous source feature (chromosome/physical piece of DNA). Chromosome name for source features (seqnames in the resulting GRanges}/VRanges is determined as follows:

The ‘chromosome’ attribute, as is (e.g., “chr1”);
the ‘strain’ attribute, combined with auto-generated count (e.g., “VR1814:1”);
the ‘organism’ attribute, combined with auto-generated count (e.g.“Human herpesvirus 5:1”)

Some GenBank files do not include origin sequence. In these cases, variation features are not supported, as there is no self-contained way to determine reference sequence and the features themselves typically contain only alt information (if that). In the case of files containing variation features but no origin sequence, those features are ignored with a warning.

Currently some information about from the header of a GenBank file, primarily reference and author based information, is not captured and returned. Please contact the maintainer if you have a direct use-case for this type of information.

9 Performance

We have taken pains to make the genbankr parser as effcient as easily possible. On our local machines, a 19MB genbank file takes 2-3 minutes to be parsed. That said, this package is not tested and likely is not suitable for processing extremely large genbank files. We suggest obtaining the annotations in a different format in such cases.

Contents