When using the first version of metaseqR, one had to either embed the annotation to the gene/exon/3’ UTR counts, or to download and construct on-the-fly the required annotation when starting from BAM files. Although the counting and gene model construction results could (anf still can) be saved and re-used with other analysis parameters changed (e.g. statistical algorithms), one could not easily add for example new data to an existing dataset without re-running the whole pipeline and re-downloading annotation. On top of that, many times, the main Ensembl servers (when using Ensembl annotations) do not respond well to biomaRt calls, so the whole pipeline may stall until the servers are back.
Another main issue with the annotation used by metaseqR was that there was no straightforward way, provided by metaseqR, to archive and version the annotation used by a specific analysis and was up to the user to take care of reproducibility at this level. Furthermore, there was no straightforward way for a user to plugin own annotation elements (e.g. in the form of a GTF file) and use it in the same manner as standard annotations supported by metaseqR, e.g. when analyzing data from not-so-often studied organisms such as insects. Plugging-in own annotation was possible but usually a painful procedure, which has become now very easy.
The annotation database builder for metaseqR2 remedies the above situations. The
buildAnnotationDatabase
function should be run once with the organisms one
requires to have locally to work with and then that’s it! Of course you can
manage your database by adding and removing specific annotations (and you even
can play with an SQLite browser, although not advised, as the database structure
is rather simple). Furthermore, you can use the metaseqR2 annotation database
and management mechanism for any other type of analysis where you require to
have a simple tab-delimited annotation file, acquired with very little effort.
The following organisms (essentially genome versions) are supported for automatic database builds:
To install the metaseqR2 package, start R and enter:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("metaseqR2")
By default, the database file will be written in the
system.file(package="metaseqR2")
directory. You can specify another prefered
destination for it using the db
argument in the function call, but if you do
that, you will have to supply the localDb
argument pointing to the SQLite
database file you created to every metaseqr2 call you perform, otherwise, the
pipeline will download and use annotations on-the-fly.
In this vignette, we will build a minimal database comprising only the mouse
mm10 genome version from Ensembl. The database will be build in a temporary
directory inside session tempdir()
.
Important note: As the annotation build function makes use of Kent utilities for creating 3’UTR annotations from RefSeq and UCSC, the latter cannot be built in Windows. Therefore it is advised to either build the annotation database in a Linux system or use our pre-built databases.
library(metaseqR2)
buildDir <- file.path(tempdir(),"test_anndb")
dir.create(buildDir)
# The location of the custom database
myDb <- file.path(buildDir,"testann.sqlite")
# Since we are using Ensembl, we can also ask for a version
organisms <- list(mm10=100)
sources <- ifelse(.Platform$OS.type=="unix",c("ensembl","refseq"),"ensembl")
# If the example is not running in a multicore system, rc is ignored
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,db=myDb,rc=0.5)
## Opening metaseqR2 SQLite database /tmp/Rtmpv8Tm64/test_anndb/testann.sqlite
## Retrieving genome information for mm10 from ensembl
## Retrieving gene annotation for mm10 from ensembl version 100
## Using Ensembl host https://apr2020.archive.ensembl.org
## Retrieving transcript annotation for mm10 from ensembl version 100
## Using Ensembl host https://apr2020.archive.ensembl.org
## Merging transcripts for mm10 from ensembl version 100
## Retrieving 3' UTR annotation for mm10 from ensembl version 100
## Using Ensembl host https://apr2020.archive.ensembl.org
## Merging gene 3' UTRs for mm10 from ensembl version 100
## Merging transcript 3' UTRs for mm10 from ensembl version 100
## Retrieving exon annotation for mm10 from ensembl version 100
## Using Ensembl host https://apr2020.archive.ensembl.org
## Retrieving extended exon annotation for mm10 from ensembl version 100
## Using Ensembl host https://apr2020.archive.ensembl.org
## Merging exons for mm10 from ensembl version 100
## Merging exons for mm10 from ensembl version 100
Now, that a small database is in place, let’s retrieve some data. Remember that
since the built database is not in the default location, we need to pass the
database file in each data retrieval function. The annotation is retrieved as
a GRanges
object by default.
# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="gene",
db=myDb)
genes
## GRanges object with 55364 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000102693 chr1 3073253-3074322 + | ENSMUSG00000102693
## ENSMUSG00000064842 chr1 3102016-3102125 + | ENSMUSG00000064842
## ENSMUSG00000051951 chr1 3205901-3671498 - | ENSMUSG00000051951
## ENSMUSG00000102851 chr1 3252757-3253236 + | ENSMUSG00000102851
## ENSMUSG00000103377 chr1 3365731-3368549 - | ENSMUSG00000103377
## ... ... ... ... . ...
## ENSMUSG00000095366 chrY 90752427-90755467 - | ENSMUSG00000095366
## ENSMUSG00000095134 chrY 90753057-90763485 + | ENSMUSG00000095134
## ENSMUSG00000096768 chrY 90784738-90816465 + | ENSMUSG00000096768
## ENSMUSG00000099871 chrY 90837413-90844040 + | ENSMUSG00000099871
## ENSMUSG00000096850 chrY 90838869-90839177 - | ENSMUSG00000096850
## gc_content gene_name biotype
## <numeric> <character> <character>
## ENSMUSG00000102693 34.21 4933401J01Rik TEC
## ENSMUSG00000064842 36.36 Gm26206 snRNA
## ENSMUSG00000051951 38.51 Xkr4 protein_coding
## ENSMUSG00000102851 39.79 Gm18956 processed_pseudogene
## ENSMUSG00000103377 40.79 Gm37180 TEC
## ... ... ... ...
## ENSMUSG00000095366 41.37 Gm21860 lincRNA
## ENSMUSG00000095134 46.85 Mid1-ps1 unprocessed_pseudogene
## ENSMUSG00000096768 46.16 Gm47283 lincRNA
## ENSMUSG00000099871 43.39 Gm21742 unprocessed_pseudogene
## ENSMUSG00000096850 48.87 Gm21748 protein_coding
## -------
## seqinfo: 21 sequences from mm10 genome
# Load standard annotation based on 3' UTR coordinates
utrs <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="utr",
db=myDb)
utrs
## GRanges object with 228087 ranges and 4 metadata columns:
## seqnames ranges strand | transcript_id
## <Rle> <IRanges> <Rle> | <character>
## ENSMUST00000193812 chr1 3074323-3074571 + | ENSMUST00000193812
## ENSMUST00000082908 chr1 3102126-3102374 + | ENSMUST00000082908
## ENSMUST00000162897 chr1 3205652-3205900 - | ENSMUST00000162897
## ENSMUST00000159265 chr1 3206274-3206522 - | ENSMUST00000159265
## ENSMUST00000070533 chr1 3214233-3214481 - | ENSMUST00000070533
## ... ... ... ... . ...
## ENSMUST00000177591 chrY 90816465-90816713 + | ENSMUST00000177591
## ENSMUST00000179077 chrY 90816465-90816713 + | ENSMUST00000179077
## ENSMUST00000238471 chrY 90816466-90816714 + | ENSMUST00000238471
## ENSMUST00000179623 chrY 90838620-90838868 - | ENSMUST00000179623
## ENSMUST00000189352 chrY 90844041-90844289 + | ENSMUST00000189352
## gene_id gene_name biotype
## <character> <character> <character>
## ENSMUST00000193812 ENSMUSG00000102693 4933401J01Rik TEC
## ENSMUST00000082908 ENSMUSG00000064842 Gm26206 snRNA
## ENSMUST00000162897 ENSMUSG00000051951 Xkr4 protein_coding
## ENSMUST00000159265 ENSMUSG00000051951 Xkr4 protein_coding
## ENSMUST00000070533 ENSMUSG00000051951 Xkr4 protein_coding
## ... ... ... ...
## ENSMUST00000177591 ENSMUSG00000096768 Gm47283 lincRNA
## ENSMUST00000179077 ENSMUSG00000096768 Gm47283 lincRNA
## ENSMUST00000238471 ENSMUSG00000096768 Gm47283 lincRNA
## ENSMUST00000179623 ENSMUSG00000096850 Gm21748 protein_coding
## ENSMUST00000189352 ENSMUSG00000099871 Gm21742 unprocessed_pseudogene
## -------
## seqinfo: 21 sequences from mm10 genome
# Load summarized exon annotation based used with RNA-Seq analysis
sumEx <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="exon",
summarized=TRUE,db=myDb)
sumEx
## GRanges object with 291497 ranges and 4 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSMUSG00000102693_MEX_1 chr1 3073253-3074322 + |
## ENSMUSG00000064842_MEX_1 chr1 3102016-3102125 + |
## ENSMUSG00000051951_MEX_1 chr1 3205901-3207317 - |
## ENSMUSG00000051951_MEX_2 chr1 3213439-3216968 - |
## ENSMUSG00000102851_MEX_1 chr1 3252757-3253236 + |
## ... ... ... ... .
## ENSMUSG00000099871_MEX_1 chrY 90837413-90837520 + |
## ENSMUSG00000096850_MEX_1 chrY 90838869-90839177 - |
## ENSMUSG00000099871_MEX_2 chrY 90841657-90841805 + |
## ENSMUSG00000099871_MEX_3 chrY 90842898-90843025 + |
## ENSMUSG00000099871_MEX_4 chrY 90843878-90844040 + |
## exon_id gene_id
## <character> <character>
## ENSMUSG00000102693_MEX_1 ENSMUSG00000102693_M.. ENSMUSG00000102693
## ENSMUSG00000064842_MEX_1 ENSMUSG00000064842_M.. ENSMUSG00000064842
## ENSMUSG00000051951_MEX_1 ENSMUSG00000051951_M.. ENSMUSG00000051951
## ENSMUSG00000051951_MEX_2 ENSMUSG00000051951_M.. ENSMUSG00000051951
## ENSMUSG00000102851_MEX_1 ENSMUSG00000102851_M.. ENSMUSG00000102851
## ... ... ...
## ENSMUSG00000099871_MEX_1 ENSMUSG00000099871_M.. ENSMUSG00000099871
## ENSMUSG00000096850_MEX_1 ENSMUSG00000096850_M.. ENSMUSG00000096850
## ENSMUSG00000099871_MEX_2 ENSMUSG00000099871_M.. ENSMUSG00000099871
## ENSMUSG00000099871_MEX_3 ENSMUSG00000099871_M.. ENSMUSG00000099871
## ENSMUSG00000099871_MEX_4 ENSMUSG00000099871_M.. ENSMUSG00000099871
## gene_name biotype
## <character> <character>
## ENSMUSG00000102693_MEX_1 4933401J01Rik TEC
## ENSMUSG00000064842_MEX_1 Gm26206 snRNA
## ENSMUSG00000051951_MEX_1 Xkr4 protein_coding
## ENSMUSG00000051951_MEX_2 Xkr4 protein_coding
## ENSMUSG00000102851_MEX_1 Gm18956 processed_pseudogene
## ... ... ...
## ENSMUSG00000099871_MEX_1 Gm21742 unprocessed_pseudogene
## ENSMUSG00000096850_MEX_1 Gm21748 protein_coding
## ENSMUSG00000099871_MEX_2 Gm21742 unprocessed_pseudogene
## ENSMUSG00000099871_MEX_3 Gm21742 unprocessed_pseudogene
## ENSMUSG00000099871_MEX_4 Gm21742 unprocessed_pseudogene
## -------
## seqinfo: 21 sequences from mm10 genome
# Load standard annotation based on gene body coordinates from RefSeq
if (.Platform$OS.type=="unix") {
refGenes <- loadAnnotation(genome="mm10",refdb="refseq",level="gene",
type="gene",db=myDb)
refGenes
}
## Getting latest annotation on the fly for mm10 from refseq
## Retrieving genome information for mm10 from refseq
## Retrieving latest gene annotation for mm10 from refseq
## Loading required namespace: RMySQL
## Converting annotation to GenomicRanges object...
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'rtracklayer'
## The following object is masked from 'package:BiocIO':
##
## FileForFormat
## Getting DNA sequences...
## Getting GC content...
## GRanges object with 23471 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id gc_content
## <Rle> <IRanges> <Rle> | <character> <numeric>
## NM_001011874 chr1 3214481-3671498 - | NM_001011874 38.56
## NM_001370921 chr1 4119865-4360303 - | NM_001370921 38.35
## NM_011441 chr1 4490927-4497354 - | NM_011441 49.75
## NM_001177658 chr1 4773199-4785726 - | NM_001177658 42.59
## NM_008866 chr1 4807822-4846735 + | NM_008866 40.99
## ... ... ... ... . ... ...
## NM_001160135 chrY 67339048-67340047 - | NM_001160135 43.50
## NM_001037748 chrY 72554831-72581058 + | NM_001037748 35.58
## NM_001160144 chrY 77073913-77076246 - | NM_001160144 39.85
## NR_137283 chrY 90751139-90755050 - | NR_137283 41.51
## NR_137282 chrY 90785441-90816465 + | NR_137282 46.22
## gene_name biotype
## <character> <character>
## NM_001011874 Xkr4 NA
## NM_001370921 Rp1 NA
## NM_011441 Sox17 NA
## NM_001177658 Mrpl15 NA
## NM_008866 Lypla1 NA
## ... ... ...
## NM_001160135 Gm20806 NA
## NM_001037748 Gm20736 NA
## NM_001160144 Gm20816 NA
## NR_137283 G530011O06Rik NA
## NR_137282 Erdr1 NA
## -------
## seqinfo: 21 sequences from mm10 genome
Or as a data frame if you prefer using asdf=TRUE
. The data frame however does
not contain metadata like Seqinfo
to be used for any susequent validations:
# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="gene",
db=myDb,asdf=TRUE)
head(genes)
## chromosome start end gene_id gc_content strand gene_name
## 1 chr1 3073253 3074322 ENSMUSG00000102693 34.21 + 4933401J01Rik
## 2 chr1 3102016 3102125 ENSMUSG00000064842 36.36 + Gm26206
## 3 chr1 3205901 3671498 ENSMUSG00000051951 38.51 - Xkr4
## 4 chr1 3252757 3253236 ENSMUSG00000102851 39.79 + Gm18956
## 5 chr1 3365731 3368549 ENSMUSG00000103377 40.79 - Gm37180
## 6 chr1 3375556 3377788 ENSMUSG00000104017 36.99 - Gm37363
## biotype
## 1 TEC
## 2 snRNA
## 3 protein_coding
## 4 processed_pseudogene
## 5 TEC
## 6 TEC
Apart from the supported organisms and databases, you can add a custom annotation. Such an annotation can be:
This can be achieved through the usage of
GTF files, along with
some simple metadata that you have to provide for proper import to the
annotation database. This can be achieved through the usage of the
buildCustomAnnotation
function. Details on required metadata can be found
in the function’s help page.
Important note: Please note that importing a custom genome annotation
directly from UCSC (UCSC SQL database dumps) is not supported in Windows as the
process involves using the genePredToGtf
which is not available for Windows.
Let’s try a couple of exammples. The first one is a custom annotation for the Ebola virus from UCSC:
# Setup a temporary directory to download files etc.
customDir <- file.path(tempdir(),"test_custom")
dir.create(customDir)
# Convert from GenePred to GTF - Unix/Linux only!
if (.Platform$OS.type == "unix" && !grepl("^darwin",R.version$os)) {
# Download data from UCSC
goldenPath="http://hgdownload.cse.ucsc.edu/goldenPath/"
# Gene annotation dump
download.file(paste0(goldenPath,"eboVir3/database/ncbiGene.txt.gz"),
file.path(customDir,"eboVir3_ncbiGene.txt.gz"))
# Chromosome information
download.file(paste0(goldenPath,"eboVir3/database/chromInfo.txt.gz"),
file.path(customDir,"eboVir3_chromInfo.txt.gz"))
# Prepare the build
chromInfo <- read.delim(file.path(customDir,"eboVir3_chromInfo.txt.gz"),
header=FALSE)
chromInfo <- chromInfo[,1:2]
rownames(chromInfo) <- as.character(chromInfo[,1])
chromInfo <- chromInfo[,2,drop=FALSE]
# Coversion from genePred to GTF
genePredToGtfEnv <- Sys.getenv("GENEPREDTOGTF_BINARY")
if (genePredToGtfEnv == "") {
genePredToGtf <- file.path(customDir,"genePredToGtf")
} else {
genePredToGtf <- file.path(genePredToGtfEnv)
}
if (!file.exists(genePredToGtf)) {
download.file(
"http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
genePredToGtf
)
system(paste("chmod 775",genePredToGtf))
}
gtfFile <- file.path(customDir,"eboVir3.gtf")
tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
"tgtf",sep="."))
command <- paste0(
"zcat ",file.path(customDir,"eboVir3_ncbiGene.txt.gz"),
" | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
" -source=eboVir3"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
gtfFile
)
system(command)
# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
gtfFile=gtfFile,
metadata=list(
organism="eboVir3_test",
source="ucsc_test",
chromInfo=chromInfo
),
db=myDb
)
# Try to retrieve some data
eboGenes <- loadAnnotation(genome="eboVir3_test",refdb="ucsc_test",
level="gene",type="gene",db=myDb)
eboGenes
}
## Opening metaseqR2 SQLite database /tmp/Rtmpv8Tm64/test_anndb/testann.sqlite
## Importing GTF /tmp/Rtmpv8Tm64/test_custom/eboVir3.gtf as GTF to make id map
## Making id map
## Importing GTF /tmp/Rtmpv8Tm64/test_custom/eboVir3.gtf as TxDb
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Retrieving gene annotation for ebovir3_test from ucsc_test version 20240501 from /tmp/Rtmpv8Tm64/test_custom/eboVir3.gtf
## Retrieving transcript annotation for ebovir3_test from ucsc_test version 20240501
## Retrieving summarized transcript annotation for ebovir3_test from ucsc_test version 20240501
## Retrieving 3' UTR annotation for ebovir3_test from ucsc_test version 20240501
## Retrieving summarized 3' UTR annotation per gene for ebovir3_test from ucsc_test version 20240501
## summarizing UTRs per gene for imported GTF
## Retrieving summarized 3' UTR annotation per transcript for ebovir3_test from ucsc_test version 20240501
## summarizing UTRs per gene for imported GTF
## Retrieving exon annotation for ebovir3_test from ucsc_test version 20240501
## Retrieving summarized exon annotation for ebovir3_test from ucsc_test version 20240501
## summarizing exons per gene for imported GTF
## GRanges object with 9 ranges and 4 metadata columns:
## seqnames ranges strand | gene_id gc_content gene_name
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## NP KM034562v1 56-3026 + | NP 50 NP
## VP35 KM034562v1 3032-4407 + | VP35 50 VP35
## VP40 KM034562v1 4390-5894 + | VP40 50 VP40
## GP KM034562v1 5900-8305 + | GP 50 GP
## sGP KM034562v1 5900-8305 + | sGP 50 sGP
## ssGP KM034562v1 5900-8305 + | ssGP 50 ssGP
## VP30 KM034562v1 8288-9740 + | VP30 50 VP30
## VP24 KM034562v1 9885-11518 + | VP24 50 VP24
## L KM034562v1 11501-18282 + | L 50 L
## biotype
## <character>
## NP gene
## VP35 gene
## VP40 gene
## GP gene
## sGP gene
## ssGP gene
## VP30 gene
## VP24 gene
## L gene
## -------
## seqinfo: 1 sequence from ebovir3_test genome
Another example, the Atlantic cod from UCSC. The same things apply for the operating system.
if (.Platform$OS.type == "unix") {
# Gene annotation dump
download.file(paste0(goldenPath,"gadMor1/database/augustusGene.txt.gz"),
file.path(customDir,"gadMori1_augustusGene.txt.gz"))
# Chromosome information
download.file(paste(goldenPath,"gadMor1/database/chromInfo.txt.gz",sep=""),
file.path(customDir,"gadMori1_chromInfo.txt.gz"))
# Prepare the build
chromInfo <- read.delim(file.path(customDir,"gadMori1_chromInfo.txt.gz"),
header=FALSE)
chromInfo <- chromInfo[,1:2]
rownames(chromInfo) <- as.character(chromInfo[,1])
chromInfo <- chromInfo[,2,drop=FALSE]
# Coversion from genePred to GTF
genePredToGtf <- file.path(customDir,"genePredToGtf")
if (!file.exists(genePredToGtf)) {
download.file(
"http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
genePredToGtf
)
system(paste("chmod 775",genePredToGtf))
}
gtfFile <- file.path(customDir,"gadMori1.gtf")
tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
"tgtf",sep="."))
command <- paste0(
"zcat ",file.path(customDir,"gadMori1_augustusGene.txt.gz"),
" | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
" -source=gadMori1"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
gtfFile
)
system(command)
# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
gtfFile=gtfFile,
metadata=list(
organism="gadMor1_test",
source="ucsc_test",
chromInfo=chromInfo
),
db=myDb
)
# Try to retrieve some data
gadGenes <- loadAnnotation(genome="gadMor1_test",refdb="ucsc_test",
level="gene",type="gene",db=myDb)
gadGenes
}
Another example, Armadillo from Ensembl. This should work irrespectively of operating system. We are downloading chromosomal information from UCSC.
# Gene annotation dump from Ensembl
download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/",
"dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"))
# Chromosome information will be provided from the following BAM file
# available from Ensembl. We have noticed that when using Windows as the OS,
# a remote BAM files cannot be opened by scanBamParam, so for this example,
# chromosome length information will not be available when running in Windows.
bamForInfo <- NULL
if (.Platform$OS.type == "unix")
bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/",
"dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam")
# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
metadata=list(
organism="dasNov3_test",
source="ensembl_test",
chromInfo=bamForInfo
),
db=myDb
)
# Try to retrieve some data
dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test",
level="gene",type="gene",db=myDb)
dasGenes
A quite complete build (with latest versions of Ensembl annotations) would look like (supposing the default annotation database location):
organisms <- list(
hg18=54,
hg19=75,
hg38=110:111,
mm9=54,
mm10=110:111,
rn5=77,
rn6=110:111,
dm3=77,
dm6=110:111,
danrer7=77,
danrer10=80,
danrer11=110:111,
pantro4=80,
pantro5=110:111,
susscr3=80,
susscr11=110:111,
equcab2=110:111
)
sources <- c("ensembl","ucsc","refseq")
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,rc=0.5)
The aforementioned complete built can be found here Complete builts will become available from time to time (e.g. with every new Ensembl relrase) for users who do not wish to create annotation databases on their own. Root access may be required (depending on the metaseqR2 library location) to place it in the default location where it can be found automatically.
If for some reason you do not want to build and use an annotation database for
metaseqR2 analyses (not recommended) or you wish to perform an analysis with an
organism that does not yet exist in the database, the loadAnnotation
function
will perform all required actions (download and create a GRanges
object)
on-the-fly as long as there is an internet connection.
However, the above function does not handle custom annotations in GTF files.
In a scenario where you want to use a custom annotation only once, you should
supply the annotation
argument to the metaseqr2
function, which is almost
the same as the metadata
argument used in buildCustomAnnotation
, actually
augmented by a list member for the GTF
file, that is:
annotation <- list(
gtf="PATH_TO_GTF",
organism="ORGANISM_NAME",
source="SOURCE_NAME",
chromInfo="CHROM_INFO"
)
The above argument can be passed to the metaseqr2 call in the respective position.
For further details about custom annotations on the fly, please check
buildCustomAnnotation
and importCustomAnnotation
functions.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] splines stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.72.0
## [3] rtracklayer_1.64.0 BiocIO_1.14.0
## [5] Biostrings_2.72.0 XVector_0.44.0
## [7] metaseqR2_1.16.0 locfit_1.5-9.9
## [9] limma_3.60.0 DESeq2_1.44.0
## [11] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [13] MatrixGenerics_1.16.0 matrixStats_1.3.0
## [15] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
## [17] IRanges_2.38.0 S4Vectors_0.42.0
## [19] BiocGenerics_0.50.0 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] survcomp_1.54.0 bitops_1.0-7
## [3] filelock_1.0.3 tibble_3.2.1
## [5] R.oo_1.26.0 preprocessCore_1.66.0
## [7] XML_3.99-0.16.1 lifecycle_1.0.4
## [9] httr2_1.0.1 pwalign_1.0.0
## [11] edgeR_4.2.0 globals_0.16.3
## [13] MASS_7.3-60.2 lattice_0.22-6
## [15] dendextend_1.17.1 magrittr_2.0.3
## [17] plotly_4.10.4 sass_0.4.9
## [19] rmarkdown_2.26 jquerylib_0.1.4
## [21] yaml_2.3.8 DBI_1.2.2
## [23] RColorBrewer_1.1-3 ABSSeq_1.58.0
## [25] harmonicmeanp_3.0.1 abind_1.4-5
## [27] ShortRead_1.62.0 zlibbioc_1.50.0
## [29] purrr_1.0.2 R.utils_2.12.3
## [31] rmeta_3.0 RCurl_1.98-1.14
## [33] rappdirs_0.3.3 lava_1.8.0
## [35] seriation_1.5.5 GenomeInfoDbData_1.2.12
## [37] survivalROC_1.0.3.1 listenv_0.9.1
## [39] genefilter_1.86.0 parallelly_1.37.1
## [41] annotate_1.82.0 permute_0.9-7
## [43] DelayedMatrixStats_1.26.0 codetools_0.2-20
## [45] DelayedArray_0.30.0 DT_0.33
## [47] xml2_1.3.6 SuppDists_1.1-9.7
## [49] tidyselect_1.2.1 futile.logger_1.4.3
## [51] UCSC.utils_1.0.0 viridis_0.6.5
## [53] TSP_1.2-4 rmdformats_1.0.4
## [55] BiocFileCache_2.12.0 webshot_0.5.5
## [57] GenomicAlignments_1.40.0 jsonlite_1.8.8
## [59] iterators_1.0.14 survival_3.6-4
## [61] foreach_1.5.2 tools_4.4.0
## [63] progress_1.2.3 Rcpp_1.0.12
## [65] glue_1.7.0 prodlim_2023.08.28
## [67] gridExtra_2.3 SparseArray_1.4.0
## [69] xfun_0.43 qvalue_2.36.0
## [71] ca_0.71.1 dplyr_1.1.4
## [73] HDF5Array_1.32.0 withr_3.0.0
## [75] NBPSeq_0.3.1 formatR_1.14
## [77] BiocManager_1.30.22 fastmap_1.1.1
## [79] latticeExtra_0.6-30 rhdf5filters_1.16.0
## [81] fansi_1.0.6 caTools_1.18.2
## [83] digest_0.6.35 R6_2.5.1
## [85] colorspace_2.1-0 RMySQL_0.10.27
## [87] gtools_3.9.5 jpeg_0.1-10
## [89] biomaRt_2.60.0 RSQLite_2.3.6
## [91] R.methodsS3_1.8.2 tidyr_1.3.1
## [93] utf8_1.2.4 generics_0.1.3
## [95] data.table_1.15.4 prettyunits_1.2.0
## [97] httr_1.4.7 htmlwidgets_1.6.4
## [99] S4Arrays_1.4.0 pkgconfig_2.0.3
## [101] gtable_0.3.5 registry_0.5-1
## [103] blob_1.2.4 hwriter_1.3.2.1
## [105] htmltools_0.5.8.1 bookdown_0.39
## [107] log4r_0.4.3 scales_1.3.0
## [109] png_0.1-8 corrplot_0.92
## [111] knitr_1.46 lambda.r_1.2.4
## [113] reshape2_1.4.4 rjson_0.2.21
## [115] curl_5.2.1 zoo_1.8-12
## [117] cachem_1.0.8 rhdf5_2.48.0
## [119] stringr_1.5.1 KernSmooth_2.23-22
## [121] parallel_4.4.0 AnnotationDbi_1.66.0
## [123] vsn_3.72.0 restfulr_0.0.15
## [125] pillar_1.9.0 grid_4.4.0
## [127] vctrs_0.6.5 gplots_3.1.3.1
## [129] dbplyr_2.5.0 xtable_1.8-4
## [131] evaluate_0.23 bsseq_1.40.0
## [133] VennDiagram_1.7.3 GenomicFeatures_1.56.0
## [135] cli_3.6.2 compiler_4.4.0
## [137] futile.options_1.0.1 Rsamtools_2.20.0
## [139] rlang_1.1.3 crayon_1.5.2
## [141] FMStable_0.1-4 future.apply_1.11.2
## [143] heatmaply_1.5.0 interp_1.1-6
## [145] aroma.light_3.34.0 affy_1.82.0
## [147] plyr_1.8.9 pander_0.6.5
## [149] stringi_1.8.3 viridisLite_0.4.2
## [151] deldir_2.0-4 BiocParallel_1.38.0
## [153] assertthat_0.2.1 txdbmaker_1.0.0
## [155] munsell_0.5.1 lazyeval_0.2.2
## [157] DSS_2.52.0 EDASeq_2.38.0
## [159] Matrix_1.7-0 hms_1.1.3
## [161] future_1.33.2 sparseMatrixStats_1.16.0
## [163] bit64_4.0.5 ggplot2_3.5.1
## [165] Rhdf5lib_1.26.0 KEGGREST_1.44.0
## [167] statmod_1.5.0 memoise_2.0.1
## [169] affyio_1.74.0 bslib_0.7.0
## [171] bootstrap_2019.6 bit_4.0.5