getGenomeAndAnnotation {ORFik}R Documentation

Download genome (fasta), annotation (GTF) and contaminants

Description

Will create a R transcript database (TxDb object) from the annotation.
It will also index the genome for you
If you misspelled something or crashed, delete wrong files and run again.
Do remake = TRUE, to do it all over again.

Usage

getGenomeAndAnnotation(
  organism,
  output.dir,
  db = "ensembl",
  GTF = TRUE,
  genome = TRUE,
  phix = FALSE,
  ncRNA = "",
  tRNA = "",
  rRNA = "",
  gunzip = TRUE,
  remake = FALSE,
  assembly_type = "primary_assembly"
)

Arguments

organism

scientific name of organism, Homo sapiens, Danio rerio, Mus musculus, etc.

output.dir

directory to save downloaded data

db

database to use for genome and GTF, default adviced: "ensembl" (will contain haplotypes, large file!). Alternatives: "refseq" (primary assembly) and "genbank" (mix)

GTF

logical, default: TRUE, download gtf of organism specified in "organism" argument. If FALSE, check if the downloaded file already exist. If you want to use a custom gtf from you hard drive, set GTF = FALSE, and assign:
annotation <- getGenomeAndAnnotation(gtf = FALSE)
annotation["gtf"] = "path/to/gtf.gtf".
Only db = "ensembl" allowed for GTF.

genome

logical, default: TRUE, download genome of organism specified in "organism" argument. If FALSE, check if the downloaded file already exist. If you want to use a custom gtf from you hard drive, set GTF = FALSE, and assign:
annotation <- getGenomeAndAnnotation(genome = FALSE)
annotation["genome"] = "path/to/genome.fasta".
Will download the primary assembly for ensembl

phix

logical, default FALSE, download phix sequence to filter out with. Only use if illumina sequencing. Phix is used in Illumina sequencers for sequencing quality control. Genome is: refseq, Escherichia virus phiX174

ncRNA

character, default "" (no download), a contaminant genome. Alternatives: "auto" or manual assign like "human". If "auto" will try to find ncRNA file from organism, Homo sapiens -> human etc. "auto" will not work for all, then you must specify the name used by NONCODE, go to the link below and find it. If not "auto" / "" it must be a character vector of species common name (not scientific name) Homo sapiens is human, Rattus norwegicus is rat etc, download ncRNA sequence to filter out with. From NONCODE online server, if you cant find common name see: http://www.noncode.org/download.php/

tRNA

chatacter, default "" (not used), if not "" it must be a character vector to valid path of mature tRNAs fasta file to remove as contaminants on your disc. Find and download your wanted mtRNA at: http://gtrnadb.ucsc.edu/, or run trna-scan on you genome.

rRNA

chatacter, default "" (not used), if not "" it must be a character vector to valid path of mature rRNA fasta file to remove as contaminants on your disc. Find and download your wanted rRNA at: https://www.arb-silva.de/

gunzip

logical, default TRUE, uncompress downloaded files that are zipped when downloaded, should be TRUE!

remake

logical, default: FALSE, if TRUE remake everything specified

assembly_type

a character string specifying from which assembly type the genome shall be retrieved from (ensembl only, else this argument is ignored): Default is assembly_type = "primary_assembly"). This will give you all no copies of any chromosomes. As an example, the primary_assembly fasta genome in human is only a few GB uncompressed.
assembly_type = "toplevel"). This will give you all multi-chromosomes (copies of the same chromosome with small variations). As an example the toplevel fasta genome in human is over 70 GB uncompressed. To get primary assembly with 1 chromosome variant per chromosome:

Details

If you want custom genome or gtf from you hard drive, assign it after you run this function, like this:
annotation <- getGenomeAndAnnotation(GTF = FALSE, genome = FALSE)
annotation["genome"] = "path/to/genome.fasta"
annotation["gtf"] = "path/to/gtf.gtf"

Value

a character vector of path to genomes and gtf downloaded, and additional contaminants if used.

See Also

Other STAR: STAR.align.folder(), STAR.align.single(), STAR.index(), STAR.install(), STAR.multiQC(), STAR.remove.crashed.genome(), install.fastp()

Examples

output.dir <- "/Bio_data/references/zebrafish"
#getGenomeAndAnnotation("Danio rerio", output.dir)

[Package ORFik version 1.8.6 Index]