Sample output files for tximport

This data package provides a set of output files from running a number of various transcript abundance quantifiers on 6 samples from the GEUVADIS Project. The files are contained in the inst/extdata directory.

A citation for the GEUVADIS Project is:

Lappalainen, et al., “Transcriptome and genome sequencing uncovers functional variation in humans”, Nature 501, 506-511 (26 September 2013) doi:10.1038/nature12531.

The purpose of this vignette is to detail which versions of software were run, and exactly what calls were made.

Sample information and quantification files

A small file, samples.txt is included in the inst/extdata directory:

dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples
##   pop center                assay    sample experiment       run
## 1 TSI  UNIGE NA20503.1.M_111124_5 ERS185497  ERX163094 ERR188297
## 2 TSI  UNIGE NA20504.1.M_111124_7 ERS185242  ERX162972 ERR188088
## 3 TSI  UNIGE NA20505.1.M_111124_6 ERS185048  ERX163009 ERR188329
## 4 TSI  UNIGE NA20507.1.M_111124_7 ERS185412  ERX163158 ERR188288
## 5 TSI  UNIGE NA20508.1.M_111124_2 ERS185362  ERX163159 ERR188021
## 6 TSI  UNIGE NA20514.1.M_111124_4 ERS185217  ERX163062 ERR188356

Further details can be found in a more extended table:

samples.ext <- read.delim(file.path(dir,"samples_extended.txt"), header=TRUE)
colnames(samples.ext)
##  [1] "Source.Name"                            
##  [2] "Comment.ENA_SAMPLE."                    
##  [3] "Characteristics.Organism."              
##  [4] "Term.Source.REF"                        
##  [5] "Term.Accession.Number"                  
##  [6] "Characteristics.Strain."                
##  [7] "Characteristics.population."            
##  [8] "Comment.1000g.Phase1.Genotypes."        
##  [9] "Protocol.REF"                           
## [10] "Protocol.REF.1"                         
## [11] "Extract.Name"                           
## [12] "Comment.LIBRARY_SELECTION."             
## [13] "Comment.LIBRARY_SOURCE."                
## [14] "Comment.SEQUENCE_LENGTH."               
## [15] "Comment.LIBRARY_STRATEGY."              
## [16] "Comment.LIBRARY_LAYOUT."                
## [17] "Comment.NOMINAL_LENGTH."                
## [18] "Comment.NOMINAL_SDEV."                  
## [19] "Protocol.REF.2"                         
## [20] "Performer"                              
## [21] "Assay.Name"                             
## [22] "Technology.Type"                        
## [23] "Comment.ENA_EXPERIMENT."                
## [24] "Comment.READ_INDEX_1_BASE_COORD."       
## [25] "Protocol.REF.3"                         
## [26] "Scan.Name"                              
## [27] "Comment.SUBMITTED_FILE_NAME."           
## [28] "Comment.ENA_RUN."                       
## [29] "Comment.FASTQ_URI."                     
## [30] "Protocol.REF.4"                         
## [31] "Derived.Array.Data.File"                
## [32] "Comment..Derived.ArrayExpress.FTP.file."
## [33] "Factor.Value.population."               
## [34] "Factor.Value.laboratory."               
## [35] "date"

The quantification outputs themselves can be found in sub-directories:

list.files(dir)
##  [1] "alevin"                  "cufflinks"              
##  [3] "kallisto"                "kallisto_boot"          
##  [5] "refseq"                  "rsem"                   
##  [7] "sailfish"                "salmon"                 
##  [9] "salmon_dm"               "salmon_ec"              
## [11] "salmon_gibbs"            "samples.txt"            
## [13] "samples_extended.txt"    "tx2gene.csv"            
## [15] "tx2gene.ensembl.v87.csv" "tx2gene.gencode.v27.csv"
## [17] "tx2gene_alevin.tsv"
list.files(file.path(dir,"cufflinks"))
## [1] "isoforms.attr_table"  "isoforms.count_table" "isoforms.fpkm_table"
list.files(file.path(dir,"rsem","ERR188021"))
## [1] "ERR188021.genes.results.gz"    "ERR188021.isoforms.results.gz"
list.files(file.path(dir,"kallisto","ERR188021"))
## [1] "abundance.h5"     "abundance.tsv.gz" "run_info.json"
list.files(file.path(dir,"salmon","ERR188021"))
## [1] "aux_info"               "cmd_info.json"          "libParams"             
## [4] "lib_format_counts.json" "logs"                   "quant.sf.gz"
list.files(file.path(dir,"sailfish","ERR188021"))
## [1] "cmd_info.json" "quant.sf"
list.files(file.path(dir,"alevin"))
## [1] "mouse1_LPS2_50"      "mouse1_unst_50"      "mouse1_unst_50_boot"

Genome and gene annotation file

Illumina iGenomes: The human genome and annotations were downloaded from Illumina iGenomes for the UCSC hg19 version. The human genome FASTA file used was in the Sequence/WholeGenomeFasta directory and the gene annotation GTF file used was the genes.gtf file in the Annotation/Genes directory. This GTF file contains RefSeq transcript IDs and UCSC gene names. The Annotation directory contained a README.txt file with the text:

The contents of the annotation directories were downloaded from UCSC on: June 02, 2014.

The genes.gtf file was filtered to include only chromosomes 1-22, X, Y, and M.

Cufflinks

Tophat2 version 2.0.11 was run with the call:

tophat -p 20 -o tophat_out/$f genome fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz;

Cufflinks version 2.2.1 was run with the call:

cuffquant -p 40 -b $GENO -o cufflinks/$f genes.gtf tophat_out/$f/accepted_hits.bam;

Cuffnorm was run with the call:

cuffnorm genes.gtf -o cufflinks/ \
cufflinks/ERR188297/abundances.cxb \
cufflinks/ERR188088/abundances.cxb \
cufflinks/ERR188329/abundances.cxb \
cufflinks/ERR188288/abundances.cxb \
cufflinks/ERR188021/abundances.cxb \
cufflinks/ERR188356/abundances.cxb 

RSEM

RSEM version 1.2.31 was run with the call:

rsem-calculate-expression --num-threads 6 --bowtie2 --paired-end <(zcat fastq/$f\_1.fastq.gz) <(zcat fastq/$f\_2.fastq.gz) index rsem/$f/$f

kallisto

kallisto version 0.43.1 was run with the call:

kallisto quant --bias -i index -t 6 -o kallisto/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz

For the files in kallisto_boot directory, kallisto version 0.43.0 was run, quantifying against the Ensembl transcripts (v87) in Homo_sapiens.GRCh38.cdna.all.fa, using the call:

kallisto quant -i index -t 6 -b 5 -o kallisto_0.43.0/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz

Salmon

Salmon version 0.8.2 was run with the call:

salmon quant -p 6 --gcBias -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon/$f

For the files in the salmon_gibbs directory, Salmon version 0.8.1 was run, quantifying against the Ensembl transcripts (v87) in Homo_sapiens.GRCh38.cdna.all.fa, using the call:

salmon quant -p 6 --numGibbsSamples 5 -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon_gibbs/$f

For the files in the salmon_dm directory (Drosophila melanogaster), Salmon version 0.10.2 was run (once with only cDNA, once combining cDNA with non-coding transcripts):

salmon quant -l A --gcBias --seqBias --posBias -i Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2 -o SRR1197474 -1 SRR1197474_1.fastq.gz -2 SRR1197474_2.fastq.gz

For the files in the salmon_ec directory, Salmon version 1.1.0 was run with --dumpEq on the files from Tasic, B., Yao, Z., Graybuck, L.T. et al. “Shared and distinct transcriptomic cell types across neocortical areas” (2018) doi: 10.1038/s41586-018-0654-5 These files were generated by Jeroen Gilis. The raw data is from: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA476008&o=acc_s%3Aa

alevin

Two small examples of alevin output (50 cells each) were generated by Jeroen Gilis. The dataset is a subset from the paper, Hagai et al. “Gene expression variability across cells and species shapes innate immunity” (2018) doi: 10.1038/s41586-018-0657-2 Salmon/alevin version 1.6.0 was run, and using the tx2gene data that is included in the package under tx2gene_alevin.tsv.

Sailfish

Sailfish version 0.9.0 was run with the call:

sailfish quant -p 10 --biasCorrect -i sailfish_0.9.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o sailfish_0.9.0/$f

Session info

sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] compiler_4.2.0 magrittr_2.0.3 tools_4.2.0    stringi_1.7.6  knitr_1.39    
## [6] stringr_1.4.0  xfun_0.30      evaluate_0.15