Contents

1 Introduction

Converting the BUS format into gene count matrix requires a file that maps transcripts to genes. This is required by both the C++ command line tool bustools and this package if you choose to use it for the gene count matrix. For bustools, the file should be of a specific format. This package implements utility functions to generate the transcript to gene data frame and save it in the format required by bustools. This vignette shows examples of generating this file with different methods

library(BUSpaRse)
library(TENxBUSData)

2 Obtaining transcript to gene information

2.1 From Ensembl

The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.

# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 94, 
                        other_attrs = "description")
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g_mm)
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
#> Version is only available to vertebrates.
#> Querying biomart for transcript and gene IDs of Arabidopsis thaliana
#> Cache found
head(tr2g_at)

2.2 From FASTA file

We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.

# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta)
#> Reading FASTA file.
head(tr2g_fa)

2.3 From GTF and GFF3 files

If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file.

# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf)
#> Reading GTF file.
head(tr2g_tg)

2.4 Sorting

Before using the transcript to gene data frame to generate the sparse matrix, it must be sorted into the same order as transcripts in the kallisto index. The equivalence classes from kallisto contain sets of numbers; these numbers are line number of the transcript, so the first transcript in the index is 0, the second is 1, and so on. Note that if the same FASTA file used for kallisto indexing is used to generate the transcript to gene data frame, then the data frame should already be in the right order. Here is an example of code used to sort the data frame, where "./out_retina" is the directory where kallisto output is stored.

# Download kallisto bus example output
TENxBUSData(".", dataset = "retina")
#> snapshotDate(): 2019-10-22
#> see ?TENxBUSData and browseVignettes('TENxBUSData') for documentation
#> loading from cache
#> The downloaded files are in /tmp/RtmpaZOu6E/Rbuild666031986928/BUSpaRse/vignettes/out_retina
#> [1] "/tmp/RtmpaZOu6E/Rbuild666031986928/BUSpaRse/vignettes/out_retina"
tr2g_mm <- sort_tr2g(tr2g_mm, kallisto_out_path = "./out_retina")
#> Sorting transcripts

2.5 Mixed species

The tr2g_* family of functions in this package only processes one species at a time. There is a function that processes multiple species at a time, but this only does Ensembl queries and Ensembl FASTA files. This function will also sort the transcripts. This function can be used for one species as well, to do the transcript and gene information extraction and the sorting in one step. Alternatively, youo can simply concatenate the tr2g data frames for single species and then sort it.

# Download example mixed species output
TENxBUSData(".", dataset = "hgmm100")
#> The dataset has already been downloaded. It is located in /tmp/RtmpaZOu6E/Rbuild666031986928/BUSpaRse/vignettes/out_hgmm100
#> [1] "/tmp/RtmpaZOu6E/Rbuild666031986928/BUSpaRse/vignettes/out_hgmm100"
tr2g_hgmm <- transcript2gene(species = c("Homo sapiens", "Mus musculus"),
                             type = "vertebrate",
                             ensembl_version = 94,
                             kallisto_out_path = "./out_hgmm100")
#> Querying biomart for transcript and gene IDs of Homo sapiens
#> Cache found
#> Querying biomart for transcript and gene IDs of Mus musculus
#> Cache found
#> Sorting transcripts

3 Save file for bustools

This package can be used to save the transcript to gene data frame to a format required by the bustools. This assumes that the data frame is already properly sorted.

save_tr2g_bustools(tr2g_at, file_save = "./tr2g_at.tsv")
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] ggplot2_3.2.1               zeallot_0.1.0              
#>  [3] DropletUtils_1.6.0          SingleCellExperiment_1.8.0 
#>  [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
#>  [7] BiocParallel_1.20.0         matrixStats_0.55.0         
#>  [9] Biobase_2.46.0              GenomicRanges_1.38.0       
#> [11] GenomeInfoDb_1.22.0         IRanges_2.20.0             
#> [13] S4Vectors_0.24.0            BiocGenerics_0.32.0        
#> [15] Matrix_1.2-17               BUSpaRse_1.0.0             
#> [17] TENxBUSData_0.99.3          BiocStyle_2.14.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] ProtGenerics_1.18.0           bitops_1.0-6                 
#>  [3] bit64_0.9-7                   progress_1.2.2               
#>  [5] httr_1.4.1                    tools_3.6.1                  
#>  [7] backports_1.1.5               R6_2.4.0                     
#>  [9] HDF5Array_1.14.0              colorspace_1.4-1             
#> [11] DBI_1.0.0                     lazyeval_0.2.2               
#> [13] withr_2.1.2                   tidyselect_0.2.5             
#> [15] prettyunits_1.0.2             bit_1.1-14                   
#> [17] curl_4.2                      compiler_3.6.1               
#> [19] rtracklayer_1.46.0            bookdown_0.14                
#> [21] scales_1.0.0                  askpass_1.1                  
#> [23] rappdirs_0.3.1                stringr_1.4.0                
#> [25] digest_0.6.22                 Rsamtools_2.2.0              
#> [27] rmarkdown_1.16                R.utils_2.9.0                
#> [29] XVector_0.26.0                pkgconfig_2.0.3              
#> [31] htmltools_0.4.0               limma_3.42.0                 
#> [33] dbplyr_1.4.2                  fastmap_1.0.1                
#> [35] ensembldb_2.10.0              BSgenome_1.54.0              
#> [37] rlang_0.4.1                   RSQLite_2.1.2                
#> [39] shiny_1.4.0                   jsonlite_1.6                 
#> [41] R.oo_1.22.0                   dplyr_0.8.3                  
#> [43] RCurl_1.95-4.12               magrittr_1.5                 
#> [45] GenomeInfoDbData_1.2.2        munsell_0.5.0                
#> [47] Rcpp_1.0.2                    Rhdf5lib_1.8.0               
#> [49] R.methodsS3_1.7.1             lifecycle_0.1.0              
#> [51] edgeR_3.28.0                  stringi_1.4.3                
#> [53] yaml_2.2.0                    zlibbioc_1.32.0              
#> [55] rhdf5_2.30.0                  BiocFileCache_1.10.0         
#> [57] AnnotationHub_2.18.0          grid_3.6.1                   
#> [59] blob_1.2.0                    dqrng_0.2.1                  
#> [61] promises_1.1.0                ExperimentHub_1.12.0         
#> [63] crayon_1.3.4                  lattice_0.20-38              
#> [65] Biostrings_2.54.0             GenomicFeatures_1.38.0       
#> [67] hms_0.5.1                     locfit_1.5-9.1               
#> [69] knitr_1.25                    pillar_1.4.2                 
#> [71] biomaRt_2.42.0                XML_3.98-1.20                
#> [73] glue_1.3.1                    BiocVersion_3.10.1           
#> [75] evaluate_0.14                 data.table_1.12.6            
#> [77] BiocManager_1.30.9            RcppParallel_4.4.4           
#> [79] vctrs_0.2.0                   httpuv_1.5.2                 
#> [81] gtable_0.3.0                  openssl_1.4.1                
#> [83] purrr_0.3.3                   tidyr_1.0.0                  
#> [85] assertthat_0.2.1              xfun_0.10                    
#> [87] mime_0.7                      xtable_1.8-4                 
#> [89] AnnotationFilter_1.10.0       later_1.0.0                  
#> [91] tibble_2.1.3                  GenomicAlignments_1.22.0     
#> [93] AnnotationDbi_1.48.0          plyranges_1.6.0              
#> [95] memoise_1.1.0                 interactiveDisplayBase_1.24.0