Contents

1 FLAMES

The FLAMES package provides a framework for performing single-cell and bulk read full-length analysis of mutations and splicing. FLAMES performs cell barcode and UMI assignment from nanopore reads as well as semi-supervised isoform detection and quantification. FLAMES is designed to be an easy and quick to use, powerful workflow for isoform detection and quantification, splicing analysis and mutation detection, and seeks to overcome the limitations of other tools, such as an inability to process single cell data, and a focus on cell barcode and UMI assignment (Tian et al. 2020).

This R package was created to simplify installation and execution of the FLAMES python module, which supports the article Comprehensive characterization of single cell full-length isoforms in human and mouse with long-read sequencing by Tian et al. (2020).

FLAMES pipeline workflow summary

Figure 1: FLAMES pipeline workflow summary

Input to FLAMES are fastq files generated from the long-read platform. Using the cell barcode annotation obtained from short-read data as the reference, the pipeline identifies and trims cell barcodes/UMI sequences from the long reads. After barcode assignment, all reads are aligned to the relevant genome to obtain a draft read alignment. The draft alignment is then polished and grouped to generate a consensus transcript assembly. All reads are aligned again using the transcript assembly as the reference and quantified.

Figure 1 provides a high level overview of the main steps in the FLAMES pipeline. The optional arguments on the left are colour coded to associate with the specific step that they apply to.

For read alignment and realignment, FLAMES uses minimap2, a versatile alignment program for aligning sequences against a reference database (Li 2018). This software needs to be downloaded prior to using the FLAMES pipeline, and can be found at https://github.com/lh3/minimap2.

However, this software is not available to Windows users. If a user wishes to use the FLAMES pipeline without access to minimap2, please refer to the vignette Vignette for FLAMES on Windows, which provides instructions for running the pipeline on Windows, which is also applicable to any system without minimap2 installed.

2 FLAMES Pipeline Execution

This vignette will detail the process of running the FLAMES pipeline. It details the execution of both the single cell pipeline (sc_long_pipeline()) and the bulk data pipeline (bulk_long_pipeline()).

2.1 FLAMES Single Cell Pipeline

2.1.1 Environment setup

To get started, the pipeline needs access to a gene annotation file in GFF3 or GTF format, a directory containing one or more FASTQ files (which will be merged as pre-processing), a genome FASTA file, as well as the file path to minimap2, and the file path to the directory to hold output files.

The single cell pipeline can demultiplex the input data before running, if required. This can be disabled by setting the match_barcode argument to FALSE when calling the pipeline. This example dataset has already been demultiplexed.

For this example, the required files are downloaded from GitHub using BiocFileCache (Shepherd and Morgan 2021).

temp_path <- tempfile()
bfc <- BiocFileCache::BiocFileCache(temp_path, ask=FALSE)
file_url <- 
  "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"
annot <- bfc[[names(BiocFileCache::bfcadd(bfc, "Annotation", 
                                          paste(file_url, 
                                                "gencodeshortened.gtf", 
                                                sep="/")))]]
genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 
                                              "Genomefa", 
                                              paste(file_url, 
                                                    "GRCh38shortened.fa", 
                                                    sep="/")))]]

fastq <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq", paste(file_url, "sc_align2genome.sample.fastq.gz", sep="/")))]]

# setup other environment variables
config_file <- FLAMES::get_default_config_file()
outdir <- tempfile() 
if (!dir.exists(outdir)) {
    dir.create(outdir)
}

The optional argument config_file can be given to both bulk_long_pipeline() and sc_long_pipeline() in order to customise the execution of the pipelines. It is expected to be a JSON file, and an example can be found by calling FLAMES::get_default_config_file, which returns the path to the default JSON configuration file. This JSON file is used in absence of both the argument config_file and any alteration of the other optional pipeline arguments.

If config_file is not given, the pipeline can instead be customised by altering any of the optional arguments the pipeline allows. These customisations are then stored in a JSON file saved in the specified out directory, which allows for easier reproduction of pipeline execution. More information on the optional arguments at the contents of the JSON configuration file can be found by running ?create_config and ?bulk_long_pipeline.

This vignette uses the default configuration file.

2.1.2 FLAMES execution

Once the initial variables have been setup, the pipeline can be run using:

library(FLAMES)
sce <- sc_long_pipeline(annot=annot, fastq=fastq, genome_fa=genome_fa, outdir=outdir, config_file=config_file, match_barcode=FALSE)

As stated above, the example dataset has already been demultiplexed, so match_barcode=FALSE should be set to skip the preprocessing of the barcodes.

If, however, the input fastq files need to be demultiplexed, then the reference_csv argument will need to be specified - reference_csv is the file path to a .csv of barcodes to be used as reference during demultiplexing.

2.1.3 FLAMES termination

The directory outdir now contains several output files returned from this pipeline. The output files generated by this pipeline are:

  • transcript_count.csv.gz - a transcript count matrix (also contained in the output SummarizedExperiment or SingleCellExperiment)
  • isoform_annotated.filtered.gff3 - found isoform information in gff3 format
  • transcript_assembly.fa - transcript sequence from the isoforms
  • align2genome.bam sorted BAM file with reads aligned to genome (intermediate FLAMES step)
  • realign2transcript.bam - sorted realigned BAM file using the transcript_assembly.fa as reference (intermediate FLAMES step)
  • tss_tes.bedgraph- TSS TES enrichment for all reads (for QC)

The pipeline also returns a SummarizedExperiment or SingleCellExperiment object, depending on the pipeline run, containing the data from transcript_count.csv.gzand isoform_annotated.filtered.gff3 (Amezquita et al. 2020) (Morgan et al. 2020). This SummarizedExperiment (or SingleCellExperiment) object contains the same data as present in the outdir directory, and is given to simplify the process of reading the transcript count matrix and annotation data back into R, for further analysis.

2.2 FLAMES Bulk Pipeline

A basic example of the execution of the FLAMES bulk pipeline is given below. The process for this is essentially identical to the above example for single cell data.

2.2.1 Environment setup

To get started, the pipeline needs access to a gene annotation file in GFF3 or GTF format, a directory containing one or more FASTQ files (which will be merged as pre-processing), a genome FASTA file, as well as the file path to minimap2, and the file path to the directory to hold output files.

For this example, these files are downloaded from GitHub using BiocFileCache (Shepherd and Morgan 2021).

temp_path <- tempfile()
bfc <- BiocFileCache::BiocFileCache(temp_path, ask=FALSE)
file_url <- 
  "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"
annot <- bfc[[names(BiocFileCache::bfcadd(bfc, "Annotation", 
                                          paste(file_url, 
                                                "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", 
                                                sep="/")))]]
genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 
                                              "Genomefa", 
                                              paste(file_url, 
                                                    "SIRV_isoforms_multi-fasta_170612a.fasta", 
                                                    sep="/")))]]

# download the two fastq files, move them to a folder to be merged together
fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep="/")))]]
fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep="/")))]]
fastq_dir <- paste(temp_path, "fastq_dir", sep="/") # the downloaded fastq files need to be in a directory to be merged together
dir.create(fastq_dir)
file.copy(c(fastq1, fastq2), fastq_dir)
#> [1] TRUE TRUE
unlink(c(fastq1, fastq2)) # the original files can be deleted

# setup other environment variables
config_file <- FLAMES::get_default_config_file()
outdir <- tempfile()
if (!dir.exists(outdir)) {
    dir.create(outdir)
}

2.2.2 FLAMES execution

Once the environment has been setup, the pipeline can be executed by running:

library(FLAMES)
summarizedExperiment <- bulk_long_pipeline(annot=annot, fastq=fastq_dir, outdir=temp_path, 
                                           genome_fa=genome_fa, minimap2_dir=mm2_dir,
                                           config_file = config_file)

2.2.3 FLAMES termination

After the bulk pipeline has completed, the output directory contains the same files as the single cell pipeline produces. bulk_long_pipeline also returns a SummarizedExperiment object, containing the same data as the SingleCellExperiment as above (Amezquita et al. 2020) (Morgan et al. 2020).

2.3 FLAMES on Windows

Due to FLAMES requiring minimap2 to complete the pipeline, the straight FLAMES pipeline functions bulk_long_pipeline() and sc_long_pipeline() won’t run on a Windows OS. To overcome this issue, the vignette ‘Vignette for FLAMES on Windows’ describes the alternate method of running the FLAMES pipelines, which requires acccess to minimap2 on an external system.

3 Session Info

#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] BiocStyle_2.22.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-7                matrixStats_0.61.0         
#>   [3] bit64_4.0.5                 filelock_1.0.2             
#>   [5] httr_1.4.2                  GenomeInfoDb_1.30.0        
#>   [7] tools_4.1.1                 bslib_0.3.1                
#>   [9] utf8_1.2.2                  R6_2.5.1                   
#>  [11] irlba_2.3.3                 vipor_0.4.5                
#>  [13] DBI_1.1.1                   BiocGenerics_0.40.0        
#>  [15] colorspace_2.0-2            gridExtra_2.3              
#>  [17] tidyselect_1.1.1            bit_4.0.4                  
#>  [19] curl_4.3.2                  compiler_4.1.1             
#>  [21] Biobase_2.54.0              BiocNeighbors_1.12.0       
#>  [23] basilisk.utils_1.6.0        DelayedArray_0.20.0        
#>  [25] bookdown_0.24               sass_0.4.0                 
#>  [27] scales_1.1.1                rappdirs_0.3.3             
#>  [29] stringr_1.4.0               digest_0.6.28              
#>  [31] Rsamtools_2.10.0            rmarkdown_2.11             
#>  [33] FLAMES_1.0.2                basilisk_1.6.0             
#>  [35] XVector_0.34.0              scater_1.22.0              
#>  [37] pkgconfig_2.0.3             htmltools_0.5.2            
#>  [39] sparseMatrixStats_1.6.0     MatrixGenerics_1.6.0       
#>  [41] dbplyr_2.1.1                fastmap_1.1.0              
#>  [43] highr_0.9                   rlang_0.4.12               
#>  [45] RSQLite_2.2.8               DelayedMatrixStats_1.16.0  
#>  [47] jquerylib_0.1.4             generics_0.1.1             
#>  [49] jsonlite_1.7.2              BiocParallel_1.28.0        
#>  [51] dplyr_1.0.7                 RCurl_1.98-1.5             
#>  [53] magrittr_2.0.1              BiocSingular_1.10.0        
#>  [55] GenomeInfoDbData_1.2.7      scuttle_1.4.0              
#>  [57] Matrix_1.3-4                Rcpp_1.0.7                 
#>  [59] ggbeeswarm_0.6.0            munsell_0.5.0              
#>  [61] S4Vectors_0.32.1            fansi_0.5.0                
#>  [63] viridis_0.6.2               reticulate_1.22            
#>  [65] lifecycle_1.0.1             stringi_1.7.5              
#>  [67] yaml_2.2.1                  SummarizedExperiment_1.24.0
#>  [69] zlibbioc_1.40.0             BiocFileCache_2.2.0        
#>  [71] grid_4.1.1                  blob_1.2.2                 
#>  [73] ggrepel_0.9.1               parallel_4.1.1             
#>  [75] crayon_1.4.2                dir.expiry_1.2.0           
#>  [77] lattice_0.20-45             Biostrings_2.62.0          
#>  [79] beachmat_2.10.0             knitr_1.36                 
#>  [81] pillar_1.6.4                GenomicRanges_1.46.0       
#>  [83] ScaledMatrix_1.2.0          stats4_4.1.1               
#>  [85] glue_1.4.2                  evaluate_0.14              
#>  [87] BiocManager_1.30.16         png_0.1-7                  
#>  [89] vctrs_0.3.8                 tidyr_1.1.4                
#>  [91] gtable_0.3.0                purrr_0.3.4                
#>  [93] assertthat_0.2.1            cachem_1.0.6               
#>  [95] ggplot2_3.3.5               xfun_0.28                  
#>  [97] rsvd_1.0.5                  viridisLite_0.4.0          
#>  [99] SingleCellExperiment_1.16.0 tibble_3.1.5               
#> [101] beeswarm_0.4.0              memoise_2.0.0              
#> [103] IRanges_2.28.0              ellipsis_0.3.2

References

Amezquita, Robert, Aaron Lun, Etienne Becht, Vince Carey, Lindsay Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17: 137–45. https://www.nature.com/articles/s41592-019-0654-x.

Li, Heng. 2018. “Minimap2: pairwise alignment for nucleotide sequences.” Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.

Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2020. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.

Shepherd, Lori, and Martin Morgan. 2021. BiocFileCache: Manage Files Across Sessions.

Tian, Luyi, Jafar S. Jabbari, Rachel Thijssen, Quentin Gouil, Shanika L. Amarasinghe, Hasaru Kariyawasam, Shian Su, et al. 2020. “Comprehensive Characterization of Single Cell Full-Length Isoforms in Human and Mouse with Long-Read Sequencing.” bioRxiv. https://doi.org/10.1101/2020.08.10.243543.