Contents

1 Windows FLAMES Pipeline

Windows is capable of running the FLAMES pipeline however, as the genome alignment in FLAMES is handled using minimap2, these functions are not able to be run on a Windows OS. This vignette provides a workflow for running the package primarily on Windows, with the required non-Windows functions being handled by a supported system - either MacOS or a Linux distrubution.

In order to run the FLAMES pipeline on Windows, ensure that the arguments do_genome_align and do_read_realign are set to FALSE. This will run the FLAMES pipeline, without the initial read alignment using Minimap2 step, and the later read realignment using Minimap2 step.

bulk_long_pipeline(..., do_genome_align=FALSE, do_read_realign=FALSE)
# OR
sc_long_pipeline(..., do_genome_align=FALSE, do_read_realign=FALSE)

Alternatively, the FLAMES pipeline can be manually executed using exported functions from the FLAMES package, and minimap2 alignment and realignment can be performed externally to complete the full pipeline.

This process for manual execution with external alignment is outlined below.

1.1 Manual execution of FLAMES pipeline

1.1.1 Environment setup

Begin by storing the required variables annot, fastq, outdir, genome_fa, in the current workspace. Ensure outdir exists, and is writable. More information on the values of these variables can be obtained by executing ?FLAMES::bulk_long_pipeline or ?FLAMES::sc_long_pipeline. The optional config_file variable can also be stored, or the default configuration file, located in the extdata/ folder in the base level of this packaged can be used.

This vignette uses BiocFileCache to download the required variables from an example data set. Below is the process for creating the BiocFileCache and downloading the required files. The code below is included for demonstration and is not required in order to run the Flames pipeline on Windows. This pipeline runs the bulk version of the FLAMES pipeline, however code is provided to run the single cell pipeline instead. Running the single cell pipeline requires little modification to this process beyond the initial pipeline setup. The downloaded files are stored in a temporary directory.

# download required files using BiocFileCache
temp_path <- tempdir()
bfc <- BiocFileCache::BiocFileCache(temp_path, ask=FALSE)
file_url <- 
  "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"
annot <- bfc[[names(BiocFileCache::bfcadd(bfc, "Annotation", 
                                          paste(file_url, 
                                                "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", 
                                                sep="/")))]] # [[ notation is used to get the local file path of the downloaded file
genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 
                                              "Genomefa", 
                                              paste(file_url, 
                                                    "SIRV_isoforms_multi-fasta_170612a.fasta", 
                                                    sep="/")))]]

# download the two fastq files, move them to a folder to be merged together
fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep="/")))]]
fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep="/")))]]
fastq_dir <- paste(temp_path, "fastq_dir", sep="/") # the downloaded fastq files need to be in a directory to be merged together
dir.create(fastq_dir)
file.copy(c(fastq1, fastq2), fastq_dir)
#> [1] TRUE TRUE
unlink(c(fastq1, fastq2)) # the original files can be deleted

# setup other environment variables
config_file <- system.file("extdata/SIRV_config_default.json", package="FLAMES") # the configuration file is included with the FLAMES package
outdir <- tempdir() # create a temporary output directory
if (!dir.exists(outdir)) {
    dir.create(outdir)
}

1.1.2 FLAMES Execution

1.1.2.1 Pipeline Setup

The FLAMES windows pipeline is broken into three steps, and three functions: bulk_windows_pipeline_setup or sc_windows_pipeline_setup, to setup the environment variables for the pipeline, and those specific for the bulk or single cell pipeline, windows_pipeline_isoforms, which is run after read alignment, in order to identify isoforms, and finally windows_pipeline_quantification, run after read realignment to quantify the results and output a SummarizedExperiemnt or SingleCellExperiment object.

bulk_windows_pipeline_setup or sc_windows_pipeline_setup must be run before the other steps are begun.

library(FLAMES)
# If running the FLAMES single cell pipeline:
# pipeline_variables <- sc_windows_pipeline_setup(annot=annot, fastq=fastq_dir, outdir=outdir,
#                                                 genome_fa=genome_fa, config_file=config_file)
# or, if running the bulk pipeline:
pipeline_variables <- bulk_windows_pipeline_setup(annot=annot, fastq=fastq_dir, outdir=outdir, 
                                                  genome_fa=genome_fa, config_file=config_file)
#> Preprocessing bulk fastqs...
#> 0x562d59bab360
#> 527662a274349_sample2.fastq.gz: 2500
#> 527665d12ecba_sample1.fastq.gz: 2500
#> Running FLAMES pipeline...
#> #### Input parameters:
#>  gene annotation: /tmp/RtmpXrqYOt/52766147bf412_SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf 
#>  genome fasta: /tmp/RtmpXrqYOt/5276644fc3569_SIRV_isoforms_multi-fasta_170612a.fasta 
#>  input fastq: /tmp/RtmpXrqYOt/merged.fastq.gz 
#>  output directory: /tmp/RtmpXrqYOt

The return list pipeline_variables contains all the required variables to run the future pipeline steps. This initial function sets up the pipeline, and validates input. After it is run, minimap2 alignment is required using the files given in pipeline_variables$return_files. This should be undertaken on a system with access to minimap2.

1.1.2.1.1 Minimap2 alignment

To run the minimap2 alignment part of the FLAMES pipeline, this section describes the minimap2 commands used in the FLAMES pipeline, which need to be undertaken on an external machine with minimap2 installed.

First, if pipeline_variables$config$alignment_parameters$use_junctions is TRUE, the following needs to be run, to produce pipeline_variables$tmp_bed.

gff3_to_bed12(pipeline_variables$annot, pipeline_variables$tmp_bed)
#> [1] "/tmp/RtmpXrqYOt/tmp_splice_anno.bed12"

After this, the files pipeline_variables$genome_fa, pipeline_variables$tmp_bed and pipeline_variables$fastq, need to be transfered to a system with minimap2 installed, where the main minimap2 alignment step will take place

After transferred, the following should be run on this new system.

{_prog} -ax splice -t 12 {_others} -k14 --secondary=no {_index} {_fq} -o {_out}

For this command:

  • {_prog} denotes the path to the minimap2 executable (eg, /~/minimap2/minimap2). If minimap2 is in PATH, {_prog} is only required to be minimap2.
  • {_index} denotes the path to the fasta index file, pipeline_variables$genome_fa
  • {_fq} denotes the input fastq file (for bulk, the merged fastq file), pipeline_variables$fastq.
  • {_out} denotes the temporary output sam file, and an approriate name should be given.
  • {_others} denotes the the command to run minimap2 with some optional arguments.
    • If pipeline_variables$config$alignment_parameters$use_junctions$ is TRUE, {_others} should include "--junc-bed {_bed} --junc-bonus 1" where {_bed} is pipeline_variables$tmp_bed.
    • If pipeline_variables$config$alignment_parameters$no_flank is TRUE, {_others} should include "--splice-flank=no".
    • If both extra command are present, they should be space separated.

After this has been completed, the resulting output file given as the {_out} argument should be returned to the original system, and stored with the same address as pipeline_variables$tmp_sam.

After this, the following code can be run on the original system, to complete the minimap2 alignment process.

samtools_as_bam(pipeline_variables$tmp_sam, pipeline_variables$tmp_bam)
samtools_sort_index(pipeline_variables$tmp_bam, pipeline_variables$genome_bam)
file.remove(pipeline_variables$tmp_sam)
file.remove(pipeline_variables$tmp_bam)

1.1.2.2 Find Isoforms

After read alignment, the pipeline can resume by calling windows_pipeline_isoforms, and passing in the pipeline_variables list returned from the previous step.

pipeline_variables <- windows_pipeline_isoforms(pipeline_variables)

Following this step, read realignment needs to be undertaken as before, on a system with access to minimap2.

1.1.2.2.1 Minimap2 read realignment

Like the original minimap2 read alignment step, this process requires the transfer of files between the starting FLAMES system and one with access to minimap2.

For this step, pipeline_variables$transcript_fa and pipeline_variables$fastq need to be transfered to the system with access to minimap2.

Once this has been done, the minimap2 realignment step can be undertaken by running:

{_prog} -ax map-ont -p 0.9 --end-bonus 10 -N 3 -t 12 {_index} {_fq} -o {_out}

Where, as before:

  • {_prog} denotes the path to the minimap2 executable (eg, /~/minimap2/minimap2). If minimap2 is in PATH, {_prog} is only required to be minimap2.
  • {_index} denotes the path to the fasta index file, pipeline_variables$genome_fa
  • {_fq} denotes the input fastq file (for bulk, the merged fastq file), pipeline_variables$fastq.
  • {_out} denotes the temporary output sam file, and an approriate name should be given.

The resulting sam file, {_out} should be transfered back to the original system, and stored as pipeline_variables$tmp_sam.

Once this is done, the following can be run to complete the minimap2 realignment process.

samtools_as_bam(pipeline_variables$tmp_sam, pipeline_variables$tmp_bam)
samtools_sort_index(pipeline_variables$tmp_bam, pipeline_variables$realign_bam)
file.remove(pipeline_variables$tmp_sam)
file.remove(pipeline_variables$tmp_bam)

1.1.2.3 Transcript Quantification

Finally, transcript quantification can be performed, as required:

se <- windows_pipeline_quantification(pipeline_variables)
#> #### Generating transcript count matrix

The directory outdir now contains several output files returned from this pipeline. The output files generated by this pipeline are:

The pipeline also returns a SummarizedExperiment or SingleCellExperiment object, depending on the pipeline run, containing the data from transcript_count.csv.gzand isoform_annotated.filtered.gff3.

se
#> class: SummarizedExperiment 
#> dim: 43 4 
#> metadata(3): Annotations AnnotationFile OutputFiles
#> assays(1): Flames Bulk
#> rownames(43): 1 2 ... 42 43
#> rowData names(0):
#> colnames(4): transcript_id gene_id sample2.fastq.gz sample1.fastq.gz
#> colData names(0):

1.1.2.4 Session Info

#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] FLAMES_1.0.2     BiocStyle_2.22.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-7                matrixStats_0.61.0         
#>   [3] bit64_4.0.5                 filelock_1.0.2             
#>   [5] httr_1.4.2                  rprojroot_2.0.2            
#>   [7] GenomeInfoDb_1.30.0         tools_4.1.1                
#>   [9] bslib_0.3.1                 utf8_1.2.2                 
#>  [11] R6_2.5.1                    irlba_2.3.3                
#>  [13] vipor_0.4.5                 DBI_1.1.1                  
#>  [15] BiocGenerics_0.40.0         colorspace_2.0-2           
#>  [17] gridExtra_2.3               tidyselect_1.1.1           
#>  [19] bit_4.0.4                   curl_4.3.2                 
#>  [21] compiler_4.1.1              Biobase_2.54.0             
#>  [23] BiocNeighbors_1.12.0        basilisk.utils_1.6.0       
#>  [25] DelayedArray_0.20.0         bookdown_0.24              
#>  [27] sass_0.4.0                  scales_1.1.1               
#>  [29] rappdirs_0.3.3              stringr_1.4.0              
#>  [31] digest_0.6.28               Rsamtools_2.10.0           
#>  [33] rmarkdown_2.11              basilisk_1.6.0             
#>  [35] XVector_0.34.0              scater_1.22.0              
#>  [37] pkgconfig_2.0.3             htmltools_0.5.2            
#>  [39] sparseMatrixStats_1.6.0     MatrixGenerics_1.6.0       
#>  [41] dbplyr_2.1.1                fastmap_1.1.0              
#>  [43] highr_0.9                   rlang_0.4.12               
#>  [45] RSQLite_2.2.8               DelayedMatrixStats_1.16.0  
#>  [47] jquerylib_0.1.4             generics_0.1.1             
#>  [49] jsonlite_1.7.2              BiocParallel_1.28.0        
#>  [51] dplyr_1.0.7                 RCurl_1.98-1.5             
#>  [53] magrittr_2.0.1              BiocSingular_1.10.0        
#>  [55] GenomeInfoDbData_1.2.7      scuttle_1.4.0              
#>  [57] Matrix_1.3-4                Rcpp_1.0.7                 
#>  [59] ggbeeswarm_0.6.0            munsell_0.5.0              
#>  [61] S4Vectors_0.32.1            fansi_0.5.0                
#>  [63] viridis_0.6.2               reticulate_1.22            
#>  [65] lifecycle_1.0.1             stringi_1.7.5              
#>  [67] yaml_2.2.1                  SummarizedExperiment_1.24.0
#>  [69] zlibbioc_1.40.0             BiocFileCache_2.2.0        
#>  [71] grid_4.1.1                  blob_1.2.2                 
#>  [73] ggrepel_0.9.1               parallel_4.1.1             
#>  [75] crayon_1.4.2                dir.expiry_1.2.0           
#>  [77] lattice_0.20-45             Biostrings_2.62.0          
#>  [79] beachmat_2.10.0             knitr_1.36                 
#>  [81] pillar_1.6.4                GenomicRanges_1.46.0       
#>  [83] ScaledMatrix_1.2.0          stats4_4.1.1               
#>  [85] glue_1.4.2                  evaluate_0.14              
#>  [87] BiocManager_1.30.16         png_0.1-7                  
#>  [89] vctrs_0.3.8                 tidyr_1.1.4                
#>  [91] gtable_0.3.0                purrr_0.3.4                
#>  [93] assertthat_0.2.1            cachem_1.0.6               
#>  [95] ggplot2_3.3.5               xfun_0.28                  
#>  [97] rsvd_1.0.5                  viridisLite_0.4.0          
#>  [99] SingleCellExperiment_1.16.0 tibble_3.1.5               
#> [101] beeswarm_0.4.0              memoise_2.0.0              
#> [103] IRanges_2.28.0              ellipsis_0.3.2             
#> [105] here_1.0.1