bulk_long_pipeline {FLAMES} | R Documentation |
Semi-supervised isofrom detection and annotation for long read data. This variant is meant for bulk samples. Specific parameters relating to analysis can be changed either through function arguments, or through a configuration JSON file.
bulk_long_pipeline( annot, fastq, in_bam = NULL, outdir, genome_fa, minimap2_dir = "", downsample_ratio = 1, config_file = NULL, do_genome_align = TRUE, do_isoform_id = TRUE, do_read_realign = TRUE, do_transcript_quanti = TRUE, gen_raw_isoform = TRUE, has_UMI = FALSE, MAX_DIST = 10, MAX_TS_DIST = 100, MAX_SPLICE_MATCH_DIST = 10, min_fl_exon_len = 40, Max_site_per_splice = 3, Min_sup_cnt = 10, Min_cnt_pct = 0.01, Min_sup_pct = 0.2, strand_specific = 1, remove_incomp_reads = 5, use_junctions = TRUE, no_flank = TRUE, use_annotation = TRUE, min_tr_coverage = 0.75, min_read_coverage = 0.75 )
annot |
The file path to gene annotations file in gff3 format |
fastq |
the path to the directory containing the fastq input files to merge into one, |
in_bam |
optional BAM file path which replaces fastq directory argument. This skips the genome alignment and realignment steps |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2_dir |
Path to the directory containing minimap2, if it is not in PATH. Only required if either or both of
|
downsample_ratio |
Integer; downsampling ratio if performing downsampling analysis. |
config_file |
File path to the JSON configuration file. If specified, |
do_genome_align |
Boolean; specifies whether to run the genome alignment step. |
do_isoform_id |
Boolean; specifies whether to run the isoform identification step. |
do_read_realign |
Boolean; specifies whether to run the read realignment step. |
do_transcript_quanti |
Boolean; specifies whether to run the transcript quantification step. |
gen_raw_isoform |
Boolean; specifies whether a gff3 should be generated containing the raw isoform information in the isoform identification step |
has_UMI |
Boolean; specifies if the data contains UMI. |
MAX_DIST |
Real; maximum distance allowed when merging splicing sites in isoform consensus clustering. |
MAX_TS_DIST |
Real; maximum distance allowed when merging transcript start/end position in isoform consensus clustering. |
MAX_SPLICE_MATCH_DIST |
Real; maximum distance allowed when merging splice site called from the data and the reference annotation. |
min_fl_exon_len |
Real; minimum length for the first exon outside the gene body in reference annotation. This is to correct the alignment artifact |
Max_site_per_splice |
Real; maximum transcript start/end site combinations allowed per splice chain |
Min_sup_cnt |
Real; minimum number of read support an isoform. Decreasing this number will significantly increase the number of isoform detected. |
Min_cnt_pct |
Real; minimum percentage of count for an isoform relative to total count for the same gene. |
Min_sup_pct |
Real; minimum percentage of count for an splice chain that support a given transcript start/end site combination. |
strand_specific |
1, -1 or 0. 1 indicates if reads are in the same strand as mRNA, -1 indicates reads are reverse complemented, 0 indicates reads are not strand specific. |
remove_incomp_reads |
Real; determines the strength of truncated isoform filtering. Larger number means more stringent filtering. |
use_junctions |
Boolean; determiens whether to use known splice junctions to help correct the alignment results |
no_flank |
Boolean; passed to minimap2 for synthetic spike-in data. Refer to Minimap2 document for more details |
use_annotation |
Boolean; specifies whether to use reference to help annotate known isoforms |
min_tr_coverage |
Real; minimum percentage of isoform coverage for a read to be aligned to that isoform |
min_read_coverage |
Real; minimum percentage of read coverage for a read to be uniquely aligned to that isoform |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. Next is the read realignment step (do_read_realign
), where the sequence of each polished transcript is extracted and used as
the updated reference. The reads are realigned to this reference by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than
it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1),
or the reverse complemented (-1) or not strand specific (0), which results in
strand information being based on reference annotation.
bulk_long_pipeline
returns a SummarizedExperiment object, containing a count
matrix as an assay, gene annotations under metadata, as well as a list of the other
output files generated by the pipeline. The pipeline also outputs a number of output
files into the given outdir
directory. These output files generated by the pipeline are:
transcript_count.csv.gz - a transcript count matrix (also contained in the SummarizedExperiment)
isoform_annotated.filtered.gff3 - isoforms in gff3 format (also contained in the SummarizedExperiment)
transcript_assembly.fa - transcript sequence from the isoforms
align2genome.bam - sorted BAM file with reads aligned to genome
realign2transcript.bam - sorted realigned BAM file using the transcript_assembly.fa as reference
tss_tes.bedgraph - TSS TES enrichment for all reads (for QC)
sc_long_pipeline()
for single cell data,
SummarizedExperiment()
for how data is outputted
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask=FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep="/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep="/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep="/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted ## Not run: # run the FLAMES bulk pipeline, using the downloaded files se <- bulk_long_pipeline(annot=system.file("extdata/SIRV_anno.gtf", package="FLAMES"), fastq=fastq_dir, outdir=tempdir(), genome_fa=system.file("extdata/SIRV_genomefa.fasta", package="FLAMES"), config_file=system.file("extdata/SIRV_config_default.json", package="FLAMES")) ## End(Not run) # OR # run the FLAMES single cell pipeline #sce <- sc_long_pipeline(annot, fastq, NULL, outdir, genome_fa, match_barcode=FALSE, config+file=config)