Contents

1 Introduction

This package was created in order to increase the sensitivity of EML4-ALK detection from commercially available NGS products such as the AVENIO (Roche) pipeline.

Paired-end sequencing of cfDNA generated BAM files can be used as input to discover EML4-ALK variants. This package was developed using position deduplicated BAM files generated with the AVENIO Oncology Analysis Software. These files are made using the AVENIO ctDNA surveillance kit and Illumina Nextseq 500 sequencing. This is a targeted hybridization NGS approach and includes ALK-specific but not EML4-specific probes.

The package includes six functions.

The output of the first function, EML4_ALK_detection(), is used to determine whether EML4-ALK is detected and serves as input for the next four exploratory functions characterizing the EML4-ALK variant. The last function EML4_ALK_analysis() combines the output of the exploratory functions.

To serve as examples, this package includes BAM files representing the EML4-ALK positive cell line H3122 and the EML4-ALK negative cell line, HCC827.

2 Installation

Use Bioconductor to install the most recent version of DNAfusion

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DNAfusion")
library(DNAfusion)

3 Package data

BAM files from the cell lines, H3122 and HCC827, are included in the package and can be used as examples to explore the functions.

H3122_bam <- system.file("extdata", 
                            "H3122_EML4.bam",
                            package = "DNAfusion")
HCC827_bam <-  system.file("extdata", 
                            "HCC827_EML4.bam", 
                            package = "DNAfusion")

4 Functions

4.1 EML4_ALK_detection()

This function looks for EML4-ALK mate pair reads in the BAM file.

Input: \[\\[0.1in]\]

file

The name of the file which the data are to be read from.

\[\\[0.1in]\]

genome

character representing the reference genome. 
Can either be "hg38" or "hg19". 
Default = "hg38".

\[\\[0.1in]\]

mates

integer, the minimum number EML4-ALK mate pairs needed to be
detected in order to call a variant. Default = 2.

\[\\[0.1in]\]

Output:

A GAlignments object with soft-clipped reads representing EML4-ALK is returned. If no EML4-ALK is detected the the GAlignments is empty.

Examples:

H3122_result <- EML4_ALK_detection(file = H3122_bam, 
                        genome = "hg38", 
                        mates = 2) 
head(H3122_result)
#> GAlignments object with 6 alignments and 2 metadata columns:
#>       seqnames strand       cigar    qwidth     start       end     width
#>          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>   [1]     chr2      +       94M2S        96  42299657  42299750        94
#>   [2]     chr2      +       94M2S        96  42299657  42299750        94
#>   [3]     chr2      +       94M2S        96  42299657  42299750        94
#>   [4]     chr2      +       94M2S        96  42299657  42299750        94
#>   [5]     chr2      +       94M2S        96  42299657  42299750        94
#>   [6]     chr2      +       94M2S        96  42299657  42299750        94
#>           njunc |      mpos                     seq
#>       <integer> | <integer>          <DNAStringSet>
#>   [1]         0 |  29223691 TTGCTTCTTT...GCAGTGGTCT
#>   [2]         0 |  29223375 TTGCTTCTTT...GCAGTGGTCT
#>   [3]         0 |  29223479 TTGCTTCTTT...GCAGTGGTCT
#>   [4]         0 |  29223686 TTGCTTCTTT...GCAGTGGTCT
#>   [5]         0 |  29223636 TTGCTTCTTT...GCAGTGGTCT
#>   [6]         0 |  29223687 TTGCTTCTTT...GCAGTGGTCT
#>   -------
#>   seqinfo: 455 sequences from an unspecified genome
HCC827_result <- EML4_ALK_detection(file = HCC827_bam, 
                    genome = "hg38", 
                    mates = 2)
HCC827_result
#> GAlignments object with 0 alignments and 0 metadata columns:
#>    seqnames strand       cigar    qwidth     start       end     width
#>       <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>        njunc
#>    <integer>
#>   -------
#>   seqinfo: no sequences

4.2 EML4_sequence()

This function identifies the basepairs leading up to the EML4 breakpoint.

Input: \[\\[0.1in]\]

reads

GAlignments object returned by EML4_ALK_detection().

\[\\[0.1in]\]

basepairs

integer, number of basepairs identified from the EML4-ALK fusion.
Default = 20.

\[\\[0.1in]\]

Output:

If EML4-ALK is detected, returns a table of identified EML4 basepairs with the number of corresponding reads for each sequence. If no EML4-ALK is detected “No EML4-ALK was detected” is returned.

Examples:

EML4_sequence(H3122_result, basepairs = 20)
#> EML4_seq
#> CCAGGCTGGAGTGCAGTGGT GGAGTGCAGTGGTGTGATTT TCAGGCTGGAGTGCAGTGGT 
#>                  201                    1                    1
EML4_sequence(HCC827_result, basepairs = 20)
#> [1] "No EML4-ALK was detected"

4.3 ALK_sequence()

This function identifies the basepairs following the ALK breakpoint.

Input: \[\\[0.1in]\]

reads

GAlignments object returned by EML4_ALK_detection().

\[\\[0.1in]\]

basepairs

integer, number of basepairs identified from the EML4-ALK fusion. 
Default = 20.

\[\\[0.1in]\]

Output:

If EML4-ALK is detected, returns a table of identified ALK basepairs with the number of corresponding reads for each sequence. If no EML4-ALK is detected “No EML4-ALK was detected” is returned.

Examples:

ALK_sequence(H3122_result, basepairs = 20)
#> ALK_seq
#> CAGAATTTTAGCTTTGCAAT CGGAATTTTAGCTTTGCATT      CGGATTTTTAGCTTT 
#>                    1                    1                    1 
#> CGGATTTTTAGCTTTTCATT                   CT                  CTG 
#>                    2                    8                    3 
#>                 CTGA                CTGAA                CTGAT 
#>                   11                    1                   16 
#>             CTGATTTT            CTGATTTTT           CTGATTTTTA 
#>                    5                    6                    3 
#>          CTGATTTTTAG CTGATTTTTAGATTTGCATT         CTGATTTTTAGC 
#>                    3                    1                   14 
#>        CTGATTTTTAGCT       CTGATTTTTAGCTT      CTGATTTTTAGCTTT 
#>                   10                   10                    3 
#>     CTGATTTTTAGCTTTG    CTGATTTTTAGCTTTGC   CTGATTTTTAGCTTTGCA 
#>                    4                    7                    8 
#> CTGATTTTTAGCTTTGCAAT  CTGATTTTTAGCTTTGCAT CTGATTTTTAGCTTTGCATT 
#>                    1                    1                   71 
#>     CTGATTTTTAGCTTTT CTGATTTTTAGCTTTTCATA          CTGATTTTTAT 
#>                    1                    1                    1 
#>     CTGATTTTTATCTTTG CTGATTTTTATCTTTGCATT CTGATTTTTATCTTTTGATT 
#>                    2                    2                    1 
#> CTGTGTTTTAGATTTGCATT CTGTTTTTTATCTTTGCAAT CTTATTTTTATCTTTGCATT 
#>                    1                    1                    1 
#>            TTAGCTTTG 
#>                    1
ALK_sequence(HCC827_result, basepairs = 20)
#> [1] "No EML4-ALK was detected"

4.4 break_position()

This function identifies the genomic position in EML4 where the breakpoint has happened.

Input: \[\\[0.1in]\]

reads

GAlignments object returned by EML4_ALK_detection().

\[\\[0.1in]\]

Output:

If EML4-ALK is detected, returns a table of genomic positions with the number of corresponding reads for each sequence. If no EML4-ALK is detected “No EML4-ALK was detected” is returned.

Examples:

break_position(H3122_result)
#> break_pos
#> 42299750 42299757 
#>      202        1
break_position(HCC827_result)
#> [1] "No EML4-ALK was detected"

4.5 break_position_depth()

This function identifies the read depth at the basepair before the breakpoint in EML4.

Input: \[\\[0.1in]\]

file

The name of the file which the data are to be read from.

\[\\[0.1in]\]

reads

GAlignments returned by EML4_ALK_detection().

\[\\[0.1in]\]

Output:

If EML4-ALK is detected a single integer corresponding to the read depth at the breakpoint is returned. If no EML4-ALK is detected “No EML4-ALK was detected” is returned.

Examples:

break_position_depth(H3122_bam, H3122_result)
#> [1] 251
break_position_depth(HCC827_bam, HCC827_result)
#> [1] "No EML4-ALK was detected"

4.6 EML4_ALK_analysis()

This functions collects the results from the other functions of the package.

Input: \[\\[0.1in]\]

file

The name of the file which the data are to be read from.

\[\\[0.1in]\]

genome

character representing the reference genome. 
Can be either "hg38" or "hg19".
Default = "hg38".

\[\\[0.1in]\]

mates

integer, the minimum number EML4-ALK mate pairs needed to be detected in
order to call a variant. Default = 2.

\[\\[0.1in]\]

basepairs

integer, number of basepairs identified from the EML4-ALK fusion. 
Default = 20.

\[\\[0.1in]\]

Output:

A list object with clipped_reads corresponding to EML4_ALK_detection(), last_EML4 corresponding to EML4_sequence(), first_ALK corresponding to ALK_sequence(), breakpoint corresponding to break_position(), and read_depth corresponding to break_position_depth(). If no EML4-ALK is detected an empty GAlignments is returned.

Examples:

H3122_results <- EML4_ALK_analysis(file = H3122_bam, 
                                    genome = "hg38", 
                                    mates = 2, 
                                    basepairs = 20)
HCC827_results <- EML4_ALK_analysis(file = HCC827_bam, 
                                    genome = "hg38", 
                                    mates = 2, 
                                    basepairs = 20)
head(H3122_results$clipped_reads)
#> GAlignments object with 6 alignments and 2 metadata columns:
#>       seqnames strand       cigar    qwidth     start       end     width
#>          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>   [1]     chr2      +       94M2S        96  42299657  42299750        94
#>   [2]     chr2      +       94M2S        96  42299657  42299750        94
#>   [3]     chr2      +       94M2S        96  42299657  42299750        94
#>   [4]     chr2      +       94M2S        96  42299657  42299750        94
#>   [5]     chr2      +       94M2S        96  42299657  42299750        94
#>   [6]     chr2      +       94M2S        96  42299657  42299750        94
#>           njunc |      mpos                     seq
#>       <integer> | <integer>          <DNAStringSet>
#>   [1]         0 |  29223691 TTGCTTCTTT...GCAGTGGTCT
#>   [2]         0 |  29223375 TTGCTTCTTT...GCAGTGGTCT
#>   [3]         0 |  29223479 TTGCTTCTTT...GCAGTGGTCT
#>   [4]         0 |  29223686 TTGCTTCTTT...GCAGTGGTCT
#>   [5]         0 |  29223636 TTGCTTCTTT...GCAGTGGTCT
#>   [6]         0 |  29223687 TTGCTTCTTT...GCAGTGGTCT
#>   -------
#>   seqinfo: 455 sequences from an unspecified genome

H3122_results$last_EML4
#> EML4_seq
#> CCAGGCTGGAGTGCAGTGGT GGAGTGCAGTGGTGTGATTT TCAGGCTGGAGTGCAGTGGT 
#>                  201                    1                    1

H3122_results$first_ALK
#> ALK_seq
#> CAGAATTTTAGCTTTGCAAT CGGAATTTTAGCTTTGCATT      CGGATTTTTAGCTTT 
#>                    1                    1                    1 
#> CGGATTTTTAGCTTTTCATT                   CT                  CTG 
#>                    2                    8                    3 
#>                 CTGA                CTGAA                CTGAT 
#>                   11                    1                   16 
#>             CTGATTTT            CTGATTTTT           CTGATTTTTA 
#>                    5                    6                    3 
#>          CTGATTTTTAG CTGATTTTTAGATTTGCATT         CTGATTTTTAGC 
#>                    3                    1                   14 
#>        CTGATTTTTAGCT       CTGATTTTTAGCTT      CTGATTTTTAGCTTT 
#>                   10                   10                    3 
#>     CTGATTTTTAGCTTTG    CTGATTTTTAGCTTTGC   CTGATTTTTAGCTTTGCA 
#>                    4                    7                    8 
#> CTGATTTTTAGCTTTGCAAT  CTGATTTTTAGCTTTGCAT CTGATTTTTAGCTTTGCATT 
#>                    1                    1                   71 
#>     CTGATTTTTAGCTTTT CTGATTTTTAGCTTTTCATA          CTGATTTTTAT 
#>                    1                    1                    1 
#>     CTGATTTTTATCTTTG CTGATTTTTATCTTTGCATT CTGATTTTTATCTTTTGATT 
#>                    2                    2                    1 
#> CTGTGTTTTAGATTTGCATT CTGTTTTTTATCTTTGCAAT CTTATTTTTATCTTTGCATT 
#>                    1                    1                    1 
#>            TTAGCTTTG 
#>                    1

H3122_results$breakpoint
#> break_pos
#> 42299750 42299757 
#>      202        1

H3122_results$read_depth
#> [1] 251

HCC827_results
#> GAlignments object with 0 alignments and 0 metadata columns:
#>    seqnames strand       cigar    qwidth     start       end     width
#>       <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>        njunc
#>    <integer>
#>   -------
#>   seqinfo: no sequences

5 Session info

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Ventura 13.0
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2022-11-08
#>  pandoc   2.18 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package   * version date (UTC) lib source
#>  BiocStyle * 2.26.0  2022-11-07 [2] Bioconductor
#>  devtools  * 2.4.3   2021-11-30 [2] CRAN (R 4.2.0)
#>  DNAfusion * 1.0.0   2022-11-07 [1] Bioconductor
#>  usethis   * 2.1.6   2022-05-25 [2] CRAN (R 4.2.0)
#> 
#>  [1] /private/tmp/RtmpMTacJG/Rinst58bb7d64ef20
#>  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────