Introduction

This vignette outlines a workflow of detecting nuclear-mitochondrial DNA fusions from Variant Call Format (VCF) using the svaNUMT package.

Using GRanges for structural variants: a breakend-centric data structure

This package uses a breakend-centric event notation adopted from the StructuralVariantAnnotation package. In short, breakends are stored in a GRanges object with strand used to indicate breakpoint orientation. where breakpoints are represented using a partner field containing the name of the breakend at the other side of the breakend. This notation was chosen as it simplifies the annotations of RTs which are detected at breakend-level.

Workflow

Loading data from VCF

VCF data is parsed into a VCF object using the readVCF function from the Bioconductor package VariantAnnotation. Simple filters could be applied to a VCF object to remove unwanted calls. The VCF object is then converted to a GRanges object with breakend-centric notations using StructuralVariantAnnotation. More information about VCF objects and breakend-centric GRanges object can be found by consulting the vignettes in the corresponding packages with browseVignettes("VariantAnnotation") and browseVignettes("StructuralVariantAnnotation").

library(StructuralVariantAnnotation)
library(VariantAnnotation)
library(svaNUMT)

vcf <- readVcf(system.file("extdata", "chr1_numt_pe_HS25.sv.vcf", package = "svaNUMT"))
gr <- breakpointRanges(vcf)

Note that StructuralVariantAnnotation requires the GRanges object to be composed entirely of valid breakpoints. Please consult the vignette of the StructuralVariantAnnotation package for ensuring breakpoint consistency.

Identifying Nuclear-mitochondrial Genome Fusion Events

Function svaNUMT searches for NUMT events by identifying breakends supporting the fusion of nuclear chromosome and mitochondrial genome. svaNUMT returns identified breakends supporting candidate NUMTs in 2 lists of list of GRanges, grouped by chromosome and insertion sites.

library(readr)
numtS <- read_table(system.file("extdata", "numtS.txt", package = "svaNUMT"), 
    col_names = FALSE)
colnames(numtS) <- c("bin", "seqnames", "start", "end", "name", "score", "strand")
numtS <- `seqlevelsStyle<-`(GRanges(numtS), "NCBI")

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
genomeMT <- genome$chrMT
NUMT <- numtDetect(gr, numtS, genomeMT, max_ins_dist = 20)
#> There is no MT sequence from known NUMT events detected.

The breakends supporting the insertion sites and the MT sequence are arranged by the order of events. Below is an example of a detected NUMT event, where MT sequence MT:15737-15836 followed by polyadenylation is inserted between chr1:1688363-1688364.

GRangesList(NU=NUMT$MT$NU$`1`[[1]], MT=NUMT$MT$MT$`1`[[1]])
#> GRangesList object of length 2:
#> $NU
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss1fb_4o        1   1688363      + |           NA           C
#>   gridss1bf_1o        1   1688364      - |           NA           C
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss1fb_4o            C[MT:15737[   3928.49        PASS gridss1fb_4o
#>   gridss1bf_1o ]MT:15836]AAAAAAAAAA..   3581.13        PASS gridss1bf_1o
#>                     partner      svtype     svLen        insSeq    insLen
#>                 <character> <character> <numeric>   <character> <integer>
#>   gridss1fb_4o gridss1fb_4h         BND        NA                       0
#>   gridss1bf_1o gridss1bf_1h         BND        NA AAAAAAAAAAAAA        13
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss1fb_4o gridss1fb_4         0
#>   gridss1bf_1o gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $MT
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss1fb_4h       MT     15737      - |           NA           G
#>   gridss1bf_1h       MT     15836      + |           NA           A
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss1fb_4h           ]1:1688363]G   3928.49        PASS gridss1fb_4h
#>   gridss1bf_1h AAAAAAAAAAAAAA[1:168..   3581.13        PASS gridss1bf_1h
#>                     partner      svtype     svLen        insSeq    insLen
#>                 <character> <character> <numeric>   <character> <integer>
#>   gridss1fb_4h gridss1fb_4o         BND        NA                       0
#>   gridss1bf_1h gridss1bf_1o         BND        NA AAAAAAAAAAAAA        13
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss1fb_4h gridss1fb_4         0
#>   gridss1bf_1h gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome

Below is an example to subset the detected NUMTs by a genomic region given seqnames, start, and end. For region chr1:1000000-3000000, there are 3 NUMTs detected.

seqnames = 1
start = 1000000
end = 3000000
i <- sapply(NUMT$MT$NU[[seqnames]], function(x) 
  sum(countOverlaps(x, GRanges(seqnames = seqnames, IRanges(start, end))))>0)
list(NU=NUMT$MT$NU[[seqnames]][i], MT=NUMT$MT$MT[[seqnames]][i])
#> $NU
#> $NU[[1]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss1fb_4o        1   1688363      + |           NA           C
#>   gridss1bf_1o        1   1688364      - |           NA           C
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss1fb_4o            C[MT:15737[   3928.49        PASS gridss1fb_4o
#>   gridss1bf_1o ]MT:15836]AAAAAAAAAA..   3581.13        PASS gridss1bf_1o
#>                     partner      svtype     svLen        insSeq    insLen
#>                 <character> <character> <numeric>   <character> <integer>
#>   gridss1fb_4o gridss1fb_4h         BND        NA                       0
#>   gridss1bf_1o gridss1bf_1h         BND        NA AAAAAAAAAAAAA        13
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss1fb_4o gridss1fb_4         0
#>   gridss1bf_1o gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $NU[[2]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames          ranges strand | paramRangeID         REF
#>                   <Rle>       <IRanges>  <Rle> |     <factor> <character>
#>   gridss1fb_5o        1 1791082-1791083      + |           NA           G
#>   gridss1bf_2o        1         1791084      - |           NA           A
#>                                  ALT      QUAL      FILTER     sourceId
#>                          <character> <numeric> <character>  <character>
#>   gridss1fb_5o            G[MT:2592[   1929.85        PASS gridss1fb_5o
#>   gridss1bf_2o ]MT:3592]AAAAAAAAAAAA   2894.91        PASS gridss1bf_2o
#>                     partner      svtype     svLen      insSeq    insLen
#>                 <character> <character> <numeric> <character> <integer>
#>   gridss1fb_5o gridss1fb_5h         BND        NA                     0
#>   gridss1bf_2o gridss1bf_2h         BND        NA AAAAAAAAAAA        11
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss1fb_5o gridss1fb_5         1
#>   gridss1bf_2o gridss1bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $NU[[3]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss2fb_3o        1   2869079      + |           NA           G
#>   gridss2bf_2o        1   2869080      - |           NA           A
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss2fb_3o             G[MT:2786[   2472.12        PASS gridss2fb_3o
#>   gridss2bf_2o ]MT:2985]AAAAAAAAAAA..   2456.81        PASS gridss2bf_2o
#>                     partner      svtype     svLen          insSeq    insLen
#>                 <character> <character> <numeric>     <character> <integer>
#>   gridss2fb_3o gridss2fb_3h         BND        NA                         0
#>   gridss2bf_2o gridss2bf_2h         BND        NA AAAAAAAAAAAAAAA        15
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss2fb_3o gridss2fb_3         0
#>   gridss2bf_2o gridss2bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> 
#> $MT
#> $MT[[1]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss1fb_4h       MT     15737      - |           NA           G
#>   gridss1bf_1h       MT     15836      + |           NA           A
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss1fb_4h           ]1:1688363]G   3928.49        PASS gridss1fb_4h
#>   gridss1bf_1h AAAAAAAAAAAAAA[1:168..   3581.13        PASS gridss1bf_1h
#>                     partner      svtype     svLen        insSeq    insLen
#>                 <character> <character> <numeric>   <character> <integer>
#>   gridss1fb_4h gridss1fb_4o         BND        NA                       0
#>   gridss1bf_1h gridss1bf_1o         BND        NA AAAAAAAAAAAAA        13
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss1fb_4h gridss1fb_4         0
#>   gridss1bf_1h gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $MT[[2]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss1fb_5h       MT 2592-2593      - |           NA           G
#>   gridss1bf_2h       MT      3592      + |           NA           G
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss1fb_5h           ]1:1791082]G   1929.85        PASS gridss1fb_5h
#>   gridss1bf_2h GAAAAAAAAAAA[1:17910..   2894.91        PASS gridss1bf_2h
#>                     partner      svtype     svLen      insSeq    insLen
#>                 <character> <character> <numeric> <character> <integer>
#>   gridss1fb_5h gridss1fb_5o         BND        NA                     0
#>   gridss1bf_2h gridss1bf_2o         BND        NA AAAAAAAAAAA        11
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss1fb_5h gridss1fb_5         1
#>   gridss1bf_2h gridss1bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $MT[[3]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>
#>   gridss2fb_3h       MT      2786      - |           NA           T
#>   gridss2bf_2h       MT      2985      + |           NA           C
#>                                   ALT      QUAL      FILTER     sourceId
#>                           <character> <numeric> <character>  <character>
#>   gridss2fb_3h           ]1:2869079]T   2472.12        PASS gridss2fb_3h
#>   gridss2bf_2h CAAAAAAAAAAAAAAA[1:2..   2456.81        PASS gridss2bf_2h
#>                     partner      svtype     svLen          insSeq    insLen
#>                 <character> <character> <numeric>     <character> <integer>
#>   gridss2fb_3h gridss2fb_3o         BND        NA                         0
#>   gridss2bf_2h gridss2bf_2o         BND        NA AAAAAAAAAAAAAAA        15
#>                      event    HOMLEN
#>                <character> <numeric>
#>   gridss2fb_3h gridss2fb_3         0
#>   gridss2bf_2h gridss2bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome

Visualising breakpoint pairs via circos plots

One way of visualising paired breakpoints is by circos plots. Here we use the package circlize to demonstrate breakpoint visualisation. The bedpe2circos function takes BEDPE-formatted dataframes (see breakpointgr2bedpe()) and plotting parameters for the circos.initializeWithIdeogram() and circos.genomicLink() functions from circlize.

To generate a simple circos plot of one candidate NUMT event:

library(circlize)
numt_chr_prefix <- c(NUMT$MT$NU$`1`[[2]], NUMT$MT$MT$`1`[[2]])
GenomeInfoDb::seqlevelsStyle(numt_chr_prefix) <- "UCSC"
pairs <- breakpointgr2pairs(numt_chr_prefix)
pairs

To see supporting breakpoints clearly, we generate the circos plot according to the loci of event.

circos.initializeWithIdeogram(
    data.frame(V1=c("chr1", "chrM"),
               V2=c(1791073,1),
               V3=c(1791093,16571),
               V4=c("p15.4",NA),
               V5=c("gpos50",NA)),  sector.width = c(0.2, 0.8))
#circos.initializeWithIdeogram()
circos.genomicLink(as.data.frame(S4Vectors::first(pairs)), 
                   as.data.frame(S4Vectors::second(pairs)))

circos.clear()

SessionInfo

sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] circlize_0.4.15                    BSgenome.Hsapiens.UCSC.hg19_1.4.3 
#>  [3] BSgenome_1.68.0                    readr_2.1.4                       
#>  [5] svaNUMT_1.6.0                      StructuralVariantAnnotation_1.16.0
#>  [7] VariantAnnotation_1.46.0           Rsamtools_2.16.0                  
#>  [9] Biostrings_2.68.0                  XVector_0.40.0                    
#> [11] SummarizedExperiment_1.30.0        Biobase_2.60.0                    
#> [13] MatrixGenerics_1.12.0              matrixStats_0.63.0                
#> [15] rtracklayer_1.60.0                 GenomicRanges_1.52.0              
#> [17] GenomeInfoDb_1.36.0                IRanges_2.34.0                    
#> [19] S4Vectors_0.38.0                   BiocGenerics_0.46.0               
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0         dplyr_1.1.2              blob_1.2.4              
#>  [4] filelock_1.0.2           bitops_1.0-7             fastmap_1.1.1           
#>  [7] RCurl_1.98-1.12          BiocFileCache_2.8.0      GenomicAlignments_1.36.0
#> [10] XML_3.99-0.14            digest_0.6.31            lifecycle_1.0.3         
#> [13] KEGGREST_1.40.0          RSQLite_2.3.1            magrittr_2.0.3          
#> [16] compiler_4.3.0           rlang_1.1.0              sass_0.4.5              
#> [19] progress_1.2.2           tools_4.3.0              utf8_1.2.3              
#> [22] yaml_2.3.7               knitr_1.42               prettyunits_1.1.1       
#> [25] bit_4.0.5                curl_5.0.0               DelayedArray_0.26.0     
#> [28] xml2_1.3.3               BiocParallel_1.34.0      withr_2.5.0             
#> [31] grid_4.3.0               fansi_1.0.4              colorspace_2.1-0        
#> [34] biomaRt_2.56.0           cli_3.6.1                rmarkdown_2.21          
#> [37] crayon_1.5.2             generics_0.1.3           httr_1.4.5              
#> [40] tzdb_0.3.0               rjson_0.2.21             DBI_1.1.3               
#> [43] cachem_1.0.7             stringr_1.5.0            zlibbioc_1.46.0         
#> [46] assertthat_0.2.1         parallel_4.3.0           AnnotationDbi_1.62.0    
#> [49] restfulr_0.0.15          vctrs_0.6.2              Matrix_1.5-4            
#> [52] jsonlite_1.8.4           hms_1.1.3                bit64_4.0.5             
#> [55] GenomicFeatures_1.52.0   jquerylib_0.1.4          glue_1.6.2              
#> [58] codetools_0.2-19         shape_1.4.6              stringi_1.7.12          
#> [61] BiocIO_1.10.0            tibble_3.2.1             pillar_1.9.0            
#> [64] rappdirs_0.3.3           htmltools_0.5.5          GenomeInfoDbData_1.2.10 
#> [67] R6_2.5.1                 dbplyr_2.3.2             evaluate_0.20           
#> [70] lattice_0.21-8           highr_0.10               png_0.1-8               
#> [73] memoise_2.0.1            bslib_0.4.2              xfun_0.39               
#> [76] GlobalOptions_0.1.2      pkgconfig_2.0.3