Contents

1 Introduction

This vignette outlines a workflow of detecting retrotransposed transcripts (RTs) from Variant Call Format (VCF) using the svaRetro package.

2 Installation

The svaRetro package can be installed from Bioconductor as follows:

if(!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("svaRetro")

3 Using GRanges for structural variants: a breakend-centric data structure

This package uses a breakend-centric event notation adopted from the StructuralVariantAnnotation package. In short, breakends are stored in a GRanges object with strand used to indicate breakpoint orientation, where breakpoints are represented using a partner field containing the name of the breakend at the other side of the breakend. This notation was chosen as it simplifies the annotations of RTs which are detected at breakend-level.

4 Workflow

4.1 Loading data from VCF

VCF data is parsed into a VCF object using the readVCF function from the Bioconductor package VariantAnnotation. Simple filters could be applied to a VCF object to remove unwanted calls. The VCF object is then converted to a GRanges object with breakend-centric notations using StructuralVariantAnnotation. More information about VCF objects and breakend-centric GRanges object can be found by consulting the vignettes in the corresponding packages with browseVignettes("VariantAnnotation") and browseVignettes("StructuralVariantAnnotation").

library(StructuralVariantAnnotation)
library(VariantAnnotation)
library(svaRetro)

RT_vcf <- readVcf(system.file("extdata", "diploidSV.vcf", package = "svaRetro"))
RT_gr <- StructuralVariantAnnotation::breakpointRanges(RT_vcf, 
                                                       nominalPosition=TRUE)
head(RT_gr)

Note that StructuralVariantAnnotation requires the GRanges object to be composed entirely of valid breakpoints. Please consult the vignette of the StructuralVariantAnnotation package for ensuring breakpoint consistency.

4.2 Identifying Retrotransposed Transcripts

The package provides rtDetect to identify RTs using the provided SV calls. This is achieved by detecting intronic deletions, which are breakpoints at exon-intron (and intron-exon) boundaries of a transcript. Fusions consisting of an exon boundary and a second genomic location are reported as potential insertion sites. Due to the complexity of RT events, insertion sites can be discovered on both left and right sides, only one side, or none at all.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(dplyr)
hg19.genes <- TxDb.Hsapiens.UCSC.hg19.knownGene

RT <- rtDetect(RT_gr, hg19.genes, maxgap=10, minscore=0.8)

The output is a list of GRanges object consisting of two sets of GRanges calls, insSite and junctions, containing candidate insertion sites and exon-exon junctions respectively. Candidate insertion sites are annotated by the source transcripts and whether exon-exon junctions are detected for the source transcripts. RT junction breakends are annotated by the UCSC exon IDs, corresponding transcripts, and NCBI gene symbols.

RT$SKA3
#> $junctions
#> GRanges object with 14 ranges and 17 metadata columns:
#>                                  seqnames    ranges strand | paramRangeID                    REF         ALT      QUAL      FILTER               sourceId                partner      svtype     svLen      insSeq    insLen       event    HOMLEN      exon                   txs  exons gene_symbol
#>                                     <Rle> <IRanges>  <Rle> |     <factor>            <character> <character> <numeric> <character>            <character>            <character> <character> <numeric> <character> <numeric> <character> <numeric> <integer>                <list> <list>      <list>
#>    MantaDEL:245251:6:6:0:0:0_bp2       13  21729832      - |           NA TCTGCAACAGATACAAATAA..           T       999        PASS MantaDEL:245251:6:6:.. MantaDEL:245251:6:6:..         DEL      -542                     0        <NA>         1    176912 uc001unt.3,uc001unv.3 176912        SKA3
#>    MantaDEL:245251:5:8:0:0:0_bp2       13  21732061      - |           NA                      G       <DEL>       999        PASS MantaDEL:245251:5:8:.. MantaDEL:245251:5:8:..         DEL     -2110        <NA>         0        <NA>         2    176913 uc001unt.3,uc001unv.3 176913        SKA3
#>    MantaDEL:245251:5:9:0:0:0_bp2       13  21734038      - |           NA                      A       <DEL>       525        PASS MantaDEL:245251:5:9:.. MantaDEL:245251:5:9:..         DEL     -1776        <NA>         0        <NA>         4    176914 uc001unt.3,uc001unv.3 176914        SKA3
#>   MantaDEL:245251:7:10:0:0:0_bp2       13  21735929      - |           NA                      T       <DEL>       539        PASS MantaDEL:245251:7:10.. MantaDEL:245251:7:10..         DEL     -1802        <NA>         0        <NA>         1    176915 uc001unt.3,uc001unv.3 176915        SKA3
#>   MantaDEL:245251:4:11:0:0:0_bp2       13  21742127      - |           NA                      A       <DEL>       999        PASS MantaDEL:245251:4:11.. MantaDEL:245251:4:11..         DEL     -6112        <NA>         0        <NA>         2    176916 uc001unt.3,uc001unv.3 176916        SKA3
#>                              ...      ...       ...    ... .          ...                    ...         ...       ...         ...                    ...                    ...         ...       ...         ...       ...         ...       ...       ...                   ...    ...         ...
#>    MantaDEL:245251:5:9:0:0:0_bp1       13  21732261      + |           NA                      A       <DEL>       525        PASS MantaDEL:245251:5:9:.. MantaDEL:245251:5:9:..         DEL     -1776        <NA>         0        <NA>         4    176913 uc001unt.3,uc001unv.3 176913        SKA3
#>   MantaDEL:245251:7:10:0:0:0_bp1       13  21734126      + |           NA                      T       <DEL>       539        PASS MantaDEL:245251:7:10.. MantaDEL:245251:7:10..         DEL     -1802        <NA>         0        <NA>         1    176914 uc001unt.3,uc001unv.3 176914        SKA3
#>   MantaDEL:245251:4:11:0:0:0_bp1       13  21736014      + |           NA                      A       <DEL>       999        PASS MantaDEL:245251:4:11.. MantaDEL:245251:4:11..         DEL     -6112        <NA>         0        <NA>         2    176915 uc001unt.3,uc001unv.3 176915        SKA3
#>    MantaDEL:245251:3:4:0:0:0_bp1       13  21742538      + |           NA                      A       <DEL>       999        PASS MantaDEL:245251:3:4:.. MantaDEL:245251:3:4:..         DEL     -3939        <NA>         0        <NA>         2    176916 uc001unt.3,uc001unv.3 176916        SKA3
#>    MantaDEL:245251:2:3:0:0:0_bp1       13  21746642      + |           NA                      T       <DEL>       999        PASS MantaDEL:245251:2:3:.. MantaDEL:245251:2:3:..         DEL     -3870        <NA>         0        <NA>         2    176917 uc001unt.3,uc001unv.3 176917        SKA3
#>   -------
#>   seqinfo: 25 sequences from an unspecified genome
#> 
#> $insSite
#> GRanges object with 2 ranges and 18 metadata columns:
#>                               seqnames    ranges strand | paramRangeID         REF             ALT      QUAL      FILTER               sourceId                partner      svtype     svLen      insSeq    insLen       event    HOMLEN  exons                   txs    rtFound rtFoundSum gene_symbol
#>                                  <Rle> <IRanges>  <Rle> |     <factor> <character>     <character> <numeric> <character>            <character>            <character> <character> <numeric> <character> <numeric> <character> <numeric> <list>                <list>     <list>  <logical>      <list>
#>   MantaBND:245251:0:3:0:0:0:0       13  21746762      + |           NA           T T[11:108585702[        49        PASS MantaBND:245251:0:3:.. MantaBND:245251:0:3:..         BND        NA                     0        <NA>         0 176918 uc001unt.3,uc001unu.3 TRUE,FALSE       TRUE        SKA3
#>   MantaBND:245251:0:3:0:0:0:1       11 108585702      - |           NA           T  ]13:21746762]T        49        PASS MantaBND:245251:0:3:.. MantaBND:245251:0:3:..         BND        NA                     0        <NA>         0   <NA>                  <NA>       <NA>       <NA>        <NA>
#>   -------
#>   seqinfo: 25 sequences from an unspecified genome

5 Visualising breakpoint pairs via circos plots

One way of visualising RT is by circos plots. Here we use the package circlize to demonstrate the visualisation of insertion site and exon-exon junctions.

To generate a simple circos plot of RT event with SKA3 transcript:

library(circlize)
rt_chr_prefix <- c(RT$SKA3$junctions, RT$SKA3$insSite)
seqlevelsStyle(rt_chr_prefix) <- "UCSC"
pairs <- breakpointgr2pairs(rt_chr_prefix)
pairs

To see supporting breakpoints clearly, we generate the circos plot according to the loci of event.

circos.initializeWithIdeogram(
    data.frame(V1=c("chr13", "chr11"),
               V2=c(21720000,108585000),
               V3=c(21755000,108586000),
               V4=c("q12.11","q24.3"),
               V5=c("gneg","gpos50")))
circos.genomicLink(as.data.frame(S4Vectors::first(pairs)), 
                   as.data.frame(S4Vectors::second(pairs)))

circos.clear()

6 SessionInfo

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB              LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] circlize_0.4.15                         dplyr_1.0.10                            TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.50.0                  AnnotationDbi_1.60.0                    svaRetro_1.4.0                          StructuralVariantAnnotation_1.14.0      VariantAnnotation_1.44.0                Rsamtools_2.14.0                        Biostrings_2.66.0                       XVector_0.38.0                          SummarizedExperiment_1.28.0             Biobase_2.58.0                          MatrixGenerics_1.10.0                   matrixStats_0.62.0                      rtracklayer_1.58.0                      GenomicRanges_1.50.0                    GenomeInfoDb_1.34.0                     IRanges_2.32.0                          S4Vectors_0.36.0                        BiocGenerics_0.44.0                     BiocStyle_2.26.0                       
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7             bit64_4.0.5              filelock_1.0.2           progress_1.2.2           httr_1.4.4               tools_4.2.1              bslib_0.4.0              utf8_1.2.2               R6_2.5.1                 colorspace_2.0-3         DBI_1.1.3                tidyselect_1.2.0         prettyunits_1.1.1        bit_4.0.4                curl_4.3.3               compiler_4.2.1           cli_3.4.1                xml2_1.3.3               DelayedArray_0.24.0      bookdown_0.29            sass_0.4.2               rappdirs_0.3.3           stringr_1.4.1            digest_0.6.30            rmarkdown_2.17           pkgconfig_2.0.3          htmltools_0.5.3          highr_0.9                dbplyr_2.2.1             fastmap_1.1.0            BSgenome_1.66.0          rlang_1.0.6              GlobalOptions_0.1.2      RSQLite_2.2.18           shape_1.4.6              jquerylib_0.1.4          BiocIO_1.8.0             generics_0.1.3           jsonlite_1.8.3          
#> [40] BiocParallel_1.32.0      RCurl_1.98-1.9           magrittr_2.0.3           GenomeInfoDbData_1.2.9   Matrix_1.5-1             Rcpp_1.0.9               fansi_1.0.3              lifecycle_1.0.3          stringi_1.7.8            yaml_2.3.6               zlibbioc_1.44.0          BiocFileCache_2.6.0      grid_4.2.1               blob_1.2.3               parallel_4.2.1           crayon_1.5.2             lattice_0.20-45          hms_1.1.2                KEGGREST_1.38.0          magick_2.7.3             knitr_1.40               pillar_1.8.1             rjson_0.2.21             codetools_0.2-18         biomaRt_2.54.0           XML_3.99-0.12            glue_1.6.2               evaluate_0.17            BiocManager_1.30.19      png_0.1-7                vctrs_0.5.0              assertthat_0.2.1         cachem_1.0.6             xfun_0.34                restfulr_0.0.15          tibble_3.1.8             GenomicAlignments_1.34.0 memoise_2.0.1            ellipsis_0.3.2