1 Introduction

ISAnalytics is an R package developed to analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies.

2 Installation and options

ISAnalytics can be installed quickly in different ways:

  • You can install it via Bioconductor
  • You can install it via GitHub using the package devtools

There are always 2 versions of the package active:

  • RELEASE is the latest stable version
  • DEVEL is the development version, it is the most up-to-date version where all new features are introduced

2.1 Installation from bioconductor

RELEASE version:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ISAnalytics")

DEVEL version:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("ISAnalytics")

2.2 Installation from GitHub

RELEASE:

if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "RELEASE_3_15",
                         dependencies = TRUE,
                         build_vignettes = TRUE)

DEVEL:

if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "master",
                         dependencies = TRUE,
                         build_vignettes = TRUE)

2.3 Setting options

ISAnalytics has a verbose option that allows some functions to print additional information to the console while they’re executing. To disable this feature do:

# DISABLE
options("ISAnalytics.verbose" = FALSE)

# ENABLE
options("ISAnalytics.verbose" = TRUE)

Some functions also produce report in a user-friendly HTML format, to set this feature:

# DISABLE HTML REPORTS
options("ISAnalytics.reports" = FALSE)

# ENABLE HTML REPORTS
options("ISAnalytics.reports" = TRUE)

3 Setting up the workflow

In the newer version of ISAnalytics, we introduced a “dynamic variables system”, to allow more flexibility in terms of input formats. Before starting with the analysis workflow, you can specify how your inputs are structured so that the package can process them. For more information on how to do this take a look at vignette("workflow_start", package = "ISAnalytics").

4 The first steps

The first steps of the analysis workflow involve the import and parsing of data and metadata files from disk.

  • Import metadata with import_association_file() and/or import_Vispa2_stats()
  • Import data with import_single_Vispa2Matrix() or import_parallel_Vispa2Matrices()

Refer to the vignette vignette("workflow_start", package = "ISAnalytics") for more details.

5 Data cleaning and pre-processing

ISAnalytics offers several different functions for cleaning and pre-processing your data.

  • Recalibration: identifies integration events that are near to each other and condenses them into a single event whenever appropriate - compute_near_integrations()
  • Outliers identification and removal: identifies samples that are considered outliers according to user-defined logic and filters them out - outlier_filter()
  • Collision removal: identifies collision events between independent samples - remove_collisions(), see also the dedicated vignette vignette("workflow_start", package = "ISAnalytics")
  • Filter based on cell lineage purity: identifies and removes contamination between different cell types - purity_filter()
  • Data and metadata aggregation: allows the union of biological samples from single pcr replicates or other arbitrary aggregations - aggregate_values_by_key(), aggregate_metadata(), see also the dedicated vignette vignette("workflow_start", package = "ISAnalytics")

6 Answering biological questions

You can answer very different biological questions by using the provided functions with appropriate inputs.

  • Descriptive statistics: sample_statistics()
  • IS relative abundance: compute_abundance(), integration_alluvial_plot()
  • Top abundant IS: top_integrations()
  • Top targeted genes: top_targeted_genes()
  • Grubbs test for common insertion sites (CIS): CIS_grubbs(), CIS_volcano_plot()
  • Fisher’s exact test for gene frequency and IS distribution on target genome: gene_frequency_fisher(), fisher_scatterplot(), circos_genomic_density()
  • Clonal sharing analyses: is_sharing(), iss_source(), sharing_heatmap(), sharing_venn()
  • Estimate HSPCs population size: HSC_population_size_estimate(), HSC_population_plot()

For more, please refer to the full function reference.

7 Working with other kinds of data

ISAnalytics is designed to be flexible concerning input formats, thus it is suited to process various kinds of data provided the correct dynamic configuration is set.

We demonstrate this with an example that uses barcodes data. The matrix is publicly available here (Ferrari Samuele Jacob Aurelien, 2020), metadata was provided to us by the authors and it is available in the package additional files.

library(ISAnalytics)
#> Loading required package: magrittr

# Set appropriate data and metadata specs ----
metadata_specs <- tibble::tribble(
  ~names, ~types, ~transform, ~flag, ~tag,
  "ProjectID", "char", NULL, "required", "project_id",
  "SubjectID", "char", NULL, "required", "subject",
  "Tissue", "char", NULL, "required", "tissue",
  "TimePoint", "int", NULL, "required", "tp_days",
  "CellMarker", "char", NULL, "required", "cell_marker",
  "ID", "char", NULL, "required", "pcr_repl_id",
  "SourceFileName", "char", NULL, "optional", NA_character_,
  "Link", "char", NULL, "optional", NA_character_
)
set_af_columns_def(metadata_specs)
#> Warning: Warning: important tags missing
#> ℹ Some tags are required for proper execution of some functions. If these tags are not provided, execution of dependent functions might fail. Review your inputs carefully.
#> ℹ Missing tags: pool_id, fusion_id, tag_seq, vector_id, tag_id, pcr_replicate, vispa_concatenate, proj_folder
#> ℹ To see where these are involved type `inspect_tags(c('pool_id','fusion_id','tag_seq','vector_id','tag_id','pcr_replicate','vispa_concatenate','proj_folder'))`
#> Association file columns specs successfully changed

mandatory_specs <- tibble::tribble(
  ~names, ~types, ~transform, ~flag, ~tag,
  "BarcodeSeq", "char", NULL, "required", NA_character_
)
set_mandatory_IS_vars(mandatory_specs)
#> Warning: Warning: important tags missing
#> ℹ Some tags are required for proper execution of some functions. If these tags are not provided, execution of dependent functions might fail. Review your inputs carefully.
#> ℹ Missing tags: chromosome, locus, is_strand
#> ℹ To see where these are involved type `inspect_tags(c('chromosome','locus','is_strand'))`
#> Mandatory IS vars successfully changed

# Files ----
data_folder <- system.file("testdata", package = "ISAnalytics")
meta_file <- "barcodes_example_af.tsv.xz"
matrix_file <- "GSE144340_Matrix_542.tsv.xz"

# Data import ----
af <- import_association_file(fs::path(data_folder, meta_file),
                              report_path = NULL)
af
#>        ProjectID SubjectID Tissue TimePoint CellMarker      ID
#>  1: PMID32601433        A0     BM        21      Whole   BM_A0
#>  2: PMID32601433        A0     PB        21      Whole PB21_A0
#>  3: PMID32601433        A1     BM        21      Whole   BM_A1
#>  4: PMID32601433        A1     PB        21      Whole PB21_A1
#>  5: PMID32601433        A2     PB        21      Whole PB21_A2
#>  6: PMID32601433        A3     PB        21      Whole PB21_A3
#>  7: PMID32601433        A4     BM        21      Whole   BM_A4
#>  8: PMID32601433        A4     PB        21      Whole PB21_A4
#>  9: PMID32601433        C0     PB        21      Whole PB21_C0
#> 10: PMID32601433        C1     BM        21      Whole   BM_C1
#> 11: PMID32601433        C1     PB        21      Whole PB21_C1
#> 12: PMID32601433        C2     BM        21      Whole   BM_C2
#> 13: PMID32601433        C2     PB        21      Whole PB21_C2
#> 14: PMID32601433        C3     BM        21      Whole   BM_C3
#> 15: PMID32601433        C3     PB        21      Whole PB21_C3
#>                  SourceFileName
#>  1: GSE144340_Matrix_542.tsv.gz
#>  2: GSE144340_Matrix_542.tsv.gz
#>  3: GSE144340_Matrix_542.tsv.gz
#>  4: GSE144340_Matrix_542.tsv.gz
#>  5: GSE144340_Matrix_542.tsv.gz
#>  6: GSE144340_Matrix_542.tsv.gz
#>  7: GSE144340_Matrix_542.tsv.gz
#>  8: GSE144340_Matrix_542.tsv.gz
#>  9: GSE144340_Matrix_542.tsv.gz
#> 10: GSE144340_Matrix_542.tsv.gz
#> 11: GSE144340_Matrix_542.tsv.gz
#> 12: GSE144340_Matrix_542.tsv.gz
#> 13: GSE144340_Matrix_542.tsv.gz
#> 14: GSE144340_Matrix_542.tsv.gz
#> 15: GSE144340_Matrix_542.tsv.gz
#>                                                                                                      Link
#>  1: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  2: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  3: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  4: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  5: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  6: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  7: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  8: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  9: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 10: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 11: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 12: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 13: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 14: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 15: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>     TimepointMonths TimepointYears
#>  1:              01             01
#>  2:              01             01
#>  3:              01             01
#>  4:              01             01
#>  5:              01             01
#>  6:              01             01
#>  7:              01             01
#>  8:              01             01
#>  9:              01             01
#> 10:              01             01
#> 11:              01             01
#> 12:              01             01
#> 13:              01             01
#> 14:              01             01
#> 15:              01             01

matrix <- import_single_Vispa2Matrix(fs::path(data_folder, matrix_file),
                                     sample_names_to = "ID")
#> Warning: compression format not supported by fread
#> ℹ File will be read using readr
#> Reading file...
#> ℹ Mode: classic
#> Reshaping...
#> *** File info *** 
#> • --- Annotated: FALSE
#> • --- Dimensions: 31757 x 16
#> • --- Read mode: classic
#> • --- Sample count: 15
matrix
#>                    BarcodeSeq      ID Value
#>     1: AAAAAAAATTTTTAAACGTACC   BM_A0     1
#>     2: AAAAAACATATCTATAGTTACC   BM_A0     1
#>     3: AAAAAATATATAAATAGATACC   BM_A0     1
#>     4: AAAAACAACAAGGAAATTCAAT   BM_A0     1
#>     5: AAAAACAACGAGGATAGTGAAT   BM_A0     1
#>    ---                                     
#> 33772: TTTTGAGACCTTCACACCTACT PB21_C3     1
#> 33773: TTTTGCCACCTTCATACCCAAC PB21_C3     1
#> 33774: TTTTTAAACCGTTAGACCCGCA PB21_C3     1
#> 33775: TTTTTTCACGACAATAGCCAAT PB21_C3     1
#> 33776: TTTTTTCACTTGCACATCCGGC PB21_C3     1

# Descriptive stats ----
desc_stats <- sample_statistics(matrix, af,
                                sample_key = pcr_id_column(),
                                value_columns = "Value")$metadata %>%
  dplyr::rename(distinct_barcodes = "nIS")
desc_stats
#>        ProjectID SubjectID Tissue TimePoint CellMarker      ID
#>  1: PMID32601433        A0     BM        21      Whole   BM_A0
#>  2: PMID32601433        A0     PB        21      Whole PB21_A0
#>  3: PMID32601433        A1     BM        21      Whole   BM_A1
#>  4: PMID32601433        A1     PB        21      Whole PB21_A1
#>  5: PMID32601433        A2     PB        21      Whole PB21_A2
#>  6: PMID32601433        A3     PB        21      Whole PB21_A3
#>  7: PMID32601433        A4     BM        21      Whole   BM_A4
#>  8: PMID32601433        A4     PB        21      Whole PB21_A4
#>  9: PMID32601433        C0     PB        21      Whole PB21_C0
#> 10: PMID32601433        C1     BM        21      Whole   BM_C1
#> 11: PMID32601433        C1     PB        21      Whole PB21_C1
#> 12: PMID32601433        C2     BM        21      Whole   BM_C2
#> 13: PMID32601433        C2     PB        21      Whole PB21_C2
#> 14: PMID32601433        C3     BM        21      Whole   BM_C3
#> 15: PMID32601433        C3     PB        21      Whole PB21_C3
#>                  SourceFileName
#>  1: GSE144340_Matrix_542.tsv.gz
#>  2: GSE144340_Matrix_542.tsv.gz
#>  3: GSE144340_Matrix_542.tsv.gz
#>  4: GSE144340_Matrix_542.tsv.gz
#>  5: GSE144340_Matrix_542.tsv.gz
#>  6: GSE144340_Matrix_542.tsv.gz
#>  7: GSE144340_Matrix_542.tsv.gz
#>  8: GSE144340_Matrix_542.tsv.gz
#>  9: GSE144340_Matrix_542.tsv.gz
#> 10: GSE144340_Matrix_542.tsv.gz
#> 11: GSE144340_Matrix_542.tsv.gz
#> 12: GSE144340_Matrix_542.tsv.gz
#> 13: GSE144340_Matrix_542.tsv.gz
#> 14: GSE144340_Matrix_542.tsv.gz
#> 15: GSE144340_Matrix_542.tsv.gz
#>                                                                                                      Link
#>  1: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  2: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  3: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  4: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  5: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  6: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  7: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  8: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>  9: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 10: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 11: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 12: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 13: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 14: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#> 15: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE144nnn/GSE144340/suppl/GSE144340%5FMatrix%5F542%2Etsv%2Egz
#>     TimepointMonths TimepointYears Value_shannon Value_simpson Value_invsimpson
#>  1:              01             01      2.952237     0.9113968        11.286277
#>  2:              01             01      3.459660     0.9488818        19.562506
#>  3:              01             01      3.006200     0.8870954         8.857038
#>  4:              01             01      3.774526     0.9453922        18.312391
#>  5:              01             01      3.181671     0.9222549        12.862544
#>  6:              01             01      3.389893     0.9401115        16.697684
#>  7:              01             01      2.820483     0.8857002         8.748925
#>  8:              01             01      3.345492     0.9344783        15.262125
#>  9:              01             01      3.843335     0.9588449        24.298324
#> 10:              01             01      3.201240     0.8979416         9.798313
#> 11:              01             01      3.194270     0.8805337         8.370564
#> 12:              01             01      2.615324     0.8670521         7.521742
#> 13:              01             01      3.515454     0.9382745        16.200749
#> 14:              01             01      2.557280     0.8453968         6.468172
#> 15:              01             01      3.929186     0.9397792        16.605561
#>     Value_sum Value_count Value_describe_vars Value_describe_n
#>  1:    244879        2284                   1             2284
#>  2:     81588        1080                   1             1080
#>  3:    274792        2477                   1             2477
#>  4:    104195        2269                   1             2269
#>  5:    124676        1465                   1             1465
#>  6:    180497        1786                   1             1786
#>  7:    296246        2255                   1             2255
#>  8:    177010        1538                   1             1538
#>  9:     59966        2644                   1             2644
#> 10:    303345        2993                   1             2993
#> 11:     95971        2636                   1             2636
#> 12:    343490        2223                   1             2223
#> 13:    149100        2386                   1             2386
#> 14:    277048        1817                   1             1817
#> 15:     64118        3923                   1             3923
#>     Value_describe_mean Value_describe_sd Value_describe_median
#>  1:           107.21497         1521.7649                     1
#>  2:            75.54444          556.4600                     1
#>  3:           110.93742         1852.2801                     1
#>  4:            45.92111          509.2056                     1
#>  5:            85.10307          904.5528                     1
#>  6:           101.06215         1040.5982                     1
#>  7:           131.37295         2105.4943                     1
#>  8:           115.09103         1149.9733                     1
#>  9:            22.68003          235.5394                     1
#> 10:           101.35149         1768.7577                     1
#> 11:            36.40781          645.1812                     1
#> 12:           154.51642         2652.4479                     1
#> 13:            62.48952          755.9385                     1
#> 14:           152.47551         2551.7136                     1
#> 15:            16.34412          250.7138                     1
#>     Value_describe_trimmed Value_describe_mad Value_describe_min
#>  1:               1.160284                  0                  1
#>  2:               1.082176                  0                  1
#>  3:               1.234493                  0                  1
#>  4:               1.068795                  0                  1
#>  5:               1.083546                  0                  1
#>  6:               1.080420                  0                  1
#>  7:               1.248199                  0                  1
#>  8:               1.065747                  0                  1
#>  9:               1.086484                  0                  1
#> 10:               1.208768                  0                  1
#> 11:               1.136019                  0                  1
#> 12:               1.295110                  0                  1
#> 13:               1.115707                  0                  1
#> 14:               1.252234                  0                  1
#> 15:               1.083147                  0                  1
#>     Value_describe_max Value_describe_range Value_describe_skew
#>  1:              48411                48410            21.84543
#>  2:              10411                10410            11.49921
#>  3:              68661                68660            29.32449
#>  4:              18252                18251            24.92426
#>  5:              19865                19864            17.48608
#>  6:              26055                26054            16.62791
#>  7:              70946                70945            25.02978
#>  8:              32536                32535            19.00613
#>  9:               5379                 5378            15.44781
#> 10:              80197                80196            35.45585
#> 11:              28170                28169            35.63449
#> 12:              81381                81380            23.99526
#> 13:              27428                27427            24.99102
#> 14:              83741                83740            26.01411
#> 15:              10463                10462            29.78536
#>     Value_describe_kurtosis Value_describe_se distinct_barcodes
#>  1:                564.8589         31.841939              2284
#>  2:                159.6777         16.932540              1080
#>  3:                969.0394         37.217196              2477
#>  4:                788.0556         10.689956              2269
#>  5:                342.6372         23.632797              1465
#>  6:                330.3889         24.623077              1786
#>  7:                716.3839         44.338480              2255
#>  8:                457.8748         29.323081              1538
#>  9:                276.7000          4.580710              2644
#> 10:               1478.2344         32.330691              2993
#> 11:               1447.9758         12.566345              2636
#> 12:                625.3701         56.257072              2223
#> 13:                793.6929         15.475733              2386
#> 14:                757.2388         59.862447              1817
#> 15:               1045.4960          4.002848              3923

# Aggregation and new stats ----
agg_key <- c("SubjectID")
agg <- aggregate_values_by_key(matrix, af, key = agg_key,
                               group = "BarcodeSeq",
                               join_af_by = pcr_id_column())
agg
#> # A tibble: 33,267 × 3
#>    BarcodeSeq             SubjectID Value_sum
#>    <chr>                  <chr>         <dbl>
#>  1 AAAAAAAACACGGAGAACGACG C3                2
#>  2 AAAAAAAACGCGAACAACTACG C3                1
#>  3 AAAAAAAACTCAAAAAAGAAAT C3                1
#>  4 AAAAAAAATTTACACAAAGAAA A4                1
#>  5 AAAAAAAATTTTTAAACGTACC A0                1
#>  6 AAAAAACATATCTATAGTTACC A0                1
#>  7 AAAAAAGACGACGATAGGCACG C1                1
#>  8 AAAAAAGACGTTTATAGGTGTA A2                1
#>  9 AAAAAAGACTGCGACAAAAGGG A4                1
#> 10 AAAAAAGACTTTGATAACCACG C3                1
#> # … with 33,257 more rows

agg_meta_functions <- tibble::tribble(
  ~Column, ~Function, ~Args, ~Output_colname,
  "TimePoint", ~mean(.x, na.rm = TRUE), NA, "{.col}_avg",
  "CellMarker", ~length(unique(.x)), NA, "distinct_cell_marker_count",
  "ID", ~length(unique(.x)), NA, "distinct_id_count"
)
agg_meta <- aggregate_metadata(
  af, aggregating_functions = agg_meta_functions,
  grouping_keys = agg_key
)
agg_meta
#> # A tibble: 9 × 4
#>   SubjectID TimePoint_avg distinct_cell_marker_count distinct_id_count
#>   <chr>             <dbl>                      <int>             <int>
#> 1 A0                   21                          1                 2
#> 2 A1                   21                          1                 2
#> 3 A2                   21                          1                 1
#> 4 A3                   21                          1                 1
#> 5 A4                   21                          1                 2
#> 6 C0                   21                          1                 1
#> 7 C1                   21                          1                 2
#> 8 C2                   21                          1                 2
#> 9 C3                   21                          1                 2

agg_stats <- sample_statistics(agg, agg_meta,
                               sample_key = agg_key,
                               value_columns = "Value_sum")$metadata %>%
  dplyr::rename(distinct_barcodes = "nIS")
agg_stats
#> # A tibble: 9 × 23
#>   SubjectID TimePoint_…¹ disti…² disti…³ Value…⁴ Value…⁵ Value…⁶ Value…⁷ Value…⁸
#>   <chr>            <dbl>   <int>   <int>   <dbl>   <dbl>   <dbl>   <dbl>   <int>
#> 1 A0                  21       1       2    3.24   0.929   14.1   326467    3304
#> 2 A1                  21       1       2    3.47   0.927   13.7   378987    4631
#> 3 A2                  21       1       1    3.18   0.922   12.9   124676    1465
#> 4 A3                  21       1       1    3.39   0.940   16.7   180497    1786
#> 5 A4                  21       1       2    3.29   0.930   14.3   473256    3718
#> 6 C0                  21       1       1    3.84   0.959   24.3    59966    2644
#> 7 C1                  21       1       2    3.39   0.920   12.6   399316    5538
#> 8 C2                  21       1       2    3.05   0.903   10.3   492590    4526
#> 9 C3                  21       1       2    3.00   0.886    8.77  341166    5655
#> # … with 14 more variables: Value_sum_describe_vars <dbl>,
#> #   Value_sum_describe_n <dbl>, Value_sum_describe_mean <dbl>,
#> #   Value_sum_describe_sd <dbl>, Value_sum_describe_median <dbl>,
#> #   Value_sum_describe_trimmed <dbl>, Value_sum_describe_mad <dbl>,
#> #   Value_sum_describe_min <dbl>, Value_sum_describe_max <dbl>,
#> #   Value_sum_describe_range <dbl>, Value_sum_describe_skew <dbl>,
#> #   Value_sum_describe_kurtosis <dbl>, Value_sum_describe_se <dbl>, …

# Abundance ----
abundance <- compute_abundance(agg, columns = "Value_sum", key = agg_key)
abundance
#> # A tibble: 33,267 × 5
#>    BarcodeSeq             SubjectID Value_sum Value_sum_RelAbundance Value_sum…¹
#>    <chr>                  <chr>         <dbl>                  <dbl>       <dbl>
#>  1 AAAAAAAACACGGAGAACGACG C3                2             0.00000586    0.000586
#>  2 AAAAAAAACGCGAACAACTACG C3                1             0.00000293    0.000293
#>  3 AAAAAAAACTCAAAAAAGAAAT C3                1             0.00000293    0.000293
#>  4 AAAAAAAATTTACACAAAGAAA A4                1             0.00000211    0.000211
#>  5 AAAAAAAATTTTTAAACGTACC A0                1             0.00000306    0.000306
#>  6 AAAAAACATATCTATAGTTACC A0                1             0.00000306    0.000306
#>  7 AAAAAAGACGACGATAGGCACG C1                1             0.00000250    0.000250
#>  8 AAAAAAGACGTTTATAGGTGTA A2                1             0.00000802    0.000802
#>  9 AAAAAAGACTGCGACAAAAGGG A4                1             0.00000211    0.000211
#> 10 AAAAAAGACTTTGATAACCACG C3                1             0.00000293    0.000293
#> # … with 33,257 more rows, and abbreviated variable name
#> #   ¹​Value_sum_PercAbundance

reset_dyn_vars_config()
#> Mandatory IS vars reset to default
#> Annotation IS vars reset to default
#> Association file columns specs reset to default
#> ISS stats specs reset to default
#> Matrix suffixes specs reset to default

8 Using the Shiny interface

The package provides a simple Shiny interface for data exploration and plotting. To start the interface use:

NGSdataExplorer()

The application main page will show a loading screen for a file. It is possible to load files also from the R environment, for example, before opening the app, we can load the included association file:

data("association_file")

Once in the application we can choose "association_file" from the R environment loading option screen and click on “Import data”. Once data is imported, we can click on the “Explore” tab in the upper navbar: here we will see 2 tabs, one allows interactive exploration of data in tabular form, in the other tab we can plot data. It is possible to customize several different parameters for the plot and finally save it to file with the dedicated button at the end of the page.

The Shiny interface is still currently under active development and new features will be added in the near future.

9 Ensuring reproducibility of results

Several implemented functions produce static HTML reports that can be saved on disk, or tabular files. Reports contain the relevant information on how the function was called, inputs and outputs statistics, and session info for reproducibility.

10 Browse documentation online and keep updated

ISAnalytics has it’s dedicated package website where you can browse the documentation and vignettes easily, in addition to keeping up to date with all relevant updates. Visit the website at https://calabrialab.github.io/ISAnalytics/

11 Problems?

If you have any issues the documentation can’t solve, get in touch by opening an issue on GitHub or contacting the maintainers

12 Bibliography

[1] B. S. Ferrari Samuele Jacob Aurelien. “Efficient gene editing of human long-term hematopoietic stem cells validated by clonal tracking”. In: Nat Biotechnol 38, 1298–1308 (Nov. 2020). DOI: https://doi.org/10.1038/s41587-020-0551-y.