1 Introduction to `ISAnalytics` import functions family

In this vignette we’re going to explain more in detail how functions of the import family should be used, the most common workflows to follow and more.

1.1 How to install ISAnalytics

To install the package run the following code:

## For release version
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }
BiocManager::install("ISAnalytics")

## For devel version
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }
# The following initializes usage of Bioc devel
BiocManager::install(version = "devel")
BiocManager::install("ISAnalytics")

To install from GitHub:

# For release version
if (!require(devtools)) {
    install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
    ref = "RELEASE_3_12",
    dependencies = TRUE,
    build_vignettes = TRUE
)

## Safer option for vignette building issue
devtools::install_github("calabrialab/ISAnalytics",
    ref = "RELEASE_3_12"
)

# For devel version
if (!require(devtools)) {
    install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
    ref = "master",
    dependencies = TRUE,
    build_vignettes = TRUE
)

## Safer option for vignette building issue
devtools::install_github("calabrialab/ISAnalytics",
    ref = "master"
)

library(ISAnalytics)

1.2 Setting options

ISAnalytics has a verbose option that allows some functions to print additional information to the console while they’re executing. To disable this feature do:

# DISABLE
options("ISAnalytics.verbose" = FALSE)

# ENABLE
options("ISAnalytics.verbose" = TRUE)

Some functions also produce report in a user-friendly HTML format, to set this feature:

# DISABLE HTML REPORTS
options("ISAnalytics.widgets" = FALSE)

# ENABLE HTML REPORTS
options("ISAnalytics.widgets" = TRUE)

1.3 Designed to work with Vispa2 pipeline

The vast majority of the functions included in this package is designed to work in combination with Vispa2 pipeline. If you don’t know what it is, we strongly recommend you to take a look at these links:

Article: VISPA2: Article
BitBucket Wiki: VISPA2 Wiki

1.4 File system structure generated

Vispa2 produces a standard file system structure starting from a folder you specify as your workbench or root. The structure always follows this schema:

root/
- Optional intermediate folders
  - Projects (PROJECTID)
    - bam
    - bcmuxall
    - bed
    - iss
    - quality
    - quantification
      - Pools (concatenatePoolIDSeqRun)
    - report

We’ve included 2 examples of this structure in our package, one correct and the other one including errors or potential problems. They are both in .zip format, so you might want to unzip them if you plan to experiment with them.
An example on how to access them:

root_correct <- system.file("extdata", "fs.zip", package = "ISAnalytics")
root_correct <- unzip_file_system(root_correct, "fs")
fs::dir_tree(root_correct)
#> /tmp/RtmpHKhRpe/fs
#> ├── PROJECT1100
#> │   ├── bam
#> │   ├── bcmuxall
#> │   ├── bed
#> │   ├── iss
#> │   │   ├── ABX-LR-PL5-POOL14-1
#> │   │   │   └── stats.sequence_PROJECT1100.ABX-LR-PL5-POOL14-1.tsv
#> │   │   └── ABX-LR-PL6-POOL15-1
#> │   │       └── stats.sequence_PROJECT1100.ABX-LR-PL6-POOL15-1.tsv
#> │   ├── quality
#> │   ├── quantification
#> │   │   ├── ABX-LR-PL5-POOL14-1
#> │   │   │   ├── PROJECT1100_ABX-LR-PL5-POOL14-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> │   │   │   ├── PROJECT1100_ABX-LR-PL5-POOL14-1_fragmentEstimate_matrix.tsv.xz
#> │   │   │   ├── PROJECT1100_ABX-LR-PL5-POOL14-1_seqCount_matrix.no0.annotated.tsv.xz
#> │   │   │   └── PROJECT1100_ABX-LR-PL5-POOL14-1_seqCount_matrix.tsv.xz
#> │   │   └── ABX-LR-PL6-POOL15-1
#> │   │       ├── PROJECT1100_ABX-LR-PL6-POOL15-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> │   │       ├── PROJECT1100_ABX-LR-PL6-POOL15-1_fragmentEstimate_matrix.tsv.xz
#> │   │       ├── PROJECT1100_ABX-LR-PL6-POOL15-1_seqCount_matrix.no0.annotated.tsv.xz
#> │   │       └── PROJECT1100_ABX-LR-PL6-POOL15-1_seqCount_matrix.tsv.xz
#> │   └── report
#> ├── PROJECT1101
#> │   ├── bam
#> │   ├── bcmuxall
#> │   ├── bed
#> │   ├── iss
#> │   │   └── ABY-LR-PL4-POOL54-2
#> │   │       └── stats.sequence_PROJECT1101.ABY-LR-PL4-POOL54-2.tsv
#> │   ├── quality
#> │   ├── quantification
#> │   │   └── ABY-LR-PL4-POOL54-2
#> │   │       ├── PROJECT1101_ABY-LR-PL4-POOL54-2_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> │   │       ├── PROJECT1101_ABY-LR-PL4-POOL54-2_fragmentEstimate_matrix.tsv.xz
#> │   │       ├── PROJECT1101_ABY-LR-PL4-POOL54-2_seqCount_matrix.no0.annotated.tsv.xz
#> │   │       └── PROJECT1101_ABY-LR-PL4-POOL54-2_seqCount_matrix.tsv.xz
#> │   └── report
#> └── VA
#>     └── CLOEXP
#>         ├── bam
#>         ├── bcmuxall
#>         ├── bed
#>         ├── iss
#>         │   └── POOL6-1
#>         │       └── stats.sequence_VA.POOL6-1.tsv
#>         ├── quality
#>         ├── quantification
#>         │   └── POOL6-1
#>         │       ├── VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#>         │       ├── VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.tsv.xz
#>         │       ├── VA-CLOEXP-POOL6-1_seqCount_matrix.no0.annotated.tsv.xz
#>         │       └── VA-CLOEXP-POOL6-1_seqCount_matrix.tsv.xz
#>         └── report

2 Importing a single integration matrix

If you want to import a single integration matrix you can do so by using the import_single_Vispa2Matrix function.
This function reads the file and converts it into a tidy structure: several different formats can be read, since you can specify the column separator. If you’re not familiar with the “tidy” concept, we recommend to take a look at this link to get the basics:

The importance of tidy data

This package is in fact based on the tidyverse and tries to follow its philosophy and guidelines as close as possible.
Vispa2 pipeline and the associated Create Matrix tool produce matrices with a standard structure which we’ll refer to as “messy”, because different experimental data is divided in different columns and there are a lot of NA values.

Table 1: A simple example of messy matrix.
chr	integration_locus	strand	GeneName	GeneStrand	exp_1	exp_2	exp_3	exp_4	exp_5
20	39085641	+	MAFB	-	NA	94925.98	NA	NA	61143.20
7	155446768	+	RBM33	+	NA	NA	68981.02	NA	NA
19	12278742	-	ZNF136	+	NA	NA	60410.18	NA	68463.19
6	1865825	-	GMDS	-	NA	NA	NA	46666.04	NA
7	21508806	+	SP4	+	NA	NA	NA	NA	NA
17	5187666	+	RABEP1	+	NA	NA	NA	NA	NA

example_matrix_path <- system.file("extdata", "ex_annotated_ISMatrix.tsv.xz",
    package = "ISAnalytics"
)
imported_im <- import_single_Vispa2Matrix(
    path = example_matrix_path,
    to_exclude = NULL,
    separator = "\t"
)
#> Warning: compression format not supported by fread
#> ℹ File will be read using readr
#> Reading file...
#> ℹ Mode: classic
#> *** File info *** 
#> * --- Matrix type: NEW
#> * --- Annotated: TRUE
#> * --- Dimensions: 50 x 10
#> * --- Read mode:  classic
#> Reshaping...
#> 
  |                                                                                                                    
  |                                                                                                              |   0%
  |                                                                                                                    
  |=====                                                                                                         |   5%
  |                                                                                                                    
  |==========                                                                                                    |   9%
  |                                                                                                                    
  |===============                                                                                               |  14%
  |                                                                                                                    
  |====================                                                                                          |  18%
  |                                                                                                                    
  |=========================                                                                                     |  23%
  |                                                                                                                    
  |==============================                                                                                |  27%
  |                                                                                                                    
  |===================================                                                                           |  32%
  |                                                                                                                    
  |========================================                                                                      |  36%
  |                                                                                                                    
  |=============================================                                                                 |  41%
  |                                                                                                                    
  |==================================================                                                            |  45%
  |                                                                                                                    
  |=======================================================                                                       |  50%
  |                                                                                                                    
  |============================================================                                                  |  55%
  |                                                                                                                    
  |=================================================================                                             |  59%
  |                                                                                                                    
  |======================================================================                                        |  64%
  |                                                                                                                    
  |===========================================================================                                   |  68%
  |                                                                                                                    
  |================================================================================                              |  73%
  |                                                                                                                    
  |=====================================================================================                         |  77%
  |                                                                                                                    
  |==========================================================================================                    |  82%
  |                                                                                                                    
  |===============================================================================================               |  86%
  |                                                                                                                    
  |====================================================================================================          |  91%
  |                                                                                                                    
  |=========================================================================================================     |  95%
  |                                                                                                                    
  |==============================================================================================================| 100%
#> Done!

Table 2: Example of tidy integration matrix
chr	integration_locus	strand	GeneName	GeneStrand	CompleteAmplificationID	Value
20	39085641	+	MAFB	-	exp_2	94925.98
20	39085641	+	MAFB	-	exp_5	61143.20
7	155446768	+	RBM33	+	exp_3	68981.02
19	4049601	+	ZBTB7A	-	exp_1	99324.68
19	4049601	+	ZBTB7A	-	exp_2	83873.38
19	12278742	-	ZNF136	+	exp_3	60410.18

We will refer to the structure generated by import_single_Vispa2Matrix as “integration matrix” for convenience.
To be considered an integration matrix the data frame must contain the mandatory variables, which are “chr” (chromosome), “integration_locus” and “strand”. It might also contain annotation variables if the matrix was annotated during the Vispa2 pipeline run.
You can access these names by using two functions:

# Displays the mandatory vars, can be called also for manipulation purposes
# on tibble instead of calling individual variables
mandatory_IS_vars()
#> [1] "chr"               "integration_locus" "strand"

# Displays the annotation variables
annotation_IS_vars()
#> [1] "GeneName"   "GeneStrand"

You can of course operate on the integration matrices as you would on any other data frame, but some functions will check the presence of specific columns because they’re needed in that context.

3 Importing the association file

While you can import single matrices for brief analysis, what you would like to do most of the times is import multiple matrices at once, based on certain parameters. To do that you must first import the association file, which is the file that holds all associated metadata and information about every project, pool and single experiment.
The function that imports this file does not simply read it into your R environment, but performs an alignment check with your file system, so you have to specify the path to the root folder where your Vispa2 runs produce output (see the previous section). To import the association file do:

path_as_file <- system.file("extdata", "ex_association_file.tsv",
    package = "ISAnalytics"
)
withr::with_options(list(ISAnalytics.widgets = FALSE), {
    association_file <- import_association_file(
        path = path_as_file,
        root = root_correct,
        tp_padding = 4,
        dates_format = "dmy",
        separator = "\t"
    )
})
#> *** Association file import summary ***
#> ℹ For detailed report please set option 'ISAnalytics.widgets' to TRUE
#> * Parsing problems detected: TRUE
#> * Date parsing problems: TRUE
#> * Column problems detected: TRUE
#> * NAs found in important columns: TRUE
#> * File system alignment: no problems detected
association_file
#> # A tibble: 49 x 67
#>    ProjectID FUSIONID   PoolID TagSequence SubjectID VectorType VectorID  ExperimentID Tissue TimePoint DNAFragmentation
#>    <chr>     <chr>      <chr>  <chr>       <chr>     <chr>      <chr>     <chr>        <chr>  <chr>     <chr>           
#>  1 CLOEXP    LR-Clonal… POOL6  LTR10LC10   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  2 CLOEXP    LR-Clonal… POOL6  LTR11LC11   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  3 CLOEXP    LR-Clonal… POOL6  LTR12LC12   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  4 CLOEXP    LR-Clonal… POOL6  LTR13LC13   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  5 CLOEXP    LR-Clonal… POOL6  LTR29LC29   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  6 CLOEXP    LR-Clonal… POOL6  LTR30LC30   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  7 CLOEXP    LR-Clonal… POOL6  LTR31LC31   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  8 CLOEXP    LR-Clonal… POOL6  LTR32LC32   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  9 CLOEXP    LR-Clonal… POOL6  LTR48LC48   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#> 10 CLOEXP    LR-Clonal… POOL6  LTR49LC49   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#> # … with 39 more rows, and 56 more variables: PCRMethod <chr>, TagIDextended <chr>, Keywords <chr>, CellMarker <chr>,
#> #   TagID <chr>, NGSProvider <chr>, NGSTechnology <chr>, ConverrtedFilesDir <chr>, ConverrtedFilesName <chr>,
#> #   SourceFileFolder <chr>, SourceFileNameR1 <chr>, SourceFileNameR2 <chr>, DNAnumber <chr>, ReplicateNumber <int>,
#> #   DNAextractionDate <dttm>, DNAngUsed <dbl>, LinearPCRID <chr>, LinearPCRDate <dttm>, SonicationDate <dttm>,
#> #   LigationDate <dttm>, 1stExpoPCRID <chr>, 1stExpoPCRDate <dttm>, 2ndExpoID <chr>, 2ndExpoDate <dttm>,
#> #   FusionPrimerPCRID <chr>, FusionPrimerPCRDate <dttm>, PoolDate <dttm>, SequencingDate <dttm>, VCN <dbl>,
#> #   Genome <chr>, SequencingRound <int>, Genotype <chr>, TestGroup <chr>, MOI <chr>, Engraftment <dbl>,
#> #   Transduction <dbl>, Notes <chr>, AddedField1 <chr>, AddedField2 <chr>, AddedField3 <chr>, AddedField4 <chr>,
#> #   concatenatePoolIDSeqRun <chr>, AddedField6_RelativeBloodPercentage <chr>, AddedField7_PurityTestFeasibility <dbl>,
#> #   AddedField8_FacsSeparationPurity <dbl>, Kapa <dbl>, ulForPool <dbl>, CompleteAmplificationID <chr>, UniqueID <chr>,
#> #   StudyTestID <chr>, StudyTestGroup <int>, MouseID <int>, Tigroup <chr>, Tisource <chr>, PathToFolderProjectID <chr>,
#> #   Path <chr>

If you have the “widgets” option active, this will produce a visual HTML report of the results of the alignment check, either in Rstudio or in your browser.
If projects or pools are missing you will be notified: until you fix those problems, those elements will be ignored until you re-import the association file.

If you’re not interested in scanning the file system you can set the ‘root’ parameter to NULL and this step will be skipped.

The function can read multiple file formats including excel files, however since metadata are crucial for a correct workflow, we recommend using .tsv or .csv format to avoid potential parsing problems. Additionally you can also specify a filter to obtain a pre-filtered association file for your needs:

withr::with_options(list(ISAnalytics.widgets = FALSE), {
    association_file_filtered <- import_association_file(
        path = path_as_file,
        root = root_correct,
        tp_padding = 4,
        dates_format = "dmy",
        separator = "\t",
        filter_for = list(ProjectID = "CLOEXP")
    )
})
#> *** Association file import summary ***
#> ℹ For detailed report please set option 'ISAnalytics.widgets' to TRUE
#> * Parsing problems detected: TRUE
#> * Date parsing problems: TRUE
#> * Column problems detected: TRUE
#> * NAs found in important columns: TRUE
#> * File system alignment: no problems detected
association_file_filtered
#> # A tibble: 12 x 67
#>    ProjectID FUSIONID   PoolID TagSequence SubjectID VectorType VectorID  ExperimentID Tissue TimePoint DNAFragmentation
#>    <chr>     <chr>      <chr>  <chr>       <chr>     <chr>      <chr>     <chr>        <chr>  <chr>     <chr>           
#>  1 CLOEXP    LR-Clonal… POOL6  LTR10LC10   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  2 CLOEXP    LR-Clonal… POOL6  LTR11LC11   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  3 CLOEXP    LR-Clonal… POOL6  LTR12LC12   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  4 CLOEXP    LR-Clonal… POOL6  LTR13LC13   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  5 CLOEXP    LR-Clonal… POOL6  LTR29LC29   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  6 CLOEXP    LR-Clonal… POOL6  LTR30LC30   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  7 CLOEXP    LR-Clonal… POOL6  LTR31LC31   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  8 CLOEXP    LR-Clonal… POOL6  LTR32LC32   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#>  9 CLOEXP    LR-Clonal… POOL6  LTR48LC48   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#> 10 CLOEXP    LR-Clonal… POOL6  LTR49LC49   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#> 11 CLOEXP    LR-Clonal… POOL6  LTR50LC50   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#> 12 CLOEXP    LR-Clonal… POOL6  LTR51LC51   VA2020-m… lenti      737.pCCL… PLATE1       <NA>   0000      SONIC           
#> # … with 56 more variables: PCRMethod <chr>, TagIDextended <chr>, Keywords <chr>, CellMarker <chr>, TagID <chr>,
#> #   NGSProvider <chr>, NGSTechnology <chr>, ConverrtedFilesDir <chr>, ConverrtedFilesName <chr>,
#> #   SourceFileFolder <chr>, SourceFileNameR1 <chr>, SourceFileNameR2 <chr>, DNAnumber <chr>, ReplicateNumber <int>,
#> #   DNAextractionDate <dttm>, DNAngUsed <dbl>, LinearPCRID <chr>, LinearPCRDate <dttm>, SonicationDate <dttm>,
#> #   LigationDate <dttm>, 1stExpoPCRID <chr>, 1stExpoPCRDate <dttm>, 2ndExpoID <chr>, 2ndExpoDate <dttm>,
#> #   FusionPrimerPCRID <chr>, FusionPrimerPCRDate <dttm>, PoolDate <dttm>, SequencingDate <dttm>, VCN <dbl>,
#> #   Genome <chr>, SequencingRound <int>, Genotype <chr>, TestGroup <chr>, MOI <chr>, Engraftment <dbl>,
#> #   Transduction <dbl>, Notes <chr>, AddedField1 <chr>, AddedField2 <chr>, AddedField3 <chr>, AddedField4 <chr>,
#> #   concatenatePoolIDSeqRun <chr>, AddedField6_RelativeBloodPercentage <chr>, AddedField7_PurityTestFeasibility <dbl>,
#> #   AddedField8_FacsSeparationPurity <dbl>, Kapa <dbl>, ulForPool <dbl>, CompleteAmplificationID <chr>, UniqueID <chr>,
#> #   StudyTestID <chr>, StudyTestGroup <int>, MouseID <int>, Tigroup <chr>, Tisource <chr>, PathToFolderProjectID <chr>,
#> #   Path <chr>

4 Importing multiple matrices in parallel

There are 2 different functions for importing multiple matrices in parallel:

import_parallel_Vispa2Matrices_interactive
import_parallel_Vispa2Matrices_auto

The interactive version will ask you to input your choices directly into the console, the automatic version will not, but has some limitations.
Both functions rely on the association file and some basic parameters, most notably:

quantification_type: this is a string or a vector of characters indicating which quantification types you want the function to look for. The possible values are fragmentEstimate, seqCount, barcodeCount, cellCount, ShsCount
matrix_type: tells the function if it should consider annotated or not annotated matrices. The only possible options are “annotated” and “not_annotated”
workers: indicates the number of parallel workers to instantiate when importing. Keep in mind that the higher is the number, the faster the process is, but also higher is the RAM peak, so you should be aware of this especially if you’re dealing with really big matrices. Set this parameter according to your needs and according to your hardware specifications.

Both the versions will produce an HTML report as a summary of the importing process. The report includes:

Which files were found, reporting eventual anomalies (missing files or duplicates)
Which files were chosen for import after interactive selection or after automatic filtering
Which files were actually imported, signaling potential errors during the import phase

Both the functions, by default, return a multi-quantification matrix (see ).

4.1 Interactive version

As stated before, with the interactive version you have more control and you can directly choose:

Which projects to import
Which pools to import
If duplicates files are found which ones are to be kept

If you haven’t imported the association file yet, you can directly pass the path to the association file and the path to the root folder into the function: in this way the association file will automatically be imported.

Example:

withr::with_options(list(ISAnalytics.widgets = FALSE), {
    matrices <- import_parallel_Vispa2Matrices_interactive(
        association_file = path_as_file,
        root = root_correct,
        quantification_type = c("fragmentEstimate", "seqCount"),
        matrix_type = "annotated",
        workers = 2
    )
})

If you’ve already imported the association file you can instead call the function like this:

matrices <- import_parallel_Vispa2Matrices_interactive(
    association_file = association_file,
    root = NULL,
    quantification_type = c("fragmentEstimate", "seqCount"),
    matrix_type = "annotated",
    workers = 2
)

You can simply access the data frames by doing:

matrices$fragmentEstimate
matrices$seqCount

4.2 Automatic version

If you choose to opt for the automatic version you should keep in mind that the function automatically considers everything included in the association file, so if you want to import only a subset of projects and/or pools you should filter the association file according to your criteria before calling the function:

library(magrittr)
refined_af <- association_file %>% dplyr::filter(.data$ProjectID == "CLOEXP")

In automatic version there is no way of discriminating duplicates, so there is the possibility to specify additional patterns to look for in file names to mitigate this problem. However, if after matching of the additional patterns duplicates are still found they’re simply discarded.

There are 2 additional parameters to set:

patterns: a string or a character vector containing regular expressions to be matched on file names. If you’re not familiar with regular expressions, I suggest you to start from here stringr cheatsheet
matching_opt: a single string that tells the function how to match the patterns. The possible values for this parameter are ANY, ALL, OPTIONAL:
- ANY : looks for files that match at least one of the patterns specified in patterns
- ALL: looks for files that match all the patterns specified in patterns
- OPTIONAL: looks for files that preferentially match all the patterns, if none are found looks for files that match any of the patterns and finally if none are found simply looks for files present that match the quantification type.

You can call the function with patterns set to NULL if you don’t wish to match anything:

withr::with_options(list(ISAnalytics.widgets = FALSE), {
    matrices_auto <- import_parallel_Vispa2Matrices_auto(
        association_file = refined_af,
        root = NULL,
        quantification_type = c("fragmentEstimate", "seqCount"),
        matrix_type = "annotated",
        workers = 2,
        patterns = NULL,
        matching_opt = "ANY" # Same if you choose "ALL" or "OPTIONAL"
    )
})
#> [1] "--- REPORT: FILES IMPORTED ---"
#> # A tibble: 2 x 5
#>   ProjectID concatenatePoolIDSeqRun Quantification_type
#>   <chr>     <chr>                   <chr>              
#> 1 CLOEXP    POOL6-1                 fragmentEstimate   
#> 2 CLOEXP    POOL6-1                 seqCount           
#>   Files_chosen                                                                                                      
#>   <fs::path>                                                                                                        
#> 1 /tmp/RtmpHKhRpe/fs/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> 2 /tmp/RtmpHKhRpe/fs/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_seqCount_matrix.no0.annotated.tsv.xz        
#>   Imported
#>   <lgl>   
#> 1 TRUE    
#> 2 TRUE    
#> [1] "--- SUMMARY OF FILES CHOSEN FOR IMPORT ---"
#> # A tibble: 2 x 4
#>   ProjectID concatenatePoolIDSeqRun Quantification_type
#>   <chr>     <chr>                   <chr>              
#> 1 CLOEXP    POOL6-1                 fragmentEstimate   
#> 2 CLOEXP    POOL6-1                 seqCount           
#>   Files_chosen                                                                                                      
#>   <fs::path>                                                                                                        
#> 1 /tmp/RtmpHKhRpe/fs/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> 2 /tmp/RtmpHKhRpe/fs/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_seqCount_matrix.no0.annotated.tsv.xz        
#> [1] "--- INTEGRATION MATRICES FOUND REPORT ---"
#> # A tibble: 2 x 5
#>   ProjectID concatenatePoolIDSeqRun Anomalies Quantification_type
#>   <chr>     <chr>                   <lgl>     <chr>              
#> 1 CLOEXP    POOL6-1                 FALSE     fragmentEstimate   
#> 2 CLOEXP    POOL6-1                 FALSE     seqCount           
#>   Files_found                                                                                                       
#>   <fs::path>                                                                                                        
#> 1 /tmp/RtmpHKhRpe/fs/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> 2 /tmp/RtmpHKhRpe/fs/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_seqCount_matrix.no0.annotated.tsv.xz
matrices_auto
#> # A tibble: 1,476 x 8
#>    chr   integration_locus strand GeneName  GeneStrand CompleteAmplificationID                 fragmentEstimate seqCount
#>    <chr>             <int> <chr>  <chr>     <chr>      <fct>                                              <dbl>    <dbl>
#>  1 10            102326217 +      HIF1AN    +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  2 10             20775144 +      MIR4675   +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  3 10             24622789 +      KIAA1217  +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  4 10             30105047 +      SVIL      -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  5 10             32656933 -      EPC1      -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  6 10             66717706 +      LOC10192… -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  7 10             76391232 -      ADK       +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  8 10             85334019 -      LOC10537… -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             2.00        6
#>  9 11            107591208 +      SLN       -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#> 10 11            111744171 +      FDXACB1   -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#> # … with 1,466 more rows

Let’s do an example with a file system where there are issues, such as duplicates:

root_err <- system.file("extdata", "fserr.zip", package = "ISAnalytics")
root_err <- unzip_file_system(root_err, "fserr")
fs::dir_tree(root_err)
#> /tmp/RtmpHKhRpe/fserr
#> ├── PROJECT1100
#> │   ├── bam
#> │   ├── bcmuxall
#> │   ├── bed
#> │   ├── iss
#> │   │   └── ABX-LR-PL5-POOL14-1
#> │   │       └── stats.sequence_PROJECT1100.ABX-LR-PL5-POOL14-1.tsv
#> │   ├── quality
#> │   ├── quantification
#> │   │   ├── ABX-LR-PL5-POOL14-1
#> │   │   │   ├── PROJECT1100_ABX-LR-PL5-POOL14-1_fragmentEstimate_matrix-NoMate.no0.annotated.tsv.xz
#> │   │   │   ├── PROJECT1100_ABX-LR-PL5-POOL14-1_fragmentEstimate_matrix-NoMate2.no0.annotated.tsv.xz
#> │   │   │   ├── PROJECT1100_ABX-LR-PL5-POOL14-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> │   │   │   └── PROJECT1100_ABX-LR-PL5-POOL14-1_seqCount_matrix.no0.annotated.tsv.xz
#> │   │   └── ABX-LR-PL6-POOL15-1
#> │   │       ├── PROJECT1100_ABX-LR-PL6-POOL15-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#> │   │       └── PROJECT1100_ABX-LR-PL6-POOL15-1_seqCount_matrix.no0.annotated.tsv.xz
#> │   └── report
#> └── VA
#>     └── CLOEXP
#>         ├── bam
#>         ├── bcmuxall
#>         ├── bed
#>         ├── iss
#>         │   └── POOL6-1
#>         │       └── stats.sequence_VA.POOL6-1.tsv
#>         ├── quality
#>         ├── quantification
#>         │   └── POOL6-1
#>         │       ├── VA-CLOEXP-POOL6-1_fragmentEstimate_matrix-NoMate.no0.annotated.tsv.xz
#>         │       ├── VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.no0.annotated.tsv.xz
#>         │       ├── VA-CLOEXP-POOL6-1_fragmentEstimate_matrix.tsv.xz
#>         │       ├── VA-CLOEXP-POOL6-1_seqCount_matrix-NoMate.no0.annotated.tsv.xz
#>         │       ├── VA-CLOEXP-POOL6-1_seqCount_matrix.no0.annotated.tsv.xz
#>         │       └── VA-CLOEXP-POOL6-1_seqCount_matrix.tsv.xz
#>         └── report
withr::with_options(list(ISAnalytics.widgets = FALSE), {
    association_file_fserr <- import_association_file(path_as_file, root_err)
    refined_af_err <- association_file_fserr %>%
        dplyr::filter(.data$ProjectID == "CLOEXP")
    matrices_auto2 <- import_parallel_Vispa2Matrices_auto(
        association_file = refined_af_err,
        root = NULL,
        quantification_type = c("fragmentEstimate", "seqCount"),
        matrix_type = "annotated",
        workers = 2,
        patterns = "NoMate",
        matching_opt = "ANY" # Same if you choose "ALL" or "OPTIONAL"
    )
})
#> *** Association file import summary ***
#> ℹ For detailed report please set option 'ISAnalytics.widgets' to TRUE
#> * Parsing problems detected: TRUE
#> * Date parsing problems: TRUE
#> * Column problems detected: TRUE
#> * NAs found in important columns: TRUE
#> * File system alignment: problems detected
#> [1] "--- REPORT: FILES IMPORTED ---"
#> # A tibble: 2 x 5
#>   ProjectID concatenatePoolIDSeqRun Quantification_type
#>   <chr>     <chr>                   <chr>              
#> 1 CLOEXP    POOL6-1                 fragmentEstimate   
#> 2 CLOEXP    POOL6-1                 seqCount           
#>   Files_chosen                                                                                                          
#>   <fs::path>                                                                                                            
#> 1 /tmp/RtmpHKhRpe/fserr/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_fragmentEstimate_matrix-NoMate.no0.annotated…
#> 2 /tmp/RtmpHKhRpe/fserr/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_seqCount_matrix-NoMate.no0.annotated.tsv.xz  
#>   Imported
#>   <lgl>   
#> 1 TRUE    
#> 2 TRUE    
#> [1] "--- SUMMARY OF FILES CHOSEN FOR IMPORT ---"
#> # A tibble: 2 x 4
#>   ProjectID concatenatePoolIDSeqRun Quantification_type
#>   <chr>     <chr>                   <chr>              
#> 1 CLOEXP    POOL6-1                 fragmentEstimate   
#> 2 CLOEXP    POOL6-1                 seqCount           
#>   Files_chosen                                                                                                          
#>   <fs::path>                                                                                                            
#> 1 /tmp/RtmpHKhRpe/fserr/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_fragmentEstimate_matrix-NoMate.no0.annotated…
#> 2 /tmp/RtmpHKhRpe/fserr/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_seqCount_matrix-NoMate.no0.annotated.tsv.xz  
#> [1] "--- INTEGRATION MATRICES FOUND REPORT ---"
#> # A tibble: 2 x 5
#>   ProjectID concatenatePoolIDSeqRun Anomalies Quantification_type
#>   <chr>     <chr>                   <lgl>     <chr>              
#> 1 CLOEXP    POOL6-1                 FALSE     fragmentEstimate   
#> 2 CLOEXP    POOL6-1                 FALSE     seqCount           
#>   Files_found                                                                                                           
#>   <fs::path>                                                                                                            
#> 1 /tmp/RtmpHKhRpe/fserr/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_fragmentEstimate_matrix-NoMate.no0.annotated…
#> 2 /tmp/RtmpHKhRpe/fserr/VA/CLOEXP/quantification/POOL6-1/VA-CLOEXP-POOL6-1_seqCount_matrix-NoMate.no0.annotated.tsv.xz
matrices_auto
#> # A tibble: 1,476 x 8
#>    chr   integration_locus strand GeneName  GeneStrand CompleteAmplificationID                 fragmentEstimate seqCount
#>    <chr>             <int> <chr>  <chr>     <chr>      <fct>                                              <dbl>    <dbl>
#>  1 10            102326217 +      HIF1AN    +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  2 10             20775144 +      MIR4675   +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  3 10             24622789 +      KIAA1217  +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  4 10             30105047 +      SVIL      -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  5 10             32656933 -      EPC1      -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  6 10             66717706 +      LOC10192… -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  7 10             76391232 -      ADK       +          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#>  8 10             85334019 -      LOC10537… -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             2.00        6
#>  9 11            107591208 +      SLN       -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#> 10 11            111744171 +      FDXACB1   -          CLOEXP_POOL6_LTR10LC10_VA2020-mix10_VA…             1.00        1
#> # … with 1,466 more rows

As you can see, in the file system with issues we have more than one file for quantification type, duplicates have “NoMate” suffix in their file name. By specifying this pattern in the function, we’re only going to import those files.

As for the interactive version, you can call the function with path to the association file and root if you want to simply import everything without filtering.

5 Reproducibility

The ISAnalytics package (calabrialab, 2021) was made possible thanks to:

R (R Core Team, 2021)
BiocStyle (Oleś, Morgan, and Huber, 2021)
knitcitations (Boettiger, 2021)
knitr (Xie, 2021)
rmarkdown (Allaire, Xie, McPherson, Luraschi, et al., 2021)
sessioninfo (Csárdi, core, Wickham, Chang, et al., 2018)
testthat (Wickham, 2011)

This package was developed using biocthis.

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       Ubuntu 18.04.5 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  C                           
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2021-04-08                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package       * version date       lib source        
#>  assertthat      0.2.1   2019-03-21 [2] CRAN (R 4.0.5)
#>  BiocManager     1.30.12 2021-03-28 [2] CRAN (R 4.0.5)
#>  BiocParallel    1.24.1  2021-04-08 [2] Bioconductor  
#>  BiocStyle     * 2.18.1  2021-04-08 [2] Bioconductor  
#>  bookdown        0.21    2020-10-13 [2] CRAN (R 4.0.5)
#>  bslib           0.2.4   2021-01-25 [2] CRAN (R 4.0.5)
#>  cellranger      1.1.0   2016-07-27 [2] CRAN (R 4.0.5)
#>  cli             2.4.0   2021-04-05 [2] CRAN (R 4.0.5)
#>  colorspace      2.0-0   2020-11-11 [2] CRAN (R 4.0.5)
#>  crayon          1.4.1   2021-02-08 [2] CRAN (R 4.0.5)
#>  data.table      1.14.0  2021-02-21 [2] CRAN (R 4.0.5)
#>  DBI             1.1.1   2021-01-15 [2] CRAN (R 4.0.5)
#>  debugme         1.1.0   2017-10-22 [2] CRAN (R 4.0.5)
#>  digest          0.6.27  2020-10-24 [2] CRAN (R 4.0.5)
#>  dplyr           1.0.5   2021-03-05 [2] CRAN (R 4.0.5)
#>  ellipsis        0.3.1   2020-05-15 [2] CRAN (R 4.0.5)
#>  evaluate        0.14    2019-05-28 [2] CRAN (R 4.0.5)
#>  fansi           0.4.2   2021-01-15 [2] CRAN (R 4.0.5)
#>  forcats         0.5.1   2021-01-27 [2] CRAN (R 4.0.5)
#>  fs              1.5.0   2020-07-31 [2] CRAN (R 4.0.5)
#>  generics        0.1.0   2020-10-31 [2] CRAN (R 4.0.5)
#>  ggplot2         3.3.3   2020-12-30 [2] CRAN (R 4.0.5)
#>  ggrepel         0.9.1   2021-01-15 [2] CRAN (R 4.0.5)
#>  glue            1.4.2   2020-08-27 [2] CRAN (R 4.0.5)
#>  gtable          0.3.0   2019-03-25 [2] CRAN (R 4.0.5)
#>  highr           0.8     2019-03-20 [2] CRAN (R 4.0.5)
#>  hms             1.0.0   2021-01-13 [2] CRAN (R 4.0.5)
#>  htmltools       0.5.1.1 2021-01-22 [2] CRAN (R 4.0.5)
#>  htmlwidgets     1.5.3   2020-12-10 [2] CRAN (R 4.0.5)
#>  httr            1.4.2   2020-07-20 [2] CRAN (R 4.0.5)
#>  ISAnalytics   * 1.0.11  2021-04-08 [1] Bioconductor  
#>  jquerylib       0.1.3   2020-12-17 [2] CRAN (R 4.0.5)
#>  jsonlite        1.7.2   2020-12-09 [2] CRAN (R 4.0.5)
#>  knitcitations * 1.0.12  2021-01-10 [2] CRAN (R 4.0.5)
#>  knitr           1.31    2021-01-27 [2] CRAN (R 4.0.5)
#>  lattice         0.20-41 2020-04-02 [2] CRAN (R 4.0.5)
#>  lifecycle       1.0.0   2021-02-15 [2] CRAN (R 4.0.5)
#>  lubridate       1.7.10  2021-02-26 [2] CRAN (R 4.0.5)
#>  magrittr      * 2.0.1   2020-11-17 [2] CRAN (R 4.0.5)
#>  mnormt          2.0.2   2020-09-01 [2] CRAN (R 4.0.5)
#>  munsell         0.5.0   2018-06-12 [2] CRAN (R 4.0.5)
#>  nlme            3.1-152 2021-02-04 [2] CRAN (R 4.0.5)
#>  pillar          1.5.1   2021-03-05 [2] CRAN (R 4.0.5)
#>  pkgconfig       2.0.3   2019-09-22 [2] CRAN (R 4.0.5)
#>  plyr            1.8.6   2020-03-03 [2] CRAN (R 4.0.5)
#>  ps              1.6.0   2021-02-28 [2] CRAN (R 4.0.5)
#>  psych           2.1.3   2021-03-27 [2] CRAN (R 4.0.5)
#>  purrr           0.3.4   2020-04-17 [2] CRAN (R 4.0.5)
#>  R6              2.5.0   2020-10-28 [2] CRAN (R 4.0.5)
#>  Rcpp            1.0.6   2021-01-15 [2] CRAN (R 4.0.5)
#>  reactable       0.2.3   2020-10-04 [2] CRAN (R 4.0.5)
#>  readr           1.4.0   2020-10-05 [2] CRAN (R 4.0.5)
#>  readxl          1.3.1   2019-03-13 [2] CRAN (R 4.0.5)
#>  RefManageR      1.3.0   2020-11-13 [2] CRAN (R 4.0.5)
#>  rlang           0.4.10  2020-12-30 [2] CRAN (R 4.0.5)
#>  rmarkdown       2.7     2021-02-19 [2] CRAN (R 4.0.5)
#>  rstudioapi      0.13    2020-11-12 [2] CRAN (R 4.0.5)
#>  sass            0.3.1   2021-01-24 [2] CRAN (R 4.0.5)
#>  scales          1.1.1   2020-05-11 [2] CRAN (R 4.0.5)
#>  sessioninfo   * 1.1.1   2018-11-05 [2] CRAN (R 4.0.5)
#>  stringi         1.5.3   2020-09-09 [2] CRAN (R 4.0.5)
#>  stringr         1.4.0   2019-02-10 [2] CRAN (R 4.0.5)
#>  tibble          3.1.0   2021-02-25 [2] CRAN (R 4.0.5)
#>  tidyr           1.1.3   2021-03-03 [2] CRAN (R 4.0.5)
#>  tidyselect      1.1.0   2020-05-11 [2] CRAN (R 4.0.5)
#>  tmvnsim         1.0-2   2016-12-15 [2] CRAN (R 4.0.5)
#>  upsetjs         1.9.0   2021-02-15 [2] CRAN (R 4.0.5)
#>  utf8            1.2.1   2021-03-12 [2] CRAN (R 4.0.5)
#>  vctrs           0.3.7   2021-03-29 [2] CRAN (R 4.0.5)
#>  withr           2.4.1   2021-01-26 [2] CRAN (R 4.0.5)
#>  xfun            0.22    2021-03-11 [2] CRAN (R 4.0.5)
#>  xml2            1.3.2   2020-04-23 [2] CRAN (R 4.0.5)
#>  yaml            2.2.1   2020-02-01 [2] CRAN (R 4.0.5)
#>  zip             2.1.1   2020-08-27 [2] CRAN (R 4.0.5)
#> 
#> [1] /tmp/RtmpuXsGTR/Rinst7cb42c7ac9b7
#> [2] /home/biocbuild/bbs-3.12-bioc/R/library

6 Bibliography

This vignette was generated using BiocStyle (Oleś, Morgan, and Huber, 2021) with knitr (Xie, 2021) and rmarkdown (Allaire, Xie, McPherson, Luraschi, et al., 2021) running behind the scenes.

Citations made with knitcitations (Boettiger, 2021).

[1] J. Allaire, Y. Xie, J. McPherson, J. Luraschi, et al. rmarkdown: Dynamic Documents for R. R package version 2.7. 2021. <URL: https://github.com/rstudio/rmarkdown>.

[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.12. 2021. <URL: https://CRAN.R-project.org/package=knitcitations>.

[3] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.1. 2018. <URL: https://CRAN.R-project.org/package=sessioninfo>.

[4] A. Oleś, M. Morgan, and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.18.1. 2021. <URL: https://github.com/Bioconductor/BiocStyle>.

[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2021. <URL: https://www.R-project.org/>.

[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5-10. <URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf>.

[7] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31. 2021. <URL: https://yihui.org/knitr/>.

[8] calabrialab. Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies. https://github.com/calabrialab/ISAnalytics - R package version 1.0.11. 2021. DOI: 10.18129/B9.bioc.ISAnalytics. <URL: http://www.bioconductor.org/packages/ISAnalytics>.

How to use import functions

8 April 2021

Package

1 Introduction to `ISAnalytics` import functions family

1.1 How to install ISAnalytics

1.2 Setting options

1.3 Designed to work with Vispa2 pipeline

1.4 File system structure generated

2 Importing a single integration matrix

3 Importing the association file

4 Importing multiple matrices in parallel

4.1 Interactive version

4.2 Automatic version

5 Reproducibility

6 Bibliography

How to use import functions

8 April 2021

Package

1 Introduction to ISAnalytics import functions family

1.1 How to install ISAnalytics

1.2 Setting options

1.3 Designed to work with Vispa2 pipeline

1.4 File system structure generated

2 Importing a single integration matrix

3 Importing the association file

4 Importing multiple matrices in parallel

4.1 Interactive version

4.2 Automatic version

5 Reproducibility

6 Bibliography

1 Introduction to `ISAnalytics` import functions family