1 Introduction to ISAnalytics import functions family

In this vignette we’re going to explain more in detail how functions of the import family should be used, the most common workflows to follow and more.

2 Installation and options

ISAnalytics can be installed quickly in different ways:

  • You can install it via Bioconductor
  • You can install it via GitHub using the package devtools

There are always 2 versions of the package active:

  • RELEASE is the latest stable version
  • DEVEL is the development version, it is the most up-to-date version where all new features are introduced

2.1 Installation from bioconductor

RELEASE version:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ISAnalytics")

DEVEL version:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("ISAnalytics")

2.2 Installation from GitHub

RELEASE:

if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "RELEASE_3_14",
                         dependencies = TRUE,
                         build_vignettes = TRUE)

DEVEL:

if (!require(devtools)) {
  install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
                         ref = "master",
                         dependencies = TRUE,
                         build_vignettes = TRUE)

2.3 Setting options

ISAnalytics has a verbose option that allows some functions to print additional information to the console while they’re executing. To disable this feature do:

# DISABLE
options("ISAnalytics.verbose" = FALSE)

# ENABLE
options("ISAnalytics.verbose" = TRUE)

Some functions also produce report in a user-friendly HTML format, to set this feature:

# DISABLE HTML REPORTS
options("ISAnalytics.reports" = FALSE)

# ENABLE HTML REPORTS
options("ISAnalytics.reports" = TRUE)
library(ISAnalytics)

2.4 Designed to work with VISPA2 pipeline

The vast majority of the functions included in this package is designed to work in combination with VISPA2 pipeline (Giulio Spinozzi Andrea Calabria, 2017). If you don’t know what it is, we strongly recommend you to take a look at these links:

2.5 File system structure generated

VISPA2 produces a standard file system structure starting from a folder you specify as your workbench or root. The structure always follows this schema:

  • root/
    • Optional intermediate folders
      • Projects (PROJECTID)
        • bam
        • bcmuxall
        • bed
        • iss
          • Pools (concatenatePoolIDSeqRun)
        • quality
        • quantification
          • Pools (concatenatePoolIDSeqRun)
        • report

Most of the functions implemented expect a standard file system structure as the one described above.

3 Notation

We call an “integration matrix” a tabular structure characterized by:

  • 3 mandatory columns of genomic features that characterize a viral insertion site in the genome: chr, integration_locus and strand
  • 2 (optional) annotation columns: GeneName and GeneStrand
  • A variable number n of sample columns containing the quantification of the corresponding integration site
#> # A tibble: 3 × 8
#>   chr   integration_locus strand GeneName     GeneStrand  exp1  exp2  exp3
#>   <chr>             <dbl> <chr>  <chr>        <chr>      <dbl> <dbl> <dbl>
#> 1 1                 12324 +      NFATC3       +           4553  5345    NA
#> 2 6                657532 +      LOC100507487 +             76   545     5
#> 3 7                657532 +      EDIL3        -             NA    56    NA

The package uses a more compact form of these matrices, limiting the amount of NA values and optimizing time and memory consumption. For more info on this take a look at: Tidy data

While integration matrices contain the actual data, we also need associated sample metadata to perform the vast majority of the analyses. ISAnalytics expects the metadata to be contained in a so called “association file”, which is a simple tabular file with a set of standard column headers.

To generate a blank association file you can use the function generate_blank_association_file. You can also view the standard column names with association_file_columns().

4 Importing metadata

To import metadata we use import_association_file(). This function is not only responsible for reading the file into the R environment as a data frame, but it is capable to perform a file system alignment operation, that is, for each project and pool contained in the file, it scans the file system starting from the provided root to check if the corresponding folders (contained in the appropriate column) can be found. Remember that to work properly, this operation expects a standard folder structure, such as the one provided by VISPA2. This function also produces an interactive HTML report, to know more about this feature see vignette(report_system).

fs_path <- system.file("extdata", "fs.zip", package = "ISAnalytics")
root <- unzip_file_system(fs_path, "fs")
withr::with_options(list(ISAnalytics.reports = FALSE), code = {
  af_path <- system.file("extdata", "asso.file.tsv.gz", 
                         package = "ISAnalytics")
  af <- import_association_file(af_path, root = root)
})
#> *** Association file import summary ***
#> ℹ For detailed report please set option 'ISAnalytics.reports' to TRUE
#> * Parsing problems detected: FALSE
#> * Date parsing problems: FALSE
#> * Column problems detected: FALSE
#> * NAs found in important columns: FALSE
#> * File system alignment: no problems detected
#> # A tibble: 6 × 74
#>   ProjectID FUSIONID  PoolID TagSequence SubjectID VectorType VectorID ExperimentID Tissue TimePoint DNAFragmentation
#>   <chr>     <chr>     <chr>  <chr>       <chr>     <chr>      <chr>    <chr>        <chr>  <chr>     <chr>           
#> 1 PJ01      ET#382.46 POOL01 LTR75LC38   PT001     lenti      GLOBE    <NA>         PB     0060      SONIC           
#> 2 PJ01      ET#381.40 POOL01 LTR53LC32   PT001     lenti      GLOBE    <NA>         BM     0180      SONIC           
#> 3 PJ01      ET#381.9  POOL01 LTR83LC66   PT001     lenti      GLOBE    <NA>         BM     0180      SONIC           
#> 4 PJ01      ET#381.71 POOL01 LTR27LC94   PT001     lenti      GLOBE    <NA>         BM     0180      SONIC           
#> 5 PJ01      ET#381.2  POOL01 LTR69LC52   PT001     lenti      GLOBE    <NA>         PB     0180      SONIC           
#> 6 PJ01      ET#382.28 POOL01 LTR37LC2    PT001     lenti      GLOBE    <NA>         BM     0060      SONIC           
#>   PCRMethod TagIDextended Keywords CellMarker TagID      NGSProvider NGSTechnology ConverrtedFilesDir
#>   <chr>     <chr>         <chr>    <chr>      <chr>      <chr>       <chr>         <chr>             
#> 1 SLiM      LTR75LC38     <NA>     MNC        LTR75.LC38 <NA>        HiSeq         <NA>              
#> 2 SLiM      LTR53LC32     <NA>     MNC        LTR53.LC32 <NA>        HiSeq         <NA>              
#> 3 SLiM      LTR83LC66     <NA>     MNC        LTR83.LC66 <NA>        HiSeq         <NA>              
#> 4 SLiM      LTR27LC94     <NA>     MNC        LTR27.LC94 <NA>        HiSeq         <NA>              
#> 5 SLiM      LTR69LC52     <NA>     MNC        LTR69.LC52 <NA>        HiSeq         <NA>              
#> 6 SLiM      LTR37LC2      <NA>     MNC        LTR37.LC2  <NA>        HiSeq         <NA>              
#>   ConverrtedFilesName SourceFileFolder SourceFileNameR1 SourceFileNameR2 DNAnumber ReplicateNumber DNAextractionDate
#>   <chr>               <chr>            <chr>            <chr>            <chr>               <int> <date>           
#> 1 <NA>                <NA>             <NA>             <NA>             PT001-103               3 2016-03-16       
#> 2 <NA>                <NA>             <NA>             <NA>             PT001-81                2 2016-07-15       
#> 3 <NA>                <NA>             <NA>             <NA>             PT001-81                1 2016-07-15       
#> 4 <NA>                <NA>             <NA>             <NA>             PT001-81                3 2016-07-15       
#> 5 <NA>                <NA>             <NA>             <NA>             PT001-74                1 2016-07-15       
#> 6 <NA>                <NA>             <NA>             <NA>             PT001-107               2 2016-03-16       
#>   DNAngUsed LinearPCRID LinearPCRDate SonicationDate LigationDate `1stExpoPCRID` `1stExpoPCRDate` `2ndExpoID`
#>       <dbl> <chr>       <date>        <date>         <date>       <chr>          <date>           <chr>      
#> 1      23.2 <NA>        NA            2016-11-02     2016-11-02   ET#380.46      2016-11-02       <NA>       
#> 2     181.  <NA>        NA            2016-11-02     2016-11-02   ET#379.40      2016-11-02       <NA>       
#> 3     181.  <NA>        NA            2016-11-02     2016-11-02   ET#379.9       2016-11-02       <NA>       
#> 4     181.  <NA>        NA            2016-11-02     2016-11-02   ET#379.71      2016-11-02       <NA>       
#> 5      23.1 <NA>        NA            2016-11-02     2016-11-02   ET#379.2       2016-11-02       <NA>       
#> 6     171.  <NA>        NA            2016-11-02     2016-11-02   ET#380.28      2016-11-02       <NA>       
#>   `2ndExpoDate` FusionPrimerPCRID FusionPrimerPCRDate PoolDate   SequencingDate   VCN Genome SequencingRound Genotype
#>   <date>        <chr>             <date>              <date>     <date>         <dbl> <chr>            <int> <chr>   
#> 1 NA            ET#382.46         2016-11-03          2016-11-07 2016-11-15      0.3  hg19                 1 <NA>    
#> 2 NA            ET#381.40         2016-11-03          2016-11-07 2016-11-15      0.27 hg19                 1 <NA>    
#> 3 NA            ET#381.9          2016-11-03          2016-11-07 2016-11-15      0.27 hg19                 1 <NA>    
#> 4 NA            ET#381.71         2016-11-03          2016-11-07 2016-11-15      0.27 hg19                 1 <NA>    
#> 5 NA            ET#381.2          2016-11-03          2016-11-07 2016-11-15      0.24 hg19                 1 <NA>    
#> 6 NA            ET#382.28         2016-11-03          2016-11-07 2016-11-15      0.42 hg19                 1 <NA>    
#>   TestGroup MOI   Engraftment Transduction Notes AddedField1 AddedField2 AddedField3 AddedField4 concatenatePoolIDSeqRun
#>   <chr>     <chr>       <dbl>        <dbl> <chr> <chr>       <chr>       <chr>       <chr>       <chr>                  
#> 1 <NA>      <NA>           NA           NA <NA>  <NA>        <NA>        <NA>        <NA>        POOL01-1               
#> 2 <NA>      <NA>           NA           NA <NA>  <NA>        <NA>        <NA>        <NA>        POOL01-1               
#> 3 <NA>      <NA>           NA           NA <NA>  <NA>        <NA>        <NA>        <NA>        POOL01-1               
#> 4 <NA>      <NA>           NA           NA <NA>  <NA>        <NA>        <NA>        <NA>        POOL01-1               
#> 5 <NA>      <NA>           NA           NA <NA>  <NA>        <NA>        <NA>        <NA>        POOL01-1               
#> 6 <NA>      <NA>           NA           NA <NA>  <NA>        <NA>        <NA>        <NA>        POOL01-1               
#>   AddedField6_RelativeBloodPercentage AddedField7_PurityTestFeasibility AddedField8_FacsSeparationPurity  Kapa ulForPool
#>   <chr>                                                           <dbl>                            <dbl> <dbl>     <dbl>
#> 1 <NA>                                                               NA                               NA    NA        NA
#> 2 <NA>                                                               NA                               NA    NA        NA
#> 3 <NA>                                                               NA                               NA    NA        NA
#> 4 <NA>                                                               NA                               NA    NA        NA
#> 5 <NA>                                                               NA                               NA    NA        NA
#> 6 <NA>                                                               NA                               NA    NA        NA
#>   CompleteAmplificationID                                              UniqueID               StudyTestID StudyTestGroup
#>   <chr>                                                                <chr>                  <chr>                <int>
#> 1 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC ID00000000000000007433 <NA>                    NA
#> 2 PJ01_POOL01_LTR53LC32_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC  ID00000000000000007340 <NA>                    NA
#> 3 PJ01_POOL01_LTR83LC66_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC  ID00000000000000007310 <NA>                    NA
#> 4 PJ01_POOL01_LTR27LC94_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC  ID00000000000000007370 <NA>                    NA
#> 5 PJ01_POOL01_LTR69LC52_PT001_PT001-74_lenti_GLOBE_PB_1_SLiM_0180_MNC  ID00000000000000007303 <NA>                    NA
#> 6 PJ01_POOL01_LTR37LC2_PT001_PT001-107_lenti_GLOBE_BM_1_SLiM_0060_MNC  ID00000000000000007417 <NA>                    NA
#>   MouseID Tigroup Tisource PathToFolderProjectID SamplesNameCheck TimepointDays TimepointMonths TimepointYears
#>     <int> <chr>   <chr>    <chr>                 <chr>                    <int> <chr>           <chr>         
#> 1      NA <NA>    <NA>     /PJ01                 <NA>                        NA 02              01            
#> 2      NA <NA>    <NA>     /PJ01                 <NA>                        NA 06              01            
#> 3      NA <NA>    <NA>     /PJ01                 <NA>                        NA 06              01            
#> 4      NA <NA>    <NA>     /PJ01                 <NA>                        NA 06              01            
#> 5      NA <NA>    <NA>     /PJ01                 <NA>                        NA 06              01            
#> 6      NA <NA>    <NA>     /PJ01                 <NA>                        NA 02              01            
#>   `ng DNA corrected` Path                    Path_quant                                     
#>                <dbl> <fs::path>              <chr>                                          
#> 1               23.2 /tmp/RtmpYYhioi/fs/PJ01 /tmp/RtmpYYhioi/fs/PJ01/quantification/POOL01-1
#> 2              181.  /tmp/RtmpYYhioi/fs/PJ01 /tmp/RtmpYYhioi/fs/PJ01/quantification/POOL01-1
#> 3              181.  /tmp/RtmpYYhioi/fs/PJ01 /tmp/RtmpYYhioi/fs/PJ01/quantification/POOL01-1
#> 4              181.  /tmp/RtmpYYhioi/fs/PJ01 /tmp/RtmpYYhioi/fs/PJ01/quantification/POOL01-1
#> 5               23.1 /tmp/RtmpYYhioi/fs/PJ01 /tmp/RtmpYYhioi/fs/PJ01/quantification/POOL01-1
#> 6              171.  /tmp/RtmpYYhioi/fs/PJ01 /tmp/RtmpYYhioi/fs/PJ01/quantification/POOL01-1
#>   Path_iss                            
#>   <chr>                               
#> 1 /tmp/RtmpYYhioi/fs/PJ01/iss/POOL01-1
#> 2 /tmp/RtmpYYhioi/fs/PJ01/iss/POOL01-1
#> 3 /tmp/RtmpYYhioi/fs/PJ01/iss/POOL01-1
#> 4 /tmp/RtmpYYhioi/fs/PJ01/iss/POOL01-1
#> 5 /tmp/RtmpYYhioi/fs/PJ01/iss/POOL01-1
#> 6 /tmp/RtmpYYhioi/fs/PJ01/iss/POOL01-1

4.1 Function arguments

You can change several arguments in the function call to modify the behavior of the function.

  • root
    • Set it to NULL if you only want to import the association file without file system alignment. Beware that some of the automated import functionalities won’t work!
    • Set it to a non-empty string (path on disk): in this case, the column PathToFolderProjectID in the file should contain relative file paths, so if for example your root is set to “/home” and your project folder in the association file is set to “/PJ01”, the function will check that the directory exists under “/home/PJ01”
    • Set it to an empty string: ideal if you want to store paths in the association file as absolute file paths. In this case if your project folder is in “/home/PJ01” you should have this path in the PathToFolderProjectID column and set root = ""
  • tp_padding: this argument is used to pad the TimePoint column in the association file so that time points have all the same length
  • dates_format: a string that is useful for properly parsing dates from tabular formats
  • separator: the column separator used in the file. Defaults to “\t”, other valid separators are “,” (comma), “;” (semi-colon)
  • filter_for: you can set this argument to a named list of filters, where names are column names. For example list(ProjectID = "PJ01") will return only those rows whose attribute “ProjectID” equals “PJ01”
  • import_iss: either TRUE or FALSE. If set to TRUE, performs an internal call to import_Vispa2_stats() (see next section), and appends the imported files to metadata
  • convert_tp: either TRUE or FALSE. Converts the “TimePoint” column in months and years (with custom logic).
  • report_path
    • Set it to NULL to avoid the production of a report
    • Set it to a folder (if it doesn’t exist, it gets automatically created)
    • Set it to a file
  • ...: additional named arguments to pass to import_Vispa2_stats() if you chose to import VISPA2 stats

NOTE: the function supports files in various formats as long as the correct separator is provided. It also accepts files in *.xlsx and *.xls formats but we do not recommend using these since the report won’t include a detailed summary of potential parsing problems.

The interactive report includes useful information such as

  • General issues: parsing problems, missing columns, NA values in important columns etc. This allows you to immediately spot problems and correct them before proceeding with the analyses
  • File system alignment issues: very useful to know if all data can be imported or folders are missing
  • Info on VISPA2 stats (if import_iss was TRUE)

5 Importing VISPA2 stats files

VISPA2 automatically produces summary files for each pool holding information that can be useful for other analyses downstream, so it is recommended to import them in the first steps of the workflow. To do that, you can use import_VISPA2_stats:

withr::with_options(list(ISAnalytics.reports = FALSE), {
    vispa_stats <- import_Vispa2_stats(
        association_file = af,
        join_with_af = FALSE
    )
})
#>         RUN_NAME     POOL       TAG PHIX_MAPPING PLASMID_MAPPED_BYPOOL BARCODE_MUX LTR_IDENTIFIED TRIMMING_FINAL_LTRLC
#> 1: PJ01|POOL01-1 POOL01-1 LTR75LC38     43586699               2256176      645026         645026               630965
#> 2: PJ01|POOL01-1 POOL01-1 LTR53LC32     43586699               2256176      652208         652177               649044
#> 3: PJ01|POOL01-1 POOL01-1 LTR83LC66     43586699               2256176      451519         451512               449669
#> 4: PJ01|POOL01-1 POOL01-1 LTR27LC94     43586699               2256176      426500         426499               425666
#> 5: PJ01|POOL01-1 POOL01-1 LTR69LC52     43586699               2256176       18300          18300                18290
#> 6: PJ01|POOL01-1 POOL01-1  LTR37LC2     43586699               2256176      729327         729327               727219
#>    LV_MAPPED BWA_MAPPED_OVERALL ISS_MAPPED_OVERALL RAW_READS QUALITY_PASSED ISS_MAPPED_PP
#> 1:    211757             402477             219452        NA             NA            NA
#> 2:    303300             322086             222646        NA             NA            NA
#> 3:    204810             227275             149385        NA             NA            NA
#> 4:    185752             223915             143283        NA             NA            NA
#> 5:      6962              10487               5907        NA             NA            NA
#> 6:    318653             369117             235640        NA             NA            NA

The function requires as input the imported and file system aligned association file and it will scan the iss folder for files that match some known prefixes (defaults are already provided but you can change them as you see fit). You can either choose to join the imported data frames with the association file in input and obtain a single data frame or keep it as it is, just set the parameter join_with_af accordingly. At the end of the process an HTML report is produced, signaling potential problems.

You can directly call this function when you import the association file by setting the import_iss argument of import_association_file to TRUE.

6 Importing a single integration matrix

If you want to import a single integration matrix you can do so by using the import_single_Vispa2Matrix() function. This function reads the file and converts it into a tidy structure: several different formats can be read, since you can specify the column separator.

matrix_path <- fs::path(root,
                        "PJ01",
                        "quantification",
                        "POOL01-1",
                        "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz")
matrix <- import_single_Vispa2Matrix(matrix_path)
#>      chr integration_locus strand GeneName GeneStrand
#>   1:  16          68164148      +   NFATC3          +
#>   2:  16           1762026      - MAPK8IP3          +
#>   3:  16          15966129      -    FOPNL          -
#>   4:  16           1762026      - MAPK8IP3          +
#>   5:  16          29843197      -      MVP          +
#>  ---                                                 
#> 798:   X          41047794      -    USP9X          +
#> 799:   X         138822227      +   ATP11C          -
#> 800:   X          69681219      +     DLG3          +
#> 801:   X          69681219      +     DLG3          +
#> 802:   X          41047794      -    USP9X          +
#>                                                   CompleteAmplificationID Value
#>   1: PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC   182
#>   2:  PJ01_POOL01_LTR53LC32_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC   727
#>   3:  PJ01_POOL01_LTR53LC32_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC   821
#>   4:  PJ01_POOL01_LTR83LC66_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC    37
#>   5:  PJ01_POOL01_LTR83LC66_PT001_PT001-81_lenti_GLOBE_BM_1_SLiM_0180_MNC   983
#>  ---                                                                           
#> 798:   PJ01_POOL01_LTR19LC2_PT001_PT001-97_lenti_GLOBE_BM_1_SLiM_0030_MNC    32
#> 799: PJ01_POOL01_LTR57LC20_PT001_PT001-116_lenti_GLOBE_BM_1_SLiM_0090_MNC  2535
#> 800:  PJ01_POOL01_LTR5LC64_PT001_PT001-116_lenti_GLOBE_BM_1_SLiM_0090_MNC  1693
#> 801:  PJ01_POOL01_LTR85LC64_PT001_PT001-97_lenti_GLOBE_BM_1_SLiM_0030_MNC     1
#> 802:  PJ01_POOL01_LTR85LC64_PT001_PT001-97_lenti_GLOBE_BM_1_SLiM_0030_MNC   609

Other arguments you can pass to the function are

  • to_exclude: a character vector that contains column names that need to be excluded when imported. This is more targeted towards files that do have all the columns of an integration matrix as presented in section 3 and other additional columns. By default this argument is set to NULL
  • keep_excluded: if set to TRUE all columns contained in to_exclude are preserved as additional annotation columns

7 Automated integration matrices import

Integration matrices import can be automated when when the association file is imported with the file system alignment option. ISAnalytics provides a function, import_parallel_Vispa2Matrices(), that allows to do just that in a fast and efficient way.

withr::with_options(list(ISAnalytics.reports = FALSE), {
    matrices <- import_parallel_Vispa2Matrices(af,
        c("seqCount", "fragmentEstimate"),
        mode = "AUTO"
    )
})

7.1 Function arguments

Let’s see how the behavior of the function changes when we change arguments.

7.1.1 association_file argument

You can supply a data frame object, imported via import_association_file() (see Section 4) or a string (the path to the association file on disk). In the first scenario it is necessary to perform file system alignment, since the function scans the folders contained in the column Path_quant, while in the second case you should also provide as additional named argument (to ...) an appropriate root: the function will internally call import_association_file(), if you don’t have specific needs we recommend doing the 2 steps separately and provide the association file as a data frame.

7.1.2 quantification_type argument

For each pool there may be multiple available quantification types, that is, different matrices containing the same samples and same genomic features but a different quantification. A typical workflow contemplates seqCount and fragmentEstimate, all the supported quantification types can be viewed with quantification_types().

7.1.3 matrix_type argument

As we mentioned in Section 3, annotation columns are optional and may not be included in some matrices. This argument allows you to specify the function to look for only a specific type of matrix, either annotated or not_annotated. Please note that in order to do that, for now, the function needs to assume some standard file name notation, that is, for annotated matrices, the function will look for the .no0.annotated suffix in the file name.

7.1.4 workers argument

Sets the number of parallel workers to set up. This highly depends on the hardware configuration of your machine.

7.1.5 multi_quant_matrix argument

When importing more than one quantification at once, it can be very handy to have all data in a single data frame rather than two. If set to TRUE the function will internally call comparison_matrix() and produce a single data frames that has a dedicated column for each quantification. For example, for the matrices we’ve imported before:

#> # A tibble: 6 × 8
#>   chr   integration_locus strand GeneName     GeneStrand
#>   <chr>             <int> <chr>  <chr>        <chr>     
#> 1 16             68164148 +      NFATC3       +         
#> 2 4             129390130 +      LOC100507487 +         
#> 3 5              84009671 -      EDIL3        -         
#> 4 12             54635693 -      CBX5         -         
#> 5 2             181930711 +      UBE2E3       +         
#> 6 20             35920986 +      MANBAL       +         
#>   CompleteAmplificationID                                              fragmentEstimate seqCount
#>   <fct>                                                                           <dbl>    <int>
#> 1 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC           103.        182
#> 2 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC             3.01        4
#> 3 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC             5.03        5
#> 4 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC             9.13        9
#> 5 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC            50.5        83
#> 6 PJ01_POOL01_LTR75LC38_PT001_PT001-103_lenti_GLOBE_PB_1_SLiM_0060_MNC            16.4        39

7.1.6 report_path argument

As other import functions, also import_parallel_Vispa2Matrices() produces an interactive report, use this argument to set the appropriate path were the report should be saved.

7.1.7 mode argument

This argument can take one of two values, AUTO or INTERACTIVE. The INTERACTIVE workflow, as the name suggests, needs user console input but allows a fine tuning of the import process. On the other hand, AUTO allows a fully automated workflow but has of course some limitations.

What do you want to import?
In a fully automated mode, the function will try to import everything that is contained in the input association file. This means that if you need to import only a specific set of projects/pools, you will need to filter the association file accordingly prior calling the function (you can easily do that via the filter_for argument as explained in Section 4). In interactive mode the function will ask you to type what you want to import.

How to deal with duplicates?
When scanning folders for files that match a given pattern (in our case the function looks for matrices that match the quantification type and the matrix type), it is very possible that the same folder contains multiple files for the same quantification. Of course this is not recommended, we suggest to move the duplicated files in a sub directory or remove them if they’re not necessary, but in case this happens, in interactive mode, the function asks directly which files should be considered. Of course this is not possible in automated mode, therefore you need to set two other arguments (described in the next sub sections) to “help” the function discriminate between duplicates. Please note that if such discrimination is not possible no files are imported.

7.1.8 patterns argument

This argument is relevant only if mode is set to AUTO. Providing a set of patterns (interpreted as regular expressions) helps the function to choose between duplicated files if any are found. If you’re confident your folders don’t contain any duplicates feel free to ignore this argument.

7.1.9 matching_opt argument

This argument is relevant only if mode is set to AUTO and patterns isn’t NULL. Tells the function how to match the given patterns if multiple are supplied: ALL means keep only those files whose name matches all the given patterns, ANY means keep only those files whose name matches any of the given patterns and OPTIONAL expresses a preference, try to find files that contain the patterns and if you don’t find any return whatever you find.

7.1.10 ... argument

Additional named arguments to supply to both import_association_file() and comparison_matrix().

7.2 Notes

Earlier versions of the package featured two separated functions, import_parallel_Vispa2Matrices_auto() and import_parallel_Vispa2Matrices_interactive(). Those functions are now officially deprecated (since ISAnalytics 1.3.3) and will be defunct on the next release cycle.

8 Reproducibility

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Ubuntu 20.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2022-01-16
#>  pandoc   2.5 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [2] CRAN (R 4.1.2)
#>  BiocManager    1.30.16 2021-06-15 [2] CRAN (R 4.1.2)
#>  BiocParallel   1.28.3  2022-01-16 [2] Bioconductor
#>  BiocStyle    * 2.22.0  2022-01-16 [2] Bioconductor
#>  bit            4.0.4   2020-08-04 [2] CRAN (R 4.1.2)
#>  bit64          4.0.5   2020-08-30 [2] CRAN (R 4.1.2)
#>  bookdown       0.24    2021-09-02 [2] CRAN (R 4.1.2)
#>  bslib          0.3.1   2021-10-06 [2] CRAN (R 4.1.2)
#>  cli            3.1.0   2021-10-27 [2] CRAN (R 4.1.2)
#>  colorspace     2.0-2   2021-06-24 [2] CRAN (R 4.1.2)
#>  crayon         1.4.2   2021-10-29 [2] CRAN (R 4.1.2)
#>  data.table     1.14.2  2021-09-27 [2] CRAN (R 4.1.2)
#>  DBI            1.1.2   2021-12-20 [2] CRAN (R 4.1.2)
#>  digest         0.6.29  2021-12-01 [2] CRAN (R 4.1.2)
#>  dplyr          1.0.7   2021-06-18 [2] CRAN (R 4.1.2)
#>  ellipsis       0.3.2   2021-04-29 [2] CRAN (R 4.1.2)
#>  evaluate       0.14    2019-05-28 [2] CRAN (R 4.1.2)
#>  fansi          1.0.2   2022-01-14 [2] CRAN (R 4.1.2)
#>  fastmap        1.1.0   2021-01-25 [2] CRAN (R 4.1.2)
#>  fs             1.5.2   2021-12-08 [2] CRAN (R 4.1.2)
#>  generics       0.1.1   2021-10-25 [2] CRAN (R 4.1.2)
#>  ggplot2        3.3.5   2021-06-25 [2] CRAN (R 4.1.2)
#>  ggrepel        0.9.1   2021-01-15 [2] CRAN (R 4.1.2)
#>  glue           1.6.0   2021-12-17 [2] CRAN (R 4.1.2)
#>  gtable         0.3.0   2019-03-25 [2] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [2] CRAN (R 4.1.2)
#>  hms            1.1.1   2021-09-26 [2] CRAN (R 4.1.2)
#>  htmltools      0.5.2   2021-08-25 [2] CRAN (R 4.1.2)
#>  httr           1.4.2   2020-07-20 [2] CRAN (R 4.1.2)
#>  ISAnalytics  * 1.4.3   2022-01-16 [1] Bioconductor
#>  jquerylib      0.1.4   2021-04-26 [2] CRAN (R 4.1.2)
#>  jsonlite       1.7.2   2020-12-09 [2] CRAN (R 4.1.2)
#>  knitr          1.37    2021-12-16 [2] CRAN (R 4.1.2)
#>  lattice        0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [2] CRAN (R 4.1.2)
#>  lubridate      1.8.0   2021-10-07 [2] CRAN (R 4.1.2)
#>  magrittr     * 2.0.1   2020-11-17 [2] CRAN (R 4.1.2)
#>  mnormt         2.0.2   2020-09-01 [2] CRAN (R 4.1.2)
#>  munsell        0.5.0   2018-06-12 [2] CRAN (R 4.1.2)
#>  nlme           3.1-155 2022-01-13 [2] CRAN (R 4.1.2)
#>  pillar         1.6.4   2021-10-18 [2] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3   2019-09-22 [2] CRAN (R 4.1.2)
#>  plyr           1.8.6   2020-03-03 [2] CRAN (R 4.1.2)
#>  psych          2.1.9   2021-09-22 [2] CRAN (R 4.1.2)
#>  purrr          0.3.4   2020-04-17 [2] CRAN (R 4.1.2)
#>  R.methodsS3    1.8.1   2020-08-26 [2] CRAN (R 4.1.2)
#>  R.oo           1.24.0  2020-08-26 [2] CRAN (R 4.1.2)
#>  R.utils        2.11.0  2021-09-26 [2] CRAN (R 4.1.2)
#>  R6             2.5.1   2021-08-19 [2] CRAN (R 4.1.2)
#>  Rcapture       1.4-3   2019-12-16 [2] CRAN (R 4.1.2)
#>  Rcpp           1.0.8   2022-01-13 [2] CRAN (R 4.1.2)
#>  readr          2.1.1   2021-11-30 [2] CRAN (R 4.1.2)
#>  RefManageR   * 1.3.0   2020-11-13 [2] CRAN (R 4.1.2)
#>  rlang          0.4.12  2021-10-18 [2] CRAN (R 4.1.2)
#>  rmarkdown      2.11    2021-09-14 [2] CRAN (R 4.1.2)
#>  sass           0.4.0   2021-05-12 [2] CRAN (R 4.1.2)
#>  scales         1.1.1   2020-05-11 [2] CRAN (R 4.1.2)
#>  sessioninfo  * 1.2.2   2021-12-06 [2] CRAN (R 4.1.2)
#>  stringi        1.7.6   2021-11-29 [2] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [2] CRAN (R 4.1.2)
#>  tibble         3.1.6   2021-11-07 [2] CRAN (R 4.1.2)
#>  tidyr          1.1.4   2021-09-27 [2] CRAN (R 4.1.2)
#>  tidyselect     1.1.1   2021-04-30 [2] CRAN (R 4.1.2)
#>  tmvnsim        1.0-2   2016-12-15 [2] CRAN (R 4.1.2)
#>  tzdb           0.2.0   2021-10-27 [2] CRAN (R 4.1.2)
#>  utf8           1.2.2   2021-07-24 [2] CRAN (R 4.1.2)
#>  vctrs          0.3.8   2021-04-29 [2] CRAN (R 4.1.2)
#>  vroom          1.5.7   2021-11-30 [2] CRAN (R 4.1.2)
#>  withr          2.4.3   2021-11-30 [2] CRAN (R 4.1.2)
#>  xfun           0.29    2021-12-14 [2] CRAN (R 4.1.2)
#>  xml2           1.3.3   2021-11-30 [2] CRAN (R 4.1.2)
#>  yaml           2.2.1   2020-02-01 [2] CRAN (R 4.1.2)
#>  zip            2.2.0   2021-05-31 [2] CRAN (R 4.1.2)
#> 
#>  [1] /tmp/RtmplowxUB/Rinst3081361f546c34
#>  [2] /home/biocbuild/bbs-3.14-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

9 Bibliography

This vignette was generated using BiocStyle (Oleś, 2022) with knitr (Xie, 2021) and rmarkdown (Allaire, Xie, McPherson, et al., 2021) running behind the scenes.

Citations made with RefManageR (McLean, 2017).

[1] J. Allaire, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.11. 2021. URL: https://github.com/rstudio/rmarkdown.

[2] S. B. Giulio Spinozzi Andrea Calabria. “VISPA2: a scalable pipeline for high-throughput identification and annotation of vector integration sites”. In: BMC Bioinformatics (Nov. 25, 2017). DOI: 10.1186/s12859-017-1937-9.

[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.

[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.22.0. 2022. URL: https://github.com/Bioconductor/BiocStyle.

[5] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.37. 2021. URL: https://yihui.org/knitr/.