Contents

1 Important Note

Although it is not a requirement, MetaScope will fun faster and more efficiently for larger samples if the samtools package is present on your device. You can download it at https://github.com/samtools/samtools.

2 Introduction

MetaScope is a complete metagenomics profiling package that can accurately identify the composition of microbes within a sample at a strain-level resolution. MetaScope can be considered as an updated and expanded R translation of PathoScope 2.0, a Python-based metagenomic profiling package created by the Johnson lab. A few improvements made in MetaScope include using the BAM file format instead of the SAM file format for significantly less disk space usage, removing all dependencies to NCBI’s now defunct GI sequence annotations, and properly filtering reads that align to filter reference genomes. Functions to analyze host microbiome data are also planned to be added in future updates to the package.

The core functions within the package create a workflow that allows users to:

  1. Obtain reference genomes for target and filter microbes;
  2. Make aligner-specific indexes from genomes;
  3. Align reads to target genomes;
  4. Filter reads aligning to filter genomes;
  5. Reassign ambiguously mapped reads to their likely correct microbe of origin.

Due to the modular nature of the package, users can choose to either conduct all of their analysis or part of their analysis within the package. This is dependent on what data the user already has prior to using the package — for example, if the user has already aligned their reads then they could skip functions (1), (2), and (3).

There are two types of workflows that are included in the package: the Rbowtie2 and the Rsubread workflow. The Rbowtie2 and Rsubread workflows are essentially analogous to each other, with the major difference being that functions (2), (3), and (4) differ by the aligner utilized. The Rbowtie2 functions utilize the Bowtie2 aligner (Langmead 2012) whereas the Rsubread functions utilize the Rsubread aligner (Liao 2019). The nuances of how to use each function can be found by looking at each function’s help manual (R command: ?<name of function>). For reference, PathoScope2.0 uses the Bowtie2 aligner in its workflow.

In this vignette, we will analyze the mock data provided in the package via the Rbowtie2 workflow. We will utilize all the core functions in sequential order. We will make mention of the equivalent Rsubread function whenever an Rbowtie2 function is being used.

2.1 Installation

In order to install MetaScope from GitHub, run the following code:

if (!requireNamespace("devtools", quietly = TRUE))
  install.packages("devtools")
devtools::install_github("compbiomed/MetaScope")
suppressPackageStartupMessages({
  library(MetaScope)
  library(magrittr)
})
NCBI_key <- "01d22876be34df5c28f4aedc479a2674c809"
options("ENTREZ_KEY" = NCBI_key)

3 Data

The mock data provided in the package consists of simulated sequencing data generated from the SAMtools wgsim function (see DATA_README.txt in inst/extdata to see exact commands). The wgsim function is a tool which allows for the generation of FASTQ reads from a reference genome (FASTA). The mock data (reads.fastq) contains 1500 reads, of which 1000 reads are derived from the Staphylococcus aureus RF122 strain and 500 reads are derived from the Staphylococcus epidermidis RP62A strain. In this data set, we assume that the S. aureus RF122 reads are the reads of interest, and the S. epidermidis RP62A reads are known contaminant reads which should be removed during the analysis. Ideally, the microbial composition report (.csv) produced at the end of the analysis should contain only reads assigned to the S. aureus RF122 strain.

4 Reference Genome Library

The MetaScope genome library workflow is designed to assist with collection of sequence data from the National Center for Biotechnology Information (NCBI) nucleotide database. Prior to doing so, the potential targets and filters for the analysis should be identified. That is, what “target” species do you expect to find in your metagenomic sample that you would like to identify, and what reads would you like to “filter” out from the data that are not essential to your analysis?

Typically, the targets of the analysis are microbes (that is, viruses, bacteria, and fungi), and we wish to filter out or discard any reads from the host in addition to artificially added sequences, such as Phi X 174. Following identification of the targets and filters, we use a reference genome library to align the vast number of sample reads back to the respective regions of origin in various species.

The download_refseq() function automatically extracts custom reference genome libraries in a FASTA file for microbial or host genomes. The user must first indicate a taxon for which to download the genomes, such as ‘bacteria’ or ‘primates’. A table of possible entries can be viewed by accessing the MetaScope:::taxonomy_table object. The user may then specify whether they wish to download only the RefSeq reference genomes, or both the reference and representative genomes. The compress option then allows users to specify whether to compress the output FASTA file; this is done by default.

4.1 Downloading target genomes

Even though in this scenario we know exactly from where the reads in the mock data (reads.fastq) originate, in most cases we may only have a general idea of read origins. Therefore, in the following code, we will download the genome of the Staphylococcus aureus RF122 strain along with the genomes of a few other closely related Staphylococcus aureus strains from the NCBI RefSeq database. These genomes together will act as our target genome library.

target_ref_temp <- tempfile()
dir.create(target_ref_temp)

all_species <- c("Staphylococcus aureus RF122",
                 "Staphylococcus aureus subsp. aureus ST398",
                 "Staphylococcus aureus subsp. aureus Mu3")
sapply(all_species, download_refseq, 
       reference = FALSE, representative = FALSE, compress = TRUE,
       out_dir = target_ref_temp, caching = TRUE)
##                                                            Staphylococcus aureus RF122 
##               "/tmp/Rtmp7mLE9K/file247f77ac48787/Staphylococcus_aureus_RF122.fasta.gz" 
##                                              Staphylococcus aureus subsp. aureus ST398 
## "/tmp/Rtmp7mLE9K/file247f77ac48787/Staphylococcus_aureus_subsp._aureus_ST398.fasta.gz" 
##                                                Staphylococcus aureus subsp. aureus Mu3 
##   "/tmp/Rtmp7mLE9K/file247f77ac48787/Staphylococcus_aureus_subsp._aureus_Mu3.fasta.gz"

4.2 Downloading filter genomes

We will also download the genome of the Staphylococcus epidermidis RP62A strain from the NCBI nucleotide database, in an uncompressed FASTA format. This genome will act as our filter library.

filter_ref_temp <- tempfile()
dir.create(filter_ref_temp)

download_refseq(
  taxon = "Staphylococcus epidermidis RP62A",
  representative = FALSE, reference = FALSE,
  compress = TRUE, out_dir = filter_ref_temp,
  caching = TRUE)
## [1] "/tmp/Rtmp7mLE9K/file247f795f19e4/Staphylococcus_epidermidis_RP62A.fasta.gz"

5 Demultiplex

Sequence runs on NGS instruments are typically carried out with multiple samples pooled together. An index tag (also called a barcode) consisting of a unique sequence of between 6 and 12bp is added to each sample so that the sequence reads from different samples can be identified. For 16s experiments or sequencing conducted using an Illumina machine, the process of demultiplexing (dividing your sequence reads into separate files for each index tag/sample) and generating the FASTQ data files required for downstream analysis can be done using the MetaScope demultiplexing workflow. This consists of the demultiplex() function, which takes as arguments a matrix of sample names/barcodes, a FASTQ file of barcodes by sequence header, and a FASTQ file of reads corresponding to the barcodes. Based on the barcodes given, the function extracts all reads for the indexed barcode and writes all the reads from that barcode to separate FASTQ files.

This is an optional step in the analysis process depending on whether your reads are multiplexed. The reads which we are currently trying to analyze are not multiplexed and therefore this step is skipped in our analysis. The example shown below is using different reads that are barcoded in order to show the utility of the function.

# Get barcode, index, and read data locations
barcodePath <-
  system.file("extdata", "barcodes.txt", package = "MetaScope")
indexPath <- system.file("extdata", "virus_example_index.fastq",
                         package = "MetaScope")
readPath <-
  system.file("extdata", "virus_example.fastq", package = "MetaScope")

# Get barcode, index, and read data locations
demult <-
  demultiplex(barcodePath,
              indexPath,
              readPath,
              rcBarcodes = FALSE,
              hammingDist = 2,
              location = tempfile())
## Warning in XStringSet("DNA", x, start = start, end = end, width = width, :
## metadata columns on input DNAStringSet object were dropped
demult
##   SampleName Barcode NumberOfReads
## 1        CDV TCCACGT            25
## 2   LaCrosse ACAGGCT            25
## 3        RSV ATCGTGC            25
## 4       EboV ACTACAG            25
## 5    Measles AAGTCGC            25
## 6        VSV TCTCAGG            25

6 Alignment with Reference Libraries

After acquiring the target and filter genome libraries, we then take the sequencing reads from our sample and map them first to the target library and then to the filter library. MetaScope’s Rbowtie2 mapping function utilizes the Bowtie2 aligner (Langmead 2012) which maps reads to a reference genome using a full-text minute index based approach. Essentially, the algorithm extracts substrings which are referred to as “seeds” from the reads and aligns them to the reference genomes with the assistance from the full-text minute index. Seed alignments to the reference genomes are prioritized and then finally extended into full alignments using dynamic programming.

The MetaScope Rbowtie2 mapping workflow incorporates three functions - mk_bowtie_index(), align_target_bowtie(), and filter_host_bowtie() (analogous Rsubread mapping workflow also incorporates three functions - mk_subread_index(), align_target(), and filter_host). First, we use mk_bowtie_index(), a wrapper for the Rbowtie2::bowtie2_build function, to generate Bowtie2 compatible indexes from the reference genomes that were previously downloaded. The target and reference genome files (.fasta or .fasta.gz extension) must be placed into their own separate empty directories prior to using the function. This is due to the fact that the function will attempt to build the indexes from all the files present in the directory. The function will give an error if other files (other than .fasta or .fasta.gz) are present in the directory. Depending on the combined size of the reference genomes, the function will automatically create either small (.bt2) or large (.bt2l) Bowtie2 indexes.

The target and filter reference genomes downloaded in the previous step have been combined and renamed to target.fasta and filter.fasta respectively for convenience. These are the files from which the Bowtie2 indexes will be made from.

# Create temp directory to store the Bowtie2 indices
index_temp <- tempfile()
dir.create(index_temp)

# Create target index
mk_bowtie_index(
  ref_dir = target_ref_temp,
  lib_dir = index_temp,
  lib_name = "target",
  overwrite = TRUE
)
## arguments 'show.output.on.console', 'minimized' and 'invisible' are for Windows only
## Index building complete
## [1] "/tmp/Rtmp7mLE9K/file247f74a0d5ae4"
# Create filter index
mk_bowtie_index(
  ref_dir = filter_ref_temp,
  lib_dir = index_temp,
  lib_name = "filter",
  overwrite = TRUE
)
## arguments 'show.output.on.console', 'minimized' and 'invisible' are for Windows only
## Index building complete
## [1] "/tmp/Rtmp7mLE9K/file247f74a0d5ae4"

Following index creation, we will use the Bowtie2 aligner to map the reads to the target genomes with the align_target_bowtie() function (Rsubread equivalent: align_target()). The function takes as an input the location of the FASTQ file to align, the directory where the indexes are stored, the names of the indexes to align against, the directory where the BAM file should be written, and the basename of the output BAM file. In practice, align_target_bowtie() maps reads to each target library separately, removes the unmapped reads from each file, and finally merges and sorts by chromosome the BAM files from each library into a single output file (same with align_target). If SAMtools is installed on the machine and can be found by the Sys.which("samtools") R command, the BAM file will be directly created, otherwise an intermediate SAM file will be created prior to the creation of the BAM file which could potentially create issues if the SAM file is large and there is limited disk space. The default alignment parameters are the same as PathoScope 2.0’s default alignment parameters, but users can provide their own Bowtie 2 alignment settings if desired.

We will now align the sample reads (reads.fastq) to the target reference genomes using the Bowtie 2 indexes that we just built.

# Create a temp directory to store output bam file
output_temp <- tempfile()
dir.create(output_temp)

# Get path to example reads
readPath <-
  system.file("extdata", "reads.fastq", package = "MetaScope")

# Align reads to the target genomes
target_map <-
  align_target_bowtie(
    read1 = readPath,
    lib_dir = index_temp,
    libs = "target",
    align_dir = output_temp,
    align_file = "bowtie_target",
    overwrite = TRUE
  )
## [1] "Samtools not found on system. Using Rsamtools to create bam file"
## arguments 'show.output.on.console', 'minimized' and 'invisible' are for Windows only

The last step in the mapping workflow is to filter the output BAM file according to the reference genome for the filter/host species. Although we have already filtered out any unmapped reads, which may belong to one or more host species or otherwise, there may still remain some sort of unwelcome contamination in the data from the filter species which we wish to remove. To do this, we employ filter_host_bowtie() (Rsubread equivalent: filter_host()), which takes as an input the location of the BAM file created from align_target_bowtie(), the directory where the indexes are stored, and the names of the filter indexes to align against, to produce a sorted BAM file with any reads that match the filter libraries removed. We will then use this final BAM file downstream for further analysis.

final_map <-
  filter_host_bowtie(
    reads_bam = target_map,
    lib_dir = index_temp,
    libs = "filter",
    make_bam = TRUE, # Set to true to create BAM output
    # Default is to create simplified .csv.gz output
    # The .csv.gz output is much quicker to create!
    overwrite = TRUE,
    threads = 1
  )
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## [1] "Samtools not found on system. Using Rsamtools to create bam file"
## arguments 'show.output.on.console', 'minimized' and 'invisible' are for Windows only

Prior to the last step in the analysis, we will look at the primary alignments of the mapped reads in the filtered BAM file that we just created using the filter_host_bowtie() function. According to the Bowtie2 manual, a primary alignment is described as the alignment that received the highest alignment score among all alignments for that read. When looking at the primary alignments of the mapped reads, we can see that the majority of reads have mapped to the correct Staphylococcus aureus RF122 strain. However, some residual reads have primary alignments to the other S. aureus strains which are incorrect. If we were to stop the analysis at this point, we could potentially be lead to believe that our sample has increased microbial diversity, when it actually does not.

bamFile <- Rsamtools::BamFile(final_map)

param <-
  Rsamtools::ScanBamParam(
    flag = Rsamtools::scanBamFlag(isSecondaryAlignment = FALSE),
    what = c("flag", "rname")
  ) #Gets info about primary alignments

aln <- Rsamtools::scanBam(bamFile, param = param)
accession_all <- aln[[1]]$rname
unique_accession_all <- unique(accession_all)
accession_all_inds <- match(accession_all, unique_accession_all)
unique_accession_genome_name <- suppressMessages(
  taxize::genbank2uid(unique_accession_all,
                      batch_size = length(unique_accession_all))) %>%
  vapply(function(x) attr(x, "name"), character(1))

genome_name_all <- unique_accession_genome_name[accession_all_inds] %>%
  gsub(',.*', '', .) %>%
  gsub("(ST398).*", "\\1", .) %>%
  gsub("(N315).*", "\\1", .) %>%
  gsub("(Newman).*", "\\1", .) %>%
  gsub("(Mu3).*", "\\1", .) %>%
  gsub("(Mu50).*", "\\1", .) %>%
  gsub("(RF122).*", "\\1", .)
read_count_table <- sort(table(genome_name_all), decreasing = TRUE)
knitr::kable(
  read_count_table,
  col.names = c("Genome Assigned", "Read Count"))
Genome Assigned Read Count
Staphylococcus aureus RF122 853
Staphylococcus aureus subsp. aureus Mu3 57
Staphylococcus aureus subsp. aureus ST398 49

We can also look at the secondary alignments of the mapped reads within our filtered BAM file. A secondary alignment occurs when a read maps to multiple different genomes. We can see that the majority of our secondary alignments are to the other Staphylococcus aureus strains, which makes sense considering that the majority of the primary alignments were to the correct Staphylococcus aureus RF122 strain.

bamFile <- Rsamtools::BamFile(final_map)

param <-
  Rsamtools::ScanBamParam(
    flag = Rsamtools::scanBamFlag(isSecondaryAlignment = TRUE),
    what = c("flag", "rname")
  ) #Gets info about secondary alignments

aln <- Rsamtools::scanBam(bamFile, param = param)
accession_all <- aln[[1]]$rname
unique_accession_all <- unique(accession_all)
accession_all_inds <- match(accession_all, unique_accession_all)
unique_accession_taxid <-
  suppressMessages(
    taxize::genbank2uid(unique_accession_all,
                        batch_size = length(unique_accession_all)))
unique_accession_genome_name <-
  vapply(unique_accession_taxid, function(x)
    attr(x, "name"), character(1))
genome_name_all <- unique_accession_genome_name[accession_all_inds]
genome_name_all <- gsub(',.*', '', genome_name_all)
genome_name_all <- gsub("(ST398).*", "\\1", genome_name_all)
genome_name_all <- gsub("(N315).*", "\\1", genome_name_all)
genome_name_all <- gsub("(Newman).*", "\\1", genome_name_all)
genome_name_all <- gsub("(Mu3).*", "\\1", genome_name_all)
genome_name_all <- gsub("(Mu50).*", "\\1", genome_name_all)
genome_name_all <- gsub("(RF122).*", "\\1", genome_name_all)
read_count_table <- sort(table(genome_name_all), decreasing = TRUE)
knitr::kable(
  read_count_table,
  col.names = c("Genome Assigned", "Read Count"))
Genome Assigned Read Count
Staphylococcus aureus subsp. aureus ST398 2550
Staphylococcus aureus subsp. aureus Mu3 1728
Staphylococcus aureus RF122 171

7 Genome Identification

Following the proper alignment of a sample to all target and filter libraries of interest, we may proceed in identifying which genomes are most likely to be represented in the sample. This identification workflow is the core of MetaScope; it features a Bayesian read reassignment model which dramatically improves specificity and sensitivity over other methods (Francis et. al 2013). This is because such a method identifies reads with unique alignments and uses them to guide the reassignment of reads with ambiguous alignments.

The identification workflow consists of a single function, MetaScope_ID(), which reads in a .bam file, annotates the taxonomy and genome names, reduces the mapping ambiguity using a mixture model, and outputs a .csv file with the results. Currently, it assumes that the genome library/.bam files use NCBI accession names for reference names.

metascope_id(
  final_map,
  input_type = "bam",
  # change input_type to "csv.gz" when not creating a BAM
  aligner = "bowtie2",
  num_species_plot = 0
)
## # A tibble: 2 × 6
##   TaxonomyID Genome                   read_count Proportion readsEM EMProportion
##   <chr>      <chr>                         <int>      <dbl>   <dbl>        <dbl>
## 1 273036     Staphylococcus aureus R…        957    0.998       957      0.998  
## 2 418127     Staphylococcus aureus s…          2    0.00209       2      0.00205

We will now look at the read reassignment results reported in the output CSV file.

relevant_col <- dirname(final_map) %>%
  file.path("bowtie_target.metascope_id.csv") %>%
  read.csv() %>% dplyr::select(2:4)

relevant_col |>
  dplyr::mutate(
    Genome = stringr::str_replace_all(Genome, ',.*', ''),
    Genome = stringr::str_replace_all(Genome, "(ST398).*", "\\1"),
    Genome = stringr::str_replace_all(Genome, "(N315).*", "\\1"),
    Genome = stringr::str_replace_all(Genome, "(Newman).*", "\\1"),
    Genome = stringr::str_replace_all(Genome, "(Mu3).*", "\\1"),
    Genome = stringr::str_replace_all(Genome, "(RF122).*", "\\1")
  ) |>
  knitr::kable()
Genome read_count Proportion
Staphylococcus aureus RF122 957 0.9979145
Staphylococcus aureus subsp. aureus Mu3 2 0.0020855
unlink(".bowtie2.cerr.txt")

We can see that the read reassignment function has reassigned the majority of the ambiguous alignments back to the Staphylococcus aureus RF122 strain, the correct strain of origin.

8 Session Info

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_2.0.3   MetaScope_1.0.0  BiocStyle_2.28.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0        dplyr_1.1.2             blob_1.2.4             
##  [4] urltools_1.7.3          filelock_1.0.2          Biostrings_2.68.0      
##  [7] bitops_1.0-7            fastmap_1.1.1           RCurl_1.98-1.12        
## [10] reshape_0.8.9           BiocFileCache_2.8.0     digest_0.6.31          
## [13] sparsesvd_0.2-2         lifecycle_1.0.3         RSQLite_2.3.1          
## [16] compiler_4.3.0          rlang_1.1.1             sass_0.4.5             
## [19] tools_4.3.0             utf8_1.2.3              yaml_2.3.7             
## [22] data.table_1.14.8       knitr_1.42              conditionz_0.1.0       
## [25] bit_4.0.5               curl_5.0.0              plyr_1.8.8             
## [28] xml2_1.3.4              BiocParallel_1.34.0     httpcode_0.3.0         
## [31] withr_2.5.0             purrr_1.0.1             BiocGenerics_0.46.0    
## [34] triebeard_0.4.1         grid_4.3.0              stats4_4.3.0           
## [37] qlcMatrix_0.9.7         fansi_1.0.4             iterators_1.0.14       
## [40] docopt_0.7.1            crul_1.3                taxize_0.9.100         
## [43] cli_3.6.1               rmarkdown_2.21          crayon_1.5.2           
## [46] generics_0.1.3          httr_1.4.5              DBI_1.1.3              
## [49] ape_5.7-1               cachem_1.0.8            stringr_1.5.0          
## [52] zlibbioc_1.46.0         parallel_4.3.0          BiocManager_1.30.20    
## [55] XVector_0.40.0          vctrs_0.6.2             Matrix_1.5-4           
## [58] jsonlite_1.8.4          slam_0.1-50             bookdown_0.33          
## [61] IRanges_2.34.0          S4Vectors_0.38.0        bit64_4.0.5            
## [64] foreach_1.5.2           jquerylib_0.1.4         glue_1.6.2             
## [67] Rbowtie2_2.6.0          codetools_0.2-19        stringi_1.7.12         
## [70] GenomeInfoDb_1.36.0     GenomicRanges_1.52.0    tibble_3.2.1           
## [73] pillar_1.9.0            htmltools_0.5.5         GenomeInfoDbData_1.2.10
## [76] R6_2.5.1                dbplyr_2.3.2            bold_1.2.0             
## [79] evaluate_0.20           lattice_0.21-8          Rsamtools_2.16.0       
## [82] memoise_2.0.1           bslib_0.4.2             Rcpp_1.0.10            
## [85] uuid_1.1-0              nlme_3.1-162            xfun_0.39              
## [88] zoo_1.8-12              pkgconfig_2.0.3