search_CPs {InPAS}R Documentation

Estimate the CP sites for UTRs on a given chromosome

Description

Estimate the CP sites for UTRs on a given chromosome

Usage

search_CPs(
  seqname,
  sqlite_db,
  utr3,
  background,
  z2s,
  depth.weight,
  genome,
  MINSIZE = 10,
  window_size = 100,
  search_point_START = 50,
  search_point_END = NA,
  cutStart = 10,
  cutEnd = 0,
  adjust_distal_polyA_end = TRUE,
  coverage_threshold = 5,
  long_coverage_threshold = 2,
  PolyA_PWM = NA,
  classifier = NA,
  classifier_cutoff = 0.8,
  shift_range = window_size,
  step = 1,
  two_way = FALSE,
  hugeData = TRUE,
  outdir,
  silence = FALSE
)

Arguments

seqname

A character(1) vector, specifying a chromososome/scaffold name

sqlite_db

A path to the SQLite database for InPAS, i.e. the output of setup_sqlitedb().

utr3

An object of GenomicRanges::GRanges. Output of extract_UTR3Anno() for a chromosome/scaffold

background

A character(1) vector, the range for calculating cutoff threshold of local background. It can be "same_as_long_coverage_threshold", "1K", "5K","10K", or "50K".

z2s

one element of an output of setup_CPsSearch() for Z-score cutoff values, which is the output of get_zScoreCutoff()

depth.weight

A named vector. One element of an output of setup_CPsSearch() for coverage depth weight, which is the output of get_depthWeight()

genome

A BSgenome::BSgenome object

MINSIZE

A integer(1) vector, specifying the minimal length in bp of a short/proximal 3' UTR. Default, 10

window_size

An integer(1) vector, the window size for novel distal or proximal CP site searching. default: 100.

search_point_START

A integer(1) vector, starting point relative to the 5' extremity of 3' UTRs for searching for proximal CP sites

search_point_END

A integer(1) vector, ending point relative to the 3' extremity of 3' UTRs for searching for proximal CP sites

cutStart

An integer(1) vector a numeric(1) vector. What percentage or how many nucleotides should be removed from 5' extremities before searching for CP sites? It can be a decimal between 0, and 1, or an integer greater than 1. 0.1 means 10 percent, 25 means cut first 25 bases

cutEnd

An integer(1) vector a numeric(1) vector. What percentage or how many nucleotides should be removed from 5' extremities before searching for CP sites? It can be a decimal between 0, and 1, or an integer greater than 1. 0.1 means 10 percent, 25 means cut first 25 bases

adjust_distal_polyA_end

A logical(1) vector. If true, distal CP sites are subject to adjustment by the Naive Bayes classifier from the cleanUpdTSeq::cleanUpdTSeq-package

coverage_threshold

An integer(1) vector, specifying the cutoff threshold of coverage for first 100 nucleotides. If the coverage of first 100 nucleotides is lower than coverage_threshold, that transcript will be not considered for further analysis. Default, 5.

long_coverage_threshold

An integer(1) vector, specifying the cutoff threshold of coverage for the terminal of long form 3' UTRs. If the coverage of first 100 nucleotides is lower than coverage_threshold, that transcript will be not considered for further analysis. Default, 2.

PolyA_PWM

An R object for a position weight matrix (PWM) for a hexamer polyadenylation signal (PAS), such as AAUAAA.

classifier

An R object for Naive Bayes classifier model, like the one in the cleanUpdTSeq package.

classifier_cutoff

A numeric(1) vector. A cutoff of probability that a site is classified as true CP sites. The value should be between 0.5 and 1. Default, 0.8.

shift_range

An integer(1) vector, specifying a shift range for adjusting the proximal and distal CP sites. Default, 100. It determines the range flanking the candidate CP sites to search the most likely real CP sites.

step

An integer (1) vector, specifying the step size used for adjusting the proximal or distal CP sites using the Naive Bayes classifier from the cleanUpdTSeq package. Default 1. It can be in the range of 1 to 10.

two_way

A logical (1), indicating whether the proximal CP sites are searched from both directions or not.

hugeData

A logical(1), indicating whether it is huge data

outdir

A character(1) vector, a path with write permission for storing the CP sites. If it doesn't exist, it will be created.

silence

logical(1), indicating whether progress is reported or not. By default, FALSE

Value

An object of GenomicRanges::GRanges containing distal and proximal CP site information for each 3' UTR

Author(s)

Jianhong Ou, Haibo Liu

See Also

search_proximalCPs(), adjust_proximalCPs(), adjust_proximalCPsByPWM(), adjust_proximalCPsByNBC(), get_PAscore(), get_PAscore2()

Examples

if (interactive()) {
   library(BSgenome.Mmusculus.UCSC.mm10)
   library(TxDb.Mmusculus.UCSC.mm10.knownGene)
   genome <- BSgenome.Mmusculus.UCSC.mm10
   TxDb <- TxDb.Mmusculus.UCSC.mm10.knownGene
   
   ## load UTR3 annotation and convert it into a GRangesList
   data(utr3.mm10)
   utr3 <- split(utr3.mm10, seqnames(utr3.mm10))
   
   bedgraphs <- system.file("extdata",c("Baf3.extract.bedgraph",
                                        "UM15.extract.bedgraph"), 
                           package = "InPAS")
   tags <- c("Baf3", "UM15")
   metadata <- data.frame(tag = tags, 
                          condition = c("Baf3", "UM15"),
                          bedgraph_file = bedgraphs)
   outdir = tempdir()
   write.table(metadata, file =file.path(outdir, "metadata.txt"), 
               sep = "\t", quote = FALSE, row.names = FALSE)
   
   sqlite_db <- setup_sqlitedb(metadata = file.path(outdir, 
                                         "metadata.txt"), outdir)
   coverage <- list()
   for (i in seq_along(bedgraphs)) {
   coverage[[tags[i]]] <- get_ssRleCov(bedgraph = bedgraphs[i],
                            tag = tags[i],
                            genome = genome,
                            sqlite_db = sqlite_db,
                            outdir = outdir,
                            removeScaffolds = TRUE,
                            BPPARAM = NULL)}
   coverage_files <- assemble_allCov(sqlite_db, 
                                    outdir, 
                                    genome, 
                                    removeScaffolds = TRUE)
   data4CPsSearch <- setup_CPsSearch(sqlite_db,
                                     genome,
                                     utr3,
                                     background = "10K",
                                     TxDb = TxDb,
                                     removeScaffolds = TRUE,
                                     BPPARAM = NULL,
                                     hugeData = TRUE,
                                     outdir = outdir)
   ## polyA_PWM
   load(system.file("extdata", "polyA.rda", package = "InPAS"))
   
   ## load the Naive Bayes classifier model from the cleanUpdTSeq package
   library(cleanUpdTSeq)
   data(classifier)
   
   CPs <- search_CPs(seqname = "chr6",
                     sqlite_db = sqlite_db, 
                     utr3 = utr3,
                     background = data4CPsSearch$background, 
                     z2s = data4CPsSearch$z2s,
                     depth.weight = data4CPsSearch$depth.weight,
                     genome = genome, 
                     MINSIZE = 10, 
                     window_size = 100,
                     search_point_START =50,
                     search_point_END = NA,
                     cutStart = 10, 
                     cutEnd = 0,
                     adjust_distal_polyA_end = TRUE,
                     coverage_threshold = 5,
                     long_coverage_threshold = 2,
                     PolyA_PWM = pwm, 
                     classifier = classifier,
                     classifier_cutoff = 0.8,
                     shift_range = 100,
                     step = 5,
                     two_way = FALSE,
                     hugeData = TRUE,
                     outdir = outdir)
}


[Package InPAS version 2.1.0 Index]