optimize.sd_selection {BioTIP}R Documentation

Optimization of sd selection

Usage

optimize.sd_selection(
  df,
  samplesL,
  B = 100,
  percent = 0.8,
  times = 0.8,
  cutoff = 0.01,
  method = c("other", "reference", "previous", "itself", "longitudinal reference"),
  control_df = NULL,
  control_samplesL = NULL
)

Arguments

df

A dataframe of numerics. The rows and columns represent unique transcript IDs (geneID) and sample names, respectively.

samplesL

A list of n vectors, where n equals to the number of states. Each vector gives the sample names in a state. Note that the vectors (sample names) has to be among the column names of the R object 'df'.

B

An integer indicating number of times to run this optimization, default 1000.

percent

A numeric value indicating the percentage of samples will be selected in each round of simulation.

times

A numeric value indicating the percentage of B times a transcript need to be selected in order to be considered a stable signature.

cutoff

A positive numeric value. Default is 0.01. If < 1, automatically goes to select top x# transcripts using the a selecting method (which is either the reference, other or previous stage), e.g. by default it will select top 1\

\item

methodSelection of methods from reference, other, previous, default uses other. Partial match enabled.

  • itself, or longitudinal reference. Some specific requirements for each option:

  • reference, the reference has to be the first.

  • previous, make sure sampleL is in the right order from benign to malign.

  • itself, make sure the cutoff is smaller than 1.

  • longitudinal reference make sure control_df and control_samplesL are not NULL. The row numbers of control_df is the same as df and all transcript in df are also in control_df.

\item

control_dfA count matrix with unique loci as row names and samples names of control samples as column names, only used for method longitudinal reference.

\item

control_samplesLA list of characters with stages as names of control samples, required for method 'longitudinal reference'.

A list of dataframe of filtered transcripts with the highest standard deviation are selected from df based on a cutoff value assigned. The resulting dataframe represents a subset of the raw input df. The optimize.sd_selection filters a multi-state dataset based on a cutoff value for standard deviation per state and optimizes. By default, a cutoff value of 0.01 is used. Suggested if each state contains more than 10 samples.

counts = matrix(sample(1:100, 30), 2, 30) colnames(counts) = 1:30 row.names(counts) = paste0('loci', 1:2) cli = cbind(1:30, rep(c('state1', 'state2', 'state3'), each = 10)) colnames(cli) = c('samples', 'group') samplesL <- split(cli[, 1], f = cli[, 'group']) test_sd_selection <- optimize.sd_selection(counts, samplesL, B = 3, cutoff =0.01) sd_selection Zhezhen Wang zhezhen@uchicago.edu


[Package BioTIP version 1.2.0 Index]