Contents

1 Introduction

The genoset package offers an extension of the familiar Bioconductor RangedSummarizedExperiment object for genome assays: the GenoSet class. The GenoSet class provides additional API features (e.g. indexing by genome position). The genoset package also provides a number of convenient functions for working with data associated with with genome locations.

1.1 Creating Objects

In typical Bioconductor style, GenoSet objects, and derivatives, can be created using the functions with the same name. Let’s create some fake data to experiment with. Don’t worry too much about how the fake data gets made. Notice how assays can be matrices or RleDataFrames. They can also be BigMatrix or BigMatrixFactor objects (from bigmemoryExtras).

library(genoset)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum
sample.names = LETTERS[11:13]
probe.names = paste("p", 1:1000, sep="")
num.samples = length(sample.names)
num.probes  = length(probe.names)

locs = GRanges(
  ranges= IRanges(
      start=c(seq(from=125e6,by=3e4,length=400), seq(from=1,length=400,by=3.25e4),
              seq(from=30e6,length=200,by=3e4)),width=1,names=probe.names),
  seqnames=factor(c(rep("chr8",400), rep("chr12",400),rep("chr17",200)),levels=c("chr8","chr12","chr17")))

fake.cn = matrix(c(
  c(rnorm(200,.4,0.05),rnorm(200,.2,0.05),rnorm(200,0.23,0.05),rnorm(200,.15,0.05),rnorm(200,2.,0.05)), 
  c(rnorm(200,0,0.05), rnorm(200,3,0.05),rnorm(200,14,0.05),rnorm(200,.1,0.05),rnorm(200,-0.05,0.05)),
  c(rnorm(200,.1,0.05),rnorm(200,1,0.05),rnorm(200,-.5,0.05),rnorm(200,3,0.05),rnorm(200,3,0.05))
  ),
  nrow=num.probes,ncol=num.samples,dimnames=list(probe.names,sample.names))
fake.pData=data.frame(matrix(LETTERS[1:15],nrow=3,ncol=5,dimnames=list(sample.names,letters[1:5])))

gs = GenoSet( rowRanges=locs, assays=list(cn=fake.cn), colData=fake.pData )
gs
## class: GenoSet 
## dim: 1000 3 
## metadata(0):
## assays(1): cn
## rownames(1000): p1 p2 ... p999 p1000
## rowData names(0):
## colnames(3): K L M
## colData names(5): a b c d e
rle.ds = GenoSet( rowRanges=locs,
                 assays=list(cn = fake.cn,
                             cn.segments=RleDataFrame(
                                 K=Rle(c(rep(1.5,300),rep(2.3,700))),L=Rle( c(rep(3.2,700),rep(2.1,300)) ),
                                 M=Rle(rep(1.1,1000)),row.names=rownames(fake.cn))),
                             colData=fake.pData
  )

Let’s have look at what’s inside these objects.

names(assays(rle.ds))
## [1] "cn"          "cn.segments"
head( rle.ds[,,"cn"] )
##            K            L             M
## p1 0.3511608 -0.039222318  0.0564429085
## p2 0.4306183  0.027664734  0.0722065536
## p3 0.3914544  0.066228528  0.0886733596
## p4 0.4163652  0.045551624  0.0654722401
## p5 0.4559186 -0.004197497  0.0873141267
## p6 0.3685728 -0.073295289 -0.0004783453
head( rle.ds[,,"cn.segments"] )
## RleDataFrame with 6 rows and 3 columns
## rownames: p1, p2, p3, p4, p5, p6, ...
## $K
## numeric-Rle of length 6 with 1 run
##   Lengths:   6
##   Values : 1.5
## 
## $L
## numeric-Rle of length 6 with 1 run
##   Lengths:   6
##   Values : 3.2
## 
## $M
## numeric-Rle of length 6 with 1 run
##   Lengths:   6
##   Values : 1.1

1.2 Accessing Genome Information

Now lets look at some special functions for accessing genome information from a genoset object. These functions are all defined for GenoSet and GenomicRanges objects. We can access per-feature information as well as summaries of chromosome boundaries in base-pair or row-index units.

There are a number of functions for getting portions of the locData data. chr and pos return the chromosome and position information for each feature. genoPos is like pos, but it returns the base positions counting from the first base in the genome, with the chromosomes in order by number and then alphabetically for the letter chromosomes. chrInfo returns the genoPos of the first and last feature on each chromosome in addition to the offset of the first feature from the start of the genome. chrInfo results are used for drawing chromosome boundaries on genome-scale plots. pos and genoPos are defined as the floor of the average of each features start and end positions.

head( rowRanges(gs) )
## GRanges object with 6 ranges and 0 metadata columns:
##      seqnames    ranges strand
##         <Rle> <IRanges>  <Rle>
##   p1     chr8 125000000      *
##   p2     chr8 125030000      *
##   p3     chr8 125060000      *
##   p4     chr8 125090000      *
##   p5     chr8 125120000      *
##   p6     chr8 125150000      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
chrNames(gs)
## [1] "chr8"  "chr12" "chr17"
chrOrder(c("chr12","chr12","chrX","chr8","chr7","chrY"))
## [1] "chr7"  "chr8"  "chr12" "chr12" "chrX"  "chrY"
chrInfo(gs)
##           start      stop    offset
## chr8          1 136970000         0
## chr12 136970001 149937501 136970000
## chr17 149937502 185907501 149937501
chrIndices(gs)
##       first last offset
## chr8      1  400      0
## chr12   401  800    400
## chr17   801 1000    800
head(chr(gs))
## [1] "chr8" "chr8" "chr8" "chr8" "chr8" "chr8"
head(start(gs))
## [1] 125000000 125030000 125060000 125090000 125120000 125150000
head(end(gs))
## [1] 125000000 125030000 125060000 125090000 125120000 125150000
head(pos(gs))
## [1] 125000000 125030000 125060000 125090000 125120000 125150000
head(genoPos(gs))
##      chr8      chr8      chr8      chr8      chr8      chr8 
## 125000000 125030000 125060000 125090000 125120000 125150000

1.3 Genome Order

GenoSet and GenomicRanges objects can be set to, and checked for, genome order. Weak genome order requires that features be ordered within each chromosome. Strong genome order requires a certain order of chromosomes as well. Features must be ordered so that features from the same chromosome are in contiguous blocks.

Certain methods on GenoSet objects expect the rows to be in genome order. Users are free to rearrange rows within chromosome as they please.

The proper order of chromosomes is desirable for full-genome plots and is specified by the chrOrder function. The object creation method Genoset creates objects in strict genome order.

chrOrder(chrNames(gs))
## [1] "chr8"  "chr12" "chr17"
gs = toGenomeOrder(gs, strict=TRUE)
isGenomeOrder(gs, strict=TRUE)
## [1] TRUE

1.4 Using the Subset Features

GenoSet objects can be subset using array notation. The `features'' index can be a set of ranges or the usual logical, numeric or character indices.chrIndices` with a chromosome argument is a convenient way to get the indices needed to subset by chromosome.

Subset by chromosome

chr12.ds = gs[ chrIndices(gs,"chr12"), ]
dim(chr12.ds)
## [1] 400   3
chrIndices(chr12.ds)
##       first last offset
## chr12     1  400      0

Subset by a collection of gene locations

gene.gr = GRanges(ranges=IRanges(start=c(35e6,127e6),end=c(35.5e6,129e6),                       
                       names=c("HER2","CMYC")), seqnames=c("chr17","chr8"))
gene.ds = gs[ gene.gr, ]
dim(gene.ds)
## [1] 84  3
chrIndices(gene.ds)
##       first last offset
## chr8      1   67      0
## chr17    68   84     67

GenoSet objects can also be subset by a group of samples and/or features, just like an ExpressionSet, or a matrix for that matter.

dim(gs[1:4,1:2])
## [1] 4 2

eSet-derived classes tend to have special functions to get and set specific assayDataElement members (the big data matrices). For example, ExpressionSet has the exprs function. It is common to put other optional matrices in assayData too (genotypes, quality scores, etc.). These can be get and set with the assayDataElement function, but typing that out can get old. GenoSet and derived classes use the ``k’’ argument to the matrix subsetting bracket to subset from a specific assayDataElement. In addition to saving some typing, you can directly use a set of ranges to subset the assayDataElement.

all( gs[ 1:4,1:2,"cn"] == assay(gs,"cn")[1:4,1:2] )
## [1] TRUE

2 Processing Data

2.1 Correction of Copy Number for local GC content

Copy number data generally shows a GC content effect that appears as slow `waves'' along the genome (Diskin et al., NAR, 2008). The functiongcCorrect` can be used to remove this effect resulting in much clearer data and more accurate segmentation. GC content is best measured as the gc content in windows around each feature, about 2Mb in size.

library(BSgenome.Hsapiens.UCSC.hg19)
gc = rnorm(nrow(gs))
gs[,,"cn"] = gcCorrect(gs[,,"cn"],gc)

2.2 Segmentation

Segmentation is the process of identifying blocks of the genome in each sample that have the same copy number value. It is a smoothing method that attempts to replicate the biological reality where chunks of chromosome have been deleted or amplified.

Genoset contains a convenience function for segmenting data for each sample/chr using the DNAcopy package (the CBS algorithm). Genoset adds features to split jobs among processor cores. When the library parallel is loaded, the argument n.cores can control the number of processor cores utilized.

Additionally, GenoSet stores segment values so that they can be accessed quickly at both the feature and segment level. We use a RleDataFrame object where each column is a Run-Length-Encoded Rle object. This dramatically reduces the amount of memory required to store the segments. Note how the segmented values become just another member of the assayData slot.

Try running CBS directly

library(DNAcopy)
cbs.cna = CNA(gs[,,"cn"], chr(gs), pos(gs) )
cbs.smoothed.CNA = smooth.CNA( cbs.cna )
cbs.segs = segment( cbs.cna )
## Analyzing: Sample.1 
## Analyzing: Sample.2 
## Analyzing: Sample.3

Or use the convenience function runCBS

gs[,,"cn.segs"] = runCBS(gs[,,"cn"],rowRanges(gs))
## Working on segmentation for sample number 1 : K
## Working on segmentation for sample number 2 : L
## Working on segmentation for sample number 3 : M

Try it with parallel

library(parallel)
gs[,,"cn.segs"] = runCBS(gs[, , "cn"],rowRanges(gs), n.cores=3)
gs[,,"cn.segs"][1:5,1:3]

Other segmenting methods can also be used of course.

This function makes use of the parallel package to run things in parallel, so plan ahead when picking ``n.cores’’. Memory usage can be a bit hard to predict.

2.3 Segments as tables or runs

Having segmented the data for each sample, you may want to explore different representations of the segments. Genoset describes data in genome segments two ways: 1) as a table of segments, and 2) a Run-Length-Encoded vector. Tables of segments are useful for printing, overlap queries, database storage, or for summarizing changes in a sample. Rle representations can be used like regular vectors, plotted as segments (see genoPlot), and stored efficiently. A collection of Rle objects, one for each sample, are often stored as one RleDataFrame in a GenoSet. genoset provides functions to quickly flip back and forth between table and Rle representations. You can use these functions on single samples, or the whole collection of samples.

head( gs[,,"cn.segs"] )
## RleDataFrame with 6 rows and 3 columns
## rownames: p1, p2, p3, p4, p5, p6, ...
## $K
## numeric-Rle of length 6 with 1 run
##   Lengths:      6
##   Values : 0.3961
## 
## $L
## numeric-Rle of length 6 with 1 run
##   Lengths:       6
##   Values : -0.0016
## 
## $M
## numeric-Rle of length 6 with 1 run
##   Lengths:     6
##   Values : 0.105
segs = segTable( gs[,2,"cn.segs"], rowRanges(gs) )
list.of.segs = segTable( gs[,,"cn.segs"], rowRanges(gs) )
rbind.list.of.segs = segTable( gs[,,"cn.segs"], rowRanges(gs), stack=TRUE )
two.kinds.of.segs = segPairTable( gs[,2,"cn.segs"], gs[,3,"cn.segs"], rowRanges(gs) )

rle = segs2Rle( segs, rowRanges(gs) )
rle.df = segs2RleDataFrame( list.of.segs, rowRanges(gs) )

bounds = matrix( c(1,3,4,6,7,10),ncol=2,byrow=TRUE)
cn = c(1,3,2)
rle = bounds2Rle( bounds, cn, 10 )

segPairTable summarizes two Rle objects into segments that have one unique value for each Rle. This is useful for cases where you want genome regions with one copy number state, and one LOH state, for example.

bounds2Rle is convenient if you already know the genome feature indices corresponding to the bounds of each segment.

Currently we use data.frames for tables of segments. In the near future these will have colnames that will make it easy to coerce these to GRanges. Coercion to GRanges takes a while, so we don’t do that by default.

2.4 Gene Level Summaries

Analyses usually start with SNP or probeset level data. Often it is desirable to get summaries of assayData matrices over an arbitrary set of ranges, like exons, genes or cytobands. The function rangeSampleMeans serves this purpose. Given a GenomicRanges of arbitrary genome ranges and a GenoSet object, rangeSampleMeans will return a matrix of values with a row for each range.

rangeSampleMeans uses boundingIndicesByChr to select the features bounding each range. The bounding features are the features with locations equal to or within the start and end of the range. If no feature exactly matches an end of the range, the nearest features outside the range will be used. This bounding ensures that the full extent of the range is accounted for, and more importantly, at least two features are included for each gene, even if the range falls between two features.

rangeColMeans is used to do a fast average of each of a set of such bounding indices for each sample. These functions are optimized for speed. For example, with 2.5M features and ~750 samples, it takes 0.12 seconds to find the features bounded by all Entrez Genes (one RefSeq each). Calculating the mean value for each gene and sample takes ~9 seconds for a matrix of array data and ~30 seconds for a RleDataFrame of compressed Rle objects.

Generally, you will want to summarize segmented data and will be working with a RleDataFrame, like that returned by runCBS.

As an example, let’s say you want to get the copynumber of your two favorite genes from the subsetting example:

Get the gene-level summary: <<genelevel, eval=TRUE>>= boundingIndicesByChr( gene.gr, rowRanges(gs) ) rangeSampleMeans(gene.gr, gs, “cn.segs”) ```

3 Plots

Genoset has some handy functions for plotting data along the genome. Segmented data knows'' it should be plotted as lines, rather than points. One often wants to plot just one chromosome, so a convenience argument for chromosome subsetting is provided. Like `plot`, `genoPlot` plots x against y. 'x' can be some form of location data, like a `GenoSet` or `GenomicRanges`. 'y' is some form of data along those coordinates, like a numeric vector or `Rle`. `genoPlot` marks chromosome boundaries and labels positions inbp’‘, kb'',Mb’‘, or ``Gb’’ units as appropriate.

genoPlot(gs, gs[ , 1, "cn"])
genoPlot(gs, gs[ , 1, "cn.segs"], add=TRUE, col="red")

Segmented copy number across the genome for 1st sample}

genoPlot(gs,gs[,1,"cn"],chr="chr12")
genoPlot(gs,gs[,1,"cn.segs"],chr="chr12",add=TRUE, col="red")

Segmented copy number across chromosome 12 for 1st sample

Plot data without a GenoSet object using numeric or Rle data:

chr12.ds = gs[chr(gs) == "chr12",]
genoPlot(pos(chr12.ds),chr12.ds[,1,"cn"],locs=rowRanges(chr12.ds))  # Numeric data and location
genoPlot(pos(chr12.ds),chr12.ds[,1,"cn.segs"],add=TRUE, col="red") # Rle data and numeric position

3.1 Processing B-Allele Frequency Data

B-Allele Frequency (BAF) data can be converted into the ``Modified BAF’’ or mBAF metric, introduced by Staaf, et al., 2008. mBAF folds the values around the 0.5 axis and makes the HOM positions NA. The preferred way to identify HOMs is to use genotype calls from a matched normal (AA, AC, AG, etc.), but NA’ing greater than a certain value works OK. A hom.cutoff of 0.90 is suggested for Affymetrix arrays and 0.95 for Illumina arrays, following Staaf, et al.

Return data as a matrix:

fake.baf  = sample(c(0,0.5,1), length(probe.names), replace=TRUE) + rnorm(length(probe.names),0,0.01)
fake.baf[ fake.baf > 1 ] = fake.baf[ fake.baf > 1 ] - 1
fake.baf[ fake.baf < 0 ] = fake.baf[ fake.baf < 0 ] + 1
hets = fake.baf < 0.75 & fake.baf > 0.25
fake.baf[ 101:200 ][ hets[101:200] ] = fake.baf[ 101:200 ][ hets[101:200] ] + rep(c(-0.2,0.2),50)[hets[101:200]]
fake.baf = matrix(fake.baf,nrow=num.probes,ncol=num.samples,dimnames=list(probe.names,sample.names))

baf.ds = GenoSet( rowRanges=locs, assays=list(lrr=fake.cn, baf=fake.baf), colData=fake.pData )
baf.ds[, , "mbaf"] = baf2mbaf(baf.ds[, , "baf"], hom.cutoff = 0.90)

… or use compress it to an RleDataFrame. This uses ~1/3 the space on our random test data.

mbaf.data = RleDataFrame( sapply(colnames( baf.ds),
  function(x) { Rle( baf.ds[,x, "mbaf"] ) },
  USE.NAMES=TRUE, simplify=FALSE ) )
as.numeric(object.size( baf.ds[, ,"mbaf"]))  / as.numeric( object.size(mbaf.data))
## [1] 3.508228

Using the HOM SNP calls from the matched normal works much better. A matrix of genotypes can be used to set the HOM SNPs to NA. A list of sample names matches the columns of the genotypes to the columns of your baf matrix. The names of the list should match column names in your baf matrix and the values of the list should match the column names in your genotype matrix. If this method is used and some columns in your baf matrix do not have an entry in this list, then those baf columns are cleaned of HOMs using the hom.cutoff, as above.

Both mBAF and LRR can and should be segmented. Consider storing mBAF as an RleDataFrame as only the ~1/1000 HET positions are being used and all those NA HOM positions will compress nicely.

baf.ds[,,"baf.segs"] = runCBS( baf.ds[, ,"mbaf"], rowRanges(baf.ds) )
## Working on segmentation for sample number 1 : K
## Working on segmentation for sample number 2 : L
## Working on segmentation for sample number 3 : M
baf.ds[,,"lrr.segs"] = runCBS( baf.ds[, , "lrr"], rowRanges(baf.ds) )
## Working on segmentation for sample number 1 : K
## Working on segmentation for sample number 2 : L
## Working on segmentation for sample number 3 : M

3.2 Plots

genoPlot(baf.ds,baf.ds[,1,"lrr"],chr="chr12",main="LRR of chr12")
genoPlot(baf.ds,baf.ds[,1,"lrr.segs"],chr="chr12",add=TRUE,col="red")

Segmented copy number across the genome for 1st sample`

par(mfrow=c(2,1))
genoPlot(baf.ds,baf.ds[,1,"baf"],chr="chr12", main="BAF of chr12")
genoPlot(baf.ds,baf.ds[,1,"mbaf"],chr="chr12", main="mBAF of chr12")
genoPlot(baf.ds,baf.ds[,1,"baf.segs"],chr="chr12", add=TRUE,col="red")

4 Cross-sample summaries

You can quickly calculate summaries across samples to identify regions with frequent alterations. A bit more care is necessary to work one sample at a time if your data “matrix” is an RleDataFrame.

gain.list = lapply(colnames(baf.ds),
  function(sample.name) {
    as.logical( baf.ds[, sample.name, "lrr.segs"] > 0.3 )
})
gain.mat = do.call(cbind, gain.list)
gain.freq = rowMeans(gain.mat,na.rm=TRUE)

GISTIC (by Behroukhim and Getz of the Broad Institute) is the standard method for assessing significance of such summaries. You’ll find segTable convenient for getting your data formatted for input. I find it convenient to load GISTIC output as a GRanges for intersection with gene locations.

5 Big Data and bigmemoryExtras

Genome-scale data can be huge and keeping everything in memory can get you into trouble quickly, especially if you like using parallel’s mclapply.

It is often convenient to use BigMatrix objects from the bigmemoryExtras package as assayDataElements, rather than base matrices. BigMatrix is based on the bigmemory package, which provides a matrix API to memory-mapped files of numeric data. This allows for data matrices larger than R’s maximum size with just the tiniest footprint in RAM. The bigmemoryExtras vignette has more details about using eSet-derived classes and BigMatrix objects.