derfinder basic results exploration

Project: report.

Introduction

This report is meant to help explore the results of the derfinder (Collado-Torres, Frazee, Love, Irizarry, et al., 2015) package and was generated using regionReport (Collado-Torres, Jaffe, and Leek, 2015) package. While the report is rich, it is meant to just start the exploration of the results and exemplify some of the code used to do so. You will most likely need a more in-depth analysis for your specific data set.

Most plots were made with using ggplot2 (Wickham, 2009).

Code setup

#### Libraries needed

## Bioconductor
library('IRanges')
library('GenomicRanges')
library('GenomeInfoDb')

if(hg19) {
    library('biovizBase')
    library('TxDb.Hsapiens.UCSC.hg19.knownGene')
}

## CRAN
library('ggplot2')
library('grid')
library('gridExtra')
library('knitr')
library('RColorBrewer')
library('mgcv')
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:IRanges':
## 
##     collapse
## 
## This is mgcv 1.8-3. For overview type 'help("mgcv-package")'.
## GitHub
library('derfinder')

## Working behind the scenes
# library('knitcitations')
# library('rmarkdown')
# library('knitrBootstrap')

#### Code setup

## For ggplot
tmp <- fullRegions
names(tmp) <- seq_len(length(tmp))
regions.df <- as.data.frame(tmp)
regions.df$width <- width(tmp)
rm(tmp)
nulls.df <- as.data.frame(fullNullSummary)

## Special subsets: need at least 3 points for a density plot
keepChr <- table(regions.df$seqnames) > 2
regions.df.plot <- subset(regions.df, seqnames %in% names(keepChr[keepChr]))

if(hasSig) {
    ## Keep only those sig
    regions.df.sig <- regions.df[idx.sig, ]
    keepChr <- table(regions.df.sig$seqnames) > 2
    regions.df.sig <- subset(regions.df.sig, seqnames %in% names(keepChr[keepChr]))
    
    if(nrow(regions.df.sig) > 0) {
        ## If there's any sig, keep those with finite areas
        if(hasArea) {
            finite.area.sig <- which(is.finite(regions.df.sig$area))
            
            regions.df.sig.area <- regions.df.sig[finite.area.sig, ]
            keepChr <- table(regions.df.sig.area$seqnames) > 2
            regions.df.sig.area <- subset(regions.df.sig.area, seqnames %in%
                names(keepChr[keepChr]))
            
            ## Save the info
            hasArea <- (nrow(regions.df.sig.area) > 0)
        }
    } else {
        hasSig <- hasArea <- FALSE
    }
}

## Get chr lengths
if(hg19) {
    data(hg19Ideogram, package = 'biovizBase')
    seqlengths(fullRegions) <- seqlengths(hg19Ideogram)[mapSeqlevels(names(seqlengths(fullRegions)),
         'UCSC')]
}

## Find which chrs are present in the data set
chrs <- levels(seqnames(fullRegions))

## Subset the fullCoverage data in case that a subset was used
colsubset <- optionsStats$colsubset
if(!is.null(fullCov) & !is.null(colsubset)) {
    fullCov <- lapply(fullCov, function(x) { x[, colsubset] })
}

## Get region coverage for the top regions
if(nBestRegions > 0) {
    if(packageVersion('derfinder') >= '0.0.60') {
        regionCoverage <- getRegionCoverage(fullCov = fullCov, 
            regions = fullRegions[seq_len(nBestRegions)],
            chrsStyle = chrsStyle, species = species,
            currentStyle = currentStyle, verbose = FALSE)
    } else {
        regionCoverage <- getRegionCoverage(fullCov = fullCov, 
            regions = fullRegions[seq_len(nBestRegions)],
            verbose = FALSE)
    }
    
    save(regionCoverage, file=file.path(workingDir, 'regionCoverage.Rdata'))
}

## Graphical setup: transcription database
if(hg19 & is.null(txdb)) {
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
} else {
    stopifnot(!is.null(txdb))
}

Quality checks

P-values

Theoretically, the p-values should be uniformly distributed between 0 and 1.

p1 <- ggplot(regions.df.plot, aes(x=pvalues, colour=seqnames)) +
    geom_line(stat='density') + xlim(0, 1) +
    labs(title='Density of p-values') + xlab('p-values') +
    scale_colour_discrete(limits=chrs) + theme(legend.title=element_blank())
p1
## Compare the pvalues
summary(fullRegions$pvalues)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01887 0.01887 0.62260 0.40650 0.62260 0.96230

This is the numerical summary of the distribution of the p-values.

Q-values

summary(fullRegions$qvalues)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01338 0.01338 0.20570 0.13420 0.20570 0.24810

This is the numerical summary of the distribution of the q-values.

qtable <- lapply(c(1e-04, 0.001, 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5,
    0.6, 0.7, 0.8, 0.9, 1), function(x) {
    data.frame('Cut' = x, 'Count' = sum(fullRegions$qvalues <= x))
})
qtable <- do.call(rbind, qtable)
kable(qtable, format = 'html', align = c('c', 'c'))
Cut Count
0.0001 0
0.0010 0
0.0100 0
0.0250 12
0.0500 12
0.1000 12
0.2000 15
0.3000 33
0.4000 33
0.5000 33
0.6000 33
0.7000 33
0.8000 33
0.9000 33
1.0000 33

This table shows the number of candidate Differentially Expressed Regions (DERs) with q-value less or equal than some commonly used cutoff values.

FWER adjusted P-values

This is the numerical summary of the distribution of the q-values. Skipped because there are no FWER-adjusted P-values.

This table shows the number of candidate Differentially Expressed Regions (DERs) with FWER adjusted p-values less or equal than some commonly used cutoff values. Skipped because there are no FWER-adjusted P-values.

Region width

xrange <- range(log10(regions.df.plot$width))
p2a <- ggplot(regions.df.plot, aes(x=log10(width), colour=seqnames)) + 
    geom_line(stat='density') + labs(title='Density of region lengths') +
    xlab('Region width (log10)') + scale_colour_discrete(limits=chrs) +
    xlim(xrange) + theme(legend.title=element_blank())
p2b <- ggplot(regions.df.sig, aes(x=log10(width), colour=seqnames)) +
    geom_line(stat='density') +
    labs(title='Density of region lengths (significant only)') +
    xlab('Region width (log10)') + scale_colour_discrete(limits=chrs) +
    xlim(xrange) + theme(legend.title=element_blank())
grid.arrange(p2a, p2b)

This plot shows the density of the region lengths for all regions. The bottom panel is restricted to significant regions (q-value < 0.1)

Region Area

xrange <- range(log10(regions.df.plot$area[finite.area]))
if(inf.area > 0) {
    print(paste('Dropping', inf.area, 'due to Inf values.'))
}
p3a <- ggplot(regions.df[finite.area, ], aes(x=log10(area), colour=seqnames)) +
    geom_line(stat='density') + labs(title='Density of region areas') +
    xlab('Region area (log10)') + scale_colour_discrete(limits=chrs) +
    xlim(xrange) + theme(legend.title=element_blank())
p3b <- ggplot(regions.df.sig.area, aes(x=log10(area), colour=seqnames)) +
    geom_line(stat='density') +
    labs(title='Density of region areas (significant only)') +
    xlab('Region area (log10)') + scale_colour_discrete(limits=chrs) +
    xlim(xrange) + theme(legend.title=element_blank())
grid.arrange(p3a, p3b)

This plot shows the density of the region areas for all regions. The bottom panel is restricted to significant regions (q-value < 0.1)

Null regions: width and area

p4 <- ggplot(nulls.df, aes(x=log10(width), colour=chr)) +
    geom_line(stat='density') + labs(title='Density of null region lengths') +
    xlab('Region width (log10)') + scale_colour_discrete(limits=chrs) +
    theme(legend.title=element_blank())
nulls.inf <- !is.finite(nulls.df$area)
if(sum(nulls.inf) > 0) {
    print(paste('Dropping', sum(nulls.inf), 'due to Inf values.'))
}
p5 <- ggplot(nulls.df[!nulls.inf, ], aes(x=log10(area), colour=chr)) +
    geom_line(stat='density') + labs(title='Density of null region areas') +
    xlab('Region area (log10)') + scale_colour_discrete(limits=chrs) +
    theme(legend.title=element_blank())
grid.arrange(p4, p5)

This plot shows the density of the null region lengths and areas. There were a total of 52 null regions.

Mean coverage

xrange <- range(log2(regions.df.plot$meanCoverage))
p6a <- ggplot(regions.df.plot, aes(x=log2(meanCoverage), colour=seqnames)) +
    geom_line(stat='density') + labs(title='Density of region mean coverage') +
    xlab('Region mean coverage (log2)') + scale_colour_discrete(limits=chrs) +
    xlim(xrange) + theme(legend.title=element_blank())
p6b <- ggplot(regions.df.sig, aes(x=log2(meanCoverage), colour=seqnames)) +
    geom_line(stat='density') +
    labs(title='Density of region mean coverage (significant only)') +
    xlab('Region mean coverage (log2)') + scale_colour_discrete(limits=chrs) +
    xlim(xrange) + theme(legend.title=element_blank())
grid.arrange(p6a, p6b)

This plot shows the density of the region mean coverage for all regions. The bottom panel is restricted to significant regions (q-value < 0.1)

Mean coverage vs fold change

The following plots are MA-style plots comparing each group vs the first one. The mean coverage is calculated using only two groups at a time and is weighted according to the number of samples on each group. Note that the mean coverage and fold change as calculated here do not taking into account the library sizes.

These plots are only shown when there are two or more groups. A total of 1 plot(s) were made.

for(j in grep('log2FoldChange', colnames(values(fullRegions)))) {
    ## Identify the groups
    groups <- strsplit(gsub('log2FoldChange', '',
        colnames(values(fullRegions))[j]), 'vs')[[1]]
    
    ## Calculate the mean coverage only using the 2 groups in question
    j.mean <- which(colnames(values(fullRegions)) %in% paste0('mean', groups))
    groups.n <- sapply(groups, function(x) { sum(optionsStats$groupInfo == x) })
    ma.mean.mat <- as.matrix(values(fullRegions)[, j.mean])
    ## Weighted means
    ma.mean <- drop(ma.mean.mat %*% groups.n) / sum(groups.n) +
        optionsStats$scalefac
    ma.fold2 <- drop(log2(ma.mean.mat + optionsStats$scalefac) %*% c(1, -1))
    
    ma <- data.frame(mean=ma.mean, log2FoldChange=ma.fold2)
    ma2 <- ma[is.finite(ma$log2FoldChange), ]
    fold.mean <- data.frame(foldMean=mean(ma2$log2FoldChange, na.rm=TRUE))
    
    p.ma <- ggplot(ma, aes(x=log2(mean), y=log2FoldChange)) +
        geom_point(size=1.5, alpha=1/5) + 
        ylab("Fold Change [log2(x + sf)]\nRed dashed line at mean; blue line is GAM fit: y ~ s(x, bs = 'cs')") +
        xlab(paste('Mean coverage [log2(x + sf)] using only groups', groups[1], 'and',
            groups[2])) + labs(title=paste('MA style plot:', groups[1], 'vs ', 
            groups[2])) + geom_hline(aes(yintercept=foldMean), data=fold.mean, 
            colour='#990000', linetype='dashed') +
        geom_smooth(aes(y=log2FoldChange, x=log2(mean)), data=subset(ma2,
            mean > 0), method = 'gam', formula = y ~ s(x, bs = 'cs'))
    print(p.ma)
}

Genomic overview

The following plots were made using ggbio (Yin, Cook, and Lawrence, 2012) which in turn uses ggplot2 (Wickham, 2009). For more details check plotOverview in derfinder (Collado-Torres, Frazee, Love, Irizarry, et al., 2015).

Q-values

plotOverview(regions=fullRegions, type='qval', base_size=overviewParams$base_size, areaRel=overviewParams$areaRel, legend.position=c(0.97, 0.12))

This plot shows the genomic locations of the candidate regions found in the analysis. The significant regions (q-value less than 0.1) are highlighted and the area of the regions is shown on top of each chromosome. Note that the area is in a relative scale.

Annotation

plotOverview(regions=fullRegions, annotation=fullRegions, type='annotation', base_size=overviewParams$base_size, areaRel=overviewParams$areaRel, legend.position=c(0.97, 0.12))

This genomic overview plot shows the annotation region type for the candidate regions. Note that the regions are shown only if the annotation information is available. Below is a table of the actual number of results per annotation region type.

annoReg <- table(fullRegions$region, useNA='always')
annoReg.df <- data.frame(Region=names(annoReg), Count=as.vector(annoReg))
kable(annoReg.df, format = 'html', align=rep('c', 3))
Region Count
upstream 0
promoter 0
overlaps 5' 0
inside 33
overlaps 3' 0
close to 3' 0
downstream 0
covers 0
NA 0

Annotation (significant)

plotOverview(regions=fullRegions[idx.sig], annotation=fullRegions[idx.sig], type='annotation', base_size=overviewParams$base_size, areaRel=overviewParams$areaRel, legend.position=c(0.97, 0.12))

This genomic overview plot shows the annotation region type for the candidate regions that have a q-value less than 0.1. Note that the regions are shown only if the annotation information is available.

Best regions

Plots

Below are the plots for the top 15 candidate DERs as ranked by area. For each plot, annotation is shown if the candidate DER has a minimum overlap of 20 base pairs with annotation information (strand specific). If present, exons are collapsed and shown in blue. Introns are shown in light blue. The title of each plot is composed of the name of the nearest annotation element, the distance to it, and whether the region of the genome the DER falls into; all three pieces of information are based on bumphunter::annotateNearest().

The annotation depends on the Genomic State used. For details on which one was used for this report check the call to mergeResults in the reproducibility details.

if(nBestRegions > 0) {
    plotRegionCoverage(regions = fullRegions, regionCoverage = regionCoverage,
        groupInfo = optionsStats$groupInfo, nearestAnnotation = regions.df,
        annotatedRegions = fullAnnotatedRegions, 
        whichRegions = seq_len(min(nBestRegions, length(fullRegions))),
        colors = NULL, scalefac = optionsStats$scalefac, ask = FALSE, 
        verbose = TRUE, txdb = txdb) 
}
## 2015-03-13 01:17:48 plotRegionCoverage: extracting Tx info
## 2015-03-13 01:17:56 plotRegionCoverage: getting Tx plot info

Below is a table summarizing the number of genomic states per region.

info <- do.call(rbind, lapply(fullAnnotatedRegions$countTable, function(x) { data.frame(table(x)) }))
colnames(info) <- c('Number of Overlapping States', 'Frequency')
info$State <- gsub('\\..*', '', rownames(info))
rownames(info) <- NULL
kable(info, format = 'html', align=rep('c', 4))
Number of Overlapping States Frequency State
0 22 exon
1 11 exon
0 33 intergenic
0 30 intron
1 3 intron

Region information

Below is an interactive table with the top 33 regions (out of 33) as ranked by area. Inf and -Inf are shown as 1e100 and -1e100 respectively.

topArea <- head(regions.df, nBestRegions * 5)
topArea <- data.frame('areaRank'=order(topArea$area, decreasing=TRUE), topArea)
## Clean up -Inf, Inf if present
## More details at https://github.com/ramnathv/rCharts/issues/259
replaceInf <- function(df, colsubset=seq_len(ncol(df))) {
    for(i in colsubset) {
        inf.idx <- !is.finite(df[, i])
        if(any(inf.idx)) {
            inf.sign <- sign(df[inf.idx, i])
            df[inf.idx, i] <- inf.sign * 1e100
        }
    }
    return(df)
}
topArea <- replaceInf(topArea, grep('log2FoldChange|value|area',
    colnames(topArea)))

## Make the table
kable(topArea, format = 'html', table.attr='id="regions_table"')
areaRank seqnames start end width strand value area indexStart indexEnd cluster clusterL meanCoverage meanCEU meanYRI log2FoldChangeYRIvsCEU pvalues significant qvalues significantQval name annotation description region distance subregion insidedistance exonnumber nexons UTR annoStrand geneL codingL
1 chr21 47409522 47409560 39
15.053644 587.092111 128 166 3 167 0.4152192 0.6129426 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 7859 inside exon 0 10 35 inside transcription region
23300 22162
2 chr21 47411924 47411986 63
7.096704 447.092322 618 680 6 389 0.5847414 0.8374906 0.0539683 -3.955890e+00 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 10261 inside exon 0 16 35 inside transcription region
23300 22162
3 chr21 47417352 47417397 46
8.778446 403.808491 1292 1337 10 338 0.3646564 0.5331263 0.0108696 -5.616111e+00 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 15689 inside exon 0 21 35 inside transcription region
23300 22162
4 chr21 47412662 47412724 63
5.671216 357.286601 843 905 7 63 0.5012801 0.7399849 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 10999 inside exon 0 19 35 inside transcription region
23300 22162
5 chr21 47408998 47409026 29
10.685252 309.872297 69 97 2 232 0.1957731 0.2889984 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 7335 inside exon 0 9 35 inside transcription region
23300 22162
6 chr21 47414081 47414143 63
3.308408 208.429718 994 1056 8 63 0.3645673 0.5381708 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 12418 inside exon 0 20 35 inside transcription region
23300 22162
7 chr21 47410893 47410930 38
5.452573 207.197772 407 444 5 269 0.4151104 0.6127820 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 9230 inside exon 0 15 35 inside transcription region
23300 22162
8 chr21 47416094 47416131 38
5.109484 194.160408 1165 1202 9 222 0.0322581 0.0476190 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside intron inside 14431 inside intron 1204 21 35 inside transcription region
23300 22162
9 chr21 47415910 47415945 36
5.109484 183.941439 1129 1164 9 222 0.0322581 0.0476190 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside intron inside 14247 inside intron 1390 21 35 inside transcription region
23300 22162
10 chr21 47412088 47412131 44
3.344261 147.147470 691 734 6 389 0.4464809 0.6590909 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 10425 inside exon 0 17 35 inside transcription region
23300 22162
11 chr21 47412277 47412312 36
3.575484 128.717432 735 770 6 389 0.5053763 0.7460317 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 inside exon inside 10614 inside exon 0 18 35 inside transcription region
23300 22162
12 chr21 47409034 47409056 23
5.557125 127.813881 105 127 2 232 0.1402525 0.2070393 0.0000000 -1.000000e+100 0.0188679 TRUE 0.0133769 TRUE COL6A1 NM_001848 NP_001839 overlaps exon downstream inside 7371 overlaps exon downstream 0 9 35 inside transcription region
23300 22162
13 chr21 47417614 47417650 37
2.849383 105.427164 1338 1374 10 338 0.4027899 0.5469755 0.1000000 -2.451476e+00 0.2452830 FALSE 0.1605224 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 15951 inside exon 0 22 35 inside transcription region
23300 22162
14 chr21 47410934 47410955 22
3.652389 80.352551 448 469 5 269 0.3079179 0.4545455 0.0000000 -1.000000e+100 0.3207547 FALSE 0.1819254 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 9271 inside exon 0 15 35 inside transcription region
23300 22162
15 chr21 47407554 47407568 15
4.704220 70.563298 18 32 1 32 0.2688172 0.3968254 0.0000000 -1.000000e+100 0.3207547 FALSE 0.1819254 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 5891 inside exon 0 8 35 inside transcription region
23300 22162
16 chr21 47408825 47408860 36
1.161740 41.822652 33 68 2 232 0.0322581 0.0476190 0.0000000 -1.000000e+100 0.5849057 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside intron inside 7162 inside intron 138 9 35 inside transcription region
23300 22162
17 chr21 47410702 47410714 13
2.378483 30.920276 368 380 5 269 0.5831266 0.8608059 0.0000000 -1.000000e+100 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 9039 inside exon 0 14 35 inside transcription region
23300 22162
18 chr21 47410292 47410305 14
2.166562 30.331874 227 240 4 159 0.7672811 1.1156463 0.0357143 -4.965235e+00 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8629 inside exon 0 13 35 inside transcription region
23300 22162
19 chr21 47410188 47410198 11
2.569046 28.259510 216 226 4 159 0.7947214 1.1255411 0.1000000 -3.492547e+00 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8525 inside exon 0 12 35 inside transcription region
23300 22162
20 chr21 47410731 47410740 10
2.698208 26.982075 397 406 5 269 0.5064516 0.7476190 0.0000000 -1.000000e+100 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 9068 inside exon 0 14 35 inside transcription region
23300 22162
21 chr21 47407537 47407542 6
4.061822 24.370932 1 6 1 32 0.0591398 0.0873016 0.0000000 -1.000000e+100 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 5874 inside exon 0 8 35 inside transcription region
23300 22162
22 chr21 47410687 47410698 12
1.773713 21.284559 353 364 5 269 0.6801075 1.0039683 0.0000000 -1.000000e+100 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 9024 inside exon 0 14 35 inside transcription region
23300 22162
23 chr21 47410175 47410184 10
1.857995 18.579952 203 212 4 159 0.7354839 1.0380952 0.1000000 -3.375867e+00 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8512 inside exon 0 12 35 inside transcription region
23300 22162
24 chr21 47410314 47410322 9
2.063009 18.567083 249 257 4 159 0.6487455 0.9576720 0.0000000 -1.000000e+100 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8651 inside exon 0 13 35 inside transcription region
23300 22162
25 chr21 47410716 47410726 11
1.631502 17.946526 382 392 5 269 0.5630499 0.8311688 0.0000000 -1.000000e+100 0.6226415 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 9053 inside exon 0 14 35 inside transcription region
23300 22162
26 chr21 47412078 47412083 6
2.293818 13.762909 681 686 6 389 0.4139785 0.6031746 0.0166667 -5.177538e+00 0.6981132 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 10415 inside exon 0 17 35 inside transcription region
23300 22162
27 chr21 47417655 47417662 8
1.570295 12.562357 1379 1386 10 338 0.2822581 0.4166667 0.0000000 -1.000000e+100 0.6981132 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 15992 inside exon 0 22 35 inside transcription region
23300 22162
28 chr21 47410327 47410333 7
1.521231 10.648620 262 268 4 159 0.7557604 1.1156463 0.0000000 -1.000000e+100 0.6981132 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8664 inside exon 0 13 35 inside transcription region
23300 22162
29 chr21 47417666 47417672 7
1.215518 8.508629 1390 1396 10 338 0.2258065 0.3333333 0.0000000 -1.000000e+100 0.7358491 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 16003 inside exon 0 22 35 inside transcription region
23300 22162
30 chr21 47409685 47409688 4
1.215855 4.863420 192 195 3 167 0.4758065 0.6547619 0.1000000 -2.710970e+00 0.7735849 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8022 inside exon 0 11 35 inside transcription region
23300 22162
31 chr21 47417335 47417337 3
1.591623 4.774869 1275 1277 10 338 0.1290323 0.1904762 0.0000000 -1.000000e+100 0.7735849 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 15672 inside exon 0 21 35 inside transcription region
23300 22162
32 chr21 47418068 47418068 1
3.997655 3.997655 1434 1434 11 1 0.0322581 0.0476190 0.0000000 -1.000000e+100 0.7735849 FALSE 0.2056693 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 16405 inside exon 0 23 35 inside transcription region
23300 22162
33 chr21 47410311 47410312 2
1.143626 2.287252 246 247 4 159 0.7258065 1.0714286 0.0000000 -1.000000e+100 0.9622642 FALSE 0.2480801 FALSE COL6A1 NM_001848 NP_001839 inside exon inside 8648 inside exon 0 13 35 inside transcription region
23300 22162

Best region clusters

The following plots were made using ggbio (Yin, Cook, and Lawrence, 2012) which in turn uses ggplot2 (Wickham, 2009). For more details check plotCluster() in derfinder (Collado-Torres, Frazee, Love, Irizarry, et al., 2015).

Plots

## Select clusters by cluster area
df <- data.frame(area = fullRegions$area,
    clusterChr = paste0(as.integer(fullRegions$cluster), 
    chr = as.character(seqnames(fullRegions))))
regionClustAreas <- tapply(df$area, df$clusterChr, sum)
bestArea <- sapply(names(head(sort(regionClustAreas, decreasing=TRUE),
    nBestClusters)), function(y) { which(df$clusterChr == y)[[1]]})

## Graphical setup: ideograms 
if(hg19 & is.null(p.ideos)) {
    ## Load ideogram info
    data(hg19IdeogramCyto, package = 'biovizBase')
    ideos.set <- as.character(unique(seqnames(fullRegions[bestArea])))
    p.ideos <- lapply(ideos.set, function(xx) { 
        plotIdeogram(hg19IdeogramCyto, mapSeqlevels(xx, 'UCSC'))
    })
    names(p.ideos) <- ideos.set
} else {
    stopifnot(!is.null(p.ideos))
}

## Graphical setup: main plotting function
regionClusterPlot <- function(idx, tUse='qval') {
    ## Chr specific selections
    chr <- as.character(seqnames(fullRegions[idx]))
    p.ideo <- p.ideos[[chr]]
    covInfo <- fullCov[[chr]]
    
    ## Make the plot
    p <- plotCluster(idx, regions = fullRegions, annotation = regions.df,
        coverageInfo = covInfo, groupInfo = optionsStats$groupInfo,
        titleUse = tUse, txdb = txdb, p.ideogram = p.ideo)
    print(p)
    rm(p.ideo, covInfo)
    
    return(invisible(TRUE))
}

Below are the best 2 region clusters ordered by cluster area (sum of the area of regions inside a cluster). The region with the highest area in the cluster is shown with a red bar.

## Genome plots
for(idx in bestArea) {
    regionClusterPlot(idx, ifelse(nullExist, ifelse(fwerExist, 'fwer', 'qval'), 'none'))
}

Permutations

Below is the information on how the samples were permutted.

Summary

## Get the permutation information
nSamples <- seq_len(length(optionsStats$groupInfo))
permuteInfo <- lapply(seeds, function(x) {
    set.seed(x)
    idx <- sample(nSamples)
    data.frame(optionsStats$groupInfo[idx])
})
permuteInfo <- cbind(data.frame(optionsStats$groupInfo), do.call(cbind, permuteInfo))
colnames(permuteInfo) <- c('original', paste0('perm', seq_len(optionsStats$nPermute)))
## The raw information
# permuteInfo

n <- names(table(permuteInfo[, 2]))
permuteDetail <- data.frame(matrix(NA, nrow=optionsStats$nPermute * length(n),
    ncol = 2 + length(n)))
permuteDetail[, 1] <- rep(seq_len(optionsStats$nPermute), each=length(n))
permuteDetail[, 2] <- rep(n, optionsStats$nPermute)
colnames(permuteDetail) <- c('permutation', 'group', as.character(n))
l <- 1
m <- 3:ncol(permuteDetail)
for(j in n) {
    k <- which(permuteInfo[, 1] == j)
    for(i in 2:(optionsStats$nPermute + 1)) {
        permuteDetail[l, m] <- table(permuteInfo[k, i])
        l <- l + 1
    }
}

## How many permutations resulted in the original grouping rearrangement
obs <- diag(length(m)) * sapply(
    permuteDetail$group[ permuteDetail$permutation == 1], function(n) {
  sum(optionsStats$groupInfo == n)
})
sameAsObs <- sapply(seq_len(length(seeds)), function(i) {
    p <- as.matrix(permuteDetail[permuteDetail$permutation == i, m])
    all((p - obs) == 0)
})

## Print the summary
summary(permuteDetail[, m])
##       CEU             YRI     
##  Min.   : 6.00   Min.   :4.0  
##  1st Qu.: 8.25   1st Qu.:4.5  
##  Median :10.50   Median :5.0  
##  Mean   :10.50   Mean   :5.0  
##  3rd Qu.:12.75   3rd Qu.:5.5  
##  Max.   :15.00   Max.   :6.0

This table shows the summary per group of how many samples were assigned to the group. It can be used for fast detection of anomalies. Also note that 0 permutations out of 1 total permutations resulted in the same grouping as in the original observed data.

Note that in derfinder the re-sampling of the samples is done without replacement. This is done to avoid singular model matrices. While the sample balance is the same across the permutations, what changes are the adjusted variables (including the column medians).

Interactive

The following table shows how the group labels were permuted. This can be useful to detect whether a permutation in particular had too many samples of a group labeled as another group, meaning that the resulting permuted group label resulted in pretty much a name change.

kable(permuteDetail, format = 'html', table.attr='id="permutation_table"')
permutation group CEU YRI
1 CEU 15 6
1 YRI 6 4

Reproducibility

General information

The F-statistic cutoff used was 1 and type of cutoff used was manual. Furthermore, the maximum region (data) gap was set to 0 and the maximum cluster gap was set to 300.

Details

This analysis was on each chromosome was performed with the following call to analyzeChr() (shown for one chromosome only):

## analyzeChr(chrnum = "21", coverageInfo = genomeData, models = models, 
##     cutoffFstat = 1, cutoffType = "manual", seeds = 20140330, 
##     groupInfo = group, mc.cores = 1, writeOutput = TRUE, returnOutput = FALSE)

The results were merged using the following call to mergeResults():

## mergeResults(chrs = "chr21", prefix = "report", genomicState = genomicState$fullGenome)

This report was generated in path /Users/lcollado/Dropbox/JHSPH/Code/regionReport/vignettes/realVignettes using the following call to derfinderReport():

## derfinderReport(prefix = "report", outdir = "html", browse = FALSE, 
##     nBestRegions = 15, makeBestClusters = TRUE, fullCov = list(`21` = genomeDataRaw$coverage), 
##     optionsStats = optionsStats)

Date the report was generated.

## [1] "2015-03-13 01:18:16 EDT"

Wallclock time spent generating the report.

## Time difference of 40.57 secs

R session information.

## Session info-----------------------------------------------------------------------------------------------------------
##  setting  value                                             
##  version  R Under development (unstable) (2014-11-01 r66923)
##  system   x86_64, darwin10.8.0                              
##  ui       X11                                               
##  language (EN)                                              
##  collate  en_US.UTF-8                                       
##  tz       America/New_York
## Packages---------------------------------------------------------------------------------------------------------------
##  package                           * version  date       source                                   
##  acepack                             1.3.3.3  2013-05-03 CRAN (R 3.2.0)                           
##  AnnotationDbi                     * 1.29.17  2015-01-21 Bioconductor                             
##  bibtex                              0.3.6    2013-07-29 CRAN (R 3.2.0)                           
##  Biobase                           * 2.27.2   2015-02-28 Bioconductor                             
##  BiocGenerics                      * 0.13.6   2015-03-01 Bioconductor                             
##  BiocParallel                        1.1.14   2015-03-12 Bioconductor                             
##  biomaRt                             2.23.5   2014-11-22 Bioconductor                             
##  Biostrings                          2.35.11  2015-02-22 Bioconductor                             
##  biovizBase                        * 1.15.2   2015-01-14 Bioconductor                             
##  bitops                              1.0.6    2013-08-17 CRAN (R 3.2.0)                           
##  BSgenome                            1.35.17  2015-02-13 Bioconductor                             
##  bumphunter                          1.7.6    2015-03-13 Github (lcolladotor/bumphunter@37d10e7)  
##  Cairo                               1.5.6    2014-06-26 CRAN (R 3.2.0)                           
##  cluster                             1.15.3   2014-09-04 CRAN (R 3.2.0)                           
##  codetools                           0.2.9    2014-08-21 CRAN (R 3.2.0)                           
##  colorspace                          1.2.4    2013-09-30 CRAN (R 3.2.0)                           
##  DBI                                 0.3.1    2014-09-24 CRAN (R 3.2.0)                           
##  derfinder                         * 1.1.17   2015-03-13 Bioconductor                             
##  derfinderHelper                     1.1.5    2014-11-05 Bioconductor                             
##  derfinderPlot                       1.1.6    2015-03-13 Bioconductor                             
##  devtools                            1.6.1    2014-10-07 CRAN (R 3.2.0)                           
##  dichromat                           2.0.0    2013-01-24 CRAN (R 3.2.0)                           
##  digest                              0.6.4    2013-12-03 CRAN (R 3.2.0)                           
##  doRNG                               1.6      2014-03-07 CRAN (R 3.2.0)                           
##  evaluate                            0.5.5    2014-04-29 CRAN (R 3.2.0)                           
##  foreach                             1.4.2    2014-04-11 CRAN (R 3.2.0)                           
##  foreign                             0.8.61   2014-03-28 CRAN (R 3.2.0)                           
##  formatR                             1.0      2014-08-25 CRAN (R 3.2.0)                           
##  Formula                             1.1.2    2014-07-13 CRAN (R 3.2.0)                           
##  futile.logger                       1.3.7    2014-01-25 CRAN (R 3.2.0)                           
##  futile.options                      1.0.0    2010-04-06 CRAN (R 3.2.0)                           
##  GenomeInfoDb                      * 1.3.13   2015-02-13 Bioconductor                             
##  GenomicAlignments                   1.3.30   2015-03-04 Bioconductor                             
##  GenomicFeatures                   * 1.19.27  2015-03-12 Bioconductor                             
##  GenomicFiles                        1.3.14   2015-03-07 Bioconductor                             
##  GenomicRanges                     * 1.19.46  2015-03-12 Bioconductor                             
##  GGally                              0.4.8    2014-08-26 CRAN (R 3.2.0)                           
##  ggbio                               1.15.1   2015-01-14 Bioconductor                             
##  ggplot2                           * 1.0.0    2014-05-21 CRAN (R 3.2.0)                           
##  graph                               1.45.2   2015-03-01 Bioconductor                             
##  gridExtra                         * 0.9.1    2012-08-09 CRAN (R 3.2.0)                           
##  gtable                              0.1.2    2012-12-05 CRAN (R 3.2.0)                           
##  Hmisc                               3.14.5   2014-09-12 CRAN (R 3.2.0)                           
##  htmltools                           0.2.6    2014-09-08 CRAN (R 3.2.0)                           
##  httr                                0.5      2014-09-02 CRAN (R 3.2.0)                           
##  IRanges                           * 2.1.43   2015-03-07 Bioconductor                             
##  iterators                           1.0.7    2014-04-11 CRAN (R 3.2.0)                           
##  knitcitations                       1.0.4    2014-11-03 Github (cboettig/knitcitations@508de74)  
##  knitr                             * 1.7      2014-10-13 CRAN (R 3.2.0)                           
##  knitrBootstrap                      1.0.0    2014-11-03 Github (jimhester/knitrBootstrap@76c41f0)
##  labeling                            0.3      2014-08-23 CRAN (R 3.2.0)                           
##  lambda.r                            1.1.6    2014-01-23 CRAN (R 3.2.0)                           
##  lattice                             0.20.29  2014-04-04 CRAN (R 3.2.0)                           
##  latticeExtra                        0.6.26   2013-08-15 CRAN (R 3.2.0)                           
##  locfit                              1.5.9.1  2013-04-20 CRAN (R 3.2.0)                           
##  lubridate                           1.3.3    2013-12-31 CRAN (R 3.2.0)                           
##  markdown                            0.7.4    2014-08-24 CRAN (R 3.2.0)                           
##  MASS                                7.3.35   2014-09-30 CRAN (R 3.2.0)                           
##  Matrix                              1.1.4    2014-06-15 CRAN (R 3.2.0)                           
##  matrixStats                         0.10.3   2014-10-15 CRAN (R 3.2.0)                           
##  memoise                             0.2.1    2014-04-22 CRAN (R 3.2.0)                           
##  mgcv                              * 1.8.3    2014-08-29 CRAN (R 3.2.0)                           
##  mime                                0.2      2014-09-26 CRAN (R 3.2.0)                           
##  munsell                             0.4.2    2013-07-11 CRAN (R 3.2.0)                           
##  nlme                              * 3.1.118  2014-10-07 CRAN (R 3.2.0)                           
##  nnet                                7.3.8    2014-03-28 CRAN (R 3.2.0)                           
##  OrganismDbi                         1.9.13   2015-03-12 Bioconductor                             
##  pkgmaker                            0.22     2014-05-14 CRAN (R 3.2.0)                           
##  plyr                                1.8.1    2014-02-26 CRAN (R 3.2.0)                           
##  proto                               0.3.10   2012-12-22 CRAN (R 3.2.0)                           
##  qvalue                              1.43.0   2015-03-04 Bioconductor                             
##  R.methodsS3                         1.6.1    2014-01-05 CRAN (R 3.2.0)                           
##  RBGL                                1.43.0   2014-10-14 Bioconductor                             
##  RColorBrewer                      * 1.0.5    2011-06-17 CRAN (R 3.2.0)                           
##  Rcpp                                0.11.3   2014-09-29 CRAN (R 3.2.0)                           
##  RCurl                               1.95.4.3 2014-07-29 CRAN (R 3.2.0)                           
##  RefManageR                          0.8.40   2014-10-29 CRAN (R 3.2.0)                           
##  regionReport                      * 1.1.7    2015-02-28 Bioconductor                             
##  registry                            0.2      2012-01-24 CRAN (R 3.2.0)                           
##  reshape                             0.8.5    2014-04-23 CRAN (R 3.2.0)                           
##  reshape2                            1.4      2014-04-23 CRAN (R 3.2.0)                           
##  RJSONIO                             1.3.0    2014-07-28 CRAN (R 3.2.0)                           
##  rmarkdown                           0.3.3    2014-09-17 CRAN (R 3.2.0)                           
##  rngtools                            1.2.4    2014-03-06 CRAN (R 3.2.0)                           
##  rpart                               4.1.8    2014-03-28 CRAN (R 3.2.0)                           
##  Rsamtools                           1.19.43  2015-03-12 Bioconductor                             
##  RSQLite                             1.0.0    2014-10-25 CRAN (R 3.2.0)                           
##  rstudioapi                          0.1      2014-03-27 CRAN (R 3.2.0)                           
##  rtracklayer                         1.27.8   2015-03-02 Bioconductor                             
##  S4Vectors                         * 0.5.22   2015-03-06 Bioconductor                             
##  scales                              0.2.4    2014-04-22 CRAN (R 3.2.0)                           
##  stringr                             0.6.2    2012-12-06 CRAN (R 3.2.0)                           
##  survival                            2.37.7   2014-01-22 CRAN (R 3.2.0)                           
##  TxDb.Hsapiens.UCSC.hg19.knownGene * 3.0.0    2014-09-26 Bioconductor                             
##  VariantAnnotation                   1.13.40  2015-03-06 Bioconductor                             
##  XML                                 3.98.1.1 2013-06-20 CRAN (R 3.2.0)                           
##  xtable                              1.7.4    2014-09-12 CRAN (R 3.2.0)                           
##  XVector                           * 0.7.4    2015-02-08 Bioconductor                             
##  yaml                                2.1.13   2014-06-12 CRAN (R 3.2.0)                           
##  zlibbioc                            1.13.1   2015-02-11 Bioconductor

Bibliography

This report was created with regionReport (Collado-Torres, Jaffe, and Leek, 2015) using knitrBootstrap (Hester, 2014) to format the html while knitr (Xie, 2014) and rmarkdown (Allaire, McPherson, Xie, Wickham, et al., 2014) were running behind the scenes.

Citations made with knitcitations (Boettiger, 2015).

[1] J. Allaire, J. McPherson, Y. Xie, H. Wickham, et al. rmarkdown: Dynamic Documents for R. R package version 0.3.3. 2014. URL: http://CRAN.R-project.org/package=rmarkdown.

[2] C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0.4. 2015. URL: https://github.com/cboettig/knitcitations.

[3] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.

[4] L. Collado-Torres, A. E. Jaffe and J. T. Leek. regionReport: Generate HTML reports for exploring a set of regions. https://github.com/lcolladotor/regionReport - R package version 1.1.7. 2015. URL: http://www.bioconductor.org/packages/release/bioc/html/regionReport.html.

[5] J. Hester. knitrBootstrap: Knitr Bootstrap framework. R package version 1.0.0. 2014. URL: https://github.com/jimhester/.

[6] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.

[7] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.

[8] T. Yin, D. Cook and M. Lawrence. “ggbio: an R package for extending the grammar of graphics for genomic data”. In: Genome Biology 13.8 (2012), p. R77.