1 Basics

1.1 Install derfinder

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. derfinder is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install derfinder by using the following commands in your R session:

install.packages("BiocManager")
BiocManager::install("derfinder")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

1.2 Required knowledge

derfinder is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like Rsamtools, GenomicAlignments and rtracklayer that allow you to import the data. A derfinder user is not expected to deal with those packages directly but will need to be familiar with GenomicRanges to understand the results derfinder generates. It might also prove to be highly beneficial to check the BiocParallel package for performing parallel computations.

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

1.3 Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the derfinder tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

We would like to highlight the derfinder user Jessica Hekman. She has used derfinder with non-human data, and in the process of doing so discovered some small bugs or sections of the documentation that were not clear.

1.4 Citing derfinder

We hope that derfinder will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation('derfinder')
## 
## Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead
## B, Irizarry RA, Leek JT, Jaffe AE (2017). "Flexible expressed region
## analysis for RNA-seq with derfinder." _Nucl. Acids Res._. doi:
## 10.1093/nar/gkw852 (URL: http://doi.org/10.1093/nar/gkw852), <URL:
## http://nar.oxfordjournals.org/content/early/2016/09/29/nar.gkw852>.
## 
## Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014).
## "Differential expression analysis of RNA-seq data at single-base
## resolution." _Biostatistics_, *15 (3)*, 413-426. doi:
## 10.1093/biostatistics/kxt053 (URL:
## http://doi.org/10.1093/biostatistics/kxt053), <URL:
## http://biostatistics.oxfordjournals.org/content/15/3/413.long>.
## 
## Collado-Torres L, Jaffe AE, Leek JT (2017). _derfinder:
## Annotation-agnostic differential expression analysis of RNA-seq data
## at base-pair resolution via the DER Finder approach_. doi:
## 10.18129/B9.bioc.derfinder (URL:
## http://doi.org/10.18129/B9.bioc.derfinder),
## https://github.com/lcolladotor/derfinder - R package version 1.16.1,
## <URL: http://www.bioconductor.org/packages/derfinder>.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

2 Quick start to using to derfinder

Here is a very quick example of a DER Finder analysis. This analysis is explained in more detail later on in this document.

## Load libraries
library('derfinder')
library('derfinderData')
library('GenomicRanges')

## Determine the files to use and fix the names
files <- rawFiles(system.file('extdata', 'AMY', package = 'derfinderData'),
    samplepatt = 'bw', fileterm = NULL)
names(files) <- gsub('.bw', '', names(files))

## Load the data from disk -- On Windows you have to load data from Bam files
fullCov <- fullCoverage(files = files, chrs = 'chr21', verbose = FALSE)

## Get the region matrix of Expressed Regions (ERs)
regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE)

## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')

## Identify which ERs are differentially expressed, that is, find the DERs
library('DESeq2')

## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)

## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)

## Perform DE analysis
dse <- DESeq(dse, test = 'LRT', reduced = ~ gender, fitType = 'local')

## Extract results
mcols(regionMat$chr21$regions) <- c(mcols(regionMat$chr21$regions), results(dse))

## Save info in an object with a shorter name
ers <- regionMat$chr21$regions
ers

3 Introduction

derfinder is an R package that implements the DER Finder approach (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014) for RNA-seq data. Briefly, this approach is an alternative to feature-counting and transcript assembly. The basic idea is to identify contiguous base-pairs in the genome that present differential expression signal. These base-pairs are grouped into _d_ifferentially _e_xpressed _r_regions (DERs). This approach is annotation-agnostic which is a feature you might be interested in. In particular, derfinder contains functions that allow you to identify DERs via two alternative methods. You can find more details on the full derfinder users guide.

4 Sample DER Finder analysis

This is a brief overview of what a DER Finder analysis looks like. In particular, here we will be identifying expressed regions (ERs) without relying on annotation. Next, we’ll identify candidate differentially expressed regions (DERs). Finally, we’ll compare the DERs with known annotation features.

We first load the required packages.

## Load libraries
library('derfinder')
library('derfinderData')
library('GenomicRanges')

Next, we need to locate the chromosome 21 coverage files for a set of 12 samples. These samples are a small subset from the BrainSpan Atlas of the Human Brain (BrainSpan, 2011) publicly available data. The function rawFiles() helps us in locating these files.

## Determine the files to use and fix the names
files <- rawFiles(system.file('extdata', 'AMY', package = 'derfinderData'),
    samplepatt = 'bw', fileterm = NULL)
names(files) <- gsub('.bw', '', names(files))

Next, we can load the full coverage data into memory using the fullCoverage() function. Note that the BrainSpan data is already normalized by the total number of mapped reads in each sample. However, that won’t be the case with most data sets in which case you might want to use the totalMapped and targetSize arguments. The function getTotalMapped() will be helpful to get this information.

## Load the data from disk
fullCov <- fullCoverage(files = files, chrs = 'chr21', verbose = FALSE,
    totalMapped = rep(1, length(files)), targetSize = 1)

Now that we have the data, we can identify expressed regions (ERs) by using a cutoff of 30 on the base-level mean coverage from these 12 samples. Once the regions have been identified, we can calculate a coverage matrix with one row per ER and one column per sample (12 in this case). For doing this calculation we need to know the length of the sequence reads, which in this study were 76 bp long.

## Get the region matrix of Expressed Regions (ERs)
regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE)

regionMatrix() returns a list of elements, each with three pieces of output. The actual ERs are arranged in a GRanges object named regions.

## regions output
regionMat$chr21$regions
## GRanges object with 45 ranges and 6 metadata columns:
##      seqnames            ranges strand |            value             area
##         <Rle>         <IRanges>  <Rle> |        <numeric>        <numeric>
##    1    chr21   9827018-9827582      * | 313.671741938907 177224.534195483
##    2    chr21 15457301-15457438      * | 215.084607424943 29681.6758246422
##    3    chr21 20230140-20230192      * | 38.8325471608144 2058.12499952316
##    4    chr21 20230445-20230505      * | 41.3245082031834 2520.79500039419
##    5    chr21 27253318-27253543      * | 34.9131305372469  7890.3675014178
##   ..      ...               ...    ... .              ...              ...
##   41    chr21 33039644-33039688      * | 34.4705371008979 1551.17416954041
##   42    chr21 33040784-33040798      * |  32.134222178989 482.013332684835
##   43    chr21          33040890      * | 30.0925002098083 30.0925002098083
##   44    chr21 33040900-33040901      * | 30.1208333174388 60.2416666348775
##   45    chr21 48019401-48019414      * | 31.1489284609755 436.084998453657
##      indexStart  indexEnd cluster clusterL
##       <integer> <integer>   <Rle>    <Rle>
##    1          1       565       1      565
##    2        566       703       2      138
##    3        704       756       3      366
##    4        757       817       3      366
##    5        818      1043       4      765
##   ..        ...       ...     ...      ...
##   41       2180      2224      17       45
##   42       2225      2239      18      118
##   43       2240      2240      18      118
##   44       2241      2242      18      118
##   45       2243      2256      19       14
##   -------
##   seqinfo: 1 sequence from an unspecified genome

bpCoverage is the base-level coverage list which can then be used for plotting.

## Base-level coverage matrices for each of the regions
## Useful for plotting
lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2)
## $`1`
##   HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178
## 1  93.20   3.32  28.22   5.62 185.17  98.34   5.88  16.71   3.52  15.71
## 2 124.76   7.25  63.68  11.32 374.85 199.28  10.39  30.53   5.83  29.35
##   HSB92 HSB97
## 1 47.40 36.54
## 2 65.04 51.42
## 
## $`2`
##     HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178
## 566  45.59   7.94  15.92  34.75 141.61 104.21  19.87  38.61   4.97   23.2
## 567  45.59   7.94  15.92  35.15 141.64 104.30  19.87  38.65   4.97   23.2
##     HSB92 HSB97
## 566 13.95 22.21
## 567 13.95 22.21
## Check dimensions. First region is 565 long, second one is 138 bp long.
## The columns match the number of samples (12 in this case).
lapply(regionMat$chr21$bpCoverage[1:2], dim)
## $`1`
## [1] 565  12
## 
## $`2`
## [1] 138  12

The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed.

## Dimensions of the coverage matrix
dim(regionMat$chr21$coverageMatrix)
## [1] 45 12
## Coverage for each region. This matrix can then be used with limma or other pkgs
head(regionMat$chr21$coverageMatrix)
##         HSB113     HSB123      HSB126      HSB130      HSB135      HSB136
## 1 3653.1093346 277.072106 1397.068687 1106.722895 8987.460401 5570.221054
## 2  333.3740816  99.987237  463.909476  267.354342 1198.713552 1162.313418
## 3   35.3828948  20.153553   30.725394   23.483947   16.786842   17.168947
## 4   42.3398681  29.931579   41.094474   24.724736   32.634080   19.309606
## 5   77.7402631 168.939342  115.059342  171.861974  180.638684   93.503158
## 6    0.7988158   1.770263    1.473421    2.231053    1.697368    1.007895
##        HSB145       HSB153     HSB159      HSB178       HSB92        HSB97
## 1 1330.158818 1461.2986829 297.939342 1407.288552 1168.519079 1325.9622371
## 2  257.114210  313.8513139  67.940131  193.695657  127.543553  200.7834228
## 3   22.895921   52.8756585  28.145395   33.127368   23.758816   20.4623685
## 4   33.802632   51.6146040  31.244343   33.576974   29.546183   28.2011836
## 5   90.950526   36.3046051  78.069605   97.151316  100.085790   35.5428946
## 6    1.171316    0.4221053   1.000132    1.139079    1.136447    0.3956579

We can then use the coverage matrix and packages such as limma, DESeq2 or edgeR to identify which ERs are differentially expressed. Here we’ll use DESeq2 for which we need some phenotype data.

## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')

Now we can identify the DERs using a rounded version of the coverage matrix.

## Identify which ERs are differentially expressed, that is, find the DERs
library('DESeq2')
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply
## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)

## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)
## converting counts to integer mode
## Perform DE analysis
dse <- DESeq(dse, test = 'LRT', reduced = ~ gender, fitType = 'local')
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## Extract results
mcols(regionMat$chr21$regions) <- c(mcols(regionMat$chr21$regions),
    results(dse))

## Save info in an object with a shorter name
ers <- regionMat$chr21$regions
ers
## GRanges object with 45 ranges and 12 metadata columns:
##      seqnames            ranges strand |            value             area
##         <Rle>         <IRanges>  <Rle> |        <numeric>        <numeric>
##    1    chr21   9827018-9827582      * | 313.671741938907 177224.534195483
##    2    chr21 15457301-15457438      * | 215.084607424943 29681.6758246422
##    3    chr21 20230140-20230192      * | 38.8325471608144 2058.12499952316
##    4    chr21 20230445-20230505      * | 41.3245082031834 2520.79500039419
##    5    chr21 27253318-27253543      * | 34.9131305372469  7890.3675014178
##   ..      ...               ...    ... .              ...              ...
##   41    chr21 33039644-33039688      * | 34.4705371008979 1551.17416954041
##   42    chr21 33040784-33040798      * |  32.134222178989 482.013332684835
##   43    chr21          33040890      * | 30.0925002098083 30.0925002098083
##   44    chr21 33040900-33040901      * | 30.1208333174388 60.2416666348775
##   45    chr21 48019401-48019414      * | 31.1489284609755 436.084998453657
##      indexStart  indexEnd cluster clusterL          baseMean
##       <integer> <integer>   <Rle>    <Rle>         <numeric>
##    1          1       565       1      565  2846.28716155756
##    2        566       703       2      138  451.519643002767
##    3        704       756       3      366  29.5780895864663
##    4        757       817       3      366  36.0602688380417
##    5        818      1043       4      765  101.646763062388
##   ..        ...       ...     ...      ...               ...
##   41       2180      2224      17       45  20.7820346002786
##   42       2225      2239      18      118  6.41054227728292
##   43       2240      2240      18      118 0.129716561731455
##   44       2241      2242      18      118 0.702291488002413
##   45       2243      2256      19       14  5.29329263119445
##          log2FoldChange             lfcSE               stat
##               <numeric>         <numeric>          <numeric>
##    1  -1.69031816472548 0.831958703669481  0.215261662919431
##    2  -1.16404264400518 0.757489962635837   0.87112644422632
##    3 0.0461487746156371 0.458096633387309   3.13208182520646
##    4 -0.186620018986543 0.390920460570196   2.22570758632982
##    5 -0.138737745290957 0.320166382651628   3.95798720528119
##   ..                ...               ...                ...
##   41  -0.64205595270105   0.4276611764306  0.604781356173675
##   42 -0.634320864108627 0.512261897770782  0.545403917207132
##   43 -0.859548956942508  3.11653965011299 0.0206273206873746
##   44 -0.628284603274911  2.24737800886771  0.582510548110115
##   45  -1.69456340588639  1.25228952015329   5.78959103593262
##                  pvalue              padj
##               <numeric>         <numeric>
##    1  0.642674256598179 0.997155030203294
##    2  0.350643649949685 0.997155030203294
##    3 0.0767656530295047 0.863613596581927
##    4  0.135730461463207 0.997155030203294
##    5 0.0466494518014194   0.8620396310466
##   ..                ...               ...
##   41   0.43675951534617 0.997155030203294
##   42  0.460201756145957 0.997155030203294
##   43  0.885798852352319 0.997155030203294
##   44  0.445329945344273 0.997155030203294
##   45 0.0161213391338242 0.725460261022089
##   -------
##   seqinfo: 1 sequence from an unspecified genome

We can then compare the DERs against known annotation to see which DERs overlap known exons, introns, or intergenic regions. A way to visualize this information is via a Venn diagram which we can create using vennRegions() from the derfinderPlot package as shown in Figure 1.

## Find overlaps between regions and summarized genomic annotation
annoRegs <- annotateRegions(ers, genomicState$fullGenome, verbose = FALSE)

library('derfinderPlot')
venn <- vennRegions(annoRegs, counts.col = 'blue',
    main = 'Venn diagram using TxDb.Hsapiens.UCSC.hg19.knownGene annotation')
Venn diagram showing ERs by annotation class.

Figure 1: Venn diagram showing ERs by annotation class

We can also identify the nearest annotated feature. In this case, we’ll look for the nearest known gene from the UCSC hg19 annotation.

## Load database of interest
library('TxDb.Hsapiens.UCSC.hg19.knownGene')
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr21')

## Find nearest feature
library('bumphunter')
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: locfit
## locfit 1.5-9.1    2013-03-22
genes <- annotateTranscripts(txdb)
## No annotationPackage supplied. Trying org.Hs.eg.db.
## Loading required package: org.Hs.eg.db
## 
## Getting TSS and TSE.
## Getting CSS and CSE.
## Getting exons.
## Annotating genes.
annotation <- matchGenes(ers, subject = genes)

## View annotation results
head(annotation)
##       name
## 1     <NA>
## 2     LIPI
## 3 TMPRSS15
## 4 TMPRSS15
## 5      APP
## 6      APP
##                                                                                                                                                                                                                                                                    annotation
## 1                                                                                                                                                                                                                                                                        <NA>
## 2                                                                                                             NM_001302998 NM_001302999 NM_001303000 NM_001303001 NM_145317 NM_198996 NP_001289927 NP_001289928 NP_001289929 NP_001289930 NP_945347 XM_006723965 XP_006724028
## 3                                                                                             NM_002772 NP_002763 XM_011529654 XM_011529655 XM_011529656 XM_011529657 XM_011529658 XM_011529659 XP_011527956 XP_011527957 XP_011527958 XP_011527959 XP_011527960 XP_011527961
## 4                                                                                             NM_002772 NP_002763 XM_011529654 XM_011529655 XM_011529656 XM_011529657 XM_011529658 XM_011529659 XP_011527956 XP_011527957 XP_011527958 XP_011527959 XP_011527960 XP_011527961
## 5 NM_000484 NM_001136016 NM_001136129 NM_001136130 NM_001136131 NM_001204301 NM_001204302 NM_001204303 NM_201413 NM_201414 NP_000475 NP_001129488 NP_001129601 NP_001129602 NP_001129603 NP_001191230 NP_001191231 NP_001191232 NP_958816 NP_958817 XM_024452075 XP_024307843
## 6 NM_000484 NM_001136016 NM_001136129 NM_001136130 NM_001136131 NM_001204301 NM_001204302 NM_001204303 NM_201413 NM_201414 NP_000475 NP_001129488 NP_001129601 NP_001129602 NP_001129603 NP_001191230 NP_001191231 NP_001191232 NP_958816 NP_958817 XM_024452075 XP_024307843
##   description      region distance   subregion insideDistance exonnumber
## 1 close to 3' close to 3'      815        <NA>             NA         NA
## 2  downstream  downstream   125774        <NA>             NA         NA
## 3    upstream    upstream   454170        <NA>             NA         NA
## 4    upstream    upstream   454475        <NA>             NA         NA
## 5 inside exon      inside   289903 inside exon              0         16
## 6 inside exon      inside   289899 inside exon              0         16
##   nexons   UTR strand  geneL codingL    Geneid subjectHits
## 1      1  <NA>      +     60      NA 100500815          15
## 2      8  <NA>      - 102077  101886    149998         149
## 3     25  <NA>      - 134537  133653      5651         579
## 4     25  <NA>      - 134537  133653      5651         579
## 5     16 3'UTR      - 290585  230434       351         354
## 6     16 3'UTR      - 290585  230434       351         354
## You can use derfinderPlot::plotOverview() to visualize this information

We can check the base-level coverage information for some of our DERs. In this example we do so for the first 5 ERs (Figures ??, ??, ??, ??, ??).

## Extract the region coverage
regionCov <- regionMat$chr21$bpCoverage
plotRegionCoverage(regions = ers, regionCoverage = regionCov, 
    groupInfo = pheno$group, nearestAnnotation = annotation, 
    annotatedRegions = annoRegs, whichRegions = seq_len(5), txdb = txdb,
    scalefac = 1, ask = FALSE, verbose = FALSE)