derfinder 1.16.1
R
is an open-source statistical environment which can be easily modified to enhance its functionality via packages. derfinder is a R
package available via the Bioconductor repository for packages. R
can be installed on any operating system from CRAN after which you can install derfinder by using the following commands in your R
session:
install.packages("BiocManager")
BiocManager::install("derfinder")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
derfinder is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like Rsamtools, GenomicAlignments and rtracklayer that allow you to import the data. A derfinder user is not expected to deal with those packages directly but will need to be familiar with GenomicRanges to understand the results derfinder generates. It might also prove to be highly beneficial to check the BiocParallel package for performing parallel computations.
If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R
and Bioconductor
have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the derfinder
tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
We would like to highlight the derfinder user Jessica Hekman. She has used derfinder with non-human data, and in the process of doing so discovered some small bugs or sections of the documentation that were not clear.
We hope that derfinder will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation('derfinder')
##
## Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead
## B, Irizarry RA, Leek JT, Jaffe AE (2017). "Flexible expressed region
## analysis for RNA-seq with derfinder." _Nucl. Acids Res._. doi:
## 10.1093/nar/gkw852 (URL: http://doi.org/10.1093/nar/gkw852), <URL:
## http://nar.oxfordjournals.org/content/early/2016/09/29/nar.gkw852>.
##
## Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014).
## "Differential expression analysis of RNA-seq data at single-base
## resolution." _Biostatistics_, *15 (3)*, 413-426. doi:
## 10.1093/biostatistics/kxt053 (URL:
## http://doi.org/10.1093/biostatistics/kxt053), <URL:
## http://biostatistics.oxfordjournals.org/content/15/3/413.long>.
##
## Collado-Torres L, Jaffe AE, Leek JT (2017). _derfinder:
## Annotation-agnostic differential expression analysis of RNA-seq data
## at base-pair resolution via the DER Finder approach_. doi:
## 10.18129/B9.bioc.derfinder (URL:
## http://doi.org/10.18129/B9.bioc.derfinder),
## https://github.com/lcolladotor/derfinder - R package version 1.16.1,
## <URL: http://www.bioconductor.org/packages/derfinder>.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
Here is a very quick example of a DER Finder analysis. This analysis is explained in more detail later on in this document.
## Load libraries
library('derfinder')
library('derfinderData')
library('GenomicRanges')
## Determine the files to use and fix the names
files <- rawFiles(system.file('extdata', 'AMY', package = 'derfinderData'),
samplepatt = 'bw', fileterm = NULL)
names(files) <- gsub('.bw', '', names(files))
## Load the data from disk -- On Windows you have to load data from Bam files
fullCov <- fullCoverage(files = files, chrs = 'chr21', verbose = FALSE)
## Get the region matrix of Expressed Regions (ERs)
regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE)
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')
## Identify which ERs are differentially expressed, that is, find the DERs
library('DESeq2')
## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)
## Perform DE analysis
dse <- DESeq(dse, test = 'LRT', reduced = ~ gender, fitType = 'local')
## Extract results
mcols(regionMat$chr21$regions) <- c(mcols(regionMat$chr21$regions), results(dse))
## Save info in an object with a shorter name
ers <- regionMat$chr21$regions
ers
derfinder is an R package that implements the DER Finder approach (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014) for RNA-seq data. Briefly, this approach is an alternative to feature-counting and transcript assembly. The basic idea is to identify contiguous base-pairs in the genome that present differential expression signal. These base-pairs are grouped into _d_ifferentially _e_xpressed _r_regions (DERs). This approach is annotation-agnostic which is a feature you might be interested in. In particular, derfinder contains functions that allow you to identify DERs via two alternative methods. You can find more details on the full derfinder users guide.
This is a brief overview of what a DER Finder analysis looks like. In particular, here we will be identifying expressed regions (ERs) without relying on annotation. Next, we’ll identify candidate differentially expressed regions (DERs). Finally, we’ll compare the DERs with known annotation features.
We first load the required packages.
## Load libraries
library('derfinder')
library('derfinderData')
library('GenomicRanges')
Next, we need to locate the chromosome 21 coverage files for a set of 12 samples. These samples are a small subset from the BrainSpan Atlas of the Human Brain (BrainSpan, 2011) publicly available data. The function rawFiles()
helps us in locating these files.
## Determine the files to use and fix the names
files <- rawFiles(system.file('extdata', 'AMY', package = 'derfinderData'),
samplepatt = 'bw', fileterm = NULL)
names(files) <- gsub('.bw', '', names(files))
Next, we can load the full coverage data into memory using the fullCoverage()
function. Note that the BrainSpan data is already normalized by the total number of mapped reads in each sample. However, that won’t be the case with most data sets in which case you might want to use the totalMapped
and targetSize
arguments. The function getTotalMapped()
will be helpful to get this information.
## Load the data from disk
fullCov <- fullCoverage(files = files, chrs = 'chr21', verbose = FALSE,
totalMapped = rep(1, length(files)), targetSize = 1)
Now that we have the data, we can identify expressed regions (ERs) by using a cutoff of 30 on the base-level mean coverage from these 12 samples. Once the regions have been identified, we can calculate a coverage matrix with one row per ER and one column per sample (12 in this case). For doing this calculation we need to know the length of the sequence reads, which in this study were 76 bp long.
## Get the region matrix of Expressed Regions (ERs)
regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE)
regionMatrix()
returns a list of elements, each with three pieces of output. The actual ERs are arranged in a GRanges
object named regions
.
## regions output
regionMat$chr21$regions
## GRanges object with 45 ranges and 6 metadata columns:
## seqnames ranges strand | value area
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## 1 chr21 9827018-9827582 * | 313.671741938907 177224.534195483
## 2 chr21 15457301-15457438 * | 215.084607424943 29681.6758246422
## 3 chr21 20230140-20230192 * | 38.8325471608144 2058.12499952316
## 4 chr21 20230445-20230505 * | 41.3245082031834 2520.79500039419
## 5 chr21 27253318-27253543 * | 34.9131305372469 7890.3675014178
## .. ... ... ... . ... ...
## 41 chr21 33039644-33039688 * | 34.4705371008979 1551.17416954041
## 42 chr21 33040784-33040798 * | 32.134222178989 482.013332684835
## 43 chr21 33040890 * | 30.0925002098083 30.0925002098083
## 44 chr21 33040900-33040901 * | 30.1208333174388 60.2416666348775
## 45 chr21 48019401-48019414 * | 31.1489284609755 436.084998453657
## indexStart indexEnd cluster clusterL
## <integer> <integer> <Rle> <Rle>
## 1 1 565 1 565
## 2 566 703 2 138
## 3 704 756 3 366
## 4 757 817 3 366
## 5 818 1043 4 765
## .. ... ... ... ...
## 41 2180 2224 17 45
## 42 2225 2239 18 118
## 43 2240 2240 18 118
## 44 2241 2242 18 118
## 45 2243 2256 19 14
## -------
## seqinfo: 1 sequence from an unspecified genome
bpCoverage
is the base-level coverage list which can then be used for plotting.
## Base-level coverage matrices for each of the regions
## Useful for plotting
lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2)
## $`1`
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178
## 1 93.20 3.32 28.22 5.62 185.17 98.34 5.88 16.71 3.52 15.71
## 2 124.76 7.25 63.68 11.32 374.85 199.28 10.39 30.53 5.83 29.35
## HSB92 HSB97
## 1 47.40 36.54
## 2 65.04 51.42
##
## $`2`
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136 HSB145 HSB153 HSB159 HSB178
## 566 45.59 7.94 15.92 34.75 141.61 104.21 19.87 38.61 4.97 23.2
## 567 45.59 7.94 15.92 35.15 141.64 104.30 19.87 38.65 4.97 23.2
## HSB92 HSB97
## 566 13.95 22.21
## 567 13.95 22.21
## Check dimensions. First region is 565 long, second one is 138 bp long.
## The columns match the number of samples (12 in this case).
lapply(regionMat$chr21$bpCoverage[1:2], dim)
## $`1`
## [1] 565 12
##
## $`2`
## [1] 138 12
The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed.
## Dimensions of the coverage matrix
dim(regionMat$chr21$coverageMatrix)
## [1] 45 12
## Coverage for each region. This matrix can then be used with limma or other pkgs
head(regionMat$chr21$coverageMatrix)
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## 1 3653.1093346 277.072106 1397.068687 1106.722895 8987.460401 5570.221054
## 2 333.3740816 99.987237 463.909476 267.354342 1198.713552 1162.313418
## 3 35.3828948 20.153553 30.725394 23.483947 16.786842 17.168947
## 4 42.3398681 29.931579 41.094474 24.724736 32.634080 19.309606
## 5 77.7402631 168.939342 115.059342 171.861974 180.638684 93.503158
## 6 0.7988158 1.770263 1.473421 2.231053 1.697368 1.007895
## HSB145 HSB153 HSB159 HSB178 HSB92 HSB97
## 1 1330.158818 1461.2986829 297.939342 1407.288552 1168.519079 1325.9622371
## 2 257.114210 313.8513139 67.940131 193.695657 127.543553 200.7834228
## 3 22.895921 52.8756585 28.145395 33.127368 23.758816 20.4623685
## 4 33.802632 51.6146040 31.244343 33.576974 29.546183 28.2011836
## 5 90.950526 36.3046051 78.069605 97.151316 100.085790 35.5428946
## 6 1.171316 0.4221053 1.000132 1.139079 1.136447 0.3956579
We can then use the coverage matrix and packages such as limma, DESeq2 or edgeR to identify which ERs are differentially expressed. Here we’ll use DESeq2 for which we need some phenotype data.
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')
Now we can identify the DERs using a rounded version of the coverage matrix.
## Identify which ERs are differentially expressed, that is, find the DERs
library('DESeq2')
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## Loading required package: BiocParallel
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply
## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)
## converting counts to integer mode
## Perform DE analysis
dse <- DESeq(dse, test = 'LRT', reduced = ~ gender, fitType = 'local')
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## Extract results
mcols(regionMat$chr21$regions) <- c(mcols(regionMat$chr21$regions),
results(dse))
## Save info in an object with a shorter name
ers <- regionMat$chr21$regions
ers
## GRanges object with 45 ranges and 12 metadata columns:
## seqnames ranges strand | value area
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## 1 chr21 9827018-9827582 * | 313.671741938907 177224.534195483
## 2 chr21 15457301-15457438 * | 215.084607424943 29681.6758246422
## 3 chr21 20230140-20230192 * | 38.8325471608144 2058.12499952316
## 4 chr21 20230445-20230505 * | 41.3245082031834 2520.79500039419
## 5 chr21 27253318-27253543 * | 34.9131305372469 7890.3675014178
## .. ... ... ... . ... ...
## 41 chr21 33039644-33039688 * | 34.4705371008979 1551.17416954041
## 42 chr21 33040784-33040798 * | 32.134222178989 482.013332684835
## 43 chr21 33040890 * | 30.0925002098083 30.0925002098083
## 44 chr21 33040900-33040901 * | 30.1208333174388 60.2416666348775
## 45 chr21 48019401-48019414 * | 31.1489284609755 436.084998453657
## indexStart indexEnd cluster clusterL baseMean
## <integer> <integer> <Rle> <Rle> <numeric>
## 1 1 565 1 565 2846.28716155756
## 2 566 703 2 138 451.519643002767
## 3 704 756 3 366 29.5780895864663
## 4 757 817 3 366 36.0602688380417
## 5 818 1043 4 765 101.646763062388
## .. ... ... ... ... ...
## 41 2180 2224 17 45 20.7820346002786
## 42 2225 2239 18 118 6.41054227728292
## 43 2240 2240 18 118 0.129716561731455
## 44 2241 2242 18 118 0.702291488002413
## 45 2243 2256 19 14 5.29329263119445
## log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric>
## 1 -1.69031816472548 0.831958703669481 0.215261662919431
## 2 -1.16404264400518 0.757489962635837 0.87112644422632
## 3 0.0461487746156371 0.458096633387309 3.13208182520646
## 4 -0.186620018986543 0.390920460570196 2.22570758632982
## 5 -0.138737745290957 0.320166382651628 3.95798720528119
## .. ... ... ...
## 41 -0.64205595270105 0.4276611764306 0.604781356173675
## 42 -0.634320864108627 0.512261897770782 0.545403917207132
## 43 -0.859548956942508 3.11653965011299 0.0206273206873746
## 44 -0.628284603274911 2.24737800886771 0.582510548110115
## 45 -1.69456340588639 1.25228952015329 5.78959103593262
## pvalue padj
## <numeric> <numeric>
## 1 0.642674256598179 0.997155030203294
## 2 0.350643649949685 0.997155030203294
## 3 0.0767656530295047 0.863613596581927
## 4 0.135730461463207 0.997155030203294
## 5 0.0466494518014194 0.8620396310466
## .. ... ...
## 41 0.43675951534617 0.997155030203294
## 42 0.460201756145957 0.997155030203294
## 43 0.885798852352319 0.997155030203294
## 44 0.445329945344273 0.997155030203294
## 45 0.0161213391338242 0.725460261022089
## -------
## seqinfo: 1 sequence from an unspecified genome
We can then compare the DERs against known annotation to see which DERs overlap known exons, introns, or intergenic regions. A way to visualize this information is via a Venn diagram which we can create using vennRegions()
from the derfinderPlot package as shown in Figure 1.
## Find overlaps between regions and summarized genomic annotation
annoRegs <- annotateRegions(ers, genomicState$fullGenome, verbose = FALSE)
library('derfinderPlot')
venn <- vennRegions(annoRegs, counts.col = 'blue',
main = 'Venn diagram using TxDb.Hsapiens.UCSC.hg19.knownGene annotation')
We can also identify the nearest annotated feature. In this case, we’ll look for the nearest known gene from the UCSC hg19 annotation.
## Load database of interest
library('TxDb.Hsapiens.UCSC.hg19.knownGene')
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr21')
## Find nearest feature
library('bumphunter')
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: locfit
## locfit 1.5-9.1 2013-03-22
genes <- annotateTranscripts(txdb)
## No annotationPackage supplied. Trying org.Hs.eg.db.
## Loading required package: org.Hs.eg.db
##
## Getting TSS and TSE.
## Getting CSS and CSE.
## Getting exons.
## Annotating genes.
annotation <- matchGenes(ers, subject = genes)
## View annotation results
head(annotation)
## name
## 1 <NA>
## 2 LIPI
## 3 TMPRSS15
## 4 TMPRSS15
## 5 APP
## 6 APP
## annotation
## 1 <NA>
## 2 NM_001302998 NM_001302999 NM_001303000 NM_001303001 NM_145317 NM_198996 NP_001289927 NP_001289928 NP_001289929 NP_001289930 NP_945347 XM_006723965 XP_006724028
## 3 NM_002772 NP_002763 XM_011529654 XM_011529655 XM_011529656 XM_011529657 XM_011529658 XM_011529659 XP_011527956 XP_011527957 XP_011527958 XP_011527959 XP_011527960 XP_011527961
## 4 NM_002772 NP_002763 XM_011529654 XM_011529655 XM_011529656 XM_011529657 XM_011529658 XM_011529659 XP_011527956 XP_011527957 XP_011527958 XP_011527959 XP_011527960 XP_011527961
## 5 NM_000484 NM_001136016 NM_001136129 NM_001136130 NM_001136131 NM_001204301 NM_001204302 NM_001204303 NM_201413 NM_201414 NP_000475 NP_001129488 NP_001129601 NP_001129602 NP_001129603 NP_001191230 NP_001191231 NP_001191232 NP_958816 NP_958817 XM_024452075 XP_024307843
## 6 NM_000484 NM_001136016 NM_001136129 NM_001136130 NM_001136131 NM_001204301 NM_001204302 NM_001204303 NM_201413 NM_201414 NP_000475 NP_001129488 NP_001129601 NP_001129602 NP_001129603 NP_001191230 NP_001191231 NP_001191232 NP_958816 NP_958817 XM_024452075 XP_024307843
## description region distance subregion insideDistance exonnumber
## 1 close to 3' close to 3' 815 <NA> NA NA
## 2 downstream downstream 125774 <NA> NA NA
## 3 upstream upstream 454170 <NA> NA NA
## 4 upstream upstream 454475 <NA> NA NA
## 5 inside exon inside 289903 inside exon 0 16
## 6 inside exon inside 289899 inside exon 0 16
## nexons UTR strand geneL codingL Geneid subjectHits
## 1 1 <NA> + 60 NA 100500815 15
## 2 8 <NA> - 102077 101886 149998 149
## 3 25 <NA> - 134537 133653 5651 579
## 4 25 <NA> - 134537 133653 5651 579
## 5 16 3'UTR - 290585 230434 351 354
## 6 16 3'UTR - 290585 230434 351 354
## You can use derfinderPlot::plotOverview() to visualize this information
We can check the base-level coverage information for some of our DERs. In this example we do so for the first 5 ERs (Figures ??, ??, ??, ??, ??).
## Extract the region coverage
regionCov <- regionMat$chr21$bpCoverage
plotRegionCoverage(regions = ers, regionCoverage = regionCov,
groupInfo = pheno$group, nearestAnnotation = annotation,
annotatedRegions = annoRegs, whichRegions = seq_len(5), txdb = txdb,
scalefac = 1, ask = FALSE, verbose = FALSE)