Contents

1 Downloading the files

We used the RNA-Seq data from the publication by Brooks et al. (Brooks et al. 2010). The experiment investigated the effect of siRNA knock-down of pasilla, a gene that is known to bind to mRNA in the spliceosome, and which is thought to be involved in the regulation of splicing. The data set contains 3 biological replicates of the knockdown as well as 4 biological replicates for the untreated control. The data files are available from the NCBI Gene Expression Omnibus under the accession GSE18508. The read sequences in FASTQ format were extracted from the NCBI short read archive file (.sra files) using the SRA toolkit.

2 Read alignment and filtering

The reads in the FASTQ files were aligned using tophat version 1.2.0 with default parameters against the reference Drosophila melanogaster genome. The following table summarizes the read number and alignment statistics. The column exon.counts refers to the number of reads that could be uniquely aligned to an exon.

dataFilesDir = system.file("extdata", package = "pasilla", mustWork=TRUE)
pasillaSampleAnno = read.csv(file.path(dataFilesDir, "pasilla_sample_annotation.csv"))
pasillaSampleAnno
##           file condition        type number.of.lanes total.number.of.reads
## 1   treated1fb   treated single-read               5              35158667
## 2   treated2fb   treated  paired-end               2         12242535 (x2)
## 3   treated3fb   treated  paired-end               2         12443664 (x2)
## 4 untreated1fb untreated single-read               2              17812866
## 5 untreated2fb untreated single-read               6              34284521
## 6 untreated3fb untreated  paired-end               2         10542625 (x2)
## 7 untreated4fb untreated  paired-end               2         12214974 (x2)
##   exon.counts
## 1    15679615
## 2    15620018
## 3    12733865
## 4    14924838
## 5    20764558
## 6    10283129
## 7    11653031

The reference genome fasta files were obtained from the Ensembl ftp server. We ran bowtie-build to index the fasta file. For more information on this procedure see the bowtie webpage. The indexed form is required by bowtie, and thus tophat.

wget ftp://ftp.ensembl.org/pub/release-62/fasta/drosophila_melanogaster/ \
dna/Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz

gunzip Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz
bowtie-build Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa \
    d_melanogaster_BDGP5.25.62

We generated the alignment BAM file using tophat. For the single-reads data:

tophat bowtie_index <filename>1.fastq,<filename>2.fastq,...,<filename>N.fastq

For the paired-end data:

tophat -r inner-fragment-size bowtie_index \
    <filename>1_1.fastq,<filename>2_1.fastq,...,<filename>N_1.fastq \
    <filename>1_2.fastq,<filename>2_2.fastq,...,<filename>N_2.fastq

The SAM alignment files from which pasilla was generated are available at this URL.

3 Exon count files

To generate the per-exon read counts, we first needed to define the exonic regions.
To this end, we downloaded the file Drosophila_melanogaster.BDGP5.25.62.gtf.gz from Ensembl. The script dexseq_prepare_annotation.py contained in the DEXSeq package was used to extract the exons of the transcripts from the file, define new non-overlapping exonic regions and reformat it to create the file Dmel.BDGP5.25.62.DEXSeq.chr.gff contained in pasilla/extdata. For example, for this file we ran:

wget ftp://ftp.ensembl.org/pub/release-62/gtf/ \
drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz

gunzip Drosophila_melanogaster.BDGP5.25.62.gtf.gz
python dexseq_prepare_annotation.py Drosophila_melanogaster.BDGP5.25.62.gtf \
    Dmel.BDGP5.25.62.DEXSeq.chr.gff

To count the reads that fell into each non-overlapping exonic part, the script dexseq_count.py, which is also contained in the DEXSeq package, was used. It took the alignment results in the form of a SAM file (sorted by position in the case of a paired end data) and the gtf file Dmel.BDGP5.25.62.DEXSeq.chr.gff and returned one file for each biological replicate with the exon counts. For example, for the file treated1.bam, which contained single-end alignments, we ran:

samtools index treated1.bam
samtools view treated1.bam > treated1.sam
python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff \
    treated1.sam treated1fb.txt

For the file treated2.bam, which contained paired-end alignments:

samtools index treated2.bam
samtools view treated2.bam > treated2.sam
sort -k1,1 -k2,2n treated2.sam > treated2_sorted.sam
python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff \
    treated2_sorted.sam treated2fb.txt

The output of the two HTSeq python scripts is provided in the pasilla package:

dir(system.file("extdata", package = "pasilla", mustWork=TRUE), pattern = ".txt$")
## [1] "geneIDsinsubset.txt" "treated1fb.txt"      "treated2fb.txt"     
## [4] "treated3fb.txt"      "untreated1fb.txt"    "untreated2fb.txt"   
## [7] "untreated3fb.txt"    "untreated4fb.txt"

The Python scripts are built upon the HTSeq library.

4 Creation of the DEXSeqDataSet dxd

To create a DEXSeqDataSet object, we need the count files, the sample annotations, and the GFF file with the per exon annotation.

gffFile = file.path(dataFilesDir, "Dmel.BDGP5.25.62.DEXSeq.chr.gff")

With these, we could call the function DEXSeqDataSet to construct the object dxd.

library("DEXSeq")
dxd = DEXSeqDataSetFromHTSeq(
  countfiles = file.path(dataFilesDir, paste(pasillaSampleAnno$file, "txt", sep=".")), 
  sampleData = pasillaSampleAnno,
  design= ~ sample + exon + condition:exon,
  flattenedfile = gffFile)
dxd 
## class: DEXSeqDataSet 
## dim: 70463 14 
## metadata(1): version
## assays(1): counts
## rownames(70463): FBgn0000003:E001 FBgn0000008:E001 ...
##   FBgn0261575:E001 FBgn0261575:E002
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
##   transcripts
## colnames: NULL
## colData names(8): sample file ... exon.counts exon

We only wanted to work with data from a subset of genes, which was defined in the following file.

genesforsubset = readLines(file.path(dataFilesDir, "geneIDsinsubset.txt"))
dxd = dxd[geneIDs( dxd ) %in% genesforsubset,]

We saved the R objects in the data directory of the package:

save(dxd, file = file.path("..", "data", "pasillaDEXSeqDataSet.RData"))

5 SessionInfo

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DEXSeq_1.24.0              RColorBrewer_1.1-2        
##  [3] AnnotationDbi_1.40.0       DESeq2_1.18.0             
##  [5] SummarizedExperiment_1.8.0 DelayedArray_0.4.0        
##  [7] matrixStats_0.52.2         GenomicRanges_1.30.0      
##  [9] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [11] S4Vectors_0.16.0           Biobase_2.38.0            
## [13] BiocGenerics_0.24.0        BiocParallel_1.12.0       
## [15] BiocStyle_2.6.0           
## 
## loaded via a namespace (and not attached):
##  [1] bit64_0.9-7             splines_3.4.2           assertthat_0.2.0       
##  [4] Formula_1.2-2           statmod_1.4.30          latticeExtra_0.6-28    
##  [7] blob_1.1.0              Rsamtools_1.30.0        GenomeInfoDbData_0.99.1
## [10] progress_1.1.2          yaml_2.1.14             RSQLite_2.0            
## [13] backports_1.1.1         lattice_0.20-35         digest_0.6.12          
## [16] XVector_0.18.0          checkmate_1.8.5         colorspace_1.3-2       
## [19] htmltools_0.3.6         Matrix_1.2-11           plyr_1.8.4             
## [22] XML_3.98-1.9            biomaRt_2.34.0          genefilter_1.60.0      
## [25] bookdown_0.5            zlibbioc_1.24.0         xtable_1.8-2           
## [28] scales_0.5.0            htmlTable_1.9           tibble_1.3.4           
## [31] annotate_1.56.0         ggplot2_2.2.1           nnet_7.3-12            
## [34] lazyeval_0.2.1          survival_2.41-3         magrittr_1.5           
## [37] memoise_1.1.0           evaluate_0.10.1         hwriter_1.3.2          
## [40] foreign_0.8-69          prettyunits_1.0.2       tools_3.4.2            
## [43] data.table_1.10.4-3     stringr_1.2.0           munsell_0.4.3          
## [46] locfit_1.5-9.1          cluster_2.0.6           Biostrings_2.46.0      
## [49] compiler_3.4.2          rlang_0.1.2             grid_3.4.2             
## [52] RCurl_1.95-4.8          htmlwidgets_0.9         bitops_1.0-6           
## [55] base64enc_0.1-3         rmarkdown_1.6           gtable_0.2.0           
## [58] codetools_0.2-15        DBI_0.7                 R6_2.2.2               
## [61] gridExtra_2.3           knitr_1.17              bit_1.1-12             
## [64] Hmisc_4.0-3             rprojroot_1.2           stringi_1.1.5          
## [67] Rcpp_0.12.13            geneplotter_1.56.0      rpart_4.1-11           
## [70] acepack_1.4.1

References

Brooks, A. N., L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner, and B. R. Graveley. 2010. “Conservation of an RNA regulatory map between Drosophila and mammals.” Genome Research, October, 193–202. doi:10.1101/gr.108662.110.