Airway smooth muscle cells

Here we provide the code which was used to contruct the RangedSummarizedExperiment object of the airway experiment data package. The experiment citation is:

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

From the abstract, a brief description of the RNA-Seq experiment on airway smooth muscle (ASM) cell lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18 hours).”

Obtaining sample information from GEO

The following code chunk obtains the sample information from the series matrix file downloaded from GEO. The columns are then parsed and new columns with shorter names and factor levels are added.

suppressPackageStartupMessages( library( "GEOquery" ) )
suppressPackageStartupMessages( library( "airway" ) )
dir <- system.file("extdata",package="airway")
geofile <- file.path(dir, "GSE52778_series_matrix.txt")
gse <- getGEO(filename=geofile)
## File stored at:
## /tmp/Rtmpasy3Nt/GPL11154.soft
pdata <- pData(gse)[,grepl("characteristics",names(pData(gse)))]
names(pdata) <- c("treatment","tissue","ercc_mix","cell","celltype")
pdataclean <- data.frame(treatment=sub("treatment: (.*)","\\1",pdata$treatment),
                         cell=sub("cell line: (.*)","\\1",pdata$cell),
                         row.names=rownames(pdata))
pdataclean$dex <- ifelse(grepl("Dex",pdataclean$treatment),"trt","untrt")
pdataclean$albut <- ifelse(grepl("Albut",pdataclean$treatment),"trt","untrt")
pdataclean$SampleName <- rownames(pdataclean)
pdataclean$treatment <- NULL

The information which connects the sample information from GEO with the SRA run id is downloaded from SRA using the Send to: File button.

srafile <- file.path(dir, "SraRunInfo_SRP033351.csv")
srp <- read.csv(srafile)
srpsmall <- srp[,c("Run","avgLength","Experiment","Sample","BioSample","SampleName")]

These two data.frames are merged and then we subset to only the samples not treated with albuterol (these samples were not included in the analysis of the publication).

coldata <- merge(pdataclean, srpsmall, by="SampleName")
rownames(coldata) <- coldata$Run
coldata <- coldata[coldata$albut == "untrt",]
coldata$albut <- NULL
coldata
##            SampleName    cell   dex        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt SRR1039521        98  SRX384358
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677

Finally, the sample table was saved to a CSV file for future reference. This file is included in the inst/extdata directory of this package.

write.csv(coldata, file="sample_table.csv")

Downloading FASTQ files from SRA

A file containing the SRA run numbers was created: files. This file was used to download the sequenced reads from the SRA using wget. The following command was used to extract the FASTQ file from the .sra files, using the SRA Toolkit

cat files | parallel -j 7 fastq-dump --split-files {}.sra

Aligning reads

The reads were aligned using the STAR read aligner to GRCh37 using the annotations from Ensembl release 75.

for f in `cat files`; do STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 \
--readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq \
--runThreadN 12 --outFileNamePrefix aligned/$f.; done

SAMtools was used to generate BAM files.

cat files | parallel -j 7 samtools view -bS aligned/{}.Aligned.out.sam -o aligned/{}.bam

Counting reads

A transcript database for the homo sapiens Ensembl genes was obtained from Biomart.

library( "GenomicFeatures" )
txdb <- makeTranscriptDbFromBiomart( biomart="ensembl", dataset="hsapiens_gene_ensembl")
exonsByGene <- exonsBy( txdb, by="gene" )

The BAM files were specified using the SRR id from the SRA. A yield size of 2 million reads was used to cap the memory used during read counting.

sampleTable <- read.csv( "sample_table.csv", row.names=1 )
fls <- file.path("aligned",rownames(sampleTable), ".bam")
library( "Rsamtools" )
bamLst <- BamFileList( fls, yieldSize=2000000 )

The following summarizeOverlaps call distributed the 8 paired-end BAM files to 8 workers. This used a maximum of 16 Gb per worker and the time elapsed was 50 minutes.

library( "BiocParallel" )
register( MulticoreParam( workers=8 ) )
library( "GenomicAlignments" )
airway <- summarizeOverlaps( features=exonsByGene, reads=bamLst,
                            mode="Union", singleEnd=FALSE,
                            ignore.strand=TRUE, fragments=TRUE )

The sample information was then added as column data.

colData(airway) <- DataFrame( sampleTable )

Finally, we attached the MIAME information using the Pubmed ID.

library( "annotate" )
miame <- list(pmid2MIAME("24926665"))
miame[[1]]@url <- "http://www.ncbi.nlm.nih.gov/pubmed/24926665"
# because R's CHECK doesn't like non-ASCII characters in data objects
# or in vignettes. the actual char was used in the first argument
miame[[1]]@abstract <- gsub("micro","micro",abstract(miame[[1]]))
miame[[1]]@abstract <- gsub("beta","beta",abstract(miame[[1]]))
metadata(airway) <- miame
save(airway, file="airway.RData")

Information on the RangedSummarizedExperiment

Below we print out some basic summary statistics on the airway object which is provided by this experiment data package.

library("airway")
data(airway)
airway
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
as.data.frame(colData(airway))
##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
summary(colSums(assay(airway))/1e6)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.16   19.05   20.90   21.94   24.67   30.82
metadata(rowRanges(airway))
## $genomeInfo
## $genomeInfo$`Db type`
## [1] "TranscriptDb"
## 
## $genomeInfo$`Supporting package`
## [1] "GenomicFeatures"
## 
## $genomeInfo$`Data source`
## [1] "BioMart"
## 
## $genomeInfo$Organism
## [1] "Homo sapiens"
## 
## $genomeInfo$`Resource URL`
## [1] "www.biomart.org:80"
## 
## $genomeInfo$`BioMart database`
## [1] "ensembl"
## 
## $genomeInfo$`BioMart database version`
## [1] "ENSEMBL GENES 75 (SANGER UK)"
## 
## $genomeInfo$`BioMart dataset`
## [1] "hsapiens_gene_ensembl"
## 
## $genomeInfo$`BioMart dataset description`
## [1] "Homo sapiens genes (GRCh37.p13)"
## 
## $genomeInfo$`BioMart dataset version`
## [1] "GRCh37.p13"
## 
## $genomeInfo$`Full dataset`
## [1] "yes"
## 
## $genomeInfo$`miRBase build ID`
## [1] NA
## 
## $genomeInfo$transcript_nrow
## [1] "215647"
## 
## $genomeInfo$exon_nrow
## [1] "745593"
## 
## $genomeInfo$cds_nrow
## [1] "537555"
## 
## $genomeInfo$`Db created by`
## [1] "GenomicFeatures package from Bioconductor"
## 
## $genomeInfo$`Creation time`
## [1] "2014-07-10 14:55:55 -0400 (Thu, 10 Jul 2014)"
## 
## $genomeInfo$`GenomicFeatures version at creation time`
## [1] "1.17.9"
## 
## $genomeInfo$`RSQLite version at creation time`
## [1] "0.11.4"
## 
## $genomeInfo$DBSCHEMAVERSION
## [1] "1.0"

Session information

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] airway_0.110.0             SummarizedExperiment_1.6.0
##  [3] DelayedArray_0.2.0         matrixStats_0.52.2        
##  [5] GenomicRanges_1.28.0       GenomeInfoDb_1.12.0       
##  [7] IRanges_2.10.0             S4Vectors_0.14.0          
##  [9] GEOquery_2.42.0            Biobase_2.36.0            
## [11] BiocGenerics_0.22.0       
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.15.1            XVector_0.16.0         
##  [3] magrittr_1.5            zlibbioc_1.22.0        
##  [5] lattice_0.20-35         R6_2.2.0               
##  [7] stringr_1.2.0           httr_1.2.1             
##  [9] tools_3.4.0             grid_3.4.0             
## [11] Matrix_1.2-9            GenomeInfoDbData_0.99.0
## [13] bitops_1.0-6            RCurl_1.95-4.8         
## [15] evaluate_0.10           stringi_1.1.5          
## [17] compiler_3.4.0          XML_3.98-1.6