Overview

In this document, we introduce the purpose of the curatedAdipoRNA package, its contents and its potential use cases. This package is a curated dataset of RNA-Seq samples. The samples are MDI-induced pre-phagocytes (3T3-L1) at different time points/stage of differentiation. The package document the data collection, pre-processing and processing. In addition to the documentation, the package contains the scripts that was used to generated the data in inst/scripts/ and the final RangedSummarizedExperiment object in data/.

Introduction

What is `curatedAdipoRNA`?

It is an R package for documenting and distributing a curated dataset. The package doesn’t contain any R functions.

What is contained in `curatedAdipoRNA`?

The package contains two different things:

Scripts for documenting/reproducing the data in inst/scripts
Final RangedSummarizedExperiment object in data/

What is `curatedAdipoRNA` for?

The RangedSummarizedExperiment object contains the adipo_counts, colData, rowRanges and metadata which can be used for the purposes of conducting differential expression or gene set enrichment analysis on the cell line model.

Installation

The curatedAdipoRNA package can be installed from Bioconductor using BiocManager.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("curatedAdipoRNA")

Docker image

The pre-processing and processing of the data setup environment is available as a docker image. This image is also suitable for reproducing this document. The docker image can be obtained using the docker CLI client.

$ docker pull bcmslab/adiporeg_rna:latest

Generating `curatedAdipoRNA`

Search strategy & data collection

The term “3T3-L1” was used to search the NCBI SRA repository. The results were sent to the run selector. 1,176 runs were viewed. The runs were faceted by Assay Type and the “rna-seq” which resulted in 323 runs. Only 98 samples from 16 different studies were included after being manually reviewed to fit the following criteria: * The raw data is available from GEO and has a GEO identifier (GSM#) * The raw data is linked to a published publicly available article * The protocols for generating the data sufficiently describe the origin of the cell line, the differentiation medium and the time points when the samples were collected. * In case the experimental designs included treatment other than the differentiation medias, the control (non-treated) samples were included.

Note: The data quality and the platform discrepancies are not included in these criteria.

Pre-processing

The scripts to download and process the raw data are located in inst/scripts/ and are glued together to run sequentially by the GNU make file Makefile. The following is basically a description of the recipes in the Makefile with emphasis on the software versions, options, inputs and outputs.

1. Downloading data `download_fastq`

Program: wget (1.18)
Input: run.csv, the URLs column
Output: *.fastq.gz
Options: -N

2. Making a genome index `make_index`

Program: hisat2-build (2.0.5)
Input: URL for mm10 mouse genome fasta files
Output: *.bt2 bowtie2 index for the mouse genome
Options: defaults

3. Dowinloading annotations `get_annotation`

Program: wget (1.18)
Input: URL for mm10 gene annotation file
Output: annotation.gtf
Options: -N

4. Aligning reads `align_reads`

Program: hisat2 (2.0.5)
Input: *.fastq.gz and mm10/ bowtie2 index for the mouse genome
Output: *.sam
Options: defaults

5. Counting features `count_features`

Program: featureCounts (1.5.1)
Input: *.bam and the annotation gtf file for the mm10 mouse genome.
Output: *.txt
Option: defaults

Quality assessment `fastqc`

Program: fastqc (0.11.5)
Input: *.fastq.gz and *.sam
Output: *_fastqc.zip
Option: defaults

Processing

The aim of this step is to construct a self-contained object with minimal manipulations of the pre-processed data followed by simple a simple exploration of the data in the next section.

Making Summarized experiment object `make_object`

The required steps to make this object from the pre-processed data are documented in the script and are supposed to be fully reproducible when run through this package. The output is a RangedSummarizedExperiment object containing the gene counts and the phenotype and features data and metadata.

The RangedSummarizedExperiment contains * The gene counts matrix gene_counts * The phenotype data colData * The feature data rowRanges * The metadata metadata which contains a data.frame of the studies from which the samples were collected.

Exploring the `adipo_counts` object

In this section, we conduct a simple exploration of the data objects to show the content of the package and how they can be loaded and used.

# loading required libraries
library(curatedAdipoRNA)
library(SummarizedExperiment)
library(S4Vectors)
library(fastqcr)
library(DESeq2)
library(dplyr)
library(tidyr)
library(ggplot2)

# load data
data("adipo_counts")

# print object
adipo_counts
#> class: RangedSummarizedExperiment 
#> dim: 23916 98 
#> metadata(1): studies
#> assays(1): gene_counts
#> rownames(23916): 0610005C13Rik 0610007P14Rik ... a l7Rn6
#> rowData names(1): gene_id
#> colnames(98): GSM1224676 GSM1224677 ... GSM873963 GSM873964
#> colData names(14): id study ... instrument_model qc

The count matrix can be accessed using assay. Here we show the first five entries of the first five samples.

# print count matrix
assay(adipo_counts)[1:5, 1:5]
#>               GSM1224676 GSM1224677 GSM1224678 GSM1224679 GSM1224680
#> 0610005C13Rik         45         36         66         76         19
#> 0610007P14Rik       7327       7899       4819       4140       4884
#> 0610009B22Rik       1576       1805       2074       1669       2443
#> 0610009L18Rik        198        161        236        172         51
#> 0610009O20Rik       4195       4713       4996       4663       4133

The phenotype/samples data is a data.frame, It can be accessed using colData. The time and stage columns encode the time point in hours and stage of differentiation respectively.

# names of the coldata object
names(colData(adipo_counts))
#>  [1] "id"               "study"            "pmid"             "time"            
#>  [5] "stage"            "bibtexkey"        "run"              "submission"      
#>  [9] "sample"           "experiment"       "study_name"       "library_layout"  
#> [13] "instrument_model" "qc"

# table of times column
table(colData(adipo_counts)$time)
#> 
#> -96 -48   0   1   2   4   6  10  24  28  48  96 120 144 168 192 240 
#>   1   6  22   1   3   9   1   2   8   1  12   2   1   4  13   8   4

# table of stage column
table(colData(adipo_counts)$stage)
#> 
#>  0  1  2  3 
#> 29 35  4 30

Other columns in colData are selected information about the samples/runs or identifiers to different databases. The following table provides the description of each of these columns.

col_name	description
id	The GEO sample identifier.
study	The SRA study identifier.
pmid	The PubMed ID of the article where the data were published originally.
time	The time point of the sample when collected in hours. The time is recorded from the beginning of the protocol as 0 hours.
stage	The stage of differentiation of the sample when collected. Possible values are 0 to 3; 0 for non-differentiated; 1 for differentiating; and 2/3 for maturating samples.
bibtexkey	The key of the study where the data were published originally. This maps to the studies object of the metadata which records the study information in bibtex format.
run	The SRA run identifier.
submission	The SRA study submission identifier.
sample	The SRA sample identifier.
experiment	The SRA experiment identifier.
study_name	The GEO study/series identifier.
library_layout	The type of RNA library. Possible values are SINGLE for single-end and PAIRED for paired-end runs.
instrument_model	The name of the sequencing machine that was used to obtain the sequence reads.
qc	The quality control output of fastqc on the separate files/runs.

Using the identifiers in colData along with Bioconductor packages such as GEOmetabd and/or SRAdb gives access to the sample metadata as submitted by the authors or recorded in the data repositories.

The features data are a GRanges object and can be accessed using rowRanges.

# print GRanges object
rowRanges(adipo_counts)
#> GRanges object with 23916 ranges and 1 metadata column:
#>                 seqnames              ranges strand |       gene_id
#>                    <Rle>           <IRanges>  <Rle> |   <character>
#>   0610005C13Rik     chr7   45567795-45575176      - | 0610005C13Rik
#>   0610007P14Rik    chr12   85815455-85824545      - | 0610007P14Rik
#>   0610009B22Rik    chr11   51685385-51688634      - | 0610009B22Rik
#>   0610009L18Rik    chr11 120348678-120351190      + | 0610009L18Rik
#>   0610009O20Rik    chr18   38250249-38262629      + | 0610009O20Rik
#>             ...      ...                 ...    ... .           ...
#>             Zyx     chr6   42349828-42360213      + |           Zyx
#>           Zzef1    chr11   72796226-72927120      + |         Zzef1
#>            Zzz3     chr3 152396003-152462826      + |          Zzz3
#>               a     chr2 155013570-155051012      + |             a
#>           l7Rn6     chr7   89918685-89941204      - |         l7Rn6
#>   -------
#>   seqinfo: 35 sequences from an unspecified genome; no seqlengths

qc is a column of colData it is a list of lists. Each entry in the list correspond to one sample. Each sample has one or more objects of qc_read class. The reason for that is because paired-end samples has two separate files on which fastqc quality control were ran.

# show qc data
adipo_counts$qc
#> List of length 98
#> names(98): GSM1224676 GSM1224677 GSM1224678 ... GSM873962 GSM873963 GSM873964

# show the class of the first entry in qc
class(adipo_counts$qc[[1]][[1]])
#> [1] "list"    "qc_read"

The metadata is a list of one object. studies is a data.frame containing the bibliography information of the studies from which the data were collected. Here we show the first entry in studies.

# print data of first study
metadata(adipo_counts)$studies[1,]
#> # A tibble: 1 x 37
#>   CATEGORY BIBTEXKEY ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION
#>   <chr>    <chr>     <chr>   <chr>  <list> <chr>     <chr>   <chr>    <chr>  
#> 1 ARTICLE  Brier2017 <NA>    <NA>   <chr … <NA>      <NA>    <NA>     <NA>   
#> # … with 28 more variables: EDITOR <list>, HOWPUBLISHED <chr>,
#> #   INSTITUTION <chr>, JOURNAL <chr>, KEY <chr>, MONTH <chr>, NOTE <chr>,
#> #   NUMBER <chr>, ORGANIZATION <chr>, PAGES <chr>, PUBLISHER <chr>,
#> #   SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>,
#> #   YEAR <dbl>, ABSTRACT <chr>, ARCHIVEPREFIX <chr>, ARXIVID <chr>, DOI <chr>,
#> #   EPRINT <chr>, ISBN <chr>, ISSN <chr>, PMID <chr>, FILE <lgl>,
#> #   KEYWORDS <chr>, URL <chr>

Summary of the studies in the dataset

GEO series ID	PubMed ID	Num. of Samples	Time (hr)	Differentiation Stage	Instrument Model
GSE100056	29138456	4	-48/24	0/1	Ion Torrent Proton
GSE104508	29091029	3	192	3	NextSeq 500
GSE35724	24095730	3	192	3	Illumina Genome Analyzer II
GSE50612	25614607	8	-48/0/10/144	0/1/3	Illumina HiSeq 2000
GSE50934	24912735	6	0/168	0/3	Illumina HiSeq 2000
GSE53244	25412662	5	-48/0/48/120/240	0/1/3	Illumina HiSeq 2000
GSE57415	24857666	4	0/4	0/1	Illumina HiSeq 1500
GSE60745	26220403	12	0/24/48	0/1	Illumina HiSeq 2500
GSE64757	25596527	6	168	3	Illumina HiSeq 2000
GSE75639	27923061	6	-96/-48/0/6/48/168	0/1/3	Illumina HiSeq 2000
GSE84410	27899593	6	0/4/48/28	0/1	Illumina HiSeq 1500
GSE87113	27777310	6	0/1/2/4/48/168	0/1/3	Illumina HiSeq 2500
GSE89621	28009298	3	240	3	Illumina HiSeq 2500
GSE95029	29317436	10	0/48/96/144/192	0/1/2/3	Illumina HiSeq 2000
GSE95533	28475875	10	4/0/24/48/168	1/0/3	Illumina HiSeq 1500
GSE96764	29748257	6	0/2/4	0/1/2	Illumina HiSeq 2000

Example of using `curatedAdipoRNA`

Motivation

All the samples in this dataset come from the 3T3-L1 cell line. The MDI induction media, were used to induce adipocyte differentiation. The two important variables in the dataset are time and stage, which correspond to the time point and stage of differentiation when the sample were captured. Ideally, this dataset should be treated as a time course. However, for the purposes of this example, we only used samples from two time points 0 and 24 hours and treated them as independent groups. The goal of this example is to show how a typical differential expression analysis can be applied in the dataset. The main focus is to explain how the the data and metadata in adipo_counts fit in each main piece of the analysis. We started by filtering the low quality samples and low count genes. Then we applied the DESeq2 method with the default values.

Filtering low quality samples

First, we subset the adipo_counts object to all samples that has time points 0 or 24. The total number of samples is 30; 22 at 0 hour and 8 samples at 24 hours. The total number of features/genes in the set is 23916.

# subsetting counts to 0 and 24 hours
se <- adipo_counts[, adipo_counts$time %in% c(0, 24)]

# showing the numbers of features, samples and time groups
dim(se)
#> [1] 23916    30
table(se$time)
#> 
#>  0 24 
#> 22  8

Since the quality metrics are reported per run file, we need to get the SSR* id for each of the samples. Notice that, some samples would have more than one file. In this case because some of the samples are paired-end, so each of them would have two files SRR\*_1 and SRR\*_2.

# filtering low quality samples
# chek the library layout
table(se$library_layout)
#> 
#> PAIRED SINGLE 
#>     12     18

# check the number of files in qc
qc <- se$qc
table(lengths(qc))
#> 
#>  1  2 
#> 18 12

# flattening qc list
qc <- unlist(qc, recursive = FALSE)
length(qc)
#> [1] 42

The qc object of the colData contains the output of fastqc in a qc_read class. More information on this object can be accessed by calling ?fastqcr::qc_read. Here, we only use the per_base_sequence_quality to filter out low quality samples. This is by no means enough quality control but it should drive the point home.

# extracting per_base_sequence_quality
per_base <- lapply(qc, function(x) {
  df <- x[['per_base_sequence_quality']]
  df %>%
    select(Base, Mean) %>%
    transform(Base = strsplit(as.character(Base), '-')) %>%
    unnest(Base) %>%
    mutate(Base = as.numeric(Base))
}) %>%
  bind_rows(.id = 'run')

After tidying the data, we get a data.frame with three columns; run, Mean and Base for the run ID, the mean quality score and the base number in each read. fastqc provide thorough documentation of this quality control module and others. Notice that read length varies significantly between the runs and that the average of the mean score is suitable.

# a quick look at quality scores
summary(per_base)
#>      run                 Base             Mean      
#>  Length:3408        Min.   :  1.00   Min.   :10.74  
#>  Class :character   1st Qu.: 21.00   1st Qu.:34.45  
#>  Mode  :character   Median : 42.00   Median :36.44  
#>                     Mean   : 50.24   Mean   :35.28  
#>                     3rd Qu.: 70.00   3rd Qu.:37.90  
#>                     Max.   :368.00   Max.   :40.17

To identify the low quality samples, we categorize the runs by length and run_average which are the read length and the average of the per base mean scores. The following figure should make it easier to see why these cutoff were used in this case.

# find low quality runs
per_base <- per_base %>%
  group_by(run) %>%
  mutate(length = max(Base) > 150,
         run_average = mean(Mean) > 34)

# plot average per base quality
per_base %>%
  ggplot(aes(x = Base, y = Mean, group = run, color = run_average)) +
  geom_line() +
  facet_wrap(~length, scales = 'free_x')

The run IDs of the “bad” samples is then used to remove them from the dataset.

# get run ids of low quality samples
bad_samples <- data.frame(samples = unique(per_base$run[per_base$run_average == FALSE]))
bad_samples <- separate(bad_samples, col = samples, into = c('id', 'run'), sep = '\\.')

# subset the counts object
se2 <- se[, !se$id %in% bad_samples$id]
table(se2$time)
#> 
#>  0 24 
#> 19  6

Filtering low count genes

To identify the low count feature/genes (possibly not expressed), we keep only the features with at least 10 reads in 2 or more samples. Then we subset the object to exclude these genes.

# filtering low count genes
low_counts <- apply(assay(se2), 1, function(x) length(x[x>10])>=2)
table(low_counts)
#> low_counts
#> FALSE  TRUE 
#> 10254 13662

# subsetting the count object
se3 <- se2[low_counts,]

Applying differential expression using `DESeq2`

DESeq2 is a well documented and widely used R package for the differential expression analysis. Here we use the default values of DESeq to find the genes which are deferentially expressed between the samples at 24 hours and 0 hours.

# differential expression analysis
se3$time <- factor(se3$time)
dds <- DESeqDataSet(se3, ~time)
#> renaming the first element in assays to 'counts'
#> converting counts to integer mode
dds <- DESeq(dds)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> -- replacing outliers and refitting for 53 genes
#> -- DESeq argument 'minReplicatesForReplace' = 7 
#> -- original counts are preserved in counts(dds)
#> estimating dispersions
#> fitting model and testing
res <- results(dds)
table(res$padj < .1)
#> 
#> FALSE  TRUE 
#>  6194  7462

Next!

In this example, we didn’t attempt to correct for the between study factors that might confound the results. To show how is this possible, we use the PCA plots with a few of these factors in the following graphs. The first uses the time factor which is the factor of interest in this case. We see that the DESeq transformation did a good job separating the samples to their expected groups. However, it also seems that the time is not the only factor in play. For example, we show in the second and the third graphs two other factors library_layout and instrument_model which might explain some of the variance between the samples. This is expected because the data were collected from different studies using slightly different protocols and different sequencing machines. Therefore, it is necessary to account for these differences to obtain reliable results. There are multiple methods to do that such as Removing Unwanted Variation (RUV) and Surrogate Variable Analysis (SVA).

# explaining variabce 
plotPCA(rlog(dds), intgroup = 'time')

plotPCA(rlog(dds), intgroup = 'library_layout')

plotPCA(rlog(dds), intgroup = 'instrument_model')

Citing the studies in this subset of the data

Speaking of studies, as mentioned earlier the studies object contains full information of the references of the original studies in which the data were published. Please cite them when using this dataset.

# keys of the studies in this subset of the data
unique(se3$bibtexkey)
#>  [1] "Duteil2014"                "zhao_fto-dependent_2014"  
#>  [3] "siersbaek_molecular_2014"  "Lim2015"                  
#>  [5] "brunmeir_comparative_2016" "Brier2017"                
#>  [7] "park_distinct_2017"        "chen_diabetes_2018"       
#>  [9] "siersbaek_dynamic_2017"    "ryu_metabolic_2018"

Citing `curatedAdipoRNA`

For citing the package use:

# citing the package
citation("curatedAdipoRNA")
#> 
#> To cite package 'curatedAdipoRNA' in publications use:
#> 
#>   Mahmoud Ahmed (2020). curatedAdipoRNA: A Curated RNA-Seq Dataset of
#>   MDI-induced Differentiated Adipocytes (3T3-L1). R package version
#>   1.4.0. https://github.com/MahShaaban/curatedAdipoRNA
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {curatedAdipoRNA: A Curated RNA-Seq Dataset of MDI-induced Differentiated Adipocytes
#> (3T3-L1)},
#>     author = {Mahmoud Ahmed},
#>     year = {2020},
#>     note = {R package version 1.4.0},
#>     url = {https://github.com/MahShaaban/curatedAdipoRNA},
#>   }

Session Info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Ubuntu 18.04.4 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  C                           
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2020-05-07                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version  date       lib source        
#>  annotate               1.66.0   2020-05-06 [2] Bioconductor  
#>  AnnotationDbi          1.50.0   2020-05-06 [2] Bioconductor  
#>  assertthat             0.2.1    2019-03-21 [2] CRAN (R 4.0.0)
#>  backports              1.1.6    2020-04-05 [2] CRAN (R 4.0.0)
#>  Biobase              * 2.48.0   2020-05-06 [2] Bioconductor  
#>  BiocGenerics         * 0.34.0   2020-05-06 [2] Bioconductor  
#>  BiocParallel           1.22.0   2020-05-06 [2] Bioconductor  
#>  bit                    1.1-15.2 2020-02-10 [2] CRAN (R 4.0.0)
#>  bit64                  0.9-7    2017-05-08 [2] CRAN (R 4.0.0)
#>  bitops                 1.0-6    2013-08-17 [2] CRAN (R 4.0.0)
#>  blob                   1.2.1    2020-01-20 [2] CRAN (R 4.0.0)
#>  callr                  3.4.3    2020-03-28 [2] CRAN (R 4.0.0)
#>  cli                    2.0.2    2020-02-28 [2] CRAN (R 4.0.0)
#>  colorspace             1.4-1    2019-03-18 [2] CRAN (R 4.0.0)
#>  crayon                 1.3.4    2017-09-16 [2] CRAN (R 4.0.0)
#>  curatedAdipoRNA      * 1.4.0    2020-05-07 [1] Bioconductor  
#>  DBI                    1.1.0    2019-12-15 [2] CRAN (R 4.0.0)
#>  DelayedArray         * 0.14.0   2020-05-06 [2] Bioconductor  
#>  desc                   1.2.0    2018-05-01 [2] CRAN (R 4.0.0)
#>  DESeq2               * 1.28.0   2020-05-06 [2] Bioconductor  
#>  devtools               2.3.0    2020-04-10 [2] CRAN (R 4.0.0)
#>  digest                 0.6.25   2020-02-23 [2] CRAN (R 4.0.0)
#>  dplyr                * 0.8.5    2020-03-07 [2] CRAN (R 4.0.0)
#>  ellipsis               0.3.0    2019-09-20 [2] CRAN (R 4.0.0)
#>  evaluate               0.14     2019-05-28 [2] CRAN (R 4.0.0)
#>  fansi                  0.4.1    2020-01-08 [2] CRAN (R 4.0.0)
#>  farver                 2.0.3    2020-01-16 [2] CRAN (R 4.0.0)
#>  fastqcr              * 0.1.2    2019-01-03 [2] CRAN (R 4.0.0)
#>  fs                     1.4.1    2020-04-04 [2] CRAN (R 4.0.0)
#>  genefilter             1.70.0   2020-05-06 [2] Bioconductor  
#>  geneplotter            1.66.0   2020-05-06 [2] Bioconductor  
#>  GenomeInfoDb         * 1.24.0   2020-05-06 [2] Bioconductor  
#>  GenomeInfoDbData       1.2.3    2020-04-24 [2] Bioconductor  
#>  GenomicRanges        * 1.40.0   2020-05-06 [2] Bioconductor  
#>  ggplot2              * 3.3.0    2020-03-05 [2] CRAN (R 4.0.0)
#>  glue                   1.4.0    2020-04-03 [2] CRAN (R 4.0.0)
#>  gtable                 0.3.0    2019-03-25 [2] CRAN (R 4.0.0)
#>  highr                  0.8      2019-03-20 [2] CRAN (R 4.0.0)
#>  htmltools              0.4.0    2019-10-04 [2] CRAN (R 4.0.0)
#>  IRanges              * 2.22.1   2020-05-06 [2] Bioconductor  
#>  knitr                  1.28     2020-02-06 [2] CRAN (R 4.0.0)
#>  labeling               0.3      2014-08-23 [2] CRAN (R 4.0.0)
#>  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.0)
#>  lifecycle              0.2.0    2020-03-06 [2] CRAN (R 4.0.0)
#>  locfit                 1.5-9.4  2020-03-25 [2] CRAN (R 4.0.0)
#>  magrittr               1.5      2014-11-22 [2] CRAN (R 4.0.0)
#>  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.0)
#>  matrixStats          * 0.56.0   2020-03-13 [2] CRAN (R 4.0.0)
#>  memoise                1.1.0    2017-04-21 [2] CRAN (R 4.0.0)
#>  munsell                0.5.0    2018-06-12 [2] CRAN (R 4.0.0)
#>  pillar                 1.4.4    2020-05-05 [2] CRAN (R 4.0.0)
#>  pkgbuild               1.0.7    2020-04-25 [2] CRAN (R 4.0.0)
#>  pkgconfig              2.0.3    2019-09-22 [2] CRAN (R 4.0.0)
#>  pkgload                1.0.2    2018-10-29 [2] CRAN (R 4.0.0)
#>  prettyunits            1.1.1    2020-01-24 [2] CRAN (R 4.0.0)
#>  processx               3.4.2    2020-02-09 [2] CRAN (R 4.0.0)
#>  ps                     1.3.2    2020-02-13 [2] CRAN (R 4.0.0)
#>  purrr                  0.3.4    2020-04-17 [2] CRAN (R 4.0.0)
#>  R6                     2.4.1    2019-11-12 [2] CRAN (R 4.0.0)
#>  RColorBrewer           1.1-2    2014-12-07 [2] CRAN (R 4.0.0)
#>  Rcpp                   1.0.4.6  2020-04-09 [2] CRAN (R 4.0.0)
#>  RCurl                  1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
#>  remotes                2.1.1    2020-02-15 [2] CRAN (R 4.0.0)
#>  rlang                  0.4.6    2020-05-02 [2] CRAN (R 4.0.0)
#>  rmarkdown              2.1      2020-01-20 [2] CRAN (R 4.0.0)
#>  rprojroot              1.3-2    2018-01-03 [2] CRAN (R 4.0.0)
#>  RSQLite                2.2.0    2020-01-07 [2] CRAN (R 4.0.0)
#>  S4Vectors            * 0.26.0   2020-05-06 [2] Bioconductor  
#>  scales                 1.1.0    2019-11-18 [2] CRAN (R 4.0.0)
#>  sessioninfo            1.1.1    2018-11-05 [2] CRAN (R 4.0.0)
#>  stringi                1.4.6    2020-02-17 [2] CRAN (R 4.0.0)
#>  stringr                1.4.0    2019-02-10 [2] CRAN (R 4.0.0)
#>  SummarizedExperiment * 1.18.1   2020-05-06 [2] Bioconductor  
#>  survival               3.1-12   2020-04-10 [2] CRAN (R 4.0.0)
#>  testthat               2.3.2    2020-03-02 [2] CRAN (R 4.0.0)
#>  tibble                 3.0.1    2020-04-20 [2] CRAN (R 4.0.0)
#>  tidyr                * 1.0.2    2020-01-24 [2] CRAN (R 4.0.0)
#>  tidyselect             1.0.0    2020-01-27 [2] CRAN (R 4.0.0)
#>  usethis                1.6.1    2020-04-29 [2] CRAN (R 4.0.0)
#>  utf8                   1.1.4    2018-05-24 [2] CRAN (R 4.0.0)
#>  vctrs                  0.2.4    2020-03-10 [2] CRAN (R 4.0.0)
#>  withr                  2.2.0    2020-04-20 [2] CRAN (R 4.0.0)
#>  xfun                   0.13     2020-04-13 [2] CRAN (R 4.0.0)
#>  XML                    3.99-0.3 2020-01-20 [2] CRAN (R 4.0.0)
#>  xtable                 1.8-4    2019-04-21 [2] CRAN (R 4.0.0)
#>  XVector                0.28.0   2020-05-06 [2] Bioconductor  
#>  yaml                   2.2.1    2020-02-01 [2] CRAN (R 4.0.0)
#>  zlibbioc               1.34.0   2020-05-06 [2] Bioconductor  
#> 
#> [1] /tmp/RtmpOAiXro/Rinst28f8671c0981
#> [2] /home/biocbuild/bbs-3.11-bioc/R/library

Using curatedAdipoRNA

Mahmoud Ahmed

2020-05-07

Overview

Introduction

What is curatedAdipoRNA?

What is contained in curatedAdipoRNA?

What is curatedAdipoRNA for?

Installation

Docker image

Generating curatedAdipoRNA

Search strategy & data collection

Pre-processing

1. Downloading data download_fastq

2. Making a genome index make_index

3. Dowinloading annotations get_annotation

4. Aligning reads align_reads

5. Counting features count_features

Quality assessment fastqc