Lifecycle:maturing

Brings SummarizedExperiment to the tidyverse!

website: stemangiola.github.io/tidySummarizedExperiment/

Please also have a look at

Introduction

tidySummarizedExperiment provides a bridge between Bioconductor SummarizedExperiment [@morgan2020summarized] and the tidyverse [@wickham2019welcome]. It creates an invisible layer that enables viewing the Bioconductor SummarizedExperiment object as a tidyverse tibble, and provides SummarizedExperiment-compatible dplyr, tidyr, ggplot and plotly functions. This allows users to get the best of both Bioconductor and tidyverse worlds.

Functions/utilities available

SummarizedExperiment-compatible Functions Description
all After all tidySummarizedExperiment is a SummarizedExperiment object, just better
tidyverse Packages Description
dplyr Almost all dplyr APIs like for any tibble
tidyr Almost all tidyr APIs like for any tibble
ggplot2 ggplot like for any tibble
plotly plot_ly like for any tibble
Utilities Description
tidy Add tidySummarizedExperiment invisible layer over a SummarizedExperiment object
as_tibble Convert cell-wise information to a tbl_df

Installation

if (!requireNamespace("BiocManager", quietly=TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("tidySummarizedExperiment")

From Github (development)

devtools::install_github("stemangiola/tidySummarizedExperiment")

Load libraries used in the examples.

library(ggplot2)
library(tidySummarizedExperiment)

Create tidySummarizedExperiment, the best of both worlds!

This is a SummarizedExperiment object but it is evaluated as a tibble. So it is fully compatible both with SummarizedExperiment and tidyverse APIs.

pasilla_tidy <- tidySummarizedExperiment::pasilla %>%
    tidy()

It looks like a tibble

pasilla_tidy
## # A tibble abstraction: 102,193 x 5
##    sample condition type       transcript  counts
##    <chr>  <chr>     <chr>      <chr>        <int>
##  1 untrt1 untreated single_end FBgn0000003      0
##  2 untrt1 untreated single_end FBgn0000008     92
##  3 untrt1 untreated single_end FBgn0000014      5
##  4 untrt1 untreated single_end FBgn0000015      0
##  5 untrt1 untreated single_end FBgn0000017   4664
##  6 untrt1 untreated single_end FBgn0000018    583
##  7 untrt1 untreated single_end FBgn0000022      0
##  8 untrt1 untreated single_end FBgn0000024     10
##  9 untrt1 untreated single_end FBgn0000028      0
## 10 untrt1 untreated single_end FBgn0000032   1446
## # … with 40 more rows

But it is a SummarizedExperiment object after all

Assays(pasilla_tidy)
## An object of class "SimpleAssays"
## Slot "data":
## List of length 1

Tidyverse commands

We can use tidyverse commands to explore the tidy SummarizedExperiment object.

We can use slice to choose rows by position, for example to choose the first row.

pasilla_tidy %>%
    slice(1)
## # A tibble abstraction: 1 x 5
##   sample condition type       transcript  counts
##   <chr>  <chr>     <chr>      <chr>        <int>
## 1 untrt1 untreated single_end FBgn0000003      0

We can use filter to choose rows by criteria.

pasilla_tidy %>%
    filter(condition == "untreated")
## # A tibble abstraction: 58,396 x 5
##    sample condition type       transcript  counts
##    <chr>  <chr>     <chr>      <chr>        <int>
##  1 untrt1 untreated single_end FBgn0000003      0
##  2 untrt1 untreated single_end FBgn0000008     92
##  3 untrt1 untreated single_end FBgn0000014      5
##  4 untrt1 untreated single_end FBgn0000015      0
##  5 untrt1 untreated single_end FBgn0000017   4664
##  6 untrt1 untreated single_end FBgn0000018    583
##  7 untrt1 untreated single_end FBgn0000022      0
##  8 untrt1 untreated single_end FBgn0000024     10
##  9 untrt1 untreated single_end FBgn0000028      0
## 10 untrt1 untreated single_end FBgn0000032   1446
## # … with 40 more rows

We can use select to choose columns.

pasilla_tidy %>%
    select(sample)
## # A tibble: 102,193 x 1
##    sample
##    <chr> 
##  1 untrt1
##  2 untrt1
##  3 untrt1
##  4 untrt1
##  5 untrt1
##  6 untrt1
##  7 untrt1
##  8 untrt1
##  9 untrt1
## 10 untrt1
## # … with 102,183 more rows

We can use count to count how many rows we have for each sample.

pasilla_tidy %>%
    count(sample)
## # A tibble: 7 x 2
##   sample     n
##   <chr>  <int>
## 1 trt1   14599
## 2 trt2   14599
## 3 trt3   14599
## 4 untrt1 14599
## 5 untrt2 14599
## 6 untrt3 14599
## 7 untrt4 14599

We can use distinct to see what distinct sample information we have.

pasilla_tidy %>%
    distinct(sample, condition, type)
## # A tibble: 7 x 3
##   sample condition type      
##   <chr>  <chr>     <chr>     
## 1 untrt1 untreated single_end
## 2 untrt2 untreated single_end
## 3 untrt3 untreated paired_end
## 4 untrt4 untreated paired_end
## 5 trt1   treated   single_end
## 6 trt2   treated   paired_end
## 7 trt3   treated   paired_end

We could use rename to rename a column. For example, to modify the type column name.

pasilla_tidy %>%
    rename(sequencing=type)
## # A tibble abstraction: 102,193 x 5
##    sample condition sequencing transcript  counts
##    <chr>  <chr>     <chr>      <chr>        <int>
##  1 untrt1 untreated single_end FBgn0000003      0
##  2 untrt1 untreated single_end FBgn0000008     92
##  3 untrt1 untreated single_end FBgn0000014      5
##  4 untrt1 untreated single_end FBgn0000015      0
##  5 untrt1 untreated single_end FBgn0000017   4664
##  6 untrt1 untreated single_end FBgn0000018    583
##  7 untrt1 untreated single_end FBgn0000022      0
##  8 untrt1 untreated single_end FBgn0000024     10
##  9 untrt1 untreated single_end FBgn0000028      0
## 10 untrt1 untreated single_end FBgn0000032   1446
## # … with 40 more rows

We could use mutate to create a column. For example, we could create a new type column that contains single and paired instead of single_end and paired_end.

pasilla_tidy %>%
    mutate(type=gsub("_end", "", type))
## # A tibble abstraction: 102,193 x 5
##    sample condition type   transcript  counts
##    <chr>  <chr>     <chr>  <chr>        <int>
##  1 untrt1 untreated single FBgn0000003      0
##  2 untrt1 untreated single FBgn0000008     92
##  3 untrt1 untreated single FBgn0000014      5
##  4 untrt1 untreated single FBgn0000015      0
##  5 untrt1 untreated single FBgn0000017   4664
##  6 untrt1 untreated single FBgn0000018    583
##  7 untrt1 untreated single FBgn0000022      0
##  8 untrt1 untreated single FBgn0000024     10
##  9 untrt1 untreated single FBgn0000028      0
## 10 untrt1 untreated single FBgn0000032   1446
## # … with 40 more rows

We could use unite to combine multiple columns into a single column.

pasilla_tidy %>%
    unite("group", c(condition, type))
## # A tibble abstraction: 102,193 x 4
##    sample group                transcript  counts
##    <chr>  <chr>                <chr>        <int>
##  1 untrt1 untreated_single_end FBgn0000003      0
##  2 untrt1 untreated_single_end FBgn0000008     92
##  3 untrt1 untreated_single_end FBgn0000014      5
##  4 untrt1 untreated_single_end FBgn0000015      0
##  5 untrt1 untreated_single_end FBgn0000017   4664
##  6 untrt1 untreated_single_end FBgn0000018    583
##  7 untrt1 untreated_single_end FBgn0000022      0
##  8 untrt1 untreated_single_end FBgn0000024     10
##  9 untrt1 untreated_single_end FBgn0000028      0
## 10 untrt1 untreated_single_end FBgn0000032   1446
## # … with 40 more rows

We can also combine commands with the tidyverse pipe %>%.

For example, we could combine group_by and summarise to get the total counts for each sample.

pasilla_tidy %>%
    group_by(sample) %>%
    summarise(total_counts=sum(counts))
## # A tibble: 7 x 2
##   sample total_counts
##   <chr>         <int>
## 1 trt1       18670279
## 2 trt2        9571826
## 3 trt3       10343856
## 4 untrt1     13972512
## 5 untrt2     21911438
## 6 untrt3      8358426
## 7 untrt4      9841335

We could combine group_by, mutate and filter to get the transcripts with mean count > 0.

pasilla_tidy %>%
    group_by(transcript) %>%
    mutate(mean_count=mean(counts)) %>%
    filter(mean_count > 0)
## # A tibble: 86,513 x 6
## # Groups:   transcript [12,359]
##    sample condition type       transcript  counts mean_count
##    <chr>  <chr>     <chr>      <chr>        <int>      <dbl>
##  1 untrt1 untreated single_end FBgn0000003      0      0.143
##  2 untrt1 untreated single_end FBgn0000008     92     99.6  
##  3 untrt1 untreated single_end FBgn0000014      5      1.43 
##  4 untrt1 untreated single_end FBgn0000015      0      0.857
##  5 untrt1 untreated single_end FBgn0000017   4664   4672.   
##  6 untrt1 untreated single_end FBgn0000018    583    461.   
##  7 untrt1 untreated single_end FBgn0000022      0      0.143
##  8 untrt1 untreated single_end FBgn0000024     10      7    
##  9 untrt1 untreated single_end FBgn0000028      0      0.429
## 10 untrt1 untreated single_end FBgn0000032   1446   1085.   
## # … with 86,503 more rows

Plotting

my_theme <-
    list(
        scale_fill_brewer(palette="Set1"),
        scale_color_brewer(palette="Set1"),
        theme_bw() +
            theme(
                panel.border=element_blank(),
                axis.line=element_line(),
                panel.grid.major=element_line(size=0.2),
                panel.grid.minor=element_line(size=0.1),
                text=element_text(size=12),
                legend.position="bottom",
                aspect.ratio=1,
                strip.background=element_blank(),
                axis.title.x=element_text(margin=margin(t=10, r=10, b=10, l=10)),
                axis.title.y=element_text(margin=margin(t=10, r=10, b=10, l=10))
            )
    )

We can treat pasilla_tidy as a normal tibble for plotting.

Here we plot the distribution of counts per sample.

pasilla_tidy %>%
    tidySummarizedExperiment::ggplot(aes(counts + 1, group=sample, color=`type`)) +
    geom_density() +
    scale_x_log10() +
    my_theme

plot of chunk plot1

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] tidySummarizedExperiment_1.0.0 SummarizedExperiment_1.20.0   
##  [3] Biobase_2.50.0                 GenomicRanges_1.42.0          
##  [5] GenomeInfoDb_1.26.0            IRanges_2.24.0                
##  [7] S4Vectors_0.28.0               BiocGenerics_0.36.0           
##  [9] MatrixGenerics_1.2.0           matrixStats_0.57.0            
## [11] ggplot2_3.3.2                  knitr_1.30                    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0       xfun_0.18              purrr_0.3.4           
##  [4] lattice_0.20-41        colorspace_1.4-1       vctrs_0.3.4           
##  [7] generics_0.0.2         htmltools_0.5.0        viridisLite_0.3.0     
## [10] utf8_1.1.4             plotly_4.9.2.1         rlang_0.4.8           
## [13] pillar_1.4.6           glue_1.4.2             withr_2.3.0           
## [16] RColorBrewer_1.1-2     GenomeInfoDbData_1.2.4 lifecycle_0.2.0       
## [19] stringr_1.4.0          zlibbioc_1.36.0        munsell_0.5.0         
## [22] gtable_0.3.0           htmlwidgets_1.5.2      evaluate_0.14         
## [25] labeling_0.4.2         fansi_0.4.1            highr_0.8             
## [28] scales_1.1.1           DelayedArray_0.16.0    jsonlite_1.7.1        
## [31] XVector_0.30.0         farver_2.0.3           digest_0.6.27         
## [34] stringi_1.5.3          dplyr_1.0.2            grid_4.0.3            
## [37] cli_2.1.0              tools_4.0.3            bitops_1.0-6          
## [40] magrittr_1.5           RCurl_1.98-1.2         lazyeval_0.2.2        
## [43] tibble_3.0.4           crayon_1.3.4           tidyr_1.1.2           
## [46] pkgconfig_2.0.3        ellipsis_0.3.1         Matrix_1.2-18         
## [49] data.table_1.13.2      assertthat_0.2.1       httr_1.4.2            
## [52] R6_2.4.1               compiler_4.0.3

References