Introduction to rnaEditr

Lanyu Zhang, Gabriel Odom, Lily Wang

2020-10-27

rnaEditr is an R package that identifies genomic sites and genomic regions that are differentially edited in RNA-seq datasets. rnaEditr can analyze studies with continuous, binary, or survival phenotypes, along with multiple covariates and/or interaction effects. To identify hyper-edited regions, rnaEditr first determines co-edited sub-regions without using any phenotype information. Next, rnaEditr tests association between RNA editing levels within the co-edited regions with binary, continuous or survival phenotypes.

1 Installation

The latest version can be installed by:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("rnaEditr")

After installation, the rnaEditr package can be loaded into R using:

library(rnaEditr)

2 Datasets

The input of rnaEditr are: (1) an RNA editing dataset, with rows corresponding to the edited sites and columns corresponding to the samples; (2) a phenotype dataset, with rows corresponding to different samples, ordered in the same way as columns of dataset (1); (3) a character string of genomic regions that users are interested in. This is only required for region-based analysis and will be explained in section 4.1.

The first dataset includes RNA editing levels for each sample at each edited site. RNA editing levels are typically defined as the total number of edited reads (i.e. reads with G nucleotides) divided by the total number of reads covering the site (i.e. reads with A and G nucleotides) (Breen et al. 2019 PMID: 31455887). They range from 0 (un-edited site) to 1 (completely edited site).

We assume quality control and normalization of the RNA editing dataset have been performed, and that each site is edited in at least 5% of the samples. For illustration, we use a subset of the TCGA breast cancer RNA editing dataset (syn 2374375). This example dataset includes RNA editing levels for 272 edited sites mapped to genes PHACTR4, CCR5, METTL7A (using hg19 reference), along with a few randomly sampled sites, for 221 subjects.

data(rnaedit_df)
rnaedit_df[1:3, 1:3]
##               BRCA-Tumor-TCGA-AC-A2FF BRCA-Normal-TCGA-AC-A2FF
## chr1:28661572              0.00000000                0.0000000
## chr1:28661576              0.64285714                0.6307692
## chr1:28661578              0.01923077                0.0000000
##               BRCA-Normal-TCGA-BH-A0AU
## chr1:28661572                0.0000000
## chr1:28661576                0.5333333
## chr1:28661578                0.0000000

In order to link the RNA editing dataset with phenotype data, the phenotype dataset needs to have a column called sample, whose values must be a exact match to the column names in the above RNA editing dataset.

pheno_df <- readRDS(
  system.file(
    "extdata",
    "pheno_df.RDS",
    package = 'rnaEditr',
    mustWork = TRUE
  )
)
pheno_df[1:3, 1:3]
##   submitter_id                   sample         sample_type
## 1 TCGA-AC-A2FF  BRCA-Tumor-TCGA-AC-A2FF       Primary Tumor
## 2 TCGA-AC-A2FF BRCA-Normal-TCGA-AC-A2FF Solid Tissue Normal
## 3 TCGA-BH-A0AU BRCA-Normal-TCGA-BH-A0AU Solid Tissue Normal

Here we are using the command below to make sure sample variable in phenotype dataset has a exact match to the column names in the RNA editing dataset.

identical(pheno_df$sample, colnames(rnaedit_df))
## [1] TRUE

rnaEditr can perform both site-specific and region-based analysis. In Section 3, we illustrate testing associations between cancer status and RNA editing levels at individual sites. In Section 4, we illustrate identifying cancer associated co-edited regions.

3 Site-specific analysis

A quick workflow for site-specific analysis

A quick workflow for site-specific analysis

3.1 Testing all edited sites

Before testing the associations, we use function CreateEditingTable() to turn RNA editing matrix into a dataframe with special class rnaEdit_df, which is a required format of input dataset for function TestAssociations().

In this example, we will test the association between cancer status ( sample_type variable from pheno_df dataset) and all edited sites. Because the outcome variable “cancer status” is a binary variable, rnaEditr will apply a logistic regression model for each site: \[logit(Pr(cancer\;status = “yes”)) \sim RNA\;editing\;level\]

First, we use the command below to check the distribution of variable sample_type. Please note that rnaEditr will fit firth corrected logistic regression instead of regular logistic regression models for binary outcomes, when the minimum sample size per group is less than 5.

## 
##       Primary Tumor Solid Tissue Normal 
##                 182                  39
##     seqnames    start      end width  estimate   stdErr       pValue
## 225    chr12 51324639 51324639     1 -19.66360 3.158016 4.767627e-10
## 42      chr1 28662199 28662199     1 -14.83609 2.399372 6.276575e-10
## 178    chr12 51324163 51324163     1 -42.69833 7.221196 3.361021e-09
##              fdr
## 225 8.536142e-08
## 42  8.536142e-08
## 178 2.978449e-07

Here tumor_single_df is a data frame of all edited sites, with corresponding p-values and false discovery rate (FDRs) from the logistic regression model.

3.2 Annotate results

We next annotate these results by adding gene names mapped to the RNA editing sites.

##   seqnames    start      end width  symbol  estimate   stdErr       pValue
## 1    chr12 51324639 51324639     1 METTL7A -19.66360 3.158016 4.767627e-10
## 2     chr1 28662199 28662199     1   MED18 -14.83609 2.399372 6.276575e-10
## 3    chr12 51324163 51324163     1 METTL7A -42.69833 7.221196 3.361021e-09
##            fdr
## 1 8.536142e-08
## 2 8.536142e-08
## 3 2.978449e-07

4 Region-based analysis

A quick workflow for region-based analysis

A quick workflow for region-based analysis

The region-based analysis consists of several steps: (1) make GRanges object for the genomic regions, (2) determine RNA editing sites that are located closely within the genomic regions, (3) identify co-edited regions, and (4) test the association between cancer status and co-edited regions.

4.1 Input regions

First, we make GRanges object for the genomic regions that we are interested in . We saved both hg19 and hg38 gene references in the package. The following command retrieves regions associated with 20,314 pre-processed hg19 genes. To retrieve hg38 gene reference, easily change “hg19_annoGene_gr.RDS” into “hg38_annoGene_gr.RDS” in the command below.

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames              ranges strand | ensembl_gene_id      symbol
##          <Rle>           <IRanges>  <Rle> |     <character> <character>
##   [1]     chr1   45959538-45965751      * | ENSG00000236624    CCDC163P
##   [2]     chr1   44457031-44462200      * | ENSG00000159214      CCDC24
##   [3]     chr1 148555979-148596267      * | ENSG00000243452      NBPF15
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

For candidate gene studies, we are often only interested in a few specific genes. We can use the following commands to extract GRanges associated with the specific genes:

## GRanges object with 3 ranges and 1 metadata column:
##       seqnames            ranges strand |     symbols
##          <Rle>         <IRanges>  <Rle> | <character>
##   [1]     chr1 28696114-28826881      * |     PHACTR4
##   [2]     chr3 46411633-46417697      * |        CCR5
##   [3]    chr12 51317255-51326300      * |     METTL7A
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

If we’re interested in particular regions, we can use the following commands to make GRanges.

## GRanges object with 2 ranges and 1 metadata column:
##       seqnames            ranges strand |        symbol
##          <Rle>         <IRanges>  <Rle> |   <character>
##   [1]    chr22 18555686-18573797      * |         PEX26
##   [2]    chr22 36883233-36908148      * | FOXRED2;EIF3D
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

4.2 Find close-by regions

Next, within each input candidate region, we identify the sub-regions that contain closely located RNA editing sites. At default, we identify sub-regions with at least 3 edited sites, for which maximum distance between two neighboring sites is 50 bp.

## GRanges object with 8 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]     chr1 28823903-28823966      *
##   [2]     chr1 28824256-28824490      *
##   [3]     chr1 28825317-28825530      *
##   [4]     chr1 28825670-28825927      *
##   [5]     chr1 28826015-28826284      *
##   [6]     chr3 46417171-46417221      *
##   [7]    chr12 51324081-51324639      *
##   [8]    chr12 51325229-51325622      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

closeByRegions_gr is a GRanges object that includes sub-regions that contain closely located RNA editing sites (close-by regions). In this example, within the three input candidate regions, 8 close-by regions were found.

4.3 Find co-edited regions

Next, within each close-by region, we identify co-edited regions based on RNA editing levels. At default, a co-edited region needs to satisfy the following requirements: (1) with at least 3 edited sites; (2) the minimum correlation between RNA editing levels of one site and the mean RNA editing levels of the rest of the sites is at least 0.4; (3) the minimum pairwise correlation of sites within the selected cluster is at least 0.1.

## GRanges object with 14 ranges and 0 metadata columns:
##        seqnames            ranges strand
##           <Rle>         <IRanges>  <Rle>
##    [1]     chr1 28825317-28825370      *
##    [2]     chr1 28825394-28825403      *
##    [3]     chr1 28825670-28825717      *
##    [4]     chr1 28825899-28825927      *
##    [5]     chr1 28826150-28826173      *
##    ...      ...               ...    ...
##   [10]    chr12 51324308-51324436      *
##   [11]    chr12 51324447-51324467      *
##   [12]    chr12 51324545-51324585      *
##   [13]    chr12 51324604-51324627      *
##   [14]    chr12 51325324-51325383      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

closeByCoeditedRegions_gr is a GRanges that contains co-edited regions within the close-by regions we found from last step. We identified 14 co-edited regions from the 8 close-by regions.

Let’s take a look at correlations of RNA editing levels within the first co-edited region:

4.4 Summarize all regions

Next, we summarize RNA editing levels from multiple edited sites within each co-edited region using medians.

##   seqnames    start      end width BRCA-Tumor-TCGA-AC-A2FF
## 1     chr1 28825317 28825370    54              0.10618417
## 2     chr1 28825394 28825403    10              0.00000000
## 3     chr1 28825670 28825717    48              0.08602151

summarizedRegions_df is a data frame with each row corresponding to one co-edited region identified from function AllCoeditedRegions(). Since there are 14 co-edited regions, this data frame has 14 rows. In column 5 to the end, each column includes median editing levels (over all sites within each co-edited region) for one sample.

4.5 Test all regions

Next, we test the association between cancer status and co-edited regions using logistic regression model: \[logit (Pr(sample\;type = “cancer”)) \sim median\;RNA\;editing\;levels\]

##    seqnames    start      end width   estimate    stdErr       pValue
## 8     chr12 51324085 51324127    43  -51.23815  8.736146 4.489288e-09
## 14    chr12 51325324 51325383    60  -94.78676 16.738315 1.488692e-08
## 11    chr12 51324447 51324467    21 -102.71801 18.271132 1.888967e-08
##             fdr
## 8  6.285003e-08
## 14 8.815177e-08
## 11 8.815177e-08

Here tumor_region_df contains results for testing each co-edited region using logistic regression model.

4.6 Annotate results

We next annotate results by adding the corresponding input regions, close-by regions and genes mapped to these genomic regions.

##   seqnames    start      end width             inputRegion
## 1    chr12 51324085 51324127    43 chr12:51317255-51326300
## 2    chr12 51325324 51325383    60 chr12:51317255-51326300
## 3    chr12 51324447 51324467    21 chr12:51317255-51326300
##             closeByRegion  symbol   estimate    stdErr       pValue
## 1 chr12:51324081-51324639 METTL7A  -51.23815  8.736146 4.489288e-09
## 2 chr12:51325229-51325622 METTL7A  -94.78676 16.738315 1.488692e-08
## 3 chr12:51324081-51324639 METTL7A -102.71801 18.271132 1.888967e-08
##            fdr
## 1 6.285003e-08
## 2 8.815177e-08
## 3 8.815177e-08

To summarize this final dataset, values of column inputRegion are the user-specified candidate genomic regions, values of column closeByRegion are the regions that contain closely located RNA editing sites, and columns seqnames, start, end are the co-edited regions.

5 Further examples of function TestAssociations in rnaEditr

rnaEditr can analyze different types of phenotypes, including binary, continuous and survival outcomes. In the last section, we analyzed a binary phenotype sample_type. We next illustrate analysis for continuous and survival phenotypes.

For continuous outcome, as an example, to identify co-edited regions associated age, we use the following commands:

tumor_region_df <- TestAssociations(
  # an RNA editing dataframe with special class "rnaEdit_df" from function
  #   CreateEditingTable() if site-specific analysis, from function 
  #   SummarizeAllRegions() if region-based analysis.
  rnaEdit_df = summarizedRegions_df,
  # a phenotype dataset that must have variable "sample" whose values are a 
  #   exact match to the colnames of "rnaEdit_df".
  pheno_df = pheno_df,
  # name of outcome variable in phenotype dataset "pheno_df" that you want to 
  #   test.
  responses_char = "age_at_diagnosis",
  # names of covariate variables in phenotype dataset "pheno_df" that you want
  #   to add into the model.
  covariates_char = NULL,
  # type of outcome variable that you input in argument "responses_char".
  respType = "continuous",
  # order the final results by p-values or not.
  orderByPval = TRUE
)
tumor_region_df[1:3, ]
##    seqnames    start      end width  estimate   stdErr     pValue       fdr
## 7      chr3 46417171 46417221    51 -29.56179 12.64776 0.02032533 0.2845546
## 2      chr1 28825394 28825403    10 -72.25518 35.67815 0.04406090 0.3084263
## 11    chr12 51324447 51324467    21  54.28664 29.48137 0.06691720 0.3122803

For survival outcome, for example, we use the following commands:

tumor_region_df <- TestAssociations(
  # an RNA editing dataframe with special class "rnaEdit_df" from function
  #   CreateEditingTable() if site-specific analysis, from function 
  #   SummarizeAllRegions() if region-based analysis.
  rnaEdit_df = summarizedRegions_df,
  # a phenotype dataset that must have variable "sample" whose values are a 
  #   exact match to the colnames of "rnaEdit_df".
  pheno_df = pheno_df,
  # name of outcome variable in phenotype dataset "pheno_df" that you want to 
  #   test.
  responses_char = c("OS.time", "OS"),
  # names of covariate variables in phenotype dataset "pheno_df" that you want
  #   to add into the model.
  covariates_char = NULL,
  # type of outcome variable that you input in argument "responses_char".
  respType = "survival",
  # order the final results by p-values or not.
  orderByPval = TRUE
)
tumor_region_df[1:3, ]
##    seqnames    start      end width      coef     exp_coef  se_coef
## 3      chr1 28825670 28825717    48 -13.64262 1.188740e-06 3.485844
## 14    chr12 51325324 51325383    60 -21.61839 4.085568e-10 5.753652
## 4      chr1 28825899 28825927    29 -24.57176 2.131191e-11 7.063456
##          pValue         fdr
## 3  9.088515e-05 0.001202134
## 14 1.717334e-04 0.001202134
## 4  5.038231e-04 0.002111975

6 Session information

Here is the R session information for this vignette:

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rnaEditr_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.2.0        Biobase_2.50.0             
##  [3] httr_1.4.2                  tidyr_1.1.2                
##  [5] bit64_4.0.5                 splines_4.0.3              
##  [7] foreach_1.5.1               bumphunter_1.32.0          
##  [9] assertthat_0.2.1            askpass_1.1                
## [11] stats4_4.0.3                BiocFileCache_1.14.0       
## [13] doRNG_1.8.2                 blob_1.2.1                 
## [15] GenomeInfoDbData_1.2.4      Rsamtools_2.6.0            
## [17] yaml_2.2.1                  progress_1.2.2             
## [19] corrplot_0.84               backports_1.1.10           
## [21] pillar_1.4.6                RSQLite_2.2.1              
## [23] lattice_0.20-41             glue_1.4.2                 
## [25] digest_0.6.27               GenomicRanges_1.42.0       
## [27] XVector_0.30.0              plyr_1.8.6                 
## [29] htmltools_0.5.0             Matrix_1.2-18              
## [31] XML_3.99-0.5                pkgconfig_2.0.3            
## [33] broom_0.7.2                 biomaRt_2.46.0             
## [35] zlibbioc_1.36.0             purrr_0.3.4                
## [37] BiocParallel_1.24.0         tibble_3.0.4               
## [39] openssl_1.4.3               mgcv_1.8-33                
## [41] generics_0.0.2              IRanges_2.24.0             
## [43] ellipsis_0.3.1              SummarizedExperiment_1.20.0
## [45] GenomicFeatures_1.42.0      BiocGenerics_0.36.0        
## [47] formula.tools_1.7.1         survival_3.2-7             
## [49] magrittr_1.5                crayon_1.3.4               
## [51] memoise_1.1.0               evaluate_0.14              
## [53] mice_3.11.0                 nlme_3.1-150               
## [55] operator.tools_1.6.3        xml2_1.3.2                 
## [57] tools_4.0.3                 prettyunits_1.1.1          
## [59] hms_0.5.3                   lifecycle_0.2.0            
## [61] matrixStats_0.57.0          stringr_1.4.0              
## [63] S4Vectors_0.28.0            locfit_1.5-9.4             
## [65] rngtools_1.5                DelayedArray_0.16.0        
## [67] AnnotationDbi_1.52.0        Biostrings_2.58.0          
## [69] compiler_4.0.3              GenomeInfoDb_1.26.0        
## [71] logistf_1.24                rlang_0.4.8                
## [73] grid_4.0.3                  RCurl_1.98-1.2             
## [75] iterators_1.0.13            rappdirs_0.3.1             
## [77] bitops_1.0-6                rmarkdown_2.5              
## [79] codetools_0.2-16            DBI_1.1.0                  
## [81] curl_4.3                    R6_2.4.1                   
## [83] GenomicAlignments_1.26.0    knitr_1.30                 
## [85] dplyr_1.0.2                 rtracklayer_1.50.0         
## [87] bit_4.0.4                   stringi_1.5.3              
## [89] parallel_4.0.3              Rcpp_1.0.5                 
## [91] vctrs_0.3.4                 dbplyr_1.4.4               
## [93] tidyselect_1.1.0            xfun_0.18