Introduction to anamiR

This document guides one through all available functions of the anamiR package. Package anamiR aims to find potential miRNA-target gene interactions from both miRNA and mRNA expression data.

Traditional miRNA analysis method is to use online databases to predict miRNA-target gene interactions. However, the false positive rate of prediction is so high that interactions are not reliable. To address this issue, anamiR integrates the whole expression analysis with expression data into workflow, including normalization, differential expression, correlation and then databases intersection, to find more reliable interactions.

0.1 Installation

anamiR is on Bioconductor and can be installed following standard installation procedure.

source("http://www.bioconductor.org/biocLite.R")
biocLite("anamiR")

To use,

library(anamiR)

## Loading required package: SummarizedExperiment

## Loading required package: GenomicRanges

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, cbind, colnames, do.call, duplicated, eval,
##     evalq, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, mapply, match, mget, order, paste, pmax, pmax.int,
##     pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Setting options('download.file.method.GEOquery'='auto')

## Setting options('GEOquery.inmemory.gpl'=FALSE)

0.2 General Workflow

The general workflow can be summarized as follows,

Basically there are six steps, corresponding to six R functions, to complete the whole analysis:

Normalize expression data
Find differential expreesion miRNAs and genes
Convert miRNA annotation to the latest version
Calculate the correlation coefficient between each miRNA and gene
Intersect with prediction and validation of miRNA- target gene interactions databases
Functional analysis with interested genes

0.3 Data Source

As shown in the workflow, not only samples of paired miRNA and mRNA expression data ,but phenotypical information of miRNA and mRNA are also required for the analysis. Since anamiR reads data in expression matrices, data sources are platform and technology agnostic. particularly, expression data from microarray or next generation sequencing are all acceptable for anamiR. However, this also indicates the raw data should be pre-processd to the expression matrices before using anamiR.

0.3.1 mRNA expression

Columns for samples. Rows for genes

GENE  SmapleA   SamplB ...
A     0.1       0.2
B    -0.5      -0.3
C     0.4       0.1

0.3.2 miRNA expression

Columns for samples. Rows for miRNAs

miRNA  SampleA  SampleB ...
A         0.1     0.5
B        -0.5     2.1
C         0.4     0.3

0.3.3 phenotype data

Cloumns for samples. Rows for feature name, including two groups, multiple groups, continuous data.

Feature  SampleA  SampleB  SampleC ...
groupA   A        B        A
groupB   A        B        C
groupC   121.22   120.34   123.33

0.4 Usage Example

Now, we show an example using internal data for anamiR workflow.

0.4.1 Example Data Source

To demonstrate the usage of the anamiR package, the package contains 30 paired miRNA and mRNA breast cancer samples, which are selected from 101 miRNA samples and 114 mRNA samples from GSE19536. As for phenotype data (hybridization information), there are three types of information in it, including two-groups, multi-groups, continuous data.

The mRNA data was conducted by Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) platform and the miRNA data was generated from Agilent-019118 Human miRNA Microarray 2.0 G4470B (miRNA ID version).

0.4.2 Format of Input Data

First of all, load the internal data and check the format.

data(mrna)
data(mirna)
data(pheno.mirna)
data(pheno.mrna)

Basically, the format of input data should be the same as the internal data. For mRNA expression data should be like,

mrna[1:5,1:5]

##          BC.M.014  BC.M.015  BC.M.017  BC.M.019  BC.M.023
## TRIM35   8.725157  8.699302  9.864870  8.986249  9.253038
## NR2C2AP 10.973569 11.318088 11.212974 11.803147 11.823159
## CENPO    7.404785  8.417221  8.272822  8.670029  9.392538
## ADCY8    5.785176  6.066669  6.240934  6.345486  6.037214
## RGS22    8.828413  7.844503  7.712051  7.078598  8.712837

As for miRNA expression data,

mirna[1:5,1:5]

##                BC.M.014  BC.M.015  BC.M.017 BC.M.019 BC.M.023
## hsa-miR-139-5p 3.301236  4.072743  3.476371 3.957260 6.662144
## hsa-miR-575    9.525705 10.097954 10.790396 9.365697 9.289803
## hsa-miR-33b*   4.089836  4.297618  4.051852 3.842798 3.044545
## hsa-miR-335    4.146060  5.615910  6.274007 5.655296 9.309553
## hsa-miR-16-2*  3.407055  4.023072  3.052135 4.208022 4.572399

And the phenotype data,

pheno.mrna[1:3,1:3]

##          ER                                  
## BC.M.014 "estrogen receptor status: Positive"
## BC.M.015 "estrogen receptor status: Positive"
## BC.M.017 "estrogen receptor status: Positive"
##          Subtype                        Survival          
## BC.M.014 "breast cancer subtype: Lum A" "126.611842105263"
## BC.M.015 "breast cancer subtype: Lum A" "126.513157894737"
## BC.M.017 "breast cancer subtype: Lum A" "80.9868421052632"

Actually, the phenotype data of miRNA and mRNA share the same contents,but in this case, we still make it in two data to prevent users from being confused about it.

0.4.3 Normalization (Optional)

Secondly, we normalize data. (If you use the normalized data, you can skip this step.)

se <- normalization(data = mirna, method = "quantile")

For this function, there are three methods provided, including quntile, rank.invariant, normal. For more detail, Please refer to their documentation.

Note that internal data have already been normalized, here is only for demonstration.

0.4.4 SummarizedExperiment class

Before entering the main workflow, we should put our data and phenotype information into SummarizedExperiment class first, which you can get more information from .

mrna_se <- SummarizedExperiment(
    assays = SimpleList(counts=mrna),
    colData = pheno.mrna)

mirna_se <- SummarizedExperiment(
    assays = SimpleList(counts=mirna),
    colData = pheno.mirna)

0.4.5 Differential Expression Analysis

Third, we will find differential expression genes and miRNAs. There are three statitical methods in this function. here, we use t.test for demonstration.

mrna_d <- differExp_discrete(se = mrna_se,
    class = "ER", method = "t.test",
    t_test.var = FALSE, log2 = FALSE,
    p_value.cutoff = 0.05,  foldchange = 0.5
)

mirna_d <- differExp_discrete(se = mirna_se,
   class = "ER", method = "t.test",
   t_test.var = FALSE, log2 = FALSE,
   p_value.cutoff = 0.05,  foldchange = 0.5
)

This function will delete genes and miRNAs (rows), which do not differential express, and add another three columns represent fold change (log2), p-value, adjusted p-value.

Take differential expressed mRNA data for instance,

nc <- ncol(mrna_d)
mrna_d[1:5, (nc-4):nc]

##          BC.M.512  BC.M.709 Fold-Change      P-Value     P-adjust
## NR2C2AP 11.142283 12.964909   1.0712486 2.316469e-04 5.312839e-04
## CENPO    8.450337  9.396517   0.8558411 2.682517e-04 5.748560e-04
## RGS22    6.908239  6.559562  -1.6894848 2.039121e-05 1.631297e-04
## KIF3C    8.383716  9.038231   0.5848147 4.541900e-05 2.270950e-04
## SENP5   10.864476 11.542792   0.7626540 8.630361e-07 1.917858e-05

0.4.6 Convert miRNA Annotation (Optional)

Before using collected databases for intersection with potential miRNA-target gene interactions, we have to make sure all miRNA are in the latest annotation version (miRBase 21). If not, we could use this function to do it.

mirna_21 <- miR_converter(data = mirna_d, remove_old = TRUE,
    original_version = 17, latest_version = 21)

Now, we can compare these two data,

# Before
head(row.names(mirna_d))

## [1] "hsa-miR-575"    "hsa-miR-150*"   "hsa-miR-198"    "hsa-miR-342-3p"
## [5] "hsa-miR-342-5p" "hsa-miR-134"

# After
head(row.names(mirna_21))

## [1] "hsa-miR-575"    "hsa-miR-150-3p" "hsa-miR-198"    "hsa-miR-342-3p"
## [5] "hsa-miR-342-5p" "hsa-miR-134-5p"

Note that user must put the right original version into parameter, because it is an important information for function to convert annotation.

0.4.7 Correlation Analysis

To find potential miRNA-target gene interactions, we should combine the information in two differential expressed data, which we obtained from differExp_discrete.

cor <- negative_cor(mrna_data = mrna_d, mirna_data = mirna_21,
    method = "pearson", cut.off = -0.5)

For output,

head(cor)

##      miRNA            Gene      Correlation( -0.5 )  FC(miRNA)     
## [1,] "hsa-miR-198"    "NR2C2AP" "-0.60749521052608"  "-1.434328524"
## [2,] "hsa-miR-198"    "SENP5"   "-0.54646311346452"  "-1.434328524"
## [3,] "hsa-miR-134-5p" "SENP5"   "-0.508211941696094" "-1.0180378"  
## [4,] "hsa-miR-198"    "AKAP1"   "-0.540628213927599" "-1.434328524"
## [5,] "hsa-miR-342-3p" "PIF1"    "-0.565476513859015" "-2.19495706" 
## [6,] "hsa-miR-342-5p" "PIF1"    "-0.621317621065092" "-2.115864648"
##      P-adjust(miRNA)        FC(gene)            P-adjust(gene)        
## [1,] "0.0287086140383543"   "1.071248568"       "0.000531283948592135"
## [2,] "0.0287086140383543"   "0.762654017199999" "1.91785808708134e-05"
## [3,] "0.000137286128508541" "0.762654017199999" "1.91785808708134e-05"
## [4,] "0.0287086140383543"   "1.2822530884"      "0.000659175618505387"
## [5,] "1.07192166300769e-09" "1.5501717548"      "0.00020239427080686" 
## [6,] "9.42861994530635e-11" "1.5501717548"      "0.00020239427080686"

As the showing list, each row is a potential interaction, and only the row that correlation coefficient < cut.off would be kept in list.

Note that in our assumption, miRNAs negatively regulate expression of their target genes, that is, cut.off basically should be negative decimal.

0.4.8 Heat map (optional)

There is a function for user to see the heatmaps about the miRNA-target gene interactions remaining in the correlation analysis table.

heat_vis(cor, mrna_d, mirna_21)

0.4.9 Intersect with Databases

After correlation analysis, we have some potential interactions, and then using database_support helps us to get information that whether there are databases predict or validate these interactions.

sup <- database_support(cor_data = cor,
    org = "hsa", Sum.cutoff = 3)

From output, column Sum tells us the total hits by 8 predict databases and column Validate tells us if this interaction have been validated.

head(sup)

##      miRNA_21         Gene_symbol Ensembl           Gene_ID
## [1,] "hsa-miR-342-3p" "ATF4"      "ENSG00000128272" "468"  
## [2,] "hsa-miR-342-3p" "PKP1"      "ENSG00000081277" "5317" 
## [3,] "hsa-miR-342-3p" "LRP8"      "ENSG00000157193" "7804" 
##      DIANA_microT_CDS EIMMo Microcosm miRDB miRanda PITA rna22 Targetscan
## [1,] 0                0     0         0     0       0    0     0         
## [2,] 1                1     0         0     1       0    1     0         
## [3,] 1                1     0         1     1       0    1     0         
##      Sum miRecords miRTarBase Evidence  Validate FC(miRNA)    
## [1,] 0   0         1          "Limited" "TRUE"   "-2.19495706"
## [2,] 4   0         1          "Limited" "TRUE"   "-2.19495706"
## [3,] 5   0         0          ""        "FALSE"  "-2.19495706"
##      P-adjust(miRNA)        FC(gene)       P-adjust(gene)         de novo
## [1,] "1.07192166300769e-09" "1.007603132"  "1.70115709649906e-05" 0      
## [2,] "1.07192166300769e-09" "1.49107022"   "0.00020239427080686"  0      
## [3,] "1.07192166300769e-09" "1.8819269508" "0.000574855984019839" 0

Note that we have 8 predict databases (DIANA microT CDS, EIMMo, Microcosm, miRDB, miRanda, PITA, rna22, TargetScan) and 2 validate databases (miRecords, miRTarBase).

0.4.10 Functional Analysis

The last, after finding reliable miRNA-target gene interactions, we are also interested in pathways, which may be enriched by these genes.

path <- enrichment(data_support = sup, org = "hsa", per_time = 1000)

Note that for parameter per_time, we only choose 1000 times, because it is for demonstration here. Default is 5000 times.

The output from this data not only shows P-Value generated by hypergeometric test, but Empirical P-Value, which means the value of average permutation test in each pathway.

head(path)

##      Database   Term                                    
## [1,] "KEGG_hsa" "Adrenergic signaling in cardiomyocytes"
## [2,] "KEGG_hsa" "Alcoholism"                            
## [3,] "KEGG_hsa" "Aldosterone synthesis and secretion"   
## [4,] "KEGG_hsa" "Amphetamine addiction"                 
## [5,] "KEGG_hsa" "Apoptosis"                             
## [6,] "KEGG_hsa" "cGMP-PKG signaling pathway"            
##      Total Genes of the Term Targets in the Term
## [1,] "148"                   "1"                
## [2,] "166"                   "1"                
## [3,] "64"                    "1"                
## [4,] "61"                    "1"                
## [5,] "133"                   "1"                
## [6,] "157"                   "1"                
##      Targets in Total Genes of the Term(%) Raw P-Value         
## [1,] "0.00675675675675676"                 "0.0857151676329049"
## [2,] "0.00602409638554217"                 "0.0957929254467815"
## [3,] "0.015625"                            "0.0376947588035947"
## [4,] "0.0163934426229508"                  "0.0359493445963326"
## [5,] "0.0075187969924812"                  "0.0772600588842339"
## [6,] "0.00636942675159236"                 "0.0907633548032286"
##      Empirical P-Value
## [1,] "0.001"          
## [2,] "0.002"          
## [3,] "0.037"          
## [4,] "0.038"          
## [5,] "0.004"          
## [6,] "0.003"