This vignette is about the integration of gene and miRNA pairs and their expression dataset and analysis. The sample dataset in this demonstration, which contains human miRNA:target pairs, was retrieved from miRTarBase website (Release 7.0).
data("mirtarbasegene")
head(mirtarbasegene)
#> # A tibble: 6 × 2
#> miRNA Target
#> <chr> <chr>
#> 1 hsa-miR-20a-5p HIF1A
#> 2 hsa-miR-146a-5p CXCR4
#> 3 hsa-miR-122-5p CYP7A1
#> 4 hsa-miR-222-3p STAT5A
#> 5 hsa-miR-21-5p RASGRP1
#> 6 hsa-miR-148a-3p DNMT1
NOTE if the mirna:target dataset includes miRNA genes as targets, the priming_graph()
function can fail. Because, the function define to miRNAs and targets without distinguishing between uppercase or lowercase.
The gene and mirna expression counts of patient barcoded with TCGA-E9-A1N5 is retrieved from TCGA via TCGAbiolinks
package (Colaprico et al. 2015) from Bioconductor
. The instructions of retrieving data can be found at TCGAbiolinks
manual.
For this step you don’t have to use TCGA data, any other source or package can be utilized.
data("TCGA_E9_A1N5_normal")
head(TCGA_E9_A1N5_normal)
#> # A tibble: 6 × 7
#> patient sample barcode defin…¹ ensem…² exter…³ gene_…⁴
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 TCGA-E9-A1N5 TCGA-E9-A1N5-11A TCGA-E9-A1N5-11… Solid … ENSG00… TSPAN6 2057
#> 2 TCGA-E9-A1N5 TCGA-E9-A1N5-11A TCGA-E9-A1N5-11… Solid … ENSG00… TNMD 39
#> 3 TCGA-E9-A1N5 TCGA-E9-A1N5-11A TCGA-E9-A1N5-11… Solid … ENSG00… DPM1 708
#> 4 TCGA-E9-A1N5 TCGA-E9-A1N5-11A TCGA-E9-A1N5-11… Solid … ENSG00… SCYL3 926
#> 5 TCGA-E9-A1N5 TCGA-E9-A1N5-11A TCGA-E9-A1N5-11… Solid … ENSG00… C1orf1… 284
#> 6 TCGA-E9-A1N5 TCGA-E9-A1N5-11A TCGA-E9-A1N5-11… Solid … ENSG00… FGR 282
#> # … with abbreviated variable names ¹definition, ²ensembl_gene_id,
#> # ³external_gene_name, ⁴gene_expression
data("TCGA_E9_A1N5_tumor")
head(TCGA_E9_A1N5_tumor)
#> # A tibble: 6 × 7
#> patient sample barcode defin…¹ ensem…² exter…³ gene_…⁴
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 TCGA-E9-A1N5 TCGA-E9-A1N5-01A TCGA-E9-A1N5-01… Primar… ENSG00… TSPAN6 1033
#> 2 TCGA-E9-A1N5 TCGA-E9-A1N5-01A TCGA-E9-A1N5-01… Primar… ENSG00… TNMD 12
#> 3 TCGA-E9-A1N5 TCGA-E9-A1N5-01A TCGA-E9-A1N5-01… Primar… ENSG00… DPM1 1426
#> 4 TCGA-E9-A1N5 TCGA-E9-A1N5-01A TCGA-E9-A1N5-01… Primar… ENSG00… SCYL3 952
#> 5 TCGA-E9-A1N5 TCGA-E9-A1N5-01A TCGA-E9-A1N5-01… Primar… ENSG00… C1orf1… 404
#> 6 TCGA-E9-A1N5 TCGA-E9-A1N5-01A TCGA-E9-A1N5-01… Primar… ENSG00… FGR 315
#> # … with abbreviated variable names ¹definition, ²ensembl_gene_id,
#> # ³external_gene_name, ⁴gene_expression
data("TCGA_E9_A1N5_mirnatumor")
head(TCGA_E9_A1N5_mirnatumor)
#> # A tibble: 6 × 6
#> barcode mirbase_ID miRNA Precu…¹ total…² total…³
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 TCGA-E9-A1N5-01A-11R-A14C-13 MIMAT0000062 hsa-let-7a-… MI0000… 45725 2.08e4
#> 2 TCGA-E9-A1N5-01A-11R-A14C-13 MIMAT0004481 hsa-let-7a-… MI0000… 100 4.55e1
#> 3 TCGA-E9-A1N5-01A-11R-A14C-13 MIMAT0010195 hsa-let-7a-… MI0000… 6 2.73e0
#> 4 TCGA-E9-A1N5-01A-11R-A14C-13 MIMAT0000063 hsa-let-7b-… MI0000… 43489 1.98e4
#> 5 TCGA-E9-A1N5-01A-11R-A14C-13 MIMAT0004482 hsa-let-7b-… MI0000… 126 5.73e1
#> 6 TCGA-E9-A1N5-01A-11R-A14C-13 MIMAT0000064 hsa-let-7c-… MI0000… 2002 9.11e2
#> # … with abbreviated variable names ¹Precusor, ²total_read, ³total_RPM
data("TCGA_E9_A1N5_mirnanormal")
head(TCGA_E9_A1N5_mirnanormal)
#> # A tibble: 6 × 6
#> barcode mirbase_ID miRNA Precu…¹ total…² total…³
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 TCGA-E9-A1N5-11A-41R-A14C-13 MIMAT0000062 hsa-let-7a-… MI0000… 67599 37068.
#> 2 TCGA-E9-A1N5-11A-41R-A14C-13 MIMAT0004481 hsa-let-7a-… MI0000… 132 72.4
#> 3 TCGA-E9-A1N5-11A-41R-A14C-13 MIMAT0010195 hsa-let-7a-… MI0000… 57 31.3
#> 4 TCGA-E9-A1N5-11A-41R-A14C-13 MIMAT0000063 hsa-let-7b-… MI0000… 47266 25918.
#> 5 TCGA-E9-A1N5-11A-41R-A14C-13 MIMAT0004482 hsa-let-7b-… MI0000… 126 69.1
#> 6 TCGA-E9-A1N5-11A-41R-A14C-13 MIMAT0000064 hsa-let-7c-… MI0000… 14554 7981.
#> # … with abbreviated variable names ¹Precusor, ²total_read, ³total_RPM
Here’s the summary of size of each dataset
Dataset name | Number of rows |
---|---|
mirtarbasegene |
380627 |
TCGA_E9_A1N5_normal |
56830 |
TCGA_E9_A1N5_tumor |
56830 |
TCGA_E9_A1N5_mirnanormal |
750 |
TCGA_E9_A1N5_mirnatumor |
648 |
All of these datasets are integrated using the code below resulting in miRNA:target dataset that contains miRNA and gene expression values.
TCGA_E9_A1N5_mirnagene <- TCGA_E9_A1N5_mirnanormal %>%
inner_join(mirtarbasegene, by= "miRNA") %>%
inner_join(TCGA_E9_A1N5_normal,
by = c("Target"= "external_gene_name")) %>%
select(Target, miRNA, total_read, gene_expression) %>%
distinct()
Note: Some of genes have expression values more than one because some of tissue samples were sequenced in two medium separately. So, we select maximum expression values of that genes at following:
#> # A tibble: 26 × 3
#> # Groups: Target, miRNA [26]
#> Target miRNA n
#> <chr> <chr> <int>
#> 1 COG8 hsa-miR-186-5p 2
#> 2 GOLGA8M hsa-miR-1270 2
#> 3 GOLGA8M hsa-miR-5703 2
#> 4 MATR3 hsa-let-7e-5p 2
#> 5 MATR3 hsa-miR-1-3p 2
#> 6 MATR3 hsa-miR-10b-3p 2
#> 7 MATR3 hsa-miR-125b-5p 2
#> 8 MATR3 hsa-miR-149-5p 2
#> 9 MATR3 hsa-miR-155-5p 2
#> 10 MATR3 hsa-miR-16-1-3p 2
#> # … with 16 more rows
head(TCGA_E9_A1N5_mirnagene)
#> # A tibble: 6 × 4
#> Target miRNA total_read gene_expression
#> <chr> <chr> <int> <dbl>
#> 1 CDK6 hsa-let-7a-5p 67599 4669
#> 2 MYC hsa-let-7a-5p 67599 11593
#> 3 BCL2 hsa-let-7a-5p 67599 2445
#> 4 NKIRAS2 hsa-let-7a-5p 67599 1519
#> 5 ITGB3 hsa-let-7a-5p 67599 196
#> 6 NF2 hsa-let-7a-5p 67599 1755
When we compared the two gene expression dataset of TCGA-E9A1N5 patient, and selected a gene which has 30-fold increased expression, (gene name: HIST1H3H), this gene node will be used in the example. Note that the selected node must not be isolated one. If the an isolated node is selected the change in expression will not propagate in network. (You can see commands for node selection in the vignette The auxiliary commands which can help to the users)
Optionally, you can filter the low expressed gene nodes because they are not effective elements.
The analysis is performed based on amounts of miRNAs and targets as seen. Firstly, we tried to find optimal iteration for the network when simulation start with HIST1H3H node. As an example, simulation()
function was used with cycle = 5
argument, this argument can be arranged according to network. Note that it can be appropriate that using greater number of cycle
for comprehensive network objects.
simulation_res_HIST <- TCGA_E9_A1N5_mirnagene %>%
priming_graph(competing_count = gene_expression,
miRNA_count = total_read) %>%
update_how(node_name = "HIST1H3H", how =30) %>%
simulate(5)
simulation_res_HIST%>%
find_iteration(plot=TRUE)
The graph was shown that the change in expression level of HIST1H3H results in weak perturbation efficiency, despite 30-fold change. The code shown below can be used for calculation of fold changes after simulation HIST1H3H gene to 30 fold:
simulation_res_HIST%>%
as_tibble()%>%
mutate(FC= count_current/initial_count)%>%
arrange(desc(FC))
#> # A tibble: 13,432 × 8
#> name type node_id initial_count count_pre count_curr…¹ chang…² FC
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 HIST1H3H Competing 9705 27 808. 808. Down 29.9
#> 2 KLHL15 Competing 618 579 579. 579. Up 1.00
#> 3 YWHAZ Competing 393 15845 15845. 15845. Up 1.00
#> 4 SOCS1 Competing 433 27 27.0 27.0 Up 1.00
#> 5 NNT Competing 3404 1909 1909. 1909. Down 1.00
#> 6 PPIP5K2 Competing 7084 2342 2342. 2342. Down 1.00
#> 7 LYVE1 Competing 11198 916 916. 916. Down 1.00
#> 8 IL10 Competing 1639 13 13.0 13.0 Down 1.00
#> 9 L1CAM Competing 2651 60 60.0 60.0 Up 1.00
#> 10 IFITM1 Competing 6030 1550 1550. 1550. Down 1.00
#> # … with 13,422 more rows, and abbreviated variable names ¹count_current,
#> # ²changes_variable
And then, we tried to simulate the network with the gene which has higher expression value. For this, we selected ACTB node as shown in The auxiliary commands which can help to the users
simulation_res_ACTB <- TCGA_E9_A1N5_mirnagene %>%
priming_graph(competing_count = gene_expression,
miRNA_count = total_read) %>%
update_how(node_name = "ACTB", how =1.87) %>%
simulate(5)
simulation_res_ACTB%>%
find_iteration(plot=TRUE)
Following codes are shown entire gene fold changes after simulation ACTB gene to 1.87 fold:
simulation_res_ACTB%>%
as_tibble()%>%
mutate(FC= count_current/initial_count)%>%
arrange(desc(FC))
#> # A tibble: 13,432 × 8
#> name type node_id initial_count count_pre count_current chang…¹ FC
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 ACTB Competing 84 101917 183711. 183710. Down 1.80
#> 2 ADIPOR2 Competing 319 3015 3026. 3026. Down 1.00
#> 3 RRM2 Competing 67 1479 1484. 1484. Up 1.00
#> 4 FAM105A Competing 385 569 571. 571. Down 1.00
#> 5 YOD1 Competing 472 470 472. 472. Up 1.00
#> 6 ZNF264 Competing 302 1395 1400. 1400. Up 1.00
#> 7 DNAJC28 Competing 557 60 60.2 60.2 Down 1.00
#> 8 FXN Competing 374 382 383. 383. Up 1.00
#> 9 NKIRAS2 Competing 4 1519 1524. 1524. Down 1.00
#> 10 CDKN1A Competing 42 1172 1173. 1176. Up 1.00
#> # … with 13,422 more rows, and abbreviated variable name ¹changes_variable
Note: it can be useful that you look at The auxiliary commands which can help to the users for perturbation efficiency of ACTB gene by simulation with same conditions and different expression changes.
In a real biological sample, we tested perturbation efficiencies of two genes; * one with low expression but high fold change (HIST1H3H, 30-fold increase in tumor) * another one with high expression but small change in expression level (ACTB, 1.87-fold increase in tumor)
With these two samples, it has been obtained that expression values of genes, rest of the perturbed gene, changed slightly.
Despite high fold change, former gene caused little perturbation. When the perturbation efficiencies of both of these genes are analysed, it has been oberved that HIST1H3H does not affect the other genes in given limit. On the contrary, high expressing gene with very low fold increase in tumor causes greater perturbation in the network. Additionaly, the perturbation efficiency of ACTB gene is quite high from HIST1H3H with 30-fold change, when ACTB is simulated with 30 fold-change.
Thus, if the perturbed node has lower target:total target ratio in group or groups, the efficiency of it can be weak, or vice versa. The efficiency of ACTB gene may be high for this reason, in comparison with HIST1H3H perturbation. In fact, it has been observed that ACTB has not strong perturbation efficiency too. This could be arisen from low miRNA:target ratio or ineffective target nodes which have very low expression levels.
huge_example
) which includes miRNA and gene expressions and miRNA:target interaction factorsInteractions between miRNAs and their targets can be analyzed after the integration of miRNA and targets via various datasets. As an example, we prepared the huge_example dataset. It was generated by integrating:
Below, only 6 rows from total of 26,176 rows are shown.
data("huge_example")
head(huge_example)
#> competing miRNA competing_counts mirnaexpression_normal Energy
#> 1 TSPAN6 hsa-miR-484 5404 23.058807 -21.4
#> 2 DPM1 hsa-miR-18b-5p 2472 0.256209 -10.0
#> 3 SCYL3 hsa-miR-149-5p 1483 34.844420 -5.5
#> 4 SCYL3 hsa-miR-30a-5p 1483 63031.505507 -13.0
#> 5 C1orf112 hsa-miR-1296-5p 312 1.793463 -8.1
#> 6 CFH hsa-miR-17-5p 5760 89.929349 -17.2
#> region_effect seed_type_effect
#> 1 0.42 0.43
#> 2 0.84 0.01
#> 3 0.42 0.01
#> 4 0.84 0.43
#> 5 0.42 0.01
#> 6 0.42 0.01
The node that initiates simulation can be determined according your interest or research.
The dataset, which is a data frame, can be manipulated with tidyverse packages. As an example, competing RNAs targeted by less than 5 miRNAs are eliminated to make the network manageable size.
filtered_example <- huge_example %>%
add_count(competing) %>%
filter(n > 5) %>%
select(-n)
head(filtered_example)
#> competing miRNA competing_counts mirnaexpression_normal Energy
#> 1 MAD1L1 hsa-miR-149-5p 1909 34.84442 -18.6
#> 2 MAD1L1 hsa-miR-16-5p 1909 296.68999 -19.3
#> 3 MAD1L1 hsa-miR-196a-5p 1909 55.85356 -15.4
#> 4 MAD1L1 hsa-miR-20a-5p 1909 65.84570 -13.6
#> 5 MAD1L1 hsa-miR-30c-5p 1909 602.60349 -26.7
#> 6 MAD1L1 hsa-miR-92a-3p 1909 5112.64996 -30.6
#> region_effect seed_type_effect
#> 1 0.42 0.01
#> 2 0.01 0.07
#> 3 0.01 0.07
#> 4 0.42 0.01
#> 5 0.01 0.01
#> 6 0.01 0.05
On the other hand, we chose the node GAPDH according to interaction count of the nodes. With the simulation, the graph was visualized after node GAPDH was increased to five fold.
simulation_GAPDH <- filtered_example %>%
priming_graph(competing_count = competing_counts,
miRNA_count = mirnaexpression_normal,
aff_factor = Energy) %>%
update_how("GAPDH", 5)
simulation_GAPDH%>%
vis_graph(title = "Distribution of GAPDH gene node")
Let’s visualize each step of simulation via simulate_vis()
function.
simulation_GAPDH%>%
simulate_vis(title = "GAPDH over expression in the real dataset", 3)
#> # A tbl_graph: 1526 nodes and 11384 edges
#> #
#> # A directed acyclic simple graph with 1 component
#> #
#> # Node Data: 1,526 × 7 (active)
#> name type node_id initial_count count_pre count_current changes_vari…
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <chr>
#> 1 MAD1L1 Competing 1 1909 1909. 1909. Up
#> 2 TFPI Competing 2 3377 3377. 3377. Down
#> 3 SLC7A2 Competing 3 8706 8707. 8707. Down
#> 4 FKBP4 Competing 4 20583 20584. 20584. Down
#> 5 SLC25A13 Competing 5 1637 1638. 1638. Down
#> 6 ST7 Competing 6 1962 1962. 1962. Up
#> # … with 1,520 more rows
#> #
#> # Edge Data: 11,384 × 21
#> from to Compet… miRNA_… compet… mirnae… Energy dummy afff_f… degg_f…
#> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1255 MAD1L1 hsa-mi… 1909 34.8 -18.6 1 0.628 1
#> 2 1 1256 MAD1L1 hsa-mi… 1909 297. -19.3 1 0.702 1
#> 3 1 1257 MAD1L1 hsa-mi… 1909 55.9 -15.4 1 0.544 1
#> # … with 11,381 more rows, and 11 more variables: comp_count_list <list>,
#> # comp_count_pre <dbl>, comp_count_current <dbl>, mirna_count_list <list>,
#> # mirna_count_pre <dbl>, mirna_count_current <dbl>,
#> # mirna_count_per_dep <dbl>, effect_current <dbl>, effect_pre <dbl>,
#> # effect_list <list>, mirna_count_per_comp <dbl>
Now, we can track changes in expression levels at every node for 3 cycles when GAPDH is overexpressed 5-fold.
After increase in GAPDH expression level in the first graph, the responses of the other competing elements to the GAPDH distributions were calculated.
The changing regulations (up or down) were observed as a result of interactions in the second graph.
When three graphs were carefully compared to each other, it can be observed that the expression levels of nodes change continuously at each stage.
find_node_perturbation()
runs calc_perturb
on all nodes in the network in parallel with help of the future
and furrr
packages. In this vignette, the function is demonstrated on the midsamp
data. This dataset is not comparable to actual biological miRNA:target gene datasets in size and complexity. Although find_node_perturbation()
runs in parallel it might take long time to run in real huge biological datasets.
In real biological datasets, more complex interactions whether functional or non-functional could be observed. We have improved our approach with fast
argument in find_node_perturbation()
based on selection of elements that could be affected from perturbation. In this fucntion, fast
argument specifies the percentage of the competing amount that can be affected within the initial competing amount and acts as a selection parameter. For instance, in filtered example data:
entire_perturbation <- filtered_example%>%
priming_graph(competing_count = competing_counts, miRNA_count = mirnaexpression_normal)%>%
find_node_perturbation(how=5, cycle=3, fast = 15)%>%
select(name, perturbation_efficiency, perturbed_count)
#> Subsetting by edges
entire_perturbation%>%
filter(!is.na(perturbation_efficiency), !is.na(perturbed_count))%>%
select(name, perturbation_efficiency, perturbed_count)
#> # A tibble: 53 × 3
#> name perturbation_efficiency perturbed_count
#> <chr> <dbl> <dbl>
#> 1 SLC25A5 1.34 43
#> 2 MSL3 0.207 43
#> 3 SCMH1 0.369 43
#> 4 RALBP1 0.657 43
#> 5 THUMPD1 0.654 43
#> 6 ARFGEF1 0.509 43
#> 7 PABPC1 6.35 43
#> 8 CNOT4 0 0
#> 9 PPP1R13B 0.285 43
#> 10 CSNK2A1 0.570 43
#> # … with 43 more rows
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] purrr_0.3.5 png_0.1-7 ceRNAnetsim_1.10.0 tidygraph_1.2.2
#> [5] dplyr_1.0.10
#>
#> loaded via a namespace (and not attached):
#> [1] ggrepel_0.9.1 Rcpp_1.0.9 tidyr_1.2.1 listenv_0.8.0
#> [5] assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 ggforce_0.4.1
#> [9] parallelly_1.32.1 R6_2.5.1 evaluate_0.17 ggplot2_3.3.6
#> [13] highr_0.9 pillar_1.8.1 rlang_1.0.6 furrr_0.3.1
#> [17] jquerylib_0.1.4 rmarkdown_2.17 labeling_0.4.2 stringr_1.4.1
#> [21] igraph_1.3.5 polyclip_1.10-4 munsell_0.5.0 compiler_4.2.1
#> [25] xfun_0.34 pkgconfig_2.0.3 globals_0.16.1 htmltools_0.5.3
#> [29] tidyselect_1.2.0 tibble_3.1.8 gridExtra_2.3 codetools_0.2-18
#> [33] graphlayouts_0.8.3 fansi_1.0.3 future_1.28.0 viridisLite_0.4.1
#> [37] withr_2.5.0 MASS_7.3-58.1 grid_4.2.1 jsonlite_1.8.3
#> [41] gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3
#> [45] scales_1.2.1 cli_3.4.1 stringi_1.7.8 cachem_1.0.6
#> [49] farver_2.1.1 viridis_0.6.2 bslib_0.4.0 ellipsis_0.3.2
#> [53] generics_0.1.3 vctrs_0.5.0 tools_4.2.1 glue_1.6.2
#> [57] tweenr_2.0.2 ggraph_2.1.0 parallel_4.2.1 fastmap_1.1.0
#> [61] yaml_2.3.6 colorspace_2.0-3 knitr_1.40 sass_0.4.2
Colaprico, Antonio, Tiago C Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S Sabedot, et al. 2015. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of Tcga Data.” Nucleic Acids Research 44 (8): e71–e71.