Fig 1. BioTIP workflow with five key analytic steps. RTF: relative transcript fluctuation; PCC: Pearson correlation coefficient; MCI: Module-Criticality Index; Ic: index of critical transition.

Fig 1. BioTIP workflow with five key analytic steps. RTF: relative transcript fluctuation; PCC: Pearson correlation coefficient; MCI: Module-Criticality Index; Ic: index of critical transition.

Standard workflow

Standard workflow: An Identification of Critical Tipping Point

Data Preprocessing

An existing dataset, GSE6136, is used to demonstrate how our functions are applied. Samples were collected from transgenic mouse lymphomas and divided into five groups based on clinical presentation, pathology and flow cytometry (Lenburg 2007), thus belonging to cross-sectional profiles. Noticing these five group represent a control stage similar to non-transgenic B cells and four major periods of B-cell lymphomagenesis, Dr. Chen and coauthors used the DNB method to identify the pre-disease state exists around the normal activated period (P2), i.e., the system transitions to the disease state around the transitional lymphoma period (Figure S12 in publication (Chen 2012)). Start by installing the package ‘BioTIP’ and other dependent packages such as stringr, psych, and igraph if necessary. Below are some examples.

# load package
library(BioTIP)

Once all the required packages are installed, load using “read.table()” function as follows. Note to change the read.table(“PATH”) when running your datasets. To check the dimension of your data set use “dim(dataset)” function. Here we show a use of dim function “dim(GSE6136)”. Notice that after editing the column and row, the dimension of our dataset changes from (22690,27) to (22690, 26) because we removed the downloaded first row after assigning it to be column-name of the final numeric data matrix.

data(GSE6136_matrix)

dim(GSE6136_matrix)               #[1] 22690rows and 27 columns
#> [1] 22690    27
row.names(GSE6136_matrix) = GSE6136_matrix$ID_REF
GSE6136 = GSE6136_matrix[,-1]
dim(GSE6136)               #[1] 22690 rows and 26 columns
#> [1] 22690    26

The summary of GSE6136_matrix is GSE6136_cli shown below. These two kind of data files can be downloaded from GSE database

#requires library(stringr)
library(BioTIP)
data(GSE6136_cli)

#dim(GSE6136_cli) #check dimension
cli = t(GSE6136_cli)

library(stringr)
colnames(cli) = str_split_fixed(cli[1,],'_',2)[,2]
cli = cli[-1,]
cli = data.frame(cli)
cli[,"cell-type:ch1"] = str_split_fixed(cli$characteristics_ch1.1,": ",2)[,2]
cli[,"Ig clonality:ch1"] = str_split_fixed(cli$characteristics_ch1.3,": ",2)[,2]

colnames(cli)[colnames(cli) == "cell-type:ch1"] = "group"
cli$Row.names = cli[,1]
head(cli[,1:3])
#>    geo_accession                status submission_date
#> V2     GSM142398 Public on Oct 28 2006     Oct 25 2006
#> V3     GSM142399 Public on Oct 28 2006     Oct 25 2006
#> V4     GSM142400 Public on Oct 28 2006     Oct 25 2006
#> V5     GSM142401 Public on Oct 28 2006     Oct 25 2006
#> V6     GSM142402 Public on Oct 28 2006     Oct 25 2006
#> V7     GSM142403 Public on Oct 28 2006     Oct 25 2006

We normalized the expression of genes using log2() scale. This normalization will ensure a more accurate comparison of the variance between the expression groups (clusters).

dat <- GSE6136
df <- log2(dat+1)
head(df)
#>              GSM142398 GSM142399 GSM142400 GSM142401 GSM142402 GSM142403
#> 1415670_at   10.219169 10.290480  9.987548 10.076816  9.827343  9.871289
#> 1415671_at   10.903581 11.159934 10.776186 10.929998 11.468268 11.200408
#> 1415672_at   11.115304 10.892087 11.091303 11.040290 11.109504 11.325305
#> 1415673_at    8.990388 10.265615  8.742141  8.422065  8.963474  8.874674
#> 1415674_a_at  9.911692 10.665869  9.942661  9.793766 10.243650 10.147078
#> 1415675_at    9.524933  9.896332  9.590774  9.375474  9.392747  9.422065
#>              GSM142404 GSM142405 GSM142406 GSM142407 GSM142408 GSM142409
#> 1415670_at    9.675428  9.950702  9.848153 10.103419 10.040838  9.823367
#> 1415671_at   10.654726 10.875135 10.854245 10.898450 10.736909 10.000000
#> 1415672_at   11.639702 10.989253 11.021605 11.255088 11.278449 11.232960
#> 1415673_at    8.923327  9.214562  8.965496  9.212861  9.121793  9.634993
#> 1415674_a_at 10.133271 10.344962 10.392210 10.551131  9.750205 10.642864
#> 1415675_at    9.396605  9.894666  9.818103  9.794741  9.857981 10.175924
#>              GSM142410 GSM142411 GSM142412 GSM142413 GSM142414 GSM142415
#> 1415670_at    9.974271 10.226412  9.471878  9.483413  9.743320 10.126317
#> 1415671_at   10.341185 10.515897 10.486835 11.247394 10.813861 10.714589
#> 1415672_at   11.227135 10.945444 10.904334 11.208173 11.316168 11.251542
#> 1415673_at    9.696794  9.112440  9.050121  9.026523  8.806711  9.242221
#> 1415674_a_at 10.239360 10.351381 10.469744  9.948075  9.760886 10.089980
#> 1415675_at    9.603997  9.392532  9.250062  9.563387  9.597680  9.594325
#>              GSM142416 GSM142417 GSM142418 GSM142419 GSM142420 GSM142421
#> 1415670_at    9.654457  9.976134 10.033423  9.712699 10.093286  9.939138
#> 1415671_at   10.429407 10.590681 10.663825 10.401733 10.812819 10.574026
#> 1415672_at   10.418696 11.120108 11.484219 11.799808 11.337064 11.169048
#> 1415673_at    8.240791 10.588715 10.380136 10.484521 10.616273  9.051209
#> 1415674_a_at  9.813941 10.572984 10.664181 10.667023 10.865347  9.748696
#> 1415675_at    8.961739  9.798634  9.638436  9.446670  9.690522  9.389739
#>              GSM142422 GSM142423
#> 1415670_at   10.152792  9.838258
#> 1415671_at   10.615905 10.375908
#> 1415672_at   11.469845 11.542500
#> 1415673_at    9.899055 10.382732
#> 1415674_a_at  9.434003  9.690696
#> 1415675_at    9.111657  9.116084

Go to Top

Pre-selection Transcript

Once normalized, we can now classify different stages. The tipping point is within the “activated” state in this case. Here we see the number of samples that are classified into states “activated”, “lymphoma_aggressive”, “lymphoma_marginal”, “lymphoma_transitional” and “resting”. For instance, states “activated” and “resting” contain three and four samples, respectively. All the contents of the data set “test” can be viewed using View(test). Each clinical state’s content can be viewed using head(test["stage_name"]). For instance, head(test[“activated”]) shows contents of the activated state.

tmp <- names(table(cli$group))
samplesL <- split(cli[,1],f = cli$group)
#head(samplesL)
test <- sd_selection(df, samplesL,0.01)
head(test[["activated"]])
#>              GSM142399 GSM142422 GSM142423
#> 1415766_at    7.600656  8.778077  9.036723
#> 1415827_a_at 11.002252 13.079218 13.205503
#> 1415904_at   12.229810  5.885086  4.217231
#> 1415985_at   11.786106 10.736656 10.044940
#> 1416034_at   11.417114 13.474682 13.760252
#> 1416071_at   11.194388 10.362273  9.993646

<a name=“>Go to Top

Network Partition

A graphical represetation of genes of interest can be achieved using the functions shown below. The getNetwork function will obtain an igraph object based on a pearson correlation of test. This igraphL object is then run using the getCluster_methods function classify nodes.

library(BioTIP)
#library(igraph)
library(cluster)
igraphL <- getNetwork(test, fdr = 1)
#> activated:226 nodes
#> lymphoma_aggressive:226 nodes
#> lymphoma_marginal:226 nodes
#> lymphoma_transitional:226 nodes
#> resting:226 nodes

cluster <- getCluster_methods(igraphL)
names(cluster)
#> [1] "activated"             "lymphoma_aggressive"   "lymphoma_marginal"    
#> [4] "lymphoma_transitional" "resting"
head(cluster[[1]])
#> $`1`
#>   [1] "1415766_at"   "1415827_a_at" "1415904_at"   "1415985_at"  
#>   [5] "1416034_at"   "1416071_at"   "1416138_at"   "1416488_at"  
#>   [9] "1416629_at"   "1416632_at"   "1416892_s_at" "1416930_at"  
#>  [13] "1417185_at"   "1417244_a_at" "1417815_a_at" "1417816_s_at"
#>  [17] "1417934_at"   "1418000_a_at" "1418133_at"   "1418191_at"  
#>  [21] "1418466_at"   "1418580_at"   "1418687_at"   "1418727_at"  
#>  [25] "1419027_s_at" "1419135_at"   "1419170_at"   "1419186_a_at"
#>  [29] "1419357_at"   "1419412_at"   "1419538_at"   "1419659_s_at"
#>  [33] "1419676_at"   "1419680_a_at" "1420138_at"   "1420503_at"  
#>  [37] "1420619_a_at" "1421055_at"   "1421279_at"   "1421377_at"  
#>  [41] "1421469_a_at" "1421560_at"   "1421688_a_at" "1421830_at"  
#>  [45] "1421832_at"   "1421854_at"   "1421855_at"   "1422032_a_at"
#>  [49] "1422302_s_at" "1422709_a_at" "1422885_at"   "1423030_at"  
#>  [53] "1423166_at"   "1423200_at"   "1423419_at"   "1423493_a_at"
#>  [57] "1423591_at"   "1423747_a_at" "1423826_at"   "1423961_at"  
#>  [61] "1424193_at"   "1424522_at"   "1424573_at"   "1424775_at"  
#>  [65] "1424990_at"   "1425154_a_at" "1425356_at"   "1425593_at"  
#>  [69] "1425650_at"   "1425675_s_at" "1425745_a_at" "1425792_a_at"
#>  [73] "1426213_at"   "1426269_at"   "1426562_a_at" "1426713_s_at"
#>  [77] "1426794_at"   "1426875_s_at" "1427562_a_at" "1428040_at"  
#>  [81] "1428106_at"   "1428306_at"   "1428476_a_at" "1428616_at"  
#>  [85] "1428648_at"   "1428820_at"   "1431056_a_at" "1431591_s_at"
#>  [89] "1434045_at"   "1435476_a_at" "1435477_s_at" "1435602_at"  
#>  [93] "1435652_a_at" "1436032_at"   "1436900_x_at" "1436996_x_at"
#>  [97] "1437172_x_at" "1437502_x_at" "1437765_at"   "1437890_at"  
#> [101] "1438095_x_at" "1438385_s_at" "1438548_x_at" "1438932_at"  
#> [105] "1438933_x_at" "1448163_at"   "1448182_a_at" "1448391_at"  
#> [109] "1448403_at"   "1448724_at"   "1448731_at"   "1448957_at"  
#> [113] "1449474_a_at" "1449553_at"   "1450107_a_at" "1450300_at"  
#> [117] "1450484_a_at" "1450490_at"   "1450814_a_at" "1450844_at"  
#> [121] "1450883_a_at" "1450884_at"   "1450903_at"   "1450945_at"  
#> [125] "1451110_at"   "1451143_at"   "1451191_at"   "1451227_a_at"
#> [129] "1451256_at"   "1451321_a_at" "1451335_at"   "1451426_at"  
#> [133] "1451468_s_at" "1451489_at"   "1451924_a_at" "1452162_at"  
#> [137] "1452170_at"   "1452252_at"   "1452532_x_at" "1453360_a_at"
#> [141] "1454021_a_at" "1454626_at"   "1454963_at"   "1455332_x_at"
#> [145] "1455550_x_at" "1455873_a_at" "1455899_x_at" "1456066_a_at"
#> [149] "1456323_at"   "1460379_at"   "1460682_s_at"
#> 
#> $`2`
#>  [1] "1416274_at"   "1416573_at"   "1417862_at"   "1418103_at"  
#>  [5] "1418525_at"   "1418535_at"   "1418691_at"   "1419415_a_at"
#>  [9] "1420667_at"   "1422242_at"   "1422598_at"   "1422898_s_at"
#> [13] "1423854_a_at" "1424310_at"   "1425093_at"   "1425744_a_at"
#> [17] "1425834_a_at" "1425843_at"   "1426444_at"   "1430167_a_at"
#> [21] "1432155_at"   "1432360_a_at" "1434009_at"   "1435317_x_at"
#> [25] "1435561_at"   "1436364_x_at" "1448058_s_at" "1449266_at"  
#> [29] "1449318_at"   "1449657_at"   "1450058_at"   "1450073_at"  
#> [33] "1450552_at"   "1450817_at"   "1451030_at"   "1451530_at"  
#> [37] "1451861_at"   "1454078_a_at" "1455890_x_at" "1460653_at"  
#> 
#> $`3`
#>  [1] "1416823_a_at" "1417184_s_at" "1417498_at"   "1417714_x_at"
#>  [5] "1417863_at"   "1418113_at"   "1419028_at"   "1419734_at"  
#>  [9] "1421044_at"   "1421351_at"   "1421584_at"   "1421848_at"  
#> [13] "1421961_a_at" "1422092_at"   "1423244_at"   "1424086_at"  
#> [17] "1424690_at"   "1426225_at"   "1426614_at"   "1426973_at"  
#> [21] "1427627_at"   "1427828_at"   "1428361_x_at" "1429980_x_at"
#> [25] "1435759_at"   "1436336_at"   "1437684_at"   "1438564_at"  
#> [29] "1450198_at"   "1450227_at"   "1451999_at"   "1452757_s_at"
#> [33] "1453886_a_at" "1454629_at"   "1460667_at"

<a name=“>Go to Top

Identifying Dynamic Network Biomodule

Here, ‘module’ refers to a cluster of network nodes (e.g. transcripts) highly linked (e.g. by correlation). “Biomodule” refers to the module resenting a highest score called “Module-Criticality Index (MCI)” per state.

The following step shows a graph of classified clustered samples for five different stages. MCI score is calculated for each module using the getMCI function. The getMaxMCImember function will obtain a list of modules with highest MCI at each stage. Use "head(maxCIms)" to view the MCI scores calculated. Use plotMaxMCI function to view the line plot of highest MCI score at each stage.

membersL_noweight <- getMCI(cluster,test,adjust.size = FALSE)
plotBar_MCI(membersL_noweight,ylim = c(0,6))

maxMCIms <- getMaxMCImember(membersL_noweight[[1]],membersL_noweight[[2]],min =10)
names(maxMCIms)
#> [1] "idx"     "members"
names(maxMCIms[[1]])
#> [1] "activated"             "lymphoma_aggressive"   "lymphoma_marginal"    
#> [4] "lymphoma_transitional" "resting"
names(maxMCIms[[2]])
#> [1] "activated"             "lymphoma_aggressive"   "lymphoma_marginal"    
#> [4] "lymphoma_transitional" "resting"
head(maxMCIms[['idx']])
#>             activated   lymphoma_aggressive     lymphoma_marginal 
#>                     3                     1                     1 
#> lymphoma_transitional               resting 
#>                     5                     3
head(maxMCIms[['members']][['lymphoma_aggressive']])
#> [1] "1416001_a_at" "1416002_x_at" "1416527_at"   "1416754_at"  
#> [5] "1416771_at"   "1417133_at"

To get the selected statistics of biomodules (the module that has the highest MCI score) of each state, please run the following

biomodules = getMaxStats(membersL_noweight[['members']],maxMCIms[[1]])
maxMCI = getMaxStats(membersL_noweight[['MCI']],maxMCIms[[1]])
head(maxMCI)
#>             activated   lymphoma_aggressive     lymphoma_marginal 
#>              3.522785              3.334630              2.387766 
#> lymphoma_transitional               resting 
#>              2.448548              2.886293
maxSD = getMaxStats(membersL_noweight[['sd']],maxMCIms[[1]])
head(maxSD)
#>             activated   lymphoma_aggressive     lymphoma_marginal 
#>              2.139577              1.885470              1.295115 
#> lymphoma_transitional               resting 
#>              1.790035              1.469989

To get the biomodule with the highest MCI score among all states, as we call it CTS (Critical Transition Signals), please run the following.

CTS = getCTS(maxMCI, maxMCIms[[2]])
#> Length: 35

Run the following to visualize the trendence of every state represented by the cluster with the highest MCI scores.

par(mar = c(10,5,0,2))
plotMaxMCI(maxMCIms,membersL_noweight[[2]],states = names(samplesL),las = 2)

We then perform simulation for MCI scores based on identified signature size (length(CTS) ) using the simulationMCI function.Use plot_MCI_simulation function to visualize the result. This step usually takes 20-30mins, so here to save the time, we picked a small number 3 as the length of the CTS.

simuMCI <- simulationMCI(3,samplesL,df)
plot_MCI_Simulation(maxMCI,simuMCI)

Go to Top

Finding Tipping Point

The next step is to calculate an Index of Critical transition (Ic score) of the dataset. First, use the getIc function to calculate the Ic score based on the biomodule previously identified. We use the plotIc function to draw a line plot of the Ic score.

IC <- getIc(df,samplesL,CTS)
par(mar = c(10,5,0,2))
plotIc(IC,las = 2)

Then use the two functions to evaluate two types of empirical significance, respectively. The first function simulation_Ic calculates random Ic-scores by shuffling features (transcripts). Showing in the plot is Ic-score of the identification (red) against its corresponding size-controlled random scores (grey).

simuIC <- simulation_Ic(length(CTS),samplesL,df)
par(mar = c(10,5,0,2))
plot_Ic_Simulation(IC,simuIC,las = 2)

Another function plot_simulation_sample calculates random Ic-scores by shuffling samples and visualizes the results. Showing in the plot is observed Ic-score (red vertical line) comparing to the density of random scores (grey), at the tipping point identified.

simulated_Ic = plot_simulation_sample(df,length(samplesL[['lymphoma_aggressive']]),IC[['lymphoma_aggressive']],CTS)

<a name=“>Go to top

###Transcript Annotation and Biotype### ###################################################

Quick Start

The R function getBiotype is used to group transcripts of interest into 11 biotypes based on GENCODE annotation (Fig 2a). When a query transcript overlaps at least half with a GENCODE transcript on the same strand, this query transcript will inherit the biotype of the GENCODE transcript.

In the previous study conducted, five out of the 11 biotypes showed high protein-coding potential while the others did not (Fig 2b) [4]. We thus concluded these seven other biotypes, including protein-coding antisense RNAs, to be lncRNAs. The remaining coding biotypes in this study included canonic protein coding (CPC), ‘PC_mixed’, and ‘PC_intron’.

First start by loading the required libraries: “GenomeInfoDb,” “BioTIP,” “GenomicRanges,” “IRanges” and “BioTIP”. Next load the datasets: “gencode”, “ILEF”, “intron” and “cod”. Using these datasets, excute BioTIP functions getBiotypes and getReadthrough as follows. These steps assume you installed the “BioTIP” package. If you did not install the package, use the install.packages("BioTIP") to install in R.

Fig 2. A getBiotypes workflow and protein-coding potential in real data analysis [4]. (a) Workflow of an in-house R function (getBiotypes) to query transcripts of interests and classify into biotypes. (b) Pie-chart of eleven types of transcripts assembled from polyadenylated RNA(TARGET). (c) Empirical cumulative distribution plot comparing the transcripts across all 11 biotypes. The protein-coding potential was estimated with the Coding Potential Assessment Tool (CPAT). Line color codes biotypes. The more a line towards the right-bottom corner, the lower protein-coding potential it has.

library(BioTIP)

data(gencode)
head(gencode)
#> GRanges object with 6 ranges and 1 metadata column:
#>       seqnames          ranges strand |  biotype
#>          <Rle>       <IRanges>  <Rle> | <factor>
#>   [1]    chr21 9683191-9683272      + |    miRNA
#>   [2]    chr21 9683191-9683272      + |    miRNA
#>   [3]    chr21 9683191-9683272      + |    miRNA
#>   [4]    chr21 9825832-9826011      + |    miRNA
#>   [5]    chr21 9825832-9826011      + |    miRNA
#>   [6]    chr21 9825832-9826011      + |    miRNA
#>   -------
#>   seqinfo: 25 sequences from an unspecified genome; no seqlengths

These illustrations above assumes you have installed “BioTIP” package. If you did not install the package already, use the install.packages("BioTIP") to install in R.


Genomic Data Source

High quality human genome sequence data can be obtained from various sources. To demonstrate this package, we obtained a comprehensive gene annotation of human GRCh37 from GENCODE. For our illustrations, human GRCh37 data will be used. A standard file structure, similar to general transfer format (gtf) format, is required for this package. This gtf file organizes genomic data in rows and columns (fields). Each row contains information about specific samples. The columns are tab separated headers of the data frame. There are eight fixed columns with specific headers. An example of gtf format is shown below. For details of the gtf file format visit this link.

Chromosome 21 of human GRCh37 gtf

Chromosome 21 of human GRCh37 gtf

The table above contains a chr21 dataset which was extracted from a full genome dataset. An extraction method for filtering chr21 from gencode file is described below.

Go to Top

Extracting Summary Data

Before any further analysis, we need to summarize the content of the raw gtf data. There are two ways to get genome biotypes: a) “transcript_type” b) “gene_type”. Due to our interst in coding and noncoding regions, the transcript_type method was used to extract the regions of interest using python script shown below. Note that the "PATH_FILE" refers to the path where the downloded gtf file is located. For instance, if the gtf file is located on your desktop, replace the "PATH_FILE" Cc "Users/user/Desktop/gtf".

Python codes:

gtf = ("Your/PATH/TO/THE/FILE")
outF = open("gtf_summary_transbiotype.txt","w")

def getquote(str,f,target):
    targetLen = len(target)+2
    strInd = str.find(target)
    st = strInd + len(target)+2
    ed = st + str[st:].find("";")
    #print(st,ed)
    f.write("\t"+str[st:ed]) if strInd!= -1 else f.write("\t"+"NA.")

with open(gtf, "r") as f:
     for line in f:
        if line[0] != "#":
            chromosome = line.split("\t")[0]
            st = line.split("\t")[3]
            ed = line.split("\t")[4]
            strand = line.split("\t")[6]
            type = line.split("\t")[2]
            outF.write(chromosome+"\t"+st+"\t"+ed+"\t"+strand+"\t"+type)
            c = "transcript_id"
            g = "gene_name"
            t = "transcript_type"
            getquote(line,outF,c)
            getquote(line,outF,g)
            getquote(line,outF,t)
            outF.write("\n")
outF.close() 

Go to Top  

Loading Data

In order to load your data from a local drive, use the following format. Note that the "PATH_FILE" refers to the location of the summary data from the above section. For more details on how to load datasets click here.

loading data from local drive

data <- read.delim(“PATH_FILE”, comment.char = “#”)

Internal BioTIP package data is included in the data folder. The data can be loaded into R working console using data()function. Here we show an example of how to load a dataset gencode from the data directory. A quick view of the data can be achieved using head(gencode).

library(BioTIP)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect,
#>     is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
#>     pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
#>     setdiff, sort, table, tapply, union, unique, unsplit, which,
#>     which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
data(gencode)
head(gencode)
#> GRanges object with 6 ranges and 1 metadata column:
#>       seqnames          ranges strand |  biotype
#>          <Rle>       <IRanges>  <Rle> | <factor>
#>   [1]    chr21 9683191-9683272      + |    miRNA
#>   [2]    chr21 9683191-9683272      + |    miRNA
#>   [3]    chr21 9683191-9683272      + |    miRNA
#>   [4]    chr21 9825832-9826011      + |    miRNA
#>   [5]    chr21 9825832-9826011      + |    miRNA
#>   [6]    chr21 9825832-9826011      + |    miRNA
#>   -------
#>   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Go to Top

 

Prepare GRanges Object

Here we show an extraction of “gencode” dataset using R commands. Note to replace PATH_FILE with file direcotry path. gtf refers to the full genome file. A subset function was used to filter chr21 dataset as follows.

chr21 <- subset(gencode, seqnames == "chr21") #“genecode” = whole genome gtf

> gtf = read.table("PATH_FILE")
> gtf = subset(gtf, biotype == "transcript")
> colnames(gtf) = c("chr","start","end","strand","biotype")
> gr = GRanges(gtf)

Go to Top

Processing Query

query <- GRanges(c("chr1:2-10:+","chr1:6-10:-"), Row.names = c("trans1","trans2"), score = c(1,2))
head(query)
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames    ranges strand |   Row.names     score
#>          <Rle> <IRanges>  <Rle> | <character> <numeric>
#>   [1]     chr1      2-10      + |      trans1         1
#>   [2]     chr1      6-10      - |      trans2         2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
Classifying Biotypes

library(BioTIP)
gr <- GRanges(c("chr1:1-5:+","chr1:2-3:+"),biotype = c("lincRNA","CPC"))
head(gr)
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |     biotype
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1       1-5      + |     lincRNA
#>   [2]     chr1       2-3      + |         CPC
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
Extracting intron coordinates

  # Intron coordinates
  
   intron <- GRanges("chr1:6-8:+")

intron <- GRanges("chr1:6-8:+")
head(intron)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       6-8      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
Filtering coding transcripts

# Filtering non-coding regions using products from example 1, 2 and 3

intron_trncp <- getBiotypes(query, gr, intron)
intron_trncp
#>   seqnames start end width strand Row.names score type.fullOverlap
#> 1     chr1     2  10     9      +    trans1     1          de novo
#> 2     chr1     6  10     5      -    trans2     2          de novo
#>   type.partialOverlap type.50Overlap hasIntron type.toPlot
#> 1        lincRNA, CPC   lincRNA, CPC       yes     lincRNA
#> 2             de novo        de novo        no     de novo
# Filtering Intron and Exons 

Here we show how to obtain protein coding and non-coding from our datasets. The coding transcripts are an expressed section of the genome that is responsible for protein formation. Meanwhile the non-coding transcripts are vital in the formation regulatory elements such promoters, enhancers and silencers.

library(BioTIP)
data("intron")
data("ILEF")
data("gencode")

gencode_gr = GRanges(gencode)
ILEF_gr = GRanges(ILEF)
cod_gr = GRanges(cod)
intron_gr = GRanges(intron)

non_coding <- getBiotypes(ILEF_gr, gencode_gr, intron_gr)
#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
#>   suppressWarnings() to suppress this warning.)
dim(non_coding)
#> [1] 300  11
head(non_coding[,1:3])
#>   seqnames    start      end
#> 1    chr21 15608524 15710335
#> 2    chr21 43619799 43717938
#> 3    chr21 28208595 28217692
#> 4    chr21 28279059 28339668
#> 5    chr21 46493768 46646483
#> 6    chr21 45285030 45407475
coding <- getBiotypes(ILEF_gr, gencode_gr)
dim(coding)
#> [1] 300  11
head(coding[,1:3])
#>   seqnames    start      end
#> 1    chr21 15608524 15710335
#> 2    chr21 43619799 43717938
#> 3    chr21 28208595 28217692
#> 4    chr21 28279059 28339668
#> 5    chr21 46493768 46646483
#> 6    chr21 45285030 45407475
Finding overlapping transcripts

# Samples with overlapping coding regions.
library(BioTIP)

data(ILEF)
data(cod)
ILEF_gr = GRanges(ILEF)
cod_gr = GRanges(cod)

rdthrough <- getReadthrough(ILEF_gr, cod_gr)
head(rdthrough)
#>   seqnames    start      end  width strand Row.names readthrough
#> 1    chr21 15608524 15710335 101812      +    ABCC13           0
#> 2    chr21 43619799 43717938  98140      +     ABCG1           1
#> 3    chr21 28208595 28217692   9098      -   ADAMTS1           1
#> 4    chr21 28279059 28339668  60610      -   ADAMTS5           1
#> 5    chr21 46493768 46646483 152716      +    ADARB1           1
#> 6    chr21 45285030 45407475 122446      +    AGPAT3           1


Acknowledgements

The development of this package would not be possible without continous help and feedback from individuals and institutions including: The Bioconductor Core Team, Dr. Xianan H Yang, Dr. Tzintzuni Garcia, and National Institutes of Health R21LM012619.


sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0  IRanges_2.20.0      
#> [4] S4Vectors_0.24.0     BiocGenerics_0.32.0  cluster_2.1.0       
#> [7] stringr_1.4.0        BioTIP_1.0.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] igraph_1.2.4.1         Rcpp_1.0.2             XVector_0.26.0        
#>  [4] knitr_1.25             magrittr_1.5           zlibbioc_1.32.0       
#>  [7] mnormt_1.5-5           lattice_0.20-38        rlang_0.4.1           
#> [10] highr_0.8              tools_3.6.1            grid_3.6.1            
#> [13] nlme_3.1-141           xfun_0.10              psych_1.8.12          
#> [16] htmltools_0.4.0        yaml_2.2.0             digest_0.6.22         
#> [19] GenomeInfoDbData_1.2.2 bitops_1.0-6           RCurl_1.95-4.12       
#> [22] evaluate_0.14          rmarkdown_1.16         stringi_1.4.3         
#> [25] compiler_3.6.1         foreign_0.8-72         pkgconfig_2.0.3

<a name=“>Go to Top

References

  • Scheffer M, Carpenter SR, Lenton TM, Bascompte J, Brock W, Dakos V, et al. Anticipating critical transitions. Science. 2012;338(6105):344-8. doi: 10.1126/science.1225244. PubMed PMID: 23087241.
  • Chen L, Liu R, Liu ZP, Li M, Aihara K. Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers. Sci Rep. 2012;2:342. Epub 2012/03/31. doi: 10.1038/srep00342. PubMed PMID: 22461973; PubMed Central PMCID: PMC3314989.
  • Lenburg, M. E., A. Sinha, D. V. Faller and G. V. Denis (2007). “Tumor-specific and proliferation-specific gene expression typifies murine transgenic B cell lymphomagenesis.” J Biol Chem 282(7): 4803-4811.PMC2819333
  • Moris, N., C. Pina and A. M. Arias (2016). “Transition states and cell fate decisions in epigenetic landscapes.” Nat Rev Genet 17(11): 693-703. PMID: 27616569.
  • Mojtahedi M, Skupin A, Zhou J, Castano IG, Leong-Quong RY, Chang H, et al. Cell Fate Decision as High-Dimensional Critical State Transition. PLoS Biol. 2016;14(12):e2000640. doi: 10.1371/journal.pbio.2000640. PubMed PMID: 28027308; PubMed Central PMCID: PMCPMC5189937.
  • Wang, Z. Z., J. M. Cunningham and X. H. Yang (2018). “CisPi: a transcriptomic score for disclosing cis-acting disease-associated lincRNAs.” Bioinformatics34(17): 664-670"