Contents

1 Introduction

TEKRABber is used to estimate the correlations between genes and transposable elements (TEs) from RNA-seq data comparing between: (1) Two Species (2) Control vs. Experiment. In the following sections, we will use built-in data to demonstrate how to implement TEKRABber on you own analysis.

2 Installation

To use TEKRABber from your R environment, you need to install it using BiocManager:

install.packages("BiocManager")
BiocManager::install("TEKRABber")
library(TEKRABber)
library(SummarizedExperiment) # load it if you are running this tutorial

3 Examples

3.1 Comparing between two species, human and chimpanzee as an example

Gene and TE expression data are generated from randomly picked brain regions FASTQ files from 10 humans and 10 chimpanzees (Khrameeva E et al., Genome Research, 2020). The values for the first column of gene and TE count table must be Ensembl gene ID and TE name:

# load built-in data
data(speciesCounts)
hmGene <- speciesCounts$hmGene
hmTE <- speciesCounts$hmTE
chimpGene <- speciesCounts$chimpGene
chimpTE <- speciesCounts$chimpTE
# the first column must be Ensembl gene ID for gene, and TE name for TE
head(hmGene)
##            Geneid SRR8750453 SRR8750454 SRR8750455 SRR8750456 SRR8750457
## 1 ENSG00000000003        250        267        227        286        128
## 2 ENSG00000000005         13          2         15          9          5
## 3 ENSG00000000419        260        311        159        259        272
## 4 ENSG00000000457         86        131        100         94         80
## 5 ENSG00000000460         21         17         42         33         55
## 6 ENSG00000000938        162         75         95        252        195
##   SRR8750458 SRR8750459 SRR8750460 SRR8750461 SRR8750462
## 1        394        268        102        370        244
## 2          0          1          8          0          2
## 3        408        371        126        211        374
## 4        158        119         46         77         81
## 5         29         50         11         18         20
## 6        137         93        108        197         69

3.1.1 Query ortholog information and estimate scaling factor

In the first step, we use orthologScale() to get orthology information and calculate the scaling factor between two species for normalizing orthologous genes. The species name needs to be the abbreviation of scientific species name used in Ensembl. (Note: (1)This step queries information using biomaRt and it might need some time or try different mirrors due to the connections to Ensembl (2)It might take some time to calculate scaling factor based on your data size). For normalizing TEs, you need to provide a repeatmasker annotation table including four columns, (1) the name of TE (2) the class of TE (3) the average gene length of TE from your reference species (4) the average gene length from the species you want to compare. A way to download repeatmasker annotations is to query from UCSC Genome Table Browser and select the RepeatMasker track.

# You can use the code below to search for species name
ensembl <- biomaRt::useEnsembl(biomart = "genes")
biomaRt::listDatasets(ensembl)
# In order to save time, we have save the data for this tutorial.
data(fetchDataHmChimp)
fetchData <- fetchDataHmChimp

# Query the data and calculate scaling factor using orthologScale():
#' data(speciesCounts)
#' data(hg38_panTro6_rmsk)
#' hmGene <- speciesCounts$hmGene
#' chimpGene <- speciesCounts$chimpGene
#' hmTE <- speciesCounts$hmTE
#' chimpTE <- speciesCounts$chimpTE
#' 
#' ## For demonstration, here we only select 1000 rows to save time
#' set.seed(1234)
#' hmGeneSample <- hmGene[sample(nrow(hmGene), 1000), ]
#' chimpGeneSample <- chimpGene[sample(nrow(chimpGene), 1000), ]
#' 
#' fetchData <- orthologScale(
#'     speciesRef = "hsapiens",
#'     speciesCompare = "ptroglodytes",
#'     geneCountRef = hmGeneSample,
#'     geneCountCompare = chimpGeneSample,
#'     teCountRef = hmTE,
#'     teCountCompare = chimpTE,
#'     rmsk = hg38_panTro6_rmsk
#' )

3.1.2 Create inputs for differentially expressed analysis and correlation estimation

We use DECorrInputs() to return input files for downstream analysis.

inputBundle <- DECorrInputs(fetchData)

3.1.3 Differentially expressed analysis (DE analysis)

In this step, we need to generate a metadata contain species name (i.e., human and chimpanzee). The row names need to be same as the DE input table and the column name must be species (see the example below). Then we use DEgeneTE() to perform DE analysis. When you are comparing samples between two species, the parameter expDesign should be TRUE (as default).

meta <- data.frame(
    species = c(rep("human", ncol(hmGene) - 1), 
    rep("chimpanzee", ncol(chimpGene) - 1))
)

meta$species <- factor(meta$species, levels = c("human", "chimpanzee"))
rownames(meta) <- colnames(inputBundle$geneInputDESeq2)
hmchimpDE <- DEgeneTE(
    geneTable = inputBundle$geneInputDESeq2,
    teTable = inputBundle$teInputDESeq2,
    metadata = meta,
    expDesign = TRUE
)

3.1.4 Correlation analysis

Here we use corrOrthologTE() to perform correlation estimation comparing each ortholog and TE. This is the most time-consuming step if you have large data. For a quick demonstration, we use a relatively small data. You can specify the correlation method and adjusted p-value method. The default methods are Pearson’s correlation and FDR. Note: For more efficient and specific analysis, you can subset your data in this step to focus on only the orthologs and TEs that you are interested in.

# load built-in data
data(speciesCorr)
hmGeneCorrInput <- assay_tekcorrset(speciesCorr, "gene", "human")
hmTECorrInput <- assay_tekcorrset(speciesCorr, "te", "human")
chimpGeneCorrInput <- assay_tekcorrset(speciesCorr, "gene", "chimpanzee")
chimpTECorrInput <- assay_tekcorrset(speciesCorr, "te", "chimpanzee")

hmCorrResult <- corrOrthologTE(
    geneInput = hmGeneCorrInput,
    teInput = hmTECorrInput,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

chimpCorrResult <- corrOrthologTE(
    geneInput = chimpGeneCorrInput,
    teInput = chimpTECorrInput,
    corrMethod = "pearson",
    padjMethod = "fdr"
)

head(hmCorrResult)
##          geneName      teName      pvalue        coef      padj
## 1 ENSG00000143125        L1MD 0.271964872  0.38497828 0.9990235
## 2 ENSG00000143125        MSTA 0.335873091  0.34036703 0.9990235
## 3 ENSG00000143125  MLT1N2-int 0.966658172  0.01524552 0.9990235
## 4 ENSG00000143125       LTR57 0.067870603  0.59794954 0.9990235
## 5 ENSG00000143125 HERVK11-int 0.001028058  0.87118988 0.3210294
## 6 ENSG00000143125        LTR5 0.855235258 -0.06647109 0.9990235

3.1.5 Explore your result using appTEKRABber():

TEKRABber provides an app function for you to quickly view your result. First, you will need to assign the differentially expressed orthologs/TEs results, correlation results and metadata as global variables: appDE, appRef, appCompare and appMeta. See the following example.

#create global variables for app-use
appDE <- hmchimpDE
appRef <- hmCorrResult
appCompare <- chimpCorrResult
appMeta <- meta # this is the same one in DE analysis

appTEKRABber()

In the Expression tab page, (1) you can specify your input gene and TE. The result will show in box plots with data points in normalized log2 expression level (2) DE analysis result will show in table including statistical information (3) Correlation result will indicate if these selected pairs are significantly correlated and the value of correlation coefficients.