Content
- 0.0 Introduction
- 1.0 Basic: Input Preparation and Prediction
- 2.0 Advanced 1: Input Matrix and Statistics
- 2.1 Advanced 2: Helping Functions
- 2.2 Advanced 3: Clustering and Multiple Datasets
- 3.0 List of Functions
- 4.0 Acknowledgment
- 5.0 References
0.0 Introduction
In the last years a class of ncRNAs were described named lincRNAs (large intergenic noncoding RNAs) as untranslated transcripts with a length of over 200 bp. [1, 2] A number of papers focused on their tissue-specific upregulation in cancer cells. [1, 3, 4] The annotation and functional classification of lincRNAs (and ncRNAs) remain challenging. Given a RNAseq or microarray experiment, one possible approach to identify candidate ncRNAs poses “guilty by association”, a principle stating that coding and noncoding genes showing a comparable expression pattern are likely to share functionality. [5] This idea can now be easily applied on arbitrary expression matrices using this R package LINC. Enriched biological terms from gene annotation resources like Reactome PA and Gene Ontology (GO) for the cluster (multiple candidate lincRNA genes) can be identified applying getbio() which will call supported functions from the clusterProfiler package. [6 - 8] This analysis will reveal which functions, pathways or compartments are associated with the lincRNA-co-expressed genes. The basic section will show how to predict the biological functions of lincRNAs based on gene expression data. The sections focusing on the advanced options will provide explanations how to control thresholds used in the computations.
1.0 Basic: Input Preparation and Prediction
This section will demonstrate the steps intended by the methods in the LINC package used to predict the biological functions of lincRNAs by co-expression. As an input a gene expression matrix is required with rows corresponding to genes and columns representing samples. The following piece of code shows an expression matrix termed GTEX_CRBL - CRBL stands for cerebellum. (Please note, that the large gene expression matrix GTEX_CRBL from http://www.gtexportal.org is not provided in this Version of the package.)
str(GTEX_CRBL)
num [1:56318, 1:117] 0.0475 10.7799 0.105 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:56318] "ENSG00000223972" "ENSG00000227232" "ENSG00000243485" "ENSG00000237613" ...
..$ : chr [1:117] "GTEX-117XS-3126-SM-5GIDP" "GTEX-1192X-3226-SM-5987D" "GTEX-11DXW-1026-SM-5H11K"
"GTEX-11DXY-3126-SM-5N9BT" ...
In order to apply the methods in the LINC package it is recommended to select for the 1000 up to 10.000 high-variance genes. It is also useful also look for the expression levels of genes. In a second step the biotypes need to be identified, for instance by biomaRt:getBM. These lines of code are an example how to choose high-variance genes and how to get the biotypes.
# STEP 1: select the high-variance genes
var_index <- order(apply(GTEX_CRBL, 1, var), decreasing = T)
GTEX_CRBL_HVAR <- GTEX_CRBL[var_index[1:5000], ]
# STEP 2: get the gene biotype
require(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
biotype <- getBM(attributes=c('gene_biotype',
'ensembl_gene_id'),
filters = 'ensembl_gene_id',
values = rownames(GTEX_CRBL_HVAR),
mart = ensembl)
# STEP 3:
index <- match(rownames(GTEX_CRBL_HVAR), biotype$ensembl_gene_id)
GTEX_CRBL_BIOTYPE <- biotype$gene_biotype[index]
The 5000 selected genes contain 4256 protein-coding genes and 74 lincRNA genes.
table(GTEX_CRBL_BIOTYPE)
3prime_overlapping_ncRNA antisense lincRNA
8 279 74
miRNA misc_RNA Mt_rRNA
3 4 2
Mt_tRNA processed_pseudogene processed_transcript
4 119 37
protein_coding rRNA sense_intronic
4256 1 11
sense_overlapping snoRNA snRNA
13 8 1
TR_C_gene transcribed_processed_pseudogene transcribed_unprocessed_pseudogene
1 8 25
unitary_pseudogene unprocessed_pseudogene
2 18
The following example uses a preprocessed gene expression matrix with 7 lincRNA genes as queries. After separating the queries from the protein-coding genes with linc(), the function clusterlinc() computes the co-expression networks and returns the significantly co-expressed genes associated with the queries. In the third step getbio() derives the enriched biological terms.
# the preprocessed expression matrix with 1000 genes
str(cerebellum)
## num [1:1000, 1:50] 73.3 101.4 328.4 244.2 24.4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "3337" "3304" "3320" "5730" ...
## ..$ : chr [1:50] "A1" "A2" "A3" "A4" ...
# a TRUE/FALSE vector with TRUE for protein-coding genes
str(pcgenes_crbl)
## logi [1:1000] TRUE TRUE TRUE TRUE TRUE TRUE ...
# STEP 1: Separate the protine-coding genes from the queries (lincRNAs)
crbl_matrix <- linc(cerebellum, codingGenes = pcgenes_crbl)
# STEP 2: Compute the co-expression network with a fixed threshold
crbl_cluster <- clusterlinc(crbl_matrix, pvalCutOff = 0.005)
# STEP 3: Interrogate enriched biological terms for co-expressed genes
crbl_bp <- getbio(crbl_cluster)
# Show the results as a plot!
plotlinc(crbl_bp)