1 Installation

To install and load KnowSeq package in R, it is necessary the previous installation of BiocManager from Bioconductor. The next code shows how this installation can be performed:

KnowSeq is now available also on Docker by running the next command, allowing the use of KnowSeq without a previous installation:

2 Introduction

KnowSeq proposes a novel methodology that comprises the most relevant steps in the Transcriptomic gene expression analysis. KnowSeq expects to serve as an integrative tool that allows to process and extract relevant biomarkers, as well as to assess them through a Machine Learning approaches. Finally, the last objective of KnowSeq is the biological knowledge extraction from the biomarkers (Gene Ontology enrichment, Pathway listing and Visualization and Evidences related to the addressed disease). Although the package allows analyzing all the data manually, the main strenght of KnowSeq is the possibilty of carrying out an automatic and intelligent HTML report that collect all the involved steps in one document. Nowadays, there is no package that only from the information of the samples to align -included in a text file-, automatically performs the download and alignment of all of the samples. Furthermore, KnowSeq is the only package that allows applying both a machine learning and biomarkers enrichment processes just after the biomarkers extraction. It is important to highligh that the pipeline is totally modular and flexible, hence it can be started from whichever of the different steps. This pipeline has been used in our previous publications for processing raw RNA-seq data and to perform the biomarkers extraction along with the machine learning classifier design steps, also for their integration with microarray data [1,2,3,4].

The whole pipeline included in KnowSeq has been designed carefully with the purpose of achieving a great quality and robustness in each of the steps that conform the pipeline. For that, the pipeline has four fundamental processes:

  • Transcriptomic RAW data processing.
  • Biomarkers identification & assessment.
  • DEGs enrichment methodology.
  • Intelligent Automatic Report.

The first process is focused on the Transcriptomic RAW data treatment. This step has the purpose of extracting a set of count files from raw files stored in the repositories supported by our package (NCBI/GEO [5] ArrayExpress [6] and GDC-Portal). The second one englobes the Differential Expressed Genes (DEGs) identification and extraction by using a novel parameter (Specifically for multiclass studies) defined as Coverage [3], and the assessment of those DEGs by applying advanced machine learning techniques (feature selection process and supervised classification). Once the DEGs are assessed, the next step is the DEGs enrichment methodology which allows retrieving biological information from the DEGs. In this process, relevant information (such as related diseases, biological processes associated and pathways) about the DEGs is retrieved by using very well-known tools and databases. The three types of enrichment are the Gene Ontology (GO) study, the pathways visualization taking into account the gene expression, and the Evidences related to the addressed disease from the final set of DEGs. Finally, all of this information can be displayed on an automatic and intelligent HTML report that contains the results of the complete study for the faced disease or diseases.

3 Transcriptomic RAW data processing

3.1 Alignment preparation

In order to avoid version incompatibilities with hisat2 aligner and the installation of the required tools, pre-compiled versions will be used to run the R functions. Consequently, all the tools were compressed and stored in an external server to be downloaded whenever it is required (http://iwbbio.ugr.es/utils/unixUtils.tar.gz). If the tools are directly downloaded from the link, the compressed files must be decompressed in the current project folder in R or RStudio. The name of the resultant folder must be “utils”. Nevertheless, this file can be downloaded automatically by just calling the function rawAlignment, in case the folder utils is not detected in the project folder. This is all needed to run hisat2 through the function rawAlignment. It is not possible to run the alignment without the utils folder. It must be mentioned too that the different files included in the compressed .tar.gz are not only the aligner but also functions needed in the raw alignment process. The tools included are the following:

  • Bowtie2 [7].
  • Hisat2 [8].
  • Htseq-count [9].
  • Samtools [12].
  • Sratoolkit [13].
  • GDC-client.

3.2 Launching Raw Alignment step

The rawAlignment function allows running hisat2 aligner. The function takes as single input a CSV from GEO or ArrayExpress loaded in R. There is the possibility to process data from GDC-portal, but a previous authorization (token file) from this platform is required. Furthermore, there is a set of logical parameters to edit the default pipeline followed for the function. With the parameters the user can select if the BAM/SAM/Count files are created. The user can choose if wants to download the reference genome, the GTF, and which version. Even if the user has custom FASTA and GTF files, this can be specified by setting the parameter referenceGenome to “custom” and using the parameters customFA and customGTF to indicates the paths to the custom files. Other functionality is the possibility to process BAM files from the GDC Portal database by setting to TRUE the parameter fromGDC. Then the function will download the specific genome reference of GDC and process the BAM files to Count files. Furthermore, if the user has access to the controlled data, with the token and the manifest acquired from GDC Portal web platform, the samples can be downloaded automatically. An example to run the function with hisat2 aligner is showed below:

RawAlignment function creates a folder structure in the current project folder which will store all the downloaded and created files. The main folder of this structure is the folder ReferenceFiles but inside of it there are more folders that allows storing the different files used by the process in an organized way.

Another important requirement to take into account is the format of the csv file used to launch the function. It could be from three repositories, two publics (NCBI/GEO and ArrayExpress) and one controlled (GDC Portal). Each of these repositories has its own format in the csv file that contains the information to download and process the desired samples. The necessary format for each repository is explained below.

3.2.1 NCBI/GEO CSV format

Series belonging to RNA-seq have a SRA identifier. If this identifier is clicked, a list with the samples that conform this series is showed. Then, the desired samples of the series can be checked and the CSV is automatically generated by clicking the button shown in the image below:

The previous selection generates a csv files that contains a number of columns with information about the samples. However, running the rawAlignment function only needs the three columns shown below in the csv (although the rest of the columns can be kept):

Run download_path LibraryLayout
SRR2753177 sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… SINGLE
SRR2753178 sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… SINGLE
SRR2753179 sra-download.ncbi.nlm.nih.gov/traces/sra21/SRR/0026… SINGLE

There is another way to obtain this csv automatically by calling the function downloadPublicSeries with the NCBI/GEO GSE ID of the wanted series, but this option does not let the user to choose the wanted samples and downloads all the samples of each selected series.

3.2.2 ArrayExpress CSV format

The process for ArrayExpress is the very similar to that for NCBI/GEO. It changes the way to download the csv and the name of the columns in the file. To download the csv there is a file finished as .sdrf.txt inside the RNA-seq series in ArrayExpress, as can be seen in the example below:

As with the NCBI/GEO csv, the csv of ArrayExpress requires only three columns as is shown below:

Comment[ENA_RUN] Comment[FASTQ_URI] Comment[LIBRARY_LAYOUT]
ERR1654640 ftp.sra.ebi.ac.uk/vol1/fastq/ERR165/000/ERR16… PAIRED
ERR1654640 ftp.sra.ebi.ac.uk/vol1/fastq/ERR165/000/ERR16… PAIRED

There is another way to achieve this csv automatically by calling the function downloadPublicSeries with the ArrayExpress MTAB ID of the wanted series, but this option does not let the user to choose the wanted samples, and therefore and downloads all the samples of each selected series.

3.2.3 GDC Portal CSV format

GDC portal has the BAM files access restricted or controlled for the user who has access to them. However, the count files are open and can be used directly in this package as input of the function countsToMatrix. If there exist the possibility to download the controlled BAM files, the tsv file that this package uses to convert them into count files is the tsv file generated when the button Sample Sheet is clicked in the cart:

As in the other two repositories, there are a lot of columns inside the tsv files but this package only needs two of them. Furthermore, if the BAM download is carried out by the gdc-client or the web browser, the BAM has to be moved to the path ReferenceFiles/Samples/RNAseq/BAMFiles/Sample.ID/File.Name/ where Sample.ID and File.Name are the columns with the samples information in the tsv file. This folder is created automatically in the current project folder when the rawAlignment function is called, but it can be created manually. However, GDC portal has public access to count files that can be used in a posterior step of the KnowSeq pipeline to merge and analyze them.

3.2.4 Downloading automatically GDC Portal controlled files (GDC permission required)

It exists the possibility to download automatically the raw data from GDC portal by using the rawAlignment function. In order to carry this out, the function needs the parameters downloadSamples and fromGDC set to TRUE, the path to the token in order to obtain the authentication to download the controlled data and the path to the manifest that contains the information to download the samples. This step needs the permission of GDC portal to the controlled data.

3.3 Preparing the count files

From now on, the data that will be used for the documentation are real count files, but with a limited number of genes (around 1000). Furthermore, to reduce the computational cost of this example, only 5 samples from each of the two selected series will be taken into account. Showed in the code snippet below, two RNA-seq series from NCBI/GEO are downloaded automatically and the existing count files prepared to be merged in one matrix with the purpose of preparing the data for further steps:

However, the user can run a complete example by executing the following code:

3.4 Processing count files

After the raw alignment step, a list of count files of the samples is available at ReferenceFiles/Samples/RNAseq/CountFiles. The next step in the pipeline implemented in this package is the processing of those count files in order to obtain a gene expression matrix by merging all of them.

3.4.1 Merging all count files

After the alignment, as many count files as samples in the CSV used for the alignment have been created. In order to prepare the data for the DEGs analysis, it is important to merge all these files in one matrix that contains the genes Ensembl ID (or other IDs) in the rows and the name of the samples in the columns. To carry this out, the function countsToMatrix is available. This function reads all count files and joints them in one matrix by using edgeR package [15]. To call the function it is only necessary a CSV with the information about the count files paths. The required CSV has to have the following format:

Run Path Class
SRR2753159 ~/ReferenceFile/Count/SRR2753159/ Tumor
SRR2753162 ~/ReferenceFile/Count/SRR2753162/ Tumor
SRR2827426 ~/ReferenceFile/Count/SRR2827426/ Healthy
SRR2827427 ~/ReferenceFile/Count/SRR2827427/ Healthy

The column Run is the name of the sample without .count, the column Path is the Path to the count file and the Class column is the labels of the samples. Furthermore, an example of this function is shown below:

## 
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR2753159/SRR2753159.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR2753160/SRR2753160.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR2753161/SRR2753161.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR2753162/SRR2753162.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR2753163/SRR2753163.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR3541296/SRR3541296.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR3541297/SRR3541297.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR3541298/SRR3541298.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR3541299/SRR3541299.count
## /tmp/RtmpHTlihE/Rinst274c7b8cef51/KnowSeq/extdata/countFiles/SRR3541300/SRR3541300.count
## Merging 10 counts files...

The function returns a list that contains the matrix with the merged counts and the labels of the samples. It is very important to store the labels in a new variable because as it will be required in several functions of KnowSeq.

3.4.2 Getting the annotation of the genes

This step is only required if the user wants to get the gene names and the annotation is retrieved with the information given by the ensembl webpage [16]. Normally, the counts matrix has the Ensembl Ids as gene identifier, but with this step, the Ensembl Ids are change by the gene names. However, the user can decide to keep its own annotation or the Ensembl Ids. For example, to achieve the gene names the function needs the current Ensembl Ids, and the reference Genome used would be the number 38. If the user wants a different annotation than the human annotation, the parameter notHSapiens has to be set to TRUE and the desired specie dataset from ensembl indicated in the parameter notHumandataset (i.e. “mm129s1svimj_gene_ensembl”). An example can be seen below:

## Getting annotation of the Homo Sapiens...
## Using reference genome 38.
## Downloading annotation mm129s1svimj_gene_ensembl...
## 
## Connection error, trying again...
## 
## Connection error, trying again...

3.4.3 Converting to gene expression matrix

Finally, once both the countsMatrix and the annotation are ready, it is time to convert those counts into gene expression values. For that, the function calculateGeneExpressionValues uses the cqn package to calculates the equivalent gene expression [17]. This function performs a conversion of counts into gene expression values, and changes the Ensembl Ids by the gene names if the parameter geneNames is equal to TRUE. An example of the use of this function is showed below:

## Calculating gene expression values...
## RQ fit ..........
## SQN .

At this time of the pipeline, a function that plots the expression data and allows verifying if the data is well normalized can be used. This function has the purpose of joining all the important graphical representation of the pipeline in the same function and is called dataPlot. It is very easy to use because just by changing the parameter method many different representations can be achieved. In this case, in order to see the expression boxplot of each sample, the function has to be called with the parameter mode equal to “boxplot”. The labels are necessary to colour the different samples depending on the class of the samples. These colours can be selected by the user, by introducing in the parameter colours a vector with the name of the desired colours. The function also allows exporting the plots as PNG and PDF files.

## Creating PNG...
## Creating PDF...

4 Biomarkers identification & assessment

4.1 Batch effect removal

A crucial step in this pipeline is the batch effect treatment. It is widely known that this is a crucial step in the omics data processing due to the intrinsic deviations that the data can present due to its origin, sequencing design, etc… Besides, when working with public data it is very difficult to know if exists a real batch effect among the selected datasets. This package allows removing batch effect if the batch groups are known by calling the function batchEffectRemoval, that makes use of sva package [18], with the parameter mode equal to “combat” [19]. This step allows obtaining an expression matrix with the batch effect treated by combat method. An example to do this is below:

## Correcting batch effect by using combat method...
## Using the 'mean only' version of ComBat
## Found2batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

There is another method in the function that removes the batch effect that uses surrogate variable analysis or sva. The only requirement to use it is to set the parameter method equal to “sva”. This methodt does not return a matrix with the batch effect corrected, instead of this, the function returns a model that has to be used as single input parameter of the function DEGsExtraction.

## Calculating sva model to batch effect correction...
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5

4.2 Differential Expressed Genes extraction and visualization

There is a long way between the raw data and the DEGs extraction, for that in this step the samples have to have had a strong pre-processing step applied. At this point of the pipeline the DEGs existing among two or more classes will be extracted by using the novel parameter coverage (cov) along with limma R-bioc package [20]. The parameter cov represents the number of different pathologies that a certain gen is able to discern. By default, the parameter is set to 1, so all genes that has the capability to discern among the comparison of two classes would be selected as DEGs. To understand better this parameter, our multiclass study applied to different leukemia sub-types introduces it, and it’s publicly available [3].

The function DEGsExtraction receives an expression matrix, the labels of the samples and the restriction imposed for considering a gene as differential expressed gene. The function returns a list containing the table with statistical values of each DEGs and the expression matrix of the DEGs instead all of the genes. The call to the function is listed below:

Furthermore, if in the batch effect step the method used was sva, this function has two parameters to indicate that the model of limma would take into account the sva model calculated previously for the expression matrix. To achieve this, svaCorrection parameter has to be set to TRUE and the sva model has to be passed in the parameter svaMod. An example of this is the following:

## Two classes detected, applying limma biclass

DEGs are genes that have a truly different expression among the studied classes, for that it is important to try to see graphically if those DEGs comply with this requirement. In order to provide a tool to perform this task, the function dataPlot encapsulate a set of graphs that allows plotting in different ways the expression of the DEGs.

dataPlot function also allows representing an ordered boxplot that internally orders the samples by class and plots a boxplot for each samples and for the first top 12 DEGs in this example. With this plot, the difference at gene expression level between the classes can be seen graphically. The code to reproduce this plot is the following:

In the previous boxplot the expression of a set of DEGs for each sample its showed, however it is interesting to see the differentiation at gene expression level for each of the top 12 genes used before separately. It is recommended to use this function with a low number of genes, because with a larger number the plot it is difficult to distinguish the information provided and R would not have enough memory to calculate the plot. For that, the function dataPlot with the mode genesBoxplot allows to do that by executing the next code:

Finally, it is possible to plot one of the most widespread visualization methods in the literature, the heatmap. By setting the parameter method to heatmap, the function calculates the heatmap for the given samples and classes. The code to do this is the same than for the previous boxplot but changing the method parameter:

4.3 Performing the machine learning processing: classifier design and assessment and gene selection

Normally, in the literature, the last step in the pipeline for differential gene expression analysis is the DEGs extraction step. However, in this package a novel machine learning step is implemented with the purpose of giving to the user an automatic tool to assess the DEGs, and evaluate their robustness in the discernment among the studied pathologies. This library has three possible classification methodologies to take into account. These options are k-NN [21], SVM [22] and Random Forest [23], three of the most popular classifiers in the literature. Furthermore, it includes two different working procedures for each of them. The first one implements a cross-validation process, in order to assess the expected accuracy with different models and samples the DEGs with a specific number of folds. These functions return a list with 4 objects that contain the confusion matrices, the accuracy, the sensitivity and the specificity.

The second one is to assess a specific test dataset by using a classifier trained using the training dataset separately. Moreover, the function featureSelection allows performing a feature selection process by using, mRMR [24], Random Forest (as feature selector instead of classifier) or Disease Association based algorithms with the purpose of finding the best DEGs order to assess the data. Da-FS is a new novel method designed in KnowSeq with the purpose of giving to the expert a biological based feature selection method. This method makes use of targetValidation webplatform to acquire an association score for each DEGs with the required genetic disease, breast cancer in the example. This score takes values between 1 and 0, meaning 1 a total association and 0 no association. Therefore, the DEGs are sorted by this score, achieving a ranking in which in the first positions those DEGs with more biological relation to breast cancer are placed.

Moreover, targetValidation webplatform allows acquiring evidences that tie a gene with a certain disease. DA-RED-FS is a novel iterative method based on DA-FS that use these evidences to calculate the redundance between genes based on biological information. This redundances take values between 1 and 0. For example, if gene A has a redundance of 1 with gene B, means that all found evidences for gene A and a certain disease are also found in gene B and the disease. Likewise, if this redundance is 0, means that there are not any gene A evidences in gene B evidences. DA-RED-FS starts with an empty set of selected genes, \(S_G\), and a set of possible genes \(G\). In the first step, the gene with the highest DA score is selected. In the following steps, genes that verify the following equation are added to selected genes set.

\[ max_{g \in G - S_G} DA(g) - \frac{\sum_{g_i \in S_g} RED(g,g_i)}{|S_G|} \times DA(g) \]

Where \(RED(g,g_i)\) is the redundance between gene \(g\) and gen \(g_i\) and \(DA(g)\) is the DA score of gene \(g\). The algorithms ends when a certain number of genes are selected, this number can be fixed by the maxGenes parameter.

To invoke these functions, it is necessary an expression matrix with the samples in the rows and the genes in the columns and the labels of the samples, the genes that will be assessed and the number of fold in the case of the cross-validation function. In the case of the test functions, it is necessary the matrix and the labels for both the training and the test datasets:

## Calculating the ranking of the most relevant genes by using mRMR algorithm...
## mRMR ranking: BCAS1 LAMC2 ATP2B4 RPS17P5 GALC SCIN ROS1 SH2D2A SNAI2 AGPS GNA15 EHD3 MAGEC2 GYG2 FSTL4 BIRC3 ARHGAP44 ADAM22 CCN5 CALCR TNC DNASE1L1 DLX3 CCDC85A FUT8 EHD2 HOXC8 MATK PRSS21 ABCB4 YBX2 VCAN COL9A2 NFIX SLC7A2 APPBP2 LY75 CCT8L1P ARSD BARX2 CNTN1 NLRP2 APBA2 MPPED2 TYMP FUZ TENM1 NEXMIF EPHA3 FAS DLEC1 CYP26B1 RAI14 CPS1 AGPAT4 RCN1 AC005747.1 PHF21B DGKA BTN3A1 CP SEZ6 ZIC2 PLXND1 ENTPD2 MSMO1 FYN MED24 RAB27B CCDC88C ITGA3 PLEKHH1 RIPOR1 CD99 PER3 SLC11A1 ISOC1 CD44 NPC1L1 SLC12A2 PARP3 ARX SELE ICA1 CACNA2D2 SLC4A1 PTGER3 HSD17B6 PKP2 DCBLD2 MGST1 MARCO KLF6 SP100 OAT MAP4K3 PHLDB1 CASP8 DNAH5 THUMPD1 CYTH3 ELOVL5 CUL7 EXTL3 VCL DSG2 HHAT LYPLA2 CAPG SOX30 CTPS2 STAG3 SPA17 MPND SYT13 SPATA20 CEACAM7 CLDN11 RIPOR3 TRAF1 COPZ2 CYFIP2 ZMYND12 IYD SLC9A3 GUCA2B CTNNA2 ITGAL CYP3A43 PLEKHG6 COL23A1 ABCC8 MKS1 NDC1 ERBB3 PLEKHB1 TNFRSF1B SNX29 DAPK2 AASS SLC6A16 SYT7 CXorf56 MLXIPL TMEM98 MTMR1 ALOX5 PRSS3 TMCC3 TRAPPC6A USP2 LTF PLAUR IL17RB KCNH2 ETV1 ARHGAP31 REXO5 TIMP2 ST3GAL6 SARM1 MTMR7 TMSB10 ZNRD1 ME1 PRKCQ GPC1 SLC7A14 MARK4 COL17A1 IL32 KCNQ1 CDH3 TSPOAP1 VIM CD84 DKK3 TIE1 CHDH USH2A NGEF PRDM1 ISL1 DPEP1 ADRB1 WAS AHRR TBXAS1 HIPK2 WWTR1 DNAH9 FAM214A FAR2 EML1 LAMC3 IGF1 MYLIP UBR7 MYO16 KIF26A TKTL1 DBF4 OSBPL5 SLC18A1 CA11 MVP SLC4A8 PPP1R3F FAM214B DBNDD1 CDH1 DCN BAIAP3 E2F2 PRSS8 PRSS22 EPN3 GRAMD1B SERPINB3 FGFR2 KITLG LAMA3 BRCA1 MCM10 ABCC2 ATP2C2 ARHGAP6 SNAP91
## Calculating the ranking of the most relevant genes by using Random Forest algorithm...
## Random Forest ranking: FAM214A MATK TIMP2 EML1 ZIC2 ME1 SP100 BIRC3 SOX30 MPND EXTL3 DLX3 ABCC8 DNAH5 CXorf56 EPN3 ICA1 PKP2 BTN3A1 EHD3 PPP1R3F MAP4K3 HIPK2 SNX29 DNAH9 ATP2C2 DAPK2 CPS1 MKS1 MARCO GALC TNC NFIX CCT8L1P E2F2 ZMYND12 RAB27B GUCA2B WWTR1 KIF26A BRCA1 FUT8 ST3GAL6 FYN SPATA20 OSBPL5 IYD PARP3 SLC12A2 CA11 GYG2 APBA2 RCN1 CEACAM7 SEZ6 SLC7A14 LAMC2 SYT13 OAT THUMPD1 ETV1 ATP2B4 CCDC85A ARSD EPHA3 CYP26B1 PHF21B CYFIP2 PRSS22 ZNRD1 ENTPD2 ITGA3 ITGAL MYO16 MGST1 CAPG STAG3 TIE1 APPBP2 NDC1 ABCC2 CCN5 COL9A2 FAS DLEC1 TMSB10 PER3 COL23A1 DCBLD2 PHLDB1 ISL1 MARK4 SPA17 EHD2 LAMA3 DPEP1 SERPINB3 CASP8 TNFRSF1B NLRP2 MPPED2 PLEKHG6 PLEKHH1 TBXAS1 KITLG RPS17P5 VIM GNA15 MAGEC2 FSTL4 COL17A1 ADAM22 CALCR DKK3 ERBB3 TENM1 CHDH RAI14 SLC9A3 NGEF MED24 AHRR CD99 COPZ2 CD44 PTGER3 TKTL1 MVP VCL FAM214B HHAT SCIN ROS1 SH2D2A PRSS3 CDH1 RIPOR3 ABCB4 SLC7A2 TYMP PRKCQ TRAF1 MSMO1 RIPOR1 SLC11A1 ISOC1 TRAPPC6A FAR2 NPC1L1 UBR7 SLC6A16 TMEM98 IL17RB ALOX5 SNAP91 LAMC3 SLC4A8 CTPS2 SNAI2 AGPS NEXMIF DBNDD1 VCAN CD84 BCAS1 ARHGAP44 DNASE1L1 PLEKHB1 HOXC8 PRSS21 YBX2 LY75 BARX2 CNTN1 DCN SARM1 FUZ PLAUR AGPAT4 AC005747.1 DGKA ARHGAP31 PRSS8 CP PLXND1 GRAMD1B CCDC88C SYT7 AASS USH2A MTMR1 KCNQ1 IL32 IGF1 ARX MYLIP SELE FGFR2 USP2 CACNA2D2 SLC4A1 CLDN11 HSD17B6 KCNH2 CTNNA2 KLF6 CYP3A43 WAS DBF4 TMCC3 REXO5 CDH3 TSPOAP1 ARHGAP6 CYTH3 ELOVL5 CUL7 MCM10 ADRB1 GPC1 MTMR7 SLC18A1 DSG2 MLXIPL LYPLA2 LTF BAIAP3 PRDM1
## Calculating ranking of biological relevant genes by using DA implementation...
## Disease Association ranking: ROS1 ETV1 BIRC3 CDH1 ERBB3 FAS CCDC88C WWTR1 PARP3 FGFR2 BRCA1 KLF6 WAS CASP8 PRDM1 TNC EPHA3 FYN KITLG ADRB1 VCL COL17A1 IGF1 PRKCQ SNX29 CTNNA2 LAMA3 VIM LAMC2 CD44 FAR2 ITGA3 USP2 DSG2 ATP2B4 BARX2 COL9A2 HIPK2 COL23A1 ITGAL LAMC3 CACNA2D2 DGKA CYFIP2 CLDN11 MTMR1 NDC1 DAPK2 KCNQ1 KCNH2 CNTN1 MCM10 PRSS8 SH2D2A PKP2 SP100 ST3GAL6 MTMR7 STAG3 VCAN MATK SELE CD99 ALOX5 CHDH CD84 EHD2 KIF26A NGEF EHD3 AGPAT4 OSBPL5 AHRR HSD17B6 MED24 ATP2C2 PRSS3 LTF NFIX SLC11A1 PTGER3 TYMP DKK3 BTN3A1 TIMP2 CDH3 PLAUR TMSB10 RIPOR3 CCN5 DCN CUL7 ABCB4 MAGEC2 FUT8 HOXC8 IL32 SNAI2 ISL1 PLEKHG6 IL17RB ADAM22 TNFRSF1B CP CALCR ZIC2 PER3 TKTL1 THUMPD1 LY75 RAI14 ABCC2 MPPED2 SPA17 SERPINB3 TRAF1 E2F2 MVP MYLIP MARK4 MLXIPL SLC4A1 SOX30 DLEC1 TBXAS1 GRAMD1B PHLDB1 BCAS1 UBR7 EXTL3 ABCC8 GPC1 ISOC1 HHAT CA11 EPN3 USH2A DPEP1 PLXND1 RAB27B GNA15 DBF4 MAP4K3 YBX2 SLC7A2 CYTH3 CAPG TIE1 RCN1 CYP3A43 AGPS MGST1 ARHGAP6 ELOVL5 CCDC85A APBA2 SARM1 SLC9A3 NPC1L1 DNAH5 DCBLD2 ICA1 DBNDD1 RPS17P5 GALC SCIN GYG2 FSTL4 ARHGAP44 DNASE1L1 DLX3 PLEKHB1 PRSS21 APPBP2 CCT8L1P ARSD CXorf56 NLRP2 FUZ TENM1 NEXMIF CYP26B1 CPS1 AC005747.1 ZMYND12 SYT13 SPATA20 CEACAM7 PHF21B ARHGAP31 SEZ6 IYD PRSS22 ME1 ZNRD1 ENTPD2 MSMO1 PLEKHH1 RIPOR1 SYT7 GUCA2B COPZ2 AASS TRAPPC6A FAM214A SLC12A2 EML1 ARX MKS1 MYO16 SLC6A16 TMEM98 MARCO TMCC3 REXO5 OAT TSPOAP1 PPP1R3F SNAP91 SLC18A1 FAM214B SLC7A14 SLC4A8 LYPLA2 DNAH9 CTPS2 BAIAP3 MPND
## Tuning the optimal K...
## Loading required package: lattice
## Loading required package: ggplot2
## Optimal K: 7 
## Training fold 1...
## Training fold 2...
## Training fold 3...
## Training fold 4...
## Training fold 5...
## Classification done successfully!
## Tuning the optimal C and G...
## Optimal cost: 0.25 
## Optimal gamma: 0.5 
## Training fold 1...
## Training fold 2...
## Training fold 3...
## Training fold 4...
## Training fold 5...
## Classification done successfully!
## Tuning the optimal mtry and ntrees...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...

## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...
## Training fold 1...
## Training fold 2...
## Training fold 3...
## Training fold 4...
## Training fold 5...
## Classification done successfully!

It is important to show graphically the results of the classifiers and for that purpose, the function dataPlot implements some methods. Concretely, to plot the accuracy, the sensitivity or the specificity reached by the classifiers, the function dataPlot has to be run with the parameter method equal to classResults. This method generated as many random colours as folds or simulations in the rows of the matrix passed to the function but, through the parameter colours a vector of desired colours can be specified. For the legend, the function uses the rownames of the input matrix but these names can be changed with the parameter legend. An example of this method is showed below:

Furthermore, the function dataPlot counts with another similar mode to the previous but this time to represents confusion matrices. This mode is called confusionMatrix and allows creating graphically a confusion matrix with the most important statistical measures. The following code allows doing this:

## Testing with 1 variables...
## Testing with 2 variables...
## Testing with 3 variables...
## Testing with 4 variables...
## Testing with 5 variables...
## Testing with 6 variables...
## Testing with 7 variables...
## Testing with 8 variables...
## Testing with 9 variables...
## Testing with 10 variables...
## Classification done successfully!
## Testing with 1 variables...
## Testing with 2 variables...
## Testing with 3 variables...
## Testing with 4 variables...
## Testing with 5 variables...
## Testing with 6 variables...
## Testing with 7 variables...
## Testing with 8 variables...
## Testing with 9 variables...
## Testing with 10 variables...
## Classification done successfully!
## Testing with 1 variables...
## Testing with 2 variables...
## Testing with 3 variables...
## Testing with 4 variables...
## Testing with 5 variables...
## Testing with 6 variables...
## Testing with 7 variables...
## Testing with 8 variables...
## Testing with 9 variables...
## Testing with 10 variables...
## Classification done successfully!

5 DEGs enrichment methodology

The main goal of the previous pipeline is the extraction of biological relevant information from the DEGs. For that, this package provides a set of tools that allows doing it. The last step of the pipeline englobes all the available tools in KnowSeq for DEGs enrichment, where three different approaches can be taken. The gene ontology information, the pathway visualization and the relationship between the DEGs and diseases related to the studied pathologies.

5.1 Gene Ontology

Gene ontology (GO) provides information about the biological functions of the genes. In order to complete this pipeline, it is important to know if the DEGs have functions related with the studied pathologies. In this sense, this package brings the possibility to know the GOs from the three different ontologies (BP, MF and CC) by using the function geneOntologyEnrichment that internally uses information from DAVID web-platform [20]. The function returns a list that contains a matrix for each ontology and a matrix with the GOs of the three ontologies together. Moreover, the matrices have different statistical measures and the description of the functionality of each GO.

## Getting gene symbols...Getting annotation of the Homo Sapiens...
## Using reference genome 38.
## Downloading annotation hsapiens_gene_ensembl...
## Retrieving Gene Ontology terms related to the list of DEGs...

For example, in this example, the top 10 GOs from the BP ontology for the extracted DEGs are shown in the following image.

5.2 Pathways visualization

Another important step in the enrichment methodology in this pipeline is the pathway visualization. The function uses the DEGs to show graphically the expression of the samples in the pathways in which those genes appear. For that, the function makes use of a DEGsMatrix with the expression of the DEGs and the annotation of those DEGs in which appear the pathway or pathways of each DEGs. Internally, the function DEGsPathwayVisualization uses pathview package [26] to retrieve and colour the pathways, but a maximum number of 24 samples can be used, for that, if the input matrix has more than 24 samples, only the first 24 will be used by the operation. Furthermore, the function needs the expression matrix with all the genes in order to use them to colour the rest of the elements in the pathways. It is important to retrieve the annotation from Ensembl for both the DEGsMatrix and the expressionMatrix because the entrezgene IDs and the KEGG enzyme of each gene are necessary.

## Getting annotation of the Homo Sapiens...
## Using reference genome 38.
## Downloading annotation hsapiens_gene_ensembl...
## Getting annotation of the Homo Sapiens...
## Using reference genome 38.
## Downloading annotation hsapiens_gene_ensembl...
## Retrieving DEGs associated pathways...
## A total of 3 pathways have been retrieved!
## hsa00600
##  hsa01100
##  hsa04142
## This pathway cannot be processed successfully...This pathway cannot be processed successfully...This pathway cannot be processed successfully...

6 Gene Expression Intelligent Report

Most of these steps can be displayed using a single function, knowseqReport, that starting from an expression matrix (or counts matrix) and the labels of the samples, will follow part of the pipeline explained above and will display the result in a report in html format.

For the obtention of this report a DEGs extraction using limma package will be performed. Then, a feature selection process will be carried out, where the used algorithm can be set by the featureSelectionMode parameter, along with the visualizations steps that have been described previously.

Following, the machine learning process starts, where the classification algorithms and the classification metrics to be displayed can be chossen by the user by clasifAlgs and metrics parameters.

Finally DEGs enrichment is obtained, showing Gene Ontologies and found evidences for related diseases for each gene. It is possible to obtain evidences for a certain disease, specifying it by the disease parameter.

7 Session Info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] caret_6.0-86          ggplot2_3.3.2         lattice_0.20-41      
##  [4] KnowSeq_1.2.2         cqn_1.34.0            quantreg_5.73        
##  [7] SparseM_1.78          preprocessCore_1.50.0 nor1mix_1.3-0        
## [10] mclust_5.4.6         
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.10     Hmisc_4.4-1          plyr_1.8.6          
##   [4] BiocParallel_1.22.0  pathview_1.28.1      sva_3.36.0          
##   [7] digest_0.6.25        foreach_1.5.0        htmltools_0.5.0     
##  [10] magrittr_1.5         checkmate_2.0.0      memoise_1.1.0       
##  [13] cluster_2.1.0        limma_3.44.3         recipes_0.1.13      
##  [16] Biostrings_2.56.0    annotate_1.66.0      gower_0.2.2         
##  [19] matrixStats_0.57.0   R.utils_2.10.1       jpeg_0.1-8.1        
##  [22] colorspace_1.4-1     blob_1.2.1           xfun_0.18           
##  [25] dplyr_1.0.2          crayon_1.3.4         RCurl_1.98-1.2      
##  [28] jsonlite_1.7.1       graph_1.66.0         genefilter_1.70.0   
##  [31] survival_3.2-7       iterators_1.0.12     glue_1.4.2          
##  [34] gtable_0.3.0         ipred_0.9-9          zlibbioc_1.34.0     
##  [37] XVector_0.28.0       MatrixModels_0.4-1   kernlab_0.9-29      
##  [40] Rgraphviz_2.32.0     BiocGenerics_0.34.0  scales_1.1.1        
##  [43] DBI_1.1.0            edgeR_3.30.3         Rcpp_1.0.5          
##  [46] xtable_1.8-4         htmlTable_2.1.0      foreign_0.8-80      
##  [49] bit_4.0.4            Formula_1.2-3        stats4_4.0.2        
##  [52] lava_1.6.8           prodlim_2019.11.13   htmlwidgets_1.5.2   
##  [55] httr_1.4.2           gplots_3.1.0         RColorBrewer_1.1-2  
##  [58] ellipsis_0.3.1       pkgconfig_2.0.3      XML_3.99-0.5        
##  [61] R.methodsS3_1.8.1    farver_2.0.3         nnet_7.3-14         
##  [64] locfit_1.5-9.4       tidyselect_1.1.0     labeling_0.3        
##  [67] rlang_0.4.8          reshape2_1.4.4       AnnotationDbi_1.50.3
##  [70] munsell_0.5.0        tools_4.0.2          generics_0.0.2      
##  [73] RSQLite_2.2.1        evaluate_0.14        stringr_1.4.0       
##  [76] yaml_2.2.1           ModelMetrics_1.2.2.2 org.Hs.eg.db_3.11.4 
##  [79] knitr_1.30           bit64_4.0.5          caTools_1.18.0      
##  [82] purrr_0.3.4          randomForest_4.6-14  KEGGREST_1.28.0     
##  [85] nlme_3.1-149         praznik_8.0.0        R.oo_1.24.0         
##  [88] KEGGgraph_1.48.0     compiler_4.0.2       rstudioapi_0.11     
##  [91] curl_4.3             png_0.1-7            e1071_1.7-3         
##  [94] tibble_3.0.3         stringi_1.5.3        Matrix_1.2-18       
##  [97] multtest_2.44.0      vctrs_0.3.4          pillar_1.4.6        
## [100] lifecycle_0.2.0      data.table_1.13.0    bitops_1.0-6        
## [103] conquer_1.0.2        R6_2.4.1             latticeExtra_0.6-29 
## [106] KernSmooth_2.23-17   gridExtra_2.3        IRanges_2.22.2      
## [109] codetools_0.2-16     MASS_7.3-53          gtools_3.8.2        
## [112] withr_2.3.0          S4Vectors_0.26.1     rlist_0.4.6.1       
## [115] mgcv_1.8-33          parallel_4.0.2       grid_4.0.2          
## [118] rpart_4.1-15         timeDate_3043.102    class_7.3-17        
## [121] rmarkdown_2.4        pROC_1.16.2          Biobase_2.48.0      
## [124] lubridate_1.7.9      base64enc_0.1-3

8 References

  1. Castillo, D., Gálvez, J. M., Herrera, L. J., San Román, B., Rojas, F., & Rojas, I. (2017). Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC bioinformatics, 18(1), 506.

  2. Gálvez, J. M., Castillo, D., Herrera, L. J., San Roman, B., Valenzuela, O., Ortuno, F. M., & Rojas, I. (2018). Multiclass classification for skin cancer profiling based on the integration of heterogeneous gene expression series. PloS one, 13(5), e0196836.

  3. Castillo, D., Galvez, J. M., Herrera, L. J., Rojas, F., Valenzuela, O., Caba, O., … & Rojas, I. (2019). Leukemia multiclass assessment and classification from Microarray and RNA-seq technologies integration at gene expression level. PloS one, 14(2), e0212127.

  4. Gálvez, J. M., Castillo, D., Herrera, L. J., Valenzuela, O., Caba, O., Prados, J. C., … & Rojas, I. (2019). Towards Improving Skin Cancer Diagnosis by Integrating Microarray and RNA-seq Datasets. IEEE Journal of Biomedical and Health Informatics.

  5. Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., … & Yefanov, A. (2012). NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research, 41(D1), D991-D995.

  6. Kolesnikov, N., Hastings, E., Keays, M., Melnichuk, O., Tang, Y. A., Williams, E., … & Megy, K. (2014). ArrayExpress update—simplifying data submissions. Nucleic acids research, 43(D1), D1113-D1116.

  7. Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357.

  8. Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357.

  9. Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), 166-169.

  10. Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525.

  11. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417.

  12. Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.

  13. Sherry, S., & Xiao, C. (2012, January). Ncbi sra toolkit technology for next generation sequence data. In Plant and Animal Genome XX Conference (January 14-18, 2012). Plant and Animal Genome.

  14. Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4.

  15. Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

  16. Cunningham, F., Achuthan, P., Akanni, W., Allen, J., … , & Huber, W. (2018). Ensembl 2019. Nucleic Acids Research, 47(D1), D745–D751.

  17. Hansen, K. D., Irizarry, R. A., & Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, 13(2), 204-216.

  18. Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE, Storey JD, Zhang Y, Torres LC (2019). sva: Surrogate Variable Analysis. R package version 3.34.0.

  19. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research, 43(7), e47-e47.

  20. X Jiao, BT Sherman, R Stephens, MW Baseler, HC Lane, RA Lempicki. DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics (2012) 28 (13): 1805-1806. doi: 10.1093/bioinformatics/bts251

  21. Luo, W., & Brouwer, C. (2013). Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, 29(14), 1830-1831.

  22. Goh, W. W. B., Wang, W., & Wong, L. (2017). Why batch effects matter in omics data, and how to avoid them. Trends in biotechnology, 35(6), 498-507.

  23. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02), 185-205.

  24. Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565.

  25. Parry, R. M., Jones, W., Stokes, T. H., Phan, J. H., Moffitt, R. A., Fang, H., … & Wang, M. D. (2010). k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. The pharmacogenomics journal, 10(4), 292.

  26. Koscielny, G., An, P., Carvalho-Silva, D., Cham, J. A., Fumis, L., Gasparyan, R., … & Pierleoni, A. (2016). Open Targets: a platform for therapeutic target identification and validation. Nucleic acids research, 45(D1), D985-D994.