1 Introduction

Our main data source to study protein sub-cellular localisation are high-throughput mass spectrometry-based experiments such as LOPIT, PCP and similar designs (see (Gatto et al. 2010) for an general introduction). Recent optimised experiments result in high quality data enabling the identification of over 6000 proteins and discriminate numerous sub-cellular and sub-organellar niches (Christoforou et al. 2016). Supervised and semi-supervised machine learning algorithms can be applied to assign thousands of proteins to annotated sub-cellular niches (Breckels et al. 2013, Gatto:2014) (see also the pRoloc-tutorial vignette). These data constitute our main source for protein localisation and are termed thereafter primary data.

There are other sources of data about sub-cellular localisation of proteins, such as the Gene Ontology (Ashburner et al. 2000) (in particular the cellular compartment name space), quantitative features derived from protein sequences (such as pseudo amino acid composition) or the Human Protein Atlas (Uhlen et al. 2010) to cite a few. These data, while not optimised to a specific system at hand and, in the case of annotation feature, not as reliable as our experimental data, constitute an invaluable, often plentiful source of auxiliary information.

The aim of a transfer learning algorithm is to combine different sources of data to improve overall classification. In particular, the goal is to support/complement the primary target domain (experimental data) with auxiliary data (annotation) features without compromising the integrity of our primary data. In this vignette, we describe the application of transfer learning algorithms for the localisation of proteins from the pRoloc package, as described in

Breckels LM, Holden S, Wonjar D, Mulvey CM, Christoforou A, Groen A, Trotter MW, Kohlbacker O, Lilley KS and Gatto L (2016). Learning from heterogeneous data sources: an application in spatial proteomics. PLoS Comput Biol 13;12(5):e1004920. doi: 10.1371/journal.pcbi.1004920.

Two algorithms were developed: a transfer learning algorithm based on the \(k\)-nearest neighbour classifier, coined kNN-TL hereafter, described in section 4, and one based on the support vector machine algorithm, termed SVM-TL, described in section 3.


2 Preparing the auxiliary data

2.1 The Gene Ontology

The auxiliary data is prepared from the primary data’s features. All the GO terms associated to these features are retrieved and used to create a binary matrix where a one (zero) at position \((i,j)\) indicates that term \(j\) has (not) been used to annotate feature \(i\).

The GO terms are retrieved from an appropriate repository using the biomaRt package. The specific Biomart repository and query will depend on the species under study and the type of features. The first step is to prepare annotation parameters that will enable to perform the query. The pRoloc package provides a dedicated infrastructure to set up the query to the annotation resource and prepare the GO data for subsequent analyses. This infrastructure is composed of:

  1. define the annotation parameters based on the species and feature types;
  2. query the resource defined in (1) to retrieve relevant terms and use the terms to prepare the auxiliary data.

We will demonstrate these steps using a LOPIT experiment on Human Embryonic Kidney (HEK293T) fibroblast cells (Breckels et al. 2013), available and documented in the pRolocdata experiment package as andy2011.


2.1.1 Preparing the query parameters

The query parameters are stored as AnnotationParams objects that are created with the setAnnotationParams function. The function will present a first menu with 486. Once the species has been selected, a set of possible identifier types is displayed.