The Human Protein Atlas (HPA) is a systematic study oh the human proteome using antibody-based proteomics. Multiple tissues and cell lines are systematically assayed affinity-purified antibodies and confocal microscopy. The hpar package is an R interface to the HPA project. It distributes three data sets, provides functionality to query these and to access detailed information pages, including confocal microscopy images available on the HPA web page.
hpar 1.20.0
From the Human Protein Atlas (Uhlén et al. 2005; Uhlen et al. 2010) site:
The Swedish Human Protein Atlas project, funded by the Knut and Alice Wallenberg Foundation, has been set up to allow for a systematic exploration of the human proteome using Antibody-Based Proteomics. This is accomplished by combining high-throughput generation of affinity-purified antibodies with protein profiling in a multitude of tissues and cells assembled in tissue microarrays. Confocal microscopy analysis using human cell lines is performed for more detailed protein localisation. The program hosts the Human Protein Atlas portal with expression profiles of human proteins in tissues and cells.
The hpar package provides access to HPA data from the R interface. It also distributes the following data sets:
hpaNormalTissue
Normal tissue data: Expression profiles for proteins in human tissues based on immunohistochemisty using tissue micro arrays. The comma-separated file includes Ensembl gene identifier (“Gene”), tissue name (“Tissue”), annotated cell type (“Cell type”), expression value (“Level”), the type of annotation (annotated protein expression (APE), based on more than one antibody, or staining, based on one antibody only) (“Expression type”), and the reliability or validation of the expression value (“Reliability”).}
hpaCancer
Cancer tumor data: Staining profiles for proteins in human tumor tissue based on immunohistochemisty using tissue micro arrays. The comma-separated file includes Ensembl gene identifier (“Gene”), tumor name (“Tumor”), staining value (“Level”), the number of patients that stain for this staining value (“Count patients”), the total amount of patients for this tumor type (“Total patients”) and the type of annotation staining (“Expression type”). }
rnaGeneTissue
RNA gene data: RNA levels in 45 cell lines and 32 tissues based on RNA-seq. The comma-separated file includes Ensembl gene identifier (“Gene”), analysed sample (“Sample”), fragments per kilobase of transcript per million fragments mapped (“Value” and “Unit”), and abundance class (“Abundance”). }
rnaGeneCellLine
RNA gene data: RNA levels in 45 cell lines and 32 tissues based on RNA-seq. The comma-separated file includes Ensembl gene identifier (“Gene”), analysed sample (“Sample”), fragments per kilobase of transcript per million fragments mapped (“Value” and “Unit”), and abundance class (“Abundance”). }
hpaSubcellularLoc
Subcellular location data: Subcellular localization of proteins based on immunofluorescently stained cells. The comma-separated file includes Ensembl gene identifier (“Gene”), main subcellular location of the protein (“Main location”), other locations (“Other location”), the type of annotation (annotated protein expression (APE), based on more than one antibody, or staining, based on one antibody only) (“Expression type”), and the reliability or validation of the expression value (“Reliability”). }
hpaSubcellularLoc14
: Same as above, for version 14.
The use of data and images from the HPA in publications and presentations is permitted provided that the following conditions are met:
hpar is available through the Bioconductor project. Details about the package and the installation procedure can be found on its landing page. To install using the dedicated Bioconductor infrastructure, run :
source("http://bioconductor.org/biocLite.R")
## or, if you have already used the above before
library("BiocInstaller") ## and to install the package
biocLite("hpar")
After installation, hpar will have to be explicitly loaded with
library("hpar")
## This is hpar version 1.20.0,
## based on the Human Protein Atlas
## Version: 16.1
## Release data: 2017.01.31
## Ensembl build: 83.38
## See '?hpar' or 'vignette('hpar')' for details.
so that all the package’s functionality and data is available to the user.
The data sets described above can be loaded with the data
function, as illustrated below for hpaNormalTissue
below. Each data set is a data.frame
and can be easily manipulated using standard R functionality. The code chunk below illustrates some of its properties.
data(hpaNormalTissue)
dim(hpaNormalTissue)
## [1] 1031835 6
names(hpaNormalTissue)
## [1] "Gene" "Gene.name" "Tissue" "Cell.type" "Level"
## [6] "Reliability"
## Number of genes
length(unique(hpaNormalTissue$Gene))
## [1] 12983
## Number of cell types
length(unique(hpaNormalTissue$Cell.type))
## [1] 67
head(levels(hpaNormalTissue$Cell.type))
## [1] "adipocytes" "bile duct cells"
## [3] "cells in anterior" "cells in cortex/medulla"
## [5] "cells in cuticle" "cells in endometrial stroma"
## Number of tissues
length(unique(hpaNormalTissue$Tissue))
## [1] 55
head(levels(hpaNormalTissue$Tissue))
## [1] "adrenal gland" "appendix" "bone marrow" "breast"
## [5] "bronchus" "caudate"
table(hpaNormalTissue$Expression.type)
## < table of extent 0 >
The package provides a interface to the HPA data. The getHpa
allows to query the data sets described above. It takes three arguments, id
, hpadata
and type
, that control the query, what data set to interrogate and how to report results respectively. The HPA data uses Ensembl gene identifiers and id
must be a valid identifier. hpadata
must be one of available dataset. type
can be either "data"
or "details"
. The former is the default and returns a data.frame
containing the information relevant to id
. It is also possible to obtained detailed information, (including cell images) as web pages, directly from the HPA web page, using "details"
.
We will illustrate this functionality with using the TSPAN6 (tetraspanin 6) gene (ENSG00000000003) as example.
id <- "ENSG00000000003"
head(getHpa(id, hpadata = "hpaNormalTissue"))
## Gene Gene.name Tissue Cell.type Level
## 1 ENSG00000000003 TSPAN6 adrenal gland glandular cells Not detected
## 2 ENSG00000000003 TSPAN6 appendix glandular cells Medium
## 3 ENSG00000000003 TSPAN6 appendix lymphoid tissue Not detected
## 4 ENSG00000000003 TSPAN6 bone marrow hematopoietic cells Not detected
## 5 ENSG00000000003 TSPAN6 breast adipocytes Not detected
## 6 ENSG00000000003 TSPAN6 breast glandular cells High
## Reliability
## 1 Approved
## 2 Approved
## 3 Approved
## 4 Approved
## 5 Approved
## 6 Approved
getHpa(id, hpadata = "hpaSubcellularLoc")
## Gene Gene.name Reliability Validated Supported Approved
## 1 ENSG00000000003 TSPAN6 Approved Cytosol
## Uncertain Cell.to.cell.variation.spatial Cell.to.cell.variation.intensity
## 1
## Cell.cycle.dependency GO.id
## 1 Cytosol (GO:0005829)
head(getHpa(id, hpadata = "rnaGeneCellLine"))
## Gene Gene.name Sample Value Unit
## 1 ENSG00000000003 TSPAN6 A-431 27.8 TPM
## 2 ENSG00000000003 TSPAN6 A549 37.6 TPM
## 3 ENSG00000000003 TSPAN6 AF22 108.2 TPM
## 4 ENSG00000000003 TSPAN6 AN3-CA 51.8 TPM
## 5 ENSG00000000003 TSPAN6 ASC TERT1 17.8 TPM
## 6 ENSG00000000003 TSPAN6 BEWO 42.8 TPM
If we ask for "detail"
, a browser page pointing to the relevant page is open (see figure below)
getHpa(id, type = "details")