hpar 1.34.0
From the Human Protein Atlas (Uhlén et al. 2005; Uhlen et al. 2010) site:
The Swedish Human Protein Atlas project, funded by the Knut and Alice Wallenberg Foundation, has been set up to allow for a systematic exploration of the human proteome using Antibody-Based Proteomics. This is accomplished by combining high-throughput generation of affinity-purified antibodies with protein profiling in a multitude of tissues and cells assembled in tissue microarrays. Confocal microscopy analysis using human cell lines is performed for more detailed protein localisation. The program hosts the Human Protein Atlas portal with expression profiles of human proteins in tissues and cells.
The hpar package provides access to HPA data from the R interface. It also distributes the following data sets:
hpaNormalTissue
Normal tissue data: Expression profiles
for proteins in human tissues based on immunohistochemisty using tissue
micro arrays. The tab-separated file includes Ensembl gene identifier
(“Gene”), tissue name (“Tissue”), annotated cell type (“Cell type”),
expression value (“Level”), and the gene reliability of the expression
value (“Reliability”).}
hpaNormalTissue16.1
: Same as above, for version 16.1.
hpaCancer
Pathology data: Staining profiles for
proteins in human tumor tissue based on immunohistochemisty using
tissue micro arrays and log-rank P value for Kaplan-Meier analysis
of correlation between mRNA expression level and patient survival.
The tab-separated file includes Ensembl gene identifier (“Gene”),
gene name (“Gene name”), tumor name (“Cancer”), the number of
patients annotated for different staining levels (“High”, “Medium”,
“Low” & “Not detected”) and log-rank p values for patient survival
and mRNA correlation (“prognostic - favourable”,
“unprognostic - favourable”, “prognostic - unfavourable”,
“unprognostic - unfavourable”). }
hpaCancer16.1
: Same as above, for version 16.1.
rnaGeneTissue
RNA HPA tissue gene data: Transcript
expression levels summarized per gene in 37 tissues based on
RNA-seq. The tab-separated file includes Ensembl gene identifier
(“Gene”), analysed sample (“Tissue”), transcripts per million
(“TPM”), protein-transcripts per million (“pTPM”) and
normalized expression (“NX”). }
rnaGeneCellLine
RNA HPA cell line gene data: Transcript
expression levels summarized per gene in 64 cell lines.
The tab-separated file includes Ensembl gene identifier
(“Gene”), analysed sample (“Cell line”), transcripts per
million (“TPM”), protein-coding transcripts per million (“pTPM”)
and normalized expression (“NX”). }
rnaGeneCellLine16.1
: Same as above, for version 16.1.
hpaSubcellularLoc
Subcellular location data: Subcellular
location of proteins based on immunofluorescently stained cells.
The tab-separated file includes the following columns: Ensembl
gene identifier (“Gene”), name of gene (“Gene name”), gene reliability
score (“Reliability”), enhanced locations (“Enhanced”), supported
locations (“Supported”), Approved locations (“Approved”), uncertain
locations (“Uncertain”), locations with single-cell variation in
intensity (“Single-cell variation intensity”), locations with spatial
single-cell variation (“Single-cell variation spatial”), locations
with observed cell cycle dependency (type can be one or more of
biological definition, custom data or correlation) (“Cell cycle
dependency”), Gene Ontology Cellular Component term identifier
(“GO id”).}
hpaSubcellularLoc14
and *16.1
: Same as above, for versions 14
and 16.1.
hpaSecretome
Secretome data: The human secretome is here
defined as all Ensembl genes with at least one predicted
secreted transcript according to HPA predictions. The complete
information about the HPA Secretomedata is given on
.
This dataset has 230 columns and includes the Ensembl gene identifier
(“Gene”). Information about the additionnal variables can be found
by clicking on
.
The use of data and images from the HPA in publications and presentations is permitted provided that the following conditions are met:
hpar is available through the Bioconductor project. Details about the package and the installation procedure can be found on its landing page. To install using the dedicated Bioconductor infrastructure, run :
## install BiocManager only one
install.packages("BiocManager")
## install hpar
BiocManager::install("hpar")
After installation, hpar will have to be explicitly loaded with
library("hpar")
## This is hpar version 1.34.0,
## based on the Human Protein Atlas
## Version: 20.0
## Release data: 2020.11.19
## Ensembl build: 92.38
## See '?hpar' or 'vignette('hpar')' for details.
so that all the package’s functionality and data is available to the user.
The data sets described above can be loaded with the data
function,
as illustrated below for hpaNormalTissue
below. Each data set is a
data.frame
and can be easily manipulated using standard R
functionality. The code chunk below illustrates some of its
properties.
data(hpaNormalTissue)
dim(hpaNormalTissue)
## [1] 1118517 6
names(hpaNormalTissue)
## [1] "Gene" "Gene.name" "Tissue" "Cell.type" "Level"
## [6] "Reliability"
## Number of genes
length(unique(hpaNormalTissue$Gene))
## [1] 15320
## Number of cell types
length(unique(hpaNormalTissue$Cell.type))
## [1] 120
head(levels(hpaNormalTissue$Cell.type))
## NULL
## Number of tissues
length(unique(hpaNormalTissue$Tissue))
## [1] 63
head(levels(hpaNormalTissue$Tissue))
## NULL
The package provides a interface to the HPA data. The getHpa
allows
to query the data sets described above. It takes three arguments,
id
, hpadata
and type
, that control the query, what data set to
interrogate and how to report results respectively. The HPA data uses
Ensembl gene identifiers and id
must be a valid
identifier. hpadata
must be one of available dataset. type
can be
either "data"
or "details"
. The former is the default and returns
a data.frame
containing the information relevant to id
. It is also
possible to obtained detailed information, (including cell images) as
web pages, directly from the HPA web page, using "details"
.
We will illustrate this functionality with using the TSPAN6 (tetraspanin 6) gene (ENSG00000000003) as example.
id <- "ENSG00000000003"
head(getHpa(id, hpadata = "hpaNormalTissue"))
## Gene Gene.name Tissue Cell.type Level
## 1 ENSG00000000003 TSPAN6 adipose tissue adipocytes Not detected
## 2 ENSG00000000003 TSPAN6 adrenal gland glandular cells Not detected
## 3 ENSG00000000003 TSPAN6 appendix glandular cells Medium
## 4 ENSG00000000003 TSPAN6 appendix lymphoid tissue Not detected
## 5 ENSG00000000003 TSPAN6 bone marrow hematopoietic cells Not detected
## 6 ENSG00000000003 TSPAN6 breast adipocytes Not detected
## Reliability
## 1 Approved
## 2 Approved
## 3 Approved
## 4 Approved
## 5 Approved
## 6 Approved
getHpa(id, hpadata = "hpaSubcellularLoc")
## Gene Gene.name Reliability Main.location
## 1 ENSG00000000003 TSPAN6 Approved Cell Junctions;Cytosol
## Additional.location Extracellular.location Enhanced Supported
## 1 Nucleoli fibrillar center
## Approved Uncertain
## 1 Cell Junctions;Cytosol;Nucleoli fibrillar center
## Single.cell.variation.intensity Single.cell.variation.spatial
## 1 Cytosol
## Cell.cycle.dependency
## 1
## GO.id
## 1 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)
head(getHpa(id, hpadata = "rnaGeneCellLine"))
## Gene Gene.name Cell.line TPM pTPM NX
## 1 ENSG00000000003 TSPAN6 A-431 27.8 33.9 7.9
## 2 ENSG00000000003 TSPAN6 A549 37.6 45.5 10.6
## 3 ENSG00000000003 TSPAN6 AF22 108.1 134.5 28.7
## 4 ENSG00000000003 TSPAN6 AN3-CA 51.8 64.4 14.5
## 5 ENSG00000000003 TSPAN6 ASC diff 32.3 37.4 12.6
## 6 ENSG00000000003 TSPAN6 ASC TERT1 17.7 20.8 6.8
If we ask for "detail"
, a browser page pointing to the
relevant page is open (see figure below)
getHpa(id, type = "details")
If a user is interested specifically in one data set, it is possible
to set hpadata
globally and omit it in getHpa
. This is done by
setting the hpar
options hpardata
with the setHparOptions
function. The current default data set can be tested with
getHparOptions
.
getHparOptions()
## $hpar
## $hpar$hpadata
## [1] "hpaNormalTissue"
setHparOptions(hpadata = "hpaSubcellularLoc")
getHparOptions()
## $hpar
## $hpar$hpadata
## [1] "hpaSubcellularLoc"
getHpa(id)
## Gene Gene.name Reliability Main.location
## 1 ENSG00000000003 TSPAN6 Approved Cell Junctions;Cytosol
## Additional.location Extracellular.location Enhanced Supported
## 1 Nucleoli fibrillar center
## Approved Uncertain
## 1 Cell Junctions;Cytosol;Nucleoli fibrillar center
## Single.cell.variation.intensity Single.cell.variation.spatial
## 1 Cytosol
## Cell.cycle.dependency
## 1
## GO.id
## 1 Cell Junctions (GO:0030054);Cytosol (GO:0005829);Nucleoli fibrillar center (GO:0001650)
Information about the HPA release used to build the installed
hpar package can be accessed with getHpaVersion
,
getHpaDate
and getHpaEnsembl
. Full release details can be found on
the HPA release history
page.
getHpaVersion()
## version
## "20.0"
getHpaDate()
## date
## "2020.11.19"
getHpaEnsembl()
## ensembl
## "92.38"
Let’s compare the subcellular localisation annotation obtained from the HPA subcellular location data set and the information available in the Bioconductor annotation packages.
id <- "ENSG00000001460"
getHpa(id, "hpaSubcellularLoc")
## Gene Gene.name Reliability Main.location Additional.location
## 8 ENSG00000001460 STPG1 Approved Nucleoplasm
## Extracellular.location Enhanced Supported Approved Uncertain
## 8 Nucleoplasm
## Single.cell.variation.intensity Single.cell.variation.spatial
## 8
## Cell.cycle.dependency GO.id
## 8 Nucleoplasm (GO:0005654)
Below, we first extract all cellular component GO terms available for
id
from the org.Hs.eg.db human annotation and
then retrieve their term definitions using the GO.db
database.
library("org.Hs.eg.db")
library("GO.db")
ans <- select(org.Hs.eg.db, keys = id,
columns = c("ENSEMBL", "GO", "ONTOLOGY"),
keytype = "ENSEMBL")
## 'select()' returned 1:many mapping between keys and columns
ans <- ans[ans$ONTOLOGY == "CC", ]
ans
## ENSEMBL GO EVIDENCE ONTOLOGY
## 2 ENSG00000001460 GO:0005634 IEA CC
## 3 ENSG00000001460 GO:0005739 IEA CC
sapply(as.list(GOTERM[ans$GO]), slot, "Term")
## GO:0005634 GO:0005739
## "nucleus" "mitochondrion"
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] hpar_1.34.0 GO.db_3.13.0 org.Hs.eg.db_3.13.0
## [4] AnnotationDbi_1.54.0 IRanges_2.26.0 S4Vectors_0.30.0
## [7] Biobase_2.52.0 BiocGenerics_0.38.0 BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 XVector_0.32.0 GenomeInfoDb_1.28.0
## [4] bslib_0.2.5.1 compiler_4.1.0 BiocManager_1.30.15
## [7] jquerylib_0.1.4 zlibbioc_1.38.0 bitops_1.0-7
## [10] tools_4.1.0 digest_0.6.27 bit_4.0.4
## [13] jsonlite_1.7.2 evaluate_0.14 RSQLite_2.2.7
## [16] memoise_2.0.0 pkgconfig_2.0.3 png_0.1-7
## [19] rlang_0.4.11 DBI_1.1.1 yaml_2.2.1
## [22] xfun_0.23 fastmap_1.1.0 GenomeInfoDbData_1.2.6
## [25] httr_1.4.2 stringr_1.4.0 knitr_1.33
## [28] Biostrings_2.60.0 sass_0.4.0 vctrs_0.3.8
## [31] bit64_4.0.5 R6_2.5.0 rmarkdown_2.8
## [34] bookdown_0.22 blob_1.2.1 magrittr_2.0.1
## [37] htmltools_0.5.1.1 KEGGREST_1.32.0 stringi_1.6.2
## [40] RCurl_1.98-1.3 cachem_1.0.5 crayon_1.4.1
Uhlén, Mathias, Erik Björling, Charlotta Agaton, Cristina Al-Khalili A. Szigyarto, Bahram Amini, Elisabet Andersen, Ann-Catrin C. Andersson, et al. 2005. “A human protein atlas for normal and cancer tissues based on antibody proteomics.” Molecular & Cellular Proteomics : MCP 4 (12): 1920–32. https://doi.org/10.1074/mcp.M500279-MCP200.
Uhlen, Mathias, Per Oksvold, Linn Fagerberg, Emma Lundberg, Kalle Jonasson, Mattias Forsberg, Martin Zwahlen, et al. 2010. “Towards a knowledge-based Human Protein Atlas.” Nature Biotechnology 28 (12): 1248–50. https://doi.org/10.1038/nbt1210-1248.