Package: Pbase
Author: Laurent Gatto
Last compiled: Mon Oct 17 19:31:38 2016
Last modified: 2016-10-17 16:10:57
The aim of this vignette is to document the mapping of proteins and the tandem mass spectrometry-derived peptides to genomic locations.
We will use a small Proteins
object from the Pbase package to illustrate how to retrieve genome coordinates and map a peptides back to genomic coordinates. See the Pbase-data
vignette for an introduction to Proteins
data. The main information needed in this vignette consists of protein UniProt identifiers and a set of peptides positions along the protein sequence.
library("Pbase")
data(p)
p
## S4 class type : Proteins
## Class version : 0.1
## Created : Sat Feb 14 12:36:40 2015
## Number of Proteins: 9
## Sequences:
## [1] A4UGR9 [2] A6H8Y1 ... [8] P04075-2 [9] P60709
## Sequence features:
## [1] DB [2] AccessionNumber ... [12] npeps [13] ENST
## Peptide features:
## [1] DB [2] AccessionNumber ... [27] acquisitionNum [28] filenames
seqnames(p)
## [1] "A4UGR9" "A6H8Y1" "O43707" "O75369" "P00558" "P02545"
## [7] "P04075" "P04075-2" "P60709"
pcols(p)[1, c("start", "end")]
## SplitDataFrameList of length 1
## $A4UGR9
## DataFrame with 36 rows and 2 columns
## start end
## <integer> <integer>
## 1 2743 2760
## 2 307 318
## 3 1858 1870
## 4 1699 1708
## 5 2622 2637
## ... ... ...
## 32 20 31
## 33 1712 1729
## 34 48 61
## 35 2082 2094
## 36 2743 2756
We will also require an identifier relating the protein feature of interest to the genome. Below, we use biomaRt to query the matching Ensembl transcript identifier. We start by create a Mart
object that stores the connection details to the latest human Ensembl biomart server.
library("biomaRt")
ens <- useMart("ensembl", "hsapiens_gene_ensembl")
ens
## Object of class 'Mart':
## Using the ENSEMBL_MART_ENSEMBL BioMart database
## Using the hsapiens_gene_ensembl dataset
bm <- select(ens, keys = seqnames(p),
keytype = "uniprot_swissprot",
columns = c(
"uniprot_swissprot",
"hgnc_symbol",
"ensembl_transcript_id"))
bm
## uniprot_swissprot hgnc_symbol ensembl_transcript_id
## 1 A4UGR9 XIRP2 ENST00000409043
## 2 A4UGR9 XIRP2 ENST00000409728
## 3 A4UGR9 XIRP2 ENST00000409195
## 4 A4UGR9 XIRP2 ENST00000409273
## 5 A4UGR9 XIRP2 ENST00000409605
## 6 A4UGR9 XIRP2 ENST00000628543
## 7 A6H8Y1 BDP1 ENST00000358731
## 8 A6H8Y1 BDP1 ENST00000617085
## 9 O43707 ACTN4 ENST00000634960
## 10 O43707 ACTN4 ENST00000634692
## 11 O43707 ACTN4 ENST00000252699
## 12 O43707 ACTN4 ENST00000390009
## 13 O75369 FLNB ENST00000490882
## 14 O75369 FLNB ENST00000295956
## 15 O75369 FLNB ENST00000358537
## 16 O75369 FLNB ENST00000429972
## 17 P00558 PGK1 ENST00000373316
## 18 P02545 LMNA ENST00000368301
## 19 P02545 LMNA ENST00000368300
## 20 P02545 LMNA ENST00000368299
## 21 P02545 LMNA ENST00000448611
## 22 P02545 LMNA ENST00000473598
## 23 P02545 LMNA ENST00000347559
## 24 P04075 ALDOA ENST00000338110
## 25 P04075 ALDOA ENST00000395248
## 26 P04075 ALDOA ENST00000566897
## 27 P04075 ALDOA ENST00000569545
## 28 P04075 ALDOA ENST00000563060
## 29 P04075 ALDOA ENST00000412304
## 30 P04075 ALDOA ENST00000564546
## 31 P04075 ALDOA ENST00000564595
## 32 P60709 ACTB ENST00000331789
As can be seen, there can be multiple transcripts for one protein accession. We have defined the transcripts of interest for our proteins in p
; they are stored as protein elements metadata:
acols(p)$ENST
## A4UGR9 A6H8Y1 O43707 O75369
## "ENST00000409195" "ENST00000358731" "ENST00000252699" "ENST00000295956"
## P00558 P02545 P04075 P04075-2
## "ENST00000373316" "ENST00000368300" "ENST00000338110" "ENST00000395248"
## P60709
## "ENST00000331789"
The etrid2grl
function takes our transcript identifiers and will query the Ensembl biomart server (note the ens
argument) and return a GRangesList
object. For each of Ensembl transcript identifiers provided as input, we have the genomic coordinates of that transcript’s exons as well as additional information such as the type of exons (protein coding or untranslated region).
grl <- etrid2grl(acols(p)$ENST, ens)
all.equal(names(grl), acols(p)$ENST,
check.attributes=FALSE)
## [1] TRUE
grl
## GRangesList object of length 9:
## $ENST00000409195
## GRanges object with 13 ranges and 7 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chr2 [166888487, 166888557] + | utr5
## [2] chr2 [166903465, 166903482] + | utr5
## [3] chr2 [166903483, 166903890] + | protein_coding
## [4] chr2 [167135909, 167136062] + | protein_coding
## [5] chr2 [167210735, 167210895] + | protein_coding
## ... ... ... ... . ...
## [9] chr2 [167241777, 167241910] + | protein_coding
## [10] chr2 [167242569, 167251947] + | protein_coding
## [11] chr2 [167254032, 167254126] + | protein_coding
## [12] chr2 [167254127, 167254165] + | utr3
## [13] chr2 [167257857, 167259753] + | utr3
## gene exon transcript symbol
## <character> <character> <character> <character>
## [1] ENSG00000163092 ENSE00001583437 ENST00000409195 XIRP2
## [2] ENSG00000163092 ENSE00001580025 ENST00000409195 XIRP2
## [3] ENSG00000163092 ENSE00001580025 ENST00000409195 XIRP2
## [4] ENSG00000163092 ENSE00001299302 ENST00000409195 XIRP2
## [5] ENSG00000163092 ENSE00003591481 ENST00000409195 XIRP2
## ... ... ... ... ...
## [9] ENSG00000163092 ENSE00001317162 ENST00000409195 XIRP2
## [10] ENSG00000163092 ENSE00001276753 ENST00000409195 XIRP2
## [11] ENSG00000163092 ENSE00003607800 ENST00000409195 XIRP2
## [12] ENSG00000163092 ENSE00003607800 ENST00000409195 XIRP2
## [13] ENSG00000163092 ENSE00001578286 ENST00000409195 XIRP2
## rank phase
## <numeric> <integer>
## [1] 1 -1
## [2] 2 -1
## [3] 2 -1
## [4] 3 0
## [5] 4 1
## ... ... ...
## [9] 8 1
## [10] 9 0
## [11] 10 0
## [12] 10 -1
## [13] 11 -1
##
## ...
## <8 more elements>
## -------
## seqinfo: 8 sequences from an unspecified genome; no seqlengths
We also need to retain only coding exons and discard untranslated regions for later peptide mapping, using the proteinCoding
function.
pcgrl <- proteinCoding(grl)
pcgrl
## GRangesList object of length 9:
## $ENST00000409195
## GRanges object with 9 ranges and 7 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chr2 [166903483, 166903890] + | protein_coding
## [2] chr2 [167135909, 167136062] + | protein_coding
## [3] chr2 [167210735, 167210895] + | protein_coding
## [4] chr2 [167218166, 167218300] + | protein_coding
## [5] chr2 [167239855, 167239965] + | protein_coding
## [6] chr2 [167240664, 167240736] + | protein_coding
## [7] chr2 [167241777, 167241910] + | protein_coding
## [8] chr2 [167242569, 167251947] + | protein_coding
## [9] chr2 [167254032, 167254126] + | protein_coding
## gene exon transcript symbol
## <character> <character> <character> <character>
## [1] ENSG00000163092 ENSE00001580025 ENST00000409195 XIRP2
## [2] ENSG00000163092 ENSE00001299302 ENST00000409195 XIRP2
## [3] ENSG00000163092 ENSE00003591481 ENST00000409195 XIRP2
## [4] ENSG00000163092 ENSE00001324449 ENST00000409195 XIRP2
## [5] ENSG00000163092 ENSE00001300610 ENST00000409195 XIRP2
## [6] ENSG00000163092 ENSE00001320657 ENST00000409195 XIRP2
## [7] ENSG00000163092 ENSE00001317162 ENST00000409195 XIRP2
## [8] ENSG00000163092 ENSE00001276753 ENST00000409195 XIRP2
## [9] ENSG00000163092 ENSE00003607800 ENST00000409195 XIRP2
## rank phase
## <numeric> <integer>
## [1] 2 -1
## [2] 3 0
## [3] 4 1
## [4] 5 0
## [5] 6 0
## [6] 7 0
## [7] 8 1
## [8] 9 0
## [9] 10 0
##
## ...
## <8 more elements>
## -------
## seqinfo: 8 sequences from an unspecified genome; no seqlengths
The peptides that have been experimentally observed are available as ranges (coordinates) along the protein sequences. For example, below, we isolate and visualise the 5 peptides () have been identified for our protein of interest P00558.
sort(pranges(p)[5])
## IRangesList of length 1
## $P00558
## IRanges object with 5 ranges and 28 metadata columns:
## start end width | DB AccessionNumber EntryName
## <integer> <integer> <integer> | <Rle> <character> <character>
## P00558 124 139 16 | sp P00558 PGK1_HUMAN
## P00558 193 206 14 | sp P00558 PGK1_HUMAN
## P00558 268 275 8 | sp P00558 PGK1_HUMAN
## P00558 333 350 18 | sp P00558 PGK1_HUMAN
## P00558 351 361 11 | sp P00558 PGK1_HUMAN
## IsoformName ProteinName
## <Rle> <character>
## P00558 <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
## P00558 <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
## P00558 <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
## P00558 <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
## P00558 <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
## OrganismName GeneName ProteinExistence SequenceVersion
## <Rle> <Rle> <Rle> <Rle>
## P00558 Homo sapiens PGK1 Evidence at protein level 3
## P00558 Homo sapiens PGK1 Evidence at protein level 3
## P00558 Homo sapiens PGK1 Evidence at protein level 3
## P00558 Homo sapiens PGK1 Evidence at protein level 3
## P00558 Homo sapiens PGK1 Evidence at protein level 3
## Comment spectrumID chargeState rank passThreshold
## <Rle> <factor> <integer> <integer> <logical>
## P00558 <NA> index=132 2 1 TRUE
## P00558 <NA> index=135 2 1 TRUE
## P00558 <NA> index=57 2 1 TRUE
## P00558 <NA> index=27 3 1 TRUE
## P00558 <NA> index=34 2 1 TRUE
## experimentalMassToCharge calculatedMassToCharge
## <numeric> <numeric>
## P00558 866.9158 866.4185
## P00558 834.4127 833.9254
## P00558 461.7346 461.2391
## P00558 702.3500 702.3583
## P00558 596.3218 595.8261
## sequence modNum isDecoy post pre
## <factor> <integer> <logical> <factor> <factor>
## P00558 FHVEEEGKGKDASGNK 0 FALSE V R
## P00558 ELNYFAKALESPER 0 FALSE P K
## P00558 DLMSKAEK 0 FALSE N K
## P00558 QIVWNGPVGVFEWEAFAR 0 FALSE G K
## P00558 GTKALMDEVVK 0 FALSE A R
## start end DatabaseAccess DBseqLength DatabaseSeq
## <integer> <integer> <factor> <integer> <factor>
## P00558 124 139 sp|P00558|PGK1_HUMAN 417
## P00558 193 206 sp|P00558|PGK1_HUMAN 417
## P00558 268 275 sp|P00558|PGK1_HUMAN 417
## P00558 333 350 sp|P00558|PGK1_HUMAN 417
## P00558 351 361 sp|P00558|PGK1_HUMAN 417
## acquisitionNum filenames
## <numeric> <Rle>
## P00558 132 Thermo_Hela_PRTC_selected.mzid
## P00558 135 Thermo_Hela_PRTC_selected.mzid
## P00558 57 Thermo_Hela_PRTC_selected.mzid
## P00558 27 Thermo_Hela_PRTC_selected.mzid
## P00558 34 Thermo_Hela_PRTC_selected.mzid
plot(p[5])
We can also plot the transcript regions inluding (grl
) or exclusing (pcgrl
) the untranslated regions.
plotAsGeneRegionTrack(grl[[5]], pcgrl[[5]])
The aim of this document is to document the mapping of peptides, i.e. ranges along a protein sequence to ranges along the genome reference. In other words, our aim is the convert protein coordinates to genome coordinates.
The first check that we want to implement is to verify that we can regenerate the protein amino acid sequence from the genome regions that we have extracted.
We also need the actual genome sequence (so far, we have only dealt with regions and features). The exons coordinates have been retrieved from the latest Ensembl release, which is based on the human genome assembly GRCh38
. We will use a genome package that is based on the same reference genome, namely BSgenome.Hsapiens.NCBI.GRCh38.
We need to make sure that the chromosomes are named the same way in the genome sequence data and our genomics ranges ("chrX"
, as seen above).
library("BSgenome.Hsapiens.NCBI.GRCh38")
## Loading required package: BSgenome
head(seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
## [1] "1" "2" "3" "4" "5" "6"
if (!"chr1" %in% seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23] <-
paste0("chr", seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23])
seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[21:27]
## [1] "chr21" "chr22"
## [3] "chrX" "Y"
## [5] "MT" "HSCHR1_CTG1_UNLOCALIZED"
## [7] "HSCHR1_CTG2_UNLOCALIZED"
Once we have extracted the actual sequences, we must also make sure that we we reverse the sequences in case out genomic features are on the reverse strand. We the combine (unlist
) the exons (coding sequences only, pcgrl
) and translate then into a protein sequence.
s <- getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl[[5]])
s
## A DNAStringSet instance of length 11
## width seq
## [1] 65 ATGTCGCTTTCTAACAAGCTGACGCTGGACA...GGACGTTAAAGGGAAGCGGGTCGTTATGAG
## [2] 51 AGTCGACTTCAATGTTCCTATGAAGAACAACCAGATAACAAACAACCAGAG
## [3] 156 GATTAAGGCTGCTGTCCCAAGCATCAAATTC...TGCTGTAGAACTCAAATCTCTGCTGGGCAA
## [4] 145 GGATGTTCTGTTCTTGAAGGACTGTGTAGGC...GGGAAGGGAAAAGATGCTTCTGGGAACAAG
## [5] 104 GTTAAAGCCGAGCCAGCCAAAATAGAAGCTT...TGCTTTTGGCACTGCTCACAGAGCCCACAG
## ... ... ...
## [7] 115 AGCTAAAGTTGCAGACAAGATCCAGCTCATC...ACCTTCCTTAAGGTGCTCAACAACATGGAG
## [8] 180 ATTGGCACTTCTCTGTTTGATGAAGAGGGAG...GTGGCTTCTGGCATACCTGCTGGCTGGATG
## [9] 178 GGCTTGGACTGTGGTCCTGAAAGCAGCAAGA...CCACTTCTAGGGGCTGCATCACCATCATAG
## [10] 99 GTGGTGGAGACACTGCCACTTGCTGTGCCAA...GGGGTGGTGCCAGTTTGGAGCTCCTGGAAG
## [11] 41 GTAAAGTCCTTCCTGGGGTGGATGCTCTCAGCAATATTTAG
if (isReverse(pcgrl[[5]]))
s <- rev(s)
aaseq <- translate(unlist(s))
aaseq
## 418-letter "AAString" instance
## seq: MSLSNKLTLDKLDVKGKRVVMRVDFNVPMKNNQI...EDKVSHVSTGGGASLELLEGKVLPGVDALSNI*
We verify that the translated genome sequence and the protein squence we started with match by aligning them.
writePairwiseAlignments(pairwiseAlignment(aa(p[5]), aaseq))
## ########################################
## # Program: Biostrings (version 2.42.0), a Bioconductor package
## # Rundate: Mon Oct 17 19:32:07 2016
## ########################################
## #=======================================
## #
## # Aligned_sequences: 2
## # 1: P00558
## # 2: S1
## # Matrix: NA
## # Gap_penalty: 14.0
## # Extend_penalty: 4.0
## #
## # Length: 418
## # Identity: 417/418 (99.8%)
## # Similarity: NA/418 (NA%)
## # Gaps: 1/418 (0.2%)
## # Score: 1780.636
## #
## #
## #=======================================
##
## P00558 1 MSLSNKLTLDKLDVKGKRVVMRVDFNVPMKNNQITNNQRIKAAVPSIKFC 50
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 1 MSLSNKLTLDKLDVKGKRVVMRVDFNVPMKNNQITNNQRIKAAVPSIKFC 50
##
## P00558 51 LDNGAKSVVLMSHLGRPDGVPMPDKYSLEPVAVELKSLLGKDVLFLKDCV 100
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 51 LDNGAKSVVLMSHLGRPDGVPMPDKYSLEPVAVELKSLLGKDVLFLKDCV 100
##
## P00558 101 GPEVEKACANPAAGSVILLENLRFHVEEEGKGKDASGNKVKAEPAKIEAF 150
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 101 GPEVEKACANPAAGSVILLENLRFHVEEEGKGKDASGNKVKAEPAKIEAF 150
##
## P00558 151 RASLSKLGDVYVNDAFGTAHRAHSSMVGVNLPQKAGGFLMKKELNYFAKA 200
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 151 RASLSKLGDVYVNDAFGTAHRAHSSMVGVNLPQKAGGFLMKKELNYFAKA 200
##
## P00558 201 LESPERPFLAILGGAKVADKIQLINNMLDKVNEMIIGGGMAFTFLKVLNN 250
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 201 LESPERPFLAILGGAKVADKIQLINNMLDKVNEMIIGGGMAFTFLKVLNN 250
##
## P00558 251 MEIGTSLFDEEGAKIVKDLMSKAEKNGVKITLPVDFVTADKFDENAKTGQ 300
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 251 MEIGTSLFDEEGAKIVKDLMSKAEKNGVKITLPVDFVTADKFDENAKTGQ 300
##
## P00558 301 ATVASGIPAGWMGLDCGPESSKKYAEAVTRAKQIVWNGPVGVFEWEAFAR 350
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 301 ATVASGIPAGWMGLDCGPESSKKYAEAVTRAKQIVWNGPVGVFEWEAFAR 350
##
## P00558 351 GTKALMDEVVKATSRGCITIIGGGDTATCCAKWNTEDKVSHVSTGGGASL 400
## ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1 351 GTKALMDEVVKATSRGCITIIGGGDTATCCAKWNTEDKVSHVSTGGGASL 400
##
## P00558 401 ELLEGKVLPGVDALSNI- 417
## |||||||||||||||||
## S1 401 ELLEGKVLPGVDALSNI* 418
##
##
## #---------------------------------------
## #---------------------------------------
We can now calculate the peptide coordinate along the genome using the position of the peptides along the protein (in p
) and the position of the exons of the protein’s transcript along the genome (in pcgrl
) using the mapToGenome
function.
res <- mapToGenome(p[5], pcgrl[5])
res[[1]]
## GRanges object with 5 ranges and 7 metadata columns:
## seqnames ranges strand | pepseq
## <Rle> <IRanges> <Rle> | <character>
## [1] chrX [78118106, 78118147] + | ELNYFAKALESPER
## [2] chrX [78123240, 78123263] + | DLMSKAEK
## [3] chrX [78124934, 78124987] + | QIVWNGPVGVFEWEAFAR
## [4] chrX [78114113, 78114160] + | FHVEEEGKGKDASGNK
## [5] chrX [78124988, 78125020] + | GTKALMDEVVK
## accession gene transcript symbol
## <character> <character> <character> <character>
## [1] P00558 ENSG00000102144 ENST00000373316 PGK1
## [2] P00558 ENSG00000102144 ENST00000373316 PGK1
## [3] P00558 ENSG00000102144 ENST00000373316 PGK1
## [4] P00558 ENSG00000102144 ENST00000373316 PGK1
## [5] P00558 ENSG00000102144 ENST00000373316 PGK1
## exonJunctions group
## <logical> <integer>
## [1] FALSE 1
## [2] FALSE 2
## [3] FALSE 3
## [4] FALSE 4
## [5] FALSE 5
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Based on the new peptide genomic coordinates, it is now straightforward to create a new AnnotationTrack
and add it the the track visualisation.
plotAsAnnotationTrack(res[[1]], grl[[5]])
Finally, we customise the figure by adding a track with the \(MS^2\) spectra. The raw data used to search the protein database an create p
are available as an MSnExp
object.
data(pms)
library("ggplot2")
details <- function(identifier, ...) {
p <- plot(pms[[as.numeric(identifier)]], full=TRUE, plot=FALSE) + ggtitle("")
p <- p + theme_bw() + theme(axis.text.y = element_blank(),
axis.text.x = element_blank()) +
labs(x = NULL, y = NULL)
print(p, newpage=FALSE)
}
res <- res[[1]]
deTrack <- AnnotationTrack(start = start(res),
end = end(res),
genome = "hg38", chromosom = "chrX",
id = pcols(p)[[5]]$acquisitionNum,
name = "MS2 spectra",
stacking = "squish", fun = details)
grTrack <- GeneRegionTrack(grl[[5]],
name = acols(p)$ENST[5])
ideoTrack <- IdeogramTrack(genome = "hg38",
chromosome = "chrX")
axisTrack <- GenomeAxisTrack()
plotTracks(list(ideoTrack, axisTrack, deTrack, grTrack),
add53 = TRUE, add35 = TRUE)
Proteins
objectAbove, we have demonstrated the mapToGenome
functionality on one protein only. The same operation can be performed on all the 9 of the p
object using the pmapToGenome
equivalent using the 9 ranges calculated with etrid2grl
. The pmapToGenome
will map the peptides of i-th protein to the i-th genomic location of pcgrl
.
pres <- pmapToGenome(p, pcgrl)
pres
## GRangesList object of length 9:
## $ENST00000409195
## GRanges object with 36 ranges and 7 metadata columns:
## seqnames ranges strand | pepseq
## <Rle> <IRanges> <Rle> | <character>
## chr2 [167249619, 167249672] + | QEITQNKSFFSSVKESQR
## chr2 [167239915, 167239950] + | LPVPKDVYSKQR
## chr2 [167246964, 167247002] + | EQNNDALEKSLRR
## chr2 [167246487, 167246516] + | SLKESSHRWK
## chr2 [167249256, 167249303] + | LKMVPRKQREFSGSDR
## . ... ... ... . ...
## chr2 [166903540, 166903575] + | PESGFAEDSAAR
## chr2 [167246526, 167246579] + | QPDAIPGDIEKAIECLEK
## chr2 [166903624, 166903665] + | MARYQAAVSRGDCR
## chr2 [167247636, 167247674] + | TNTSTGLKMAMER
## chr2 [167249619, 167249660] + | QEITQNKSFFSSVK
## accession gene transcript symbol exonJunctions
## <character> <character> <character> <character> <logical>
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## . ... ... ... ... ...
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## A4UGR9 ENSG00000163092 ENST00000409195 XIRP2 FALSE
## group
## <integer>
## 1
## 2
## 3
## 4
## 5
## . ...
## 32
## 33
## 34
## 35
## 36
##
## ...
## <8 more elements>
## -------
## seqinfo: 8 sequences from an unspecified genome; no seqlengths
k <- 6
seqnames(p)[k]
## [1] "P02545"
In the code chunk below, we remind ourselves that, querying the Ensembl Biomart server for P02545, we obtain several possible transcript identifiers, including the identifier of interest ENST00000368300.
sel <- bm$uniprot_swissprot == seqnames(p)[k]
bm[sel, ]
## uniprot_swissprot hgnc_symbol ensembl_transcript_id
## 18 P02545 LMNA ENST00000368301
## 19 P02545 LMNA ENST00000368300
## 20 P02545 LMNA ENST00000368299
## 21 P02545 LMNA ENST00000448611
## 22 P02545 LMNA ENST00000473598
## 23 P02545 LMNA ENST00000347559
acols(p)$ENST[k]
## P02545
## "ENST00000368300"
Let’s fetch the coordinates of all possible transcipts, making sure that the names of the Ensembl identifiers are used to name the grl
ranges (using use.names = TRUE
).
eid <- bm[sel, 3]
names(eid) <- bm[sel, 1]
eid
## P02545 P02545 P02545 P02545
## "ENST00000368301" "ENST00000368300" "ENST00000368299" "ENST00000448611"
## P02545 P02545
## "ENST00000473598" "ENST00000347559"
grl <- etrid2grl(eid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
plotAsGeneRegionTrack(pcgrl)
We extract the transcript sequences, translate them into protein sequences and align each to our protein sequence (originally imported from the fasta database, see ?Proteins
for the construction of p
).
lseq <- lapply(getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl),
function(s) translate(unlist(s)))
laln <- sapply(lseq, pairwiseAlignment, aa(p[k]))
sapply(laln, nmatch)/width(aa(p[k]))
## P02545 P02545 P02545 P02545 P02545 P02545
## 0.8614458 1.0000000 0.9246988 0.8298193 0.8358434 0.9548193
We see that transcript number 2, ENST00000368300, perfectly aligns with our protein sequence. This is also the transcipt that corresponds to the curated Ensembl transcript in acols(p)$ENST
.
ki <- which.max(sapply(laln, nmatch))
stopifnot(eid[ki] == acols(p)$ENST[k])
res <- pmapToGenome(p[k], pcgrl[ki])
As shown on the next figure, peptides that span over exon junctions are grouped together and, below, colour-coded.
plotAsAnnotationTrack(res[[1]], pcgrl[[ki]])
One can also apply a many-to-one mapping approach to all proteins in the p
object and all the transcripts identifiers fetched with etrid2grl
as shown below.
alleid <- bm[, 3]
names(alleid) <- bm[, 1]
grl <- etrid2grl(alleid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
res <- mapToGenome(p, pcgrl)
## Mapping 8 out of 9 peptide ranges.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
## Warning: Mapping failed. Returning an empty range.
length(res)
## [1] 22
The messages indicate that one protein accession number was not found in the pcgrl
ranges (no transcript was found) and several mapping failed. In total, we obtain 22 mapping for 8 protein accession numbers.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ggplot2_2.1.0
## [2] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
## [3] BSgenome_1.42.0
## [4] rtracklayer_1.34.0
## [5] AnnotationHub_2.6.0
## [6] biomaRt_2.30.0
## [7] mzID_1.12.0
## [8] Biostrings_2.42.0
## [9] XVector_0.14.0
## [10] Pbase_0.14.0
## [11] Gviz_1.18.0
## [12] GenomicRanges_1.26.0
## [13] GenomeInfoDb_1.10.0
## [14] IRanges_2.8.0
## [15] S4Vectors_0.12.0
## [16] Rcpp_0.12.7
## [17] BiocGenerics_0.20.0
## [18] BiocStyle_2.2.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.6.0 bitops_1.0-6
## [3] matrixStats_0.51.0 doParallel_1.0.10
## [5] RColorBrewer_1.1-2 httr_1.2.1
## [7] MSnbase_2.0.0 tools_3.3.1
## [9] R6_2.2.0 affyio_1.44.0
## [11] rpart_4.1-10 cleaver_1.12.0
## [13] Hmisc_3.17-4 DBI_0.5-1
## [15] colorspace_1.2-7 nnet_7.3-12
## [17] gridExtra_2.2.1 Pviz_1.8.0
## [19] curl_2.1 preprocessCore_1.36.0
## [21] chron_2.3-47 Biobase_2.34.0
## [23] formatR_1.4 labeling_0.3
## [25] scales_0.4.0 affy_1.52.0
## [27] stringr_1.1.0 digest_0.6.10
## [29] Rsamtools_1.26.0 foreign_0.8-67
## [31] rmarkdown_1.1 dichromat_2.0-0
## [33] htmltools_0.3.5 ensembldb_1.6.0
## [35] limma_3.30.0 RSQLite_1.0.0
## [37] impute_1.48.0 BiocInstaller_1.24.0
## [39] shiny_0.14.1 BiocParallel_1.8.0
## [41] acepack_1.3-3.3 VariantAnnotation_1.20.0
## [43] RCurl_1.95-4.8 magrittr_1.5
## [45] Formula_1.2-1 MALDIquant_1.15
## [47] Matrix_1.2-7.1 munsell_0.4.3
## [49] vsn_3.42.0 stringi_1.1.2
## [51] yaml_2.1.13 SummarizedExperiment_1.4.0
## [53] zlibbioc_1.20.0 plyr_1.8.4
## [55] lattice_0.20-34 splines_3.3.1
## [57] GenomicFeatures_1.26.0 mzR_2.8.0
## [59] knitr_1.14 reshape2_1.4.1
## [61] codetools_0.2-15 XML_3.98-1.4
## [63] evaluate_0.10 biovizBase_1.22.0
## [65] latticeExtra_0.6-28 pcaMethods_1.66.0
## [67] data.table_1.9.6 httpuv_1.3.3
## [69] foreach_1.4.3 gtable_0.2.0
## [71] assertthat_0.1 mime_0.5
## [73] xtable_1.8-2 survival_2.39-5
## [75] tibble_1.2 iterators_1.0.8
## [77] GenomicAlignments_1.10.0 AnnotationDbi_1.36.0
## [79] cluster_2.0.5 interactiveDisplayBase_1.12.0