Mapping experimental MS data to genomic coordinates

0.1 Introduction

The aim of this vignette is to document the mapping of proteins and the tandem mass spectrometry-derived peptides to genomic locations.

Mapping proteins to a genome reference.

0.2 Protein and genome data

We will use a small Proteins object from the Pbase package to illustrate how to retrieve genome coordinates and map a peptides back to genomic coordinates. See the Pbase-data vignette for an introduction to Proteins data. The main information needed in this vignette consists of protein UniProt identifiers and a set of peptides positions along the protein sequence.

library("Pbase")
data(p)
p

## S4 class type     : Proteins
## Class version     : 0.1
## Created           : Sat Feb 14 12:36:40 2015
## Number of Proteins: 9
## Sequences:
##   [1] A4UGR9 [2] A6H8Y1 ... [8] P04075-2 [9] P60709
## Sequence features:
##   [1] DB [2] AccessionNumber ... [12] npeps [13] ENST
## Peptide features:
##   [1] DB [2] AccessionNumber ... [27] acquisitionNum [28] filenames

seqnames(p)

## [1] "A4UGR9"   "A6H8Y1"   "O43707"   "O75369"   "P00558"   "P02545"  
## [7] "P04075"   "P04075-2" "P60709"

pcols(p)[1, c("start", "end")]

## SplitDataFrameList of length 1
## $A4UGR9
## DataFrame with 36 rows and 2 columns
##         start       end
##     <integer> <integer>
## 1        2743      2760
## 2         307       318
## 3        1858      1870
## 4        1699      1708
## 5        2622      2637
## ...       ...       ...
## 32         20        31
## 33       1712      1729
## 34         48        61
## 35       2082      2094
## 36       2743      2756

We will also require an identifier relating the protein feature of interest to the genome. Below, we use biomaRt to query the matching Ensembl transcript identifier. We start by create a Mart object that stores the connection details to the latest human Ensembl biomart server.

library("biomaRt")
ens <- useMart("ensembl", "hsapiens_gene_ensembl")
ens

## Object of class 'Mart':
##  Using the ENSEMBL_MART_ENSEMBL BioMart database
##  Using the hsapiens_gene_ensembl dataset

bm <- select(ens, keys = seqnames(p),
             keytype = "uniprot_swissprot",
             columns = c(
                 "uniprot_swissprot",
                 "hgnc_symbol",
                 "ensembl_transcript_id"))

bm

##    uniprot_swissprot hgnc_symbol ensembl_transcript_id
## 1             A4UGR9       XIRP2       ENST00000409043
## 2             A4UGR9       XIRP2       ENST00000409728
## 3             A4UGR9       XIRP2       ENST00000409195
## 4             A4UGR9       XIRP2       ENST00000409273
## 5             A4UGR9       XIRP2       ENST00000409605
## 6             A4UGR9       XIRP2       ENST00000628543
## 7             A6H8Y1        BDP1       ENST00000358731
## 8             A6H8Y1        BDP1       ENST00000617085
## 9             O43707       ACTN4       ENST00000252699
## 10            O43707       ACTN4       ENST00000390009
## 11            O75369        FLNB       ENST00000490882
## 12            O75369        FLNB       ENST00000295956
## 13            O75369        FLNB       ENST00000358537
## 14            O75369        FLNB       ENST00000429972
## 15            P00558        PGK1       ENST00000373316
## 16            P02545        LMNA       ENST00000368301
## 17            P02545        LMNA       ENST00000368300
## 18            P02545        LMNA       ENST00000368299
## 19            P02545        LMNA       ENST00000448611
## 20            P02545        LMNA       ENST00000473598
## 21            P02545        LMNA       ENST00000347559
## 22            P04075       ALDOA       ENST00000338110
## 23            P04075       ALDOA       ENST00000395248
## 24            P04075       ALDOA       ENST00000566897
## 25            P04075       ALDOA       ENST00000569545
## 26            P04075       ALDOA       ENST00000563060
## 27            P04075       ALDOA       ENST00000412304
## 28            P04075       ALDOA       ENST00000564546
## 29            P04075       ALDOA       ENST00000564595
## 30            P60709        ACTB       ENST00000331789

As can be seen, there can be multiple transcripts for one protein accession. We have defined the transcripts of interest for our proteins in p; they are stored as protein elements metadata:

acols(p)$ENST

##            A4UGR9            A6H8Y1            O43707            O75369 
## "ENST00000409195" "ENST00000358731" "ENST00000252699" "ENST00000295956" 
##            P00558            P02545            P04075          P04075-2 
## "ENST00000373316" "ENST00000368300" "ENST00000338110" "ENST00000395248" 
##            P60709 
## "ENST00000331789"

0.3 Genomic transcript coordinates

The etrid2grl function takes our transcript identifiers and will query the Ensembl biomart server (note the ens argument) and return a GRangesList object. For each of Ensembl transcript identifiers provided as input, we have the genomic coordinates of that transcript’s exons as well as additional information such as the type of exons (protein coding or untranslated region).

grl <- etrid2grl(acols(p)$ENST, ens)
all.equal(names(grl), acols(p)$ENST,
          check.attributes=FALSE)

## [1] TRUE

grl

## GRangesList object of length 9:
## $ENST00000409195 
## GRanges object with 13 ranges and 7 metadata columns:
##        seqnames                 ranges strand |        feature
##           <Rle>              <IRanges>  <Rle> |    <character>
##    [1]     chr2 [166888487, 166888557]      + |           utr5
##    [2]     chr2 [166903465, 166903482]      + |           utr5
##    [3]     chr2 [166903483, 166903890]      + | protein_coding
##    [4]     chr2 [167135909, 167136062]      + | protein_coding
##    [5]     chr2 [167210735, 167210895]      + | protein_coding
##    ...      ...                    ...    ... .            ...
##    [9]     chr2 [167241777, 167241910]      + | protein_coding
##   [10]     chr2 [167242569, 167251947]      + | protein_coding
##   [11]     chr2 [167254032, 167254126]      + | protein_coding
##   [12]     chr2 [167254127, 167254165]      + |           utr3
##   [13]     chr2 [167257857, 167259753]      + |           utr3
##                   gene            exon      transcript      symbol
##            <character>     <character>     <character> <character>
##    [1] ENSG00000163092 ENSE00001583437 ENST00000409195       XIRP2
##    [2] ENSG00000163092 ENSE00001580025 ENST00000409195       XIRP2
##    [3] ENSG00000163092 ENSE00001580025 ENST00000409195       XIRP2
##    [4] ENSG00000163092 ENSE00001299302 ENST00000409195       XIRP2
##    [5] ENSG00000163092 ENSE00003591481 ENST00000409195       XIRP2
##    ...             ...             ...             ...         ...
##    [9] ENSG00000163092 ENSE00001317162 ENST00000409195       XIRP2
##   [10] ENSG00000163092 ENSE00001276753 ENST00000409195       XIRP2
##   [11] ENSG00000163092 ENSE00003607800 ENST00000409195       XIRP2
##   [12] ENSG00000163092 ENSE00003607800 ENST00000409195       XIRP2
##   [13] ENSG00000163092 ENSE00001578286 ENST00000409195       XIRP2
##             rank     phase
##        <numeric> <integer>
##    [1]         1        -1
##    [2]         2        -1
##    [3]         2        -1
##    [4]         3         0
##    [5]         4         1
##    ...       ...       ...
##    [9]         8         1
##   [10]         9         0
##   [11]        10         0
##   [12]        10        -1
##   [13]        11        -1
## 
## ...
## <8 more elements>
## -------
## seqinfo: 8 sequences from an unspecified genome; no seqlengths

We also need to retain only coding exons and discard untranslated regions for later peptide mapping, using the proteinCoding function.

pcgrl <- proteinCoding(grl)
pcgrl

## GRangesList object of length 9:
## $ENST00000409195 
## GRanges object with 9 ranges and 7 metadata columns:
##       seqnames                 ranges strand |        feature
##          <Rle>              <IRanges>  <Rle> |    <character>
##   [1]     chr2 [166903483, 166903890]      + | protein_coding
##   [2]     chr2 [167135909, 167136062]      + | protein_coding
##   [3]     chr2 [167210735, 167210895]      + | protein_coding
##   [4]     chr2 [167218166, 167218300]      + | protein_coding
##   [5]     chr2 [167239855, 167239965]      + | protein_coding
##   [6]     chr2 [167240664, 167240736]      + | protein_coding
##   [7]     chr2 [167241777, 167241910]      + | protein_coding
##   [8]     chr2 [167242569, 167251947]      + | protein_coding
##   [9]     chr2 [167254032, 167254126]      + | protein_coding
##                  gene            exon      transcript      symbol
##           <character>     <character>     <character> <character>
##   [1] ENSG00000163092 ENSE00001580025 ENST00000409195       XIRP2
##   [2] ENSG00000163092 ENSE00001299302 ENST00000409195       XIRP2
##   [3] ENSG00000163092 ENSE00003591481 ENST00000409195       XIRP2
##   [4] ENSG00000163092 ENSE00001324449 ENST00000409195       XIRP2
##   [5] ENSG00000163092 ENSE00001300610 ENST00000409195       XIRP2
##   [6] ENSG00000163092 ENSE00001320657 ENST00000409195       XIRP2
##   [7] ENSG00000163092 ENSE00001317162 ENST00000409195       XIRP2
##   [8] ENSG00000163092 ENSE00001276753 ENST00000409195       XIRP2
##   [9] ENSG00000163092 ENSE00003607800 ENST00000409195       XIRP2
##            rank     phase
##       <numeric> <integer>
##   [1]         2        -1
##   [2]         3         0
##   [3]         4         1
##   [4]         5         0
##   [5]         6         0
##   [6]         7         0
##   [7]         8         1
##   [8]         9         0
##   [9]        10         0
## 
## ...
## <8 more elements>
## -------
## seqinfo: 8 sequences from an unspecified genome; no seqlengths

0.4 Visualisation with Gviz and Pviz

0.4.1 Peptides along proteins

The peptides that have been experimentally observed are available as ranges (coordinates) along the protein sequences. For example, below, we isolate and visualise the 5 peptides () have been identified for our protein of interest P00558.

sort(pranges(p)[5])

## IRangesList of length 1
## $P00558
## IRanges object with 5 ranges and 28 metadata columns:
##              start       end     width |    DB AccessionNumber   EntryName
##          <integer> <integer> <integer> | <Rle>     <character> <character>
##   P00558       124       139        16 |    sp          P00558  PGK1_HUMAN
##   P00558       193       206        14 |    sp          P00558  PGK1_HUMAN
##   P00558       268       275         8 |    sp          P00558  PGK1_HUMAN
##   P00558       333       350        18 |    sp          P00558  PGK1_HUMAN
##   P00558       351       361        11 |    sp          P00558  PGK1_HUMAN
##          IsoformName                                    ProteinName
##                <Rle>                                    <character>
##   P00558        <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
##   P00558        <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
##   P00558        <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
##   P00558        <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
##   P00558        <NA> sp|P00558|PGK1_HUMAN Phosphoglycerate kinase 1
##          OrganismName GeneName          ProteinExistence SequenceVersion
##                 <Rle>    <Rle>                     <Rle>           <Rle>
##   P00558 Homo sapiens     PGK1 Evidence at protein level               3
##   P00558 Homo sapiens     PGK1 Evidence at protein level               3
##   P00558 Homo sapiens     PGK1 Evidence at protein level               3
##   P00558 Homo sapiens     PGK1 Evidence at protein level               3
##   P00558 Homo sapiens     PGK1 Evidence at protein level               3
##          Comment spectrumID chargeState      rank passThreshold
##            <Rle>   <factor>   <integer> <integer>     <logical>
##   P00558    <NA>  index=132           2         1          TRUE
##   P00558    <NA>  index=135           2         1          TRUE
##   P00558    <NA>   index=57           2         1          TRUE
##   P00558    <NA>   index=27           3         1          TRUE
##   P00558    <NA>   index=34           2         1          TRUE
##          experimentalMassToCharge calculatedMassToCharge
##                         <numeric>              <numeric>
##   P00558                 866.9158               866.4185
##   P00558                 834.4127               833.9254
##   P00558                 461.7346               461.2391
##   P00558                 702.3500               702.3583
##   P00558                 596.3218               595.8261
##                    sequence    modNum   isDecoy     post      pre
##                    <factor> <integer> <logical> <factor> <factor>
##   P00558   FHVEEEGKGKDASGNK         0     FALSE        V        R
##   P00558     ELNYFAKALESPER         0     FALSE        P        K
##   P00558           DLMSKAEK         0     FALSE        N        K
##   P00558 QIVWNGPVGVFEWEAFAR         0     FALSE        G        K
##   P00558        GTKALMDEVVK         0     FALSE        A        R
##              start       end       DatabaseAccess DBseqLength DatabaseSeq
##          <integer> <integer>             <factor>   <integer>    <factor>
##   P00558       124       139 sp|P00558|PGK1_HUMAN         417            
##   P00558       193       206 sp|P00558|PGK1_HUMAN         417            
##   P00558       268       275 sp|P00558|PGK1_HUMAN         417            
##   P00558       333       350 sp|P00558|PGK1_HUMAN         417            
##   P00558       351       361 sp|P00558|PGK1_HUMAN         417            
##          acquisitionNum                      filenames
##               <numeric>                          <Rle>
##   P00558            132 Thermo_Hela_PRTC_selected.mzid
##   P00558            135 Thermo_Hela_PRTC_selected.mzid
##   P00558             57 Thermo_Hela_PRTC_selected.mzid
##   P00558             27 Thermo_Hela_PRTC_selected.mzid
##   P00558             34 Thermo_Hela_PRTC_selected.mzid

plot(p[5])

0.4.2 Exons along the genome

We can also plot the transcript regions inluding (grl) or exclusing (pcgrl) the untranslated regions.

plotAsGeneRegionTrack(grl[[5]], pcgrl[[5]])

0.5 Mapping peptides back to the genome

The aim of this document is to document the mapping of peptides, i.e. ranges along a protein sequence to ranges along the genome reference. In other words, our aim is the convert protein coordinates to genome coordinates.

0.5.1 Comparing protein and translated DNA sequences

The first check that we want to implement is to verify that we can regenerate the protein amino acid sequence from the genome regions that we have extracted.

We also need the actual genome sequence (so far, we have only dealt with regions and features). The exons coordinates have been retrieved from the latest Ensembl release, which is based on the human genome assembly GRCh38. We will use a genome package that is based on the same reference genome, namely BSgenome.Hsapiens.NCBI.GRCh38.

We need to make sure that the chromosomes are named the same way in the genome sequence data and our genomics ranges ("chrX", as seen above).

library("BSgenome.Hsapiens.NCBI.GRCh38")

## Loading required package: BSgenome

head(seqnames(BSgenome.Hsapiens.NCBI.GRCh38))

## [1] "1" "2" "3" "4" "5" "6"

if (!"chr1" %in% seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
    seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23] <-
        paste0("chr", seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23])
seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[21:27]

## [1] "chr21"                   "chr22"                  
## [3] "chrX"                    "Y"                      
## [5] "MT"                      "HSCHR1_CTG1_UNLOCALIZED"
## [7] "HSCHR1_CTG2_UNLOCALIZED"

Once we have extracted the actual sequences, we must also make sure that we we reverse the sequences in case out genomic features are on the reverse strand. We the combine (unlist) the exons (coding sequences only, pcgrl) and translate then into a protein sequence.

s <- getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl[[5]])
s

##   A DNAStringSet instance of length 11
##      width seq
##  [1]    65 ATGTCGCTTTCTAACAAGCTGACGCTGGACA...GGACGTTAAAGGGAAGCGGGTCGTTATGAG
##  [2]    51 AGTCGACTTCAATGTTCCTATGAAGAACAACCAGATAACAAACAACCAGAG
##  [3]   156 GATTAAGGCTGCTGTCCCAAGCATCAAATTC...TGCTGTAGAACTCAAATCTCTGCTGGGCAA
##  [4]   145 GGATGTTCTGTTCTTGAAGGACTGTGTAGGC...GGGAAGGGAAAAGATGCTTCTGGGAACAAG
##  [5]   104 GTTAAAGCCGAGCCAGCCAAAATAGAAGCTT...TGCTTTTGGCACTGCTCACAGAGCCCACAG
##  ...   ... ...
##  [7]   115 AGCTAAAGTTGCAGACAAGATCCAGCTCATC...ACCTTCCTTAAGGTGCTCAACAACATGGAG
##  [8]   180 ATTGGCACTTCTCTGTTTGATGAAGAGGGAG...GTGGCTTCTGGCATACCTGCTGGCTGGATG
##  [9]   178 GGCTTGGACTGTGGTCCTGAAAGCAGCAAGA...CCACTTCTAGGGGCTGCATCACCATCATAG
## [10]    99 GTGGTGGAGACACTGCCACTTGCTGTGCCAA...GGGGTGGTGCCAGTTTGGAGCTCCTGGAAG
## [11]    41 GTAAAGTCCTTCCTGGGGTGGATGCTCTCAGCAATATTTAG

if (isReverse(pcgrl[[5]]))
    s <- rev(s)

aaseq <- translate(unlist(s))
aaseq

##   418-letter "AAString" instance
## seq: MSLSNKLTLDKLDVKGKRVVMRVDFNVPMKNNQI...EDKVSHVSTGGGASLELLEGKVLPGVDALSNI*

We verify that the translated genome sequence and the protein squence we started with match by aligning them.

writePairwiseAlignments(pairwiseAlignment(aa(p[5]), aaseq))

## ########################################
## # Program: Biostrings (version 2.40.0), a Bioconductor package
## # Rundate: Sun May 15 21:37:55 2016
## ########################################
## #=======================================
## #
## # Aligned_sequences: 2
## # 1: P00558
## # 2: S1
## # Matrix: NA
## # Gap_penalty: 14.0
## # Extend_penalty: 4.0
## #
## # Length: 418
## # Identity:     417/418 (99.8%)
## # Similarity:    NA/418 (NA%)
## # Gaps:           1/418 (0.2%)
## # Score: 1780.636
## #
## #
## #=======================================
## 
## P00558             1 MSLSNKLTLDKLDVKGKRVVMRVDFNVPMKNNQITNNQRIKAAVPSIKFC     50
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1                 1 MSLSNKLTLDKLDVKGKRVVMRVDFNVPMKNNQITNNQRIKAAVPSIKFC     50
## 
## P00558            51 LDNGAKSVVLMSHLGRPDGVPMPDKYSLEPVAVELKSLLGKDVLFLKDCV    100
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1                51 LDNGAKSVVLMSHLGRPDGVPMPDKYSLEPVAVELKSLLGKDVLFLKDCV    100
## 
## P00558           101 GPEVEKACANPAAGSVILLENLRFHVEEEGKGKDASGNKVKAEPAKIEAF    150
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1               101 GPEVEKACANPAAGSVILLENLRFHVEEEGKGKDASGNKVKAEPAKIEAF    150
## 
## P00558           151 RASLSKLGDVYVNDAFGTAHRAHSSMVGVNLPQKAGGFLMKKELNYFAKA    200
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1               151 RASLSKLGDVYVNDAFGTAHRAHSSMVGVNLPQKAGGFLMKKELNYFAKA    200
## 
## P00558           201 LESPERPFLAILGGAKVADKIQLINNMLDKVNEMIIGGGMAFTFLKVLNN    250
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1               201 LESPERPFLAILGGAKVADKIQLINNMLDKVNEMIIGGGMAFTFLKVLNN    250
## 
## P00558           251 MEIGTSLFDEEGAKIVKDLMSKAEKNGVKITLPVDFVTADKFDENAKTGQ    300
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1               251 MEIGTSLFDEEGAKIVKDLMSKAEKNGVKITLPVDFVTADKFDENAKTGQ    300
## 
## P00558           301 ATVASGIPAGWMGLDCGPESSKKYAEAVTRAKQIVWNGPVGVFEWEAFAR    350
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1               301 ATVASGIPAGWMGLDCGPESSKKYAEAVTRAKQIVWNGPVGVFEWEAFAR    350
## 
## P00558           351 GTKALMDEVVKATSRGCITIIGGGDTATCCAKWNTEDKVSHVSTGGGASL    400
##                      ||||||||||||||||||||||||||||||||||||||||||||||||||
## S1               351 GTKALMDEVVKATSRGCITIIGGGDTATCCAKWNTEDKVSHVSTGGGASL    400
## 
## P00558           401 ELLEGKVLPGVDALSNI-    417
##                      ||||||||||||||||| 
## S1               401 ELLEGKVLPGVDALSNI*    418
## 
## 
## #---------------------------------------
## #---------------------------------------

0.5.2 Calculating new coordinates

We can now calculate the peptide coordinate along the genome using the position of the peptides along the protein (in p) and the position of the exons of the protein’s transcript along the genome (in pcgrl) using the mapToGenome function.

res <- mapToGenome(p[5], pcgrl[5])
res[[1]]

## GRanges object with 5 ranges and 7 metadata columns:
##       seqnames               ranges strand |             pepseq
##          <Rle>            <IRanges>  <Rle> |        <character>
##   [1]     chrX [78118106, 78118147]      + |     ELNYFAKALESPER
##   [2]     chrX [78123240, 78123263]      + |           DLMSKAEK
##   [3]     chrX [78124934, 78124987]      + | QIVWNGPVGVFEWEAFAR
##   [4]     chrX [78114113, 78114160]      + |   FHVEEEGKGKDASGNK
##   [5]     chrX [78124988, 78125020]      + |        GTKALMDEVVK
##         accession            gene      transcript      symbol
##       <character>     <character>     <character> <character>
##   [1]      P00558 ENSG00000102144 ENST00000373316        PGK1
##   [2]      P00558 ENSG00000102144 ENST00000373316        PGK1
##   [3]      P00558 ENSG00000102144 ENST00000373316        PGK1
##   [4]      P00558 ENSG00000102144 ENST00000373316        PGK1
##   [5]      P00558 ENSG00000102144 ENST00000373316        PGK1
##       exonJunctions     group
##           <logical> <integer>
##   [1]         FALSE         1
##   [2]         FALSE         2
##   [3]         FALSE         3
##   [4]         FALSE         4
##   [5]         FALSE         5
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

0.5.3 Plotting

Based on the new peptide genomic coordinates, it is now straightforward to create a new AnnotationTrack and add it the the track visualisation.

plotAsAnnotationTrack(res[[1]], grl[[5]])

0.5.4 Detailed annotation track

Finally, we customise the figure by adding a track with the $MS^2$ spectra. The raw data used to search the protein database an create p are available as an MSnExp object.

data(pms)

library("ggplot2")
details <- function(identifier, ...) {
    p <- plot(pms[[as.numeric(identifier)]], full=TRUE, plot=FALSE) + ggtitle("") 
    p <- p + theme_bw() + theme(axis.text.y = element_blank(),
                                axis.text.x = element_blank()) + 
                                labs(x = NULL, y = NULL)
    print(p, newpage=FALSE)
}

res <- res[[1]]
deTrack <- AnnotationTrack(start = start(res),
                           end = end(res),
                           genome = "hg38", chromosom = "chrX",
                           id = pcols(p)[[5]]$acquisitionNum,
                           name = "MS2 spectra",
                           stacking = "squish", fun = details)

grTrack <- GeneRegionTrack(grl[[5]],
                           name = acols(p)$ENST[5])
ideoTrack <- IdeogramTrack(genome = "hg38",
                           chromosome = "chrX")
axisTrack <- GenomeAxisTrack()

plotTracks(list(ideoTrack, axisTrack, deTrack, grTrack),
           add53 = TRUE, add35 = TRUE)

0.5.5 Mapping all peptide sets in the `Proteins` object

Above, we have demonstrated the mapToGenome functionality on one protein only. The same operation can be performed on all the 9 of the p object using the pmapToGenome equivalent using the 9 ranges calculated with etrid2grl. The pmapToGenome will map the peptides of i-th protein to the i-th genomic location of pcgrl.

pres <- pmapToGenome(p, pcgrl)
pres

## GRangesList object of length 9:
## $ENST00000409195 
## GRanges object with 36 ranges and 7 metadata columns:
##     seqnames                 ranges strand |             pepseq
##        <Rle>              <IRanges>  <Rle> |        <character>
##         chr2 [167249619, 167249672]      + | QEITQNKSFFSSVKESQR
##         chr2 [167239915, 167239950]      + |       LPVPKDVYSKQR
##         chr2 [167246964, 167247002]      + |      EQNNDALEKSLRR
##         chr2 [167246487, 167246516]      + |         SLKESSHRWK
##         chr2 [167249256, 167249303]      + |   LKMVPRKQREFSGSDR
##   .      ...                    ...    ... .                ...
##         chr2 [166903540, 166903575]      + |       PESGFAEDSAAR
##         chr2 [167246526, 167246579]      + | QPDAIPGDIEKAIECLEK
##         chr2 [166903624, 166903665]      + |     MARYQAAVSRGDCR
##         chr2 [167247636, 167247674]      + |      TNTSTGLKMAMER
##         chr2 [167249619, 167249660]      + |     QEITQNKSFFSSVK
##       accession            gene      transcript      symbol exonJunctions
##     <character>     <character>     <character> <character>     <logical>
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##   .         ...             ...             ...         ...           ...
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##          A4UGR9 ENSG00000163092 ENST00000409195       XIRP2         FALSE
##         group
##     <integer>
##             1
##             2
##             3
##             4
##             5
##   .       ...
##            32
##            33
##            34
##            35
##            36
## 
## ...
## <8 more elements>
## -------
## seqinfo: 8 sequences from an unspecified genome; no seqlengths

0.6 Dealing with multiple transcipts per protein

k <- 6
seqnames(p)[k]

## [1] "P02545"

In the code chunk below, we remind ourselves that, querying the Ensembl Biomart server for P02545, we obtain several possible transcript identifiers, including the identifier of interest ENST00000368300.

sel <- bm$uniprot_swissprot == seqnames(p)[k]
bm[sel, ]

##    uniprot_swissprot hgnc_symbol ensembl_transcript_id
## 16            P02545        LMNA       ENST00000368301
## 17            P02545        LMNA       ENST00000368300
## 18            P02545        LMNA       ENST00000368299
## 19            P02545        LMNA       ENST00000448611
## 20            P02545        LMNA       ENST00000473598
## 21            P02545        LMNA       ENST00000347559

acols(p)$ENST[k]

##            P02545 
## "ENST00000368300"

Let’s fetch the coordinates of all possible transcipts, making sure that the names of the Ensembl identifiers are used to name the grl ranges (using use.names = TRUE).

eid <- bm[sel, 3]
names(eid) <- bm[sel, 1]
eid

##            P02545            P02545            P02545            P02545 
## "ENST00000368301" "ENST00000368300" "ENST00000368299" "ENST00000448611" 
##            P02545            P02545 
## "ENST00000473598" "ENST00000347559"

grl <- etrid2grl(eid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)

plotAsGeneRegionTrack(pcgrl)

0.6.1 Descriminating transcripts

We extract the transcript sequences, translate them into protein sequences and align each to our protein sequence (originally imported from the fasta database, see ?Proteins for the construction of p).

lseq <- lapply(getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl),
               function(s) translate(unlist(s)))

laln <- sapply(lseq, pairwiseAlignment, aa(p[k]))
sapply(laln, nmatch)/width(aa(p[k]))

##    P02545    P02545    P02545    P02545    P02545    P02545 
## 0.8614458 1.0000000 0.9246988 0.8298193 0.8358434 0.9548193

We see that transcript number 2, ENST00000368300, perfectly aligns with our protein sequence. This is also the transcipt that corresponds to the curated Ensembl transcript in acols(p)$ENST.

ki <- which.max(sapply(laln, nmatch))
stopifnot(eid[ki] == acols(p)$ENST[k])

res <- pmapToGenome(p[k], pcgrl[ki])

As shown on the next figure, peptides that span over exon junctions are grouped together and, below, colour-coded.

plotAsAnnotationTrack(res[[1]], pcgrl[[ki]])

One can also apply a many-to-one mapping approach to all proteins in the p object and all the transcripts identifiers fetched with etrid2grl as shown below.

alleid <- bm[, 3]
names(alleid) <- bm[, 1]
grl <- etrid2grl(alleid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
res <- mapToGenome(p, pcgrl)

## Mapping 8 out of 9 peptide ranges.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

## Warning: Mapping failed. Returning an empty range.

length(res)

## [1] 21

The messages indicate that one protein accession number was not found in the pcgrl ranges (no transcript was found) and several mapping failed. In total, we obtain 21 mapping for 8 protein accession numbers.

0.7 Session information

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] ggplot2_2.1.0                         
##  [2] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
##  [3] BSgenome_1.40.0                       
##  [4] rtracklayer_1.32.0                    
##  [5] AnnotationHub_2.4.2                   
##  [6] biomaRt_2.28.0                        
##  [7] mzID_1.10.2                           
##  [8] Biostrings_2.40.0                     
##  [9] XVector_0.12.0                        
## [10] Pbase_0.12.2                          
## [11] Gviz_1.16.1                           
## [12] GenomicRanges_1.24.0                  
## [13] GenomeInfoDb_1.8.2                    
## [14] IRanges_2.6.0                         
## [15] S4Vectors_0.10.0                      
## [16] Rcpp_0.12.5                           
## [17] BiocGenerics_0.18.0                   
## [18] BiocStyle_2.0.2                       
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.32.0                httr_1.1.0                   
##  [3] vsn_3.40.0                    splines_3.3.0                
##  [5] foreach_1.4.3                 Formula_1.2-1                
##  [7] shiny_0.13.2                  interactiveDisplayBase_1.10.3
##  [9] affy_1.50.0                   latticeExtra_0.6-28          
## [11] Rsamtools_1.24.0              impute_1.46.0                
## [13] yaml_2.1.13                   RSQLite_1.0.0                
## [15] lattice_0.20-33               biovizBase_1.20.0            
## [17] limma_3.28.4                  chron_2.3-47                 
## [19] digest_0.6.9                  RColorBrewer_1.1-2           
## [21] colorspace_1.2-6              preprocessCore_1.34.0        
## [23] htmltools_0.3.5               httpuv_1.3.3                 
## [25] Matrix_1.2-6                  plyr_1.8.3                   
## [27] MALDIquant_1.14               XML_3.98-1.4                 
## [29] zlibbioc_1.18.0               xtable_1.8-2                 
## [31] scales_0.4.0                  affyio_1.42.0                
## [33] cleaver_1.10.2                BiocParallel_1.6.2           
## [35] SummarizedExperiment_1.2.2    GenomicFeatures_1.24.2       
## [37] nnet_7.3-12                   survival_2.39-4              
## [39] magrittr_1.5                  mime_0.4                     
## [41] evaluate_0.9                  doParallel_1.0.10            
## [43] foreign_0.8-66                mzR_2.6.2                    
## [45] Pviz_1.6.2                    BiocInstaller_1.22.2         
## [47] tools_3.3.0                   data.table_1.9.6             
## [49] formatR_1.4                   matrixStats_0.50.2           
## [51] stringr_1.0.0                 MSnbase_1.20.5               
## [53] munsell_0.4.3                 cluster_2.0.4                
## [55] AnnotationDbi_1.34.2          ensembldb_1.4.2              
## [57] pcaMethods_1.64.0             RCurl_1.95-4.8               
## [59] dichromat_2.0-0               iterators_1.0.8              
## [61] VariantAnnotation_1.18.1      labeling_0.3                 
## [63] bitops_1.0-6                  rmarkdown_0.9.6              
## [65] gtable_0.2.0                  codetools_0.2-14             
## [67] curl_0.9.7                    DBI_0.4-1                    
## [69] reshape2_1.4.1                R6_2.1.2                     
## [71] GenomicAlignments_1.8.0       gridExtra_2.2.1              
## [73] knitr_1.13                    Hmisc_3.17-4                 
## [75] ProtGenerics_1.4.0            stringi_1.0-1                
## [77] rpart_4.1-10                  acepack_1.3-3.3