1 introduction

A sequence logo has been widely used as a graphical representation of an alignment of multiple amino acid or nucleic acid sequences. There is a package seqlogo(Bembom 2006) implemented in R to draw DNA sequence logos. And another package motifStack(Ou 2012) was developed for drawing sequence logos for Amino Acid, DNA and RNA sequences. motifStack also has the capability for graphical representation of multiple motifs.

IceLogo(Colaert et al. 2009) is a tool developed in java to visualize significant conserved sequence patterns in an alignment of multiple peptide sequence against background sequences. Compare to webLogo(Crooks et al. 2004), which relying on information theory, iceLogo builds on probability theory. It is reported that iceLogo has a more dynamic nature and is correcter and completer in the analysis of conserved sequence patterns.

However iceLogo can only compare conserved sequences to reference sequences peptide by peptide. As we know, some conserved sequence patterns are not conserved by peptides but by groups such as charge, chemistry, hydrophobicity and etc.

Here we developed a R package:dagLogo based on iceLogo to visualize significant conserved sequence patterns in groups.

2 Prepare environment

You will need ghostscript: the full path to the executable can be set by the environment variable R_GSCMD. If this is unset, a GhostScript executable will be searched by name on your path. For example, on a Unix, linux or Mac “gs” is used for searching, and on Windows the setting of the environment variable GSC is used, otherwise commands “gswi64c.exe” then “gswin32c.exe” are tried.

Example on Windows: assume that the gswin32c.exe is installed at C:\ Program Files\ gs\ gs9.06\ bin, then open R and try:

Sys.setenv(R_GSCMD=file.path("C:", "Program Files", "gs", 
                             "gs9.06", "bin", "gswin32c.exe"))

3 Examples of using dagLogo

3.1 Sample 1, start from given peptides positions

3.2 Step 1, fetch sequences

You should have interesting peptides position info and the identifiers for fetching sequences via biomaRt.

library(dagLogo)
library(biomaRt)
try({##just in case biomaRt server does not response
    mart <- useMart("ensembl", "dmelanogaster_gene_ensembl")
    dat <- read.csv(system.file("extdata", "dagLogoTestData.csv", 
                                package="dagLogo"))
    dat <- dat[1:5,] ##subset to speed sample
    dat
    seq <- fetchSequence(as.character(dat$entrez_geneid), 
                         anchorPos=as.character(dat$NCBI_site), 
                         mart=mart, 
                         upstreamOffset=7, 
                         downstreamOffset=7)
    head(seq@peptides)
})

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "G"  "I"  "A"  "S"  "E"  "A"  "Q"  "K"  "Y"  "Q"   "A"   "K"   "I"  
## [2,] "A"  "S"  "K"  "V"  "A"  "L"  "S"  "K"  "F"  "D"   "S"   "D"   "V"  
## [3,] "Q"  "F"  "I"  "S"  "S"  "G"  "L"  "K"  "K"  "V"   "A"   "V"   "P"  
## [4,] "G"  "R"  "C"  "A"  "S"  "I"  "A"  "K"  "D"  "A"   "M"   "S"   "H"  
## [5,] "G"  "I"  "S"  "E"  "V"  "F"  "D"  "K"  "F"  "G"   "G"   "T"   "V"  
##      [,14] [,15]
## [1,] "L"   "S"  
## [2,] "Y"   "L"  
## [3,] "S"   "T"  
## [4,] "G"   "L"  
## [5,] "L"   "A"

Sometimes you don’t have the exactly postions. You have only the interesting peptides and the identifiers.

try({
    seq <- fetchSequence(as.character(dat$entrez_geneid), 
                         anchorAA="*",
                         anchorPos=as.character(dat$peptide), 
                         mart=mart, 
                         upstreamOffset=7, 
                         downstreamOffset=7)
    head(seq@peptides)
})

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "G"  "I"  "A"  "S"  "E"  "A"  "Q"  "K"  "Y"  "Q"   "A"   "K"   "I"  
## [2,] "A"  "S"  "K"  "V"  "A"  "L"  "S"  "K"  "F"  "D"   "S"   "D"   "V"  
## [3,] "Q"  "F"  "I"  "S"  "S"  "G"  "L"  "K"  "K"  "V"   "A"   "V"   "P"  
## [4,] "G"  "R"  "C"  "A"  "S"  "I"  "A"  "K"  "D"  "A"   "M"   "S"   "H"  
## [5,] "G"  "I"  "S"  "E"  "V"  "F"  "D"  "K"  "F"  "G"   "G"   "T"   "V"  
##      [,14] [,15]
## [1,] "L"   "S"  
## [2,] "Y"   "L"  
## [3,] "S"   "T"  
## [4,] "G"   "L"  
## [5,] "L"   "A"

In above sample, anchorAA is represented by asterisk. In following sample, anchorAA is represented by lower case of amino acid.

if(interactive()){
    dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
                package="dagLogo"))

    tail(dat)
    mart <- useMart("ensembl", "hsapiens_gene_ensembl")
    seq <- fetchSequence(toupper(as.character(dat$symbol)), 
                         type="hgnc_symbol",
                         anchorAA="s",
                         anchorPos=as.character(dat$peptides), 
                         mart=mart, 
                         upstreamOffset=7, 
                         downstreamOffset=7)
    head(seq@peptides)
}

Sometimes you may already have the aligned peptides sequences in hand. You will use formatSequence function to prepare an object of dagPeptides for further testing. To use formatSequence, you need prepare the proteome by prepareProteome function.

dat <- unlist(read.delim(system.file("extdata", "grB.txt", package="dagLogo"), 
                         header=F, as.is=TRUE))
head(dat)

##                              V11                              V12 
## "GHISVKEPTPSIASDISLPIATQELRQRLR" "EREMFDKASLKLGLDKAVLQSMSGRENATN" 
##                              V13                              V14 
## "XXXXXXMSDIVVVTDLIAVGLKRGSDELLS" "GQDQEEEEIEDILMDTEEELMRAEDTEQLK" 
##                              V15                              V16 
## "ESYATDNEKMTSTPETLLEEIEAKNRELIA" "VENKERTLKRLLLQDQENSLQDNRTSSDSP"

##prepare proteome from a fasta file
proteome <- prepareProteome(fasta=system.file("extdata", 
                                              "HUMAN.fasta",
                                              package="dagLogo"))
##prepare object of dagPeptides
seq <- formatSequence(seq=dat, proteome=proteome, 
                      upstreamOffset=14, downstreamOffset=15)

3.3 Step 2, build background model

Once you have an object of dagPeptides in hand, you can start to build background model for DAG test. The background could be random subsequence of whole proteome or your inputs. If the background was built from whole proteome or proteome without your inputs, an object of Proteome is required.

To prepare a proteome, there are two methods, from a fasta file or from UniProt webservice. Last example shows how to prepare proteome from a fasta file. Here we show how to prepare proteome via UniProt webservice.

if(interactive()){
    library(UniProt.ws)
    UniProt.ws <- UniProt.ws(taxId=9606)
    proteome <- prepareProteome(UniProt.ws=UniProt.ws)
}

Then the proteome can be used for background model building.

bg <- buildBackgroundModel(seq, bg="wholeGenome", proteome=proteome)

3.4 Step 3, do test

Test can be done without any change of the symbol pattern or with changes of grouped peptides by such as charge, chemistry, hydrophobicity and etc.

t0 <- testDAU(seq, bg)
t1 <- testDAU(seq, bg, group="classic")
t2 <- testDAU(seq, bg, group="charge")
t3 <- testDAU(seq, bg, group="chemistry")
t4 <- testDAU(seq, bg, group="hydrophobicity")

3.5 Step 4, graphical representation results

We can use heatmap or logo to show the results.

dagHeatmap(t0) ##Plot a heatmap to show the results

dagLogo(t0) ##Plot a logo to show the ungrouped results

ungrouped results

##Plot a logo to show the classic grouped results
dagLogo(t1, namehash=nameHash(t1@group), legend=TRUE)

classic grouped

##Plot a logo to show the charge grouped results
dagLogo(t2, namehash=nameHash(t2@group), legend=TRUE)

charge grouped

##Plot a logo to show the chemistry grouped results
dagLogo(t3, namehash=nameHash(t3@group), legend=TRUE)

chemistry grouped

##Plot a logo to show the hydrophobicity grouped results
dagLogo(t4, namehash=nameHash(t4@group), legend=TRUE)

hydrophobicity grouped

4 using dagLogo to analysis Catobolite Activator Protein

CAP (Catabolite Activator Protein, also known as CRP for cAMP Receptor Protein) is a transcription promoter that binds at more than 100 sites within the E. coli genome.

The motif of the DNA-binding helix-turn-helix motif of the CAP family is drawn by motifStack as following figure.

library(motifStack)
protein<-read.table(file.path(find.package("motifStack"),"extdata","cap.txt"))
protein<-t(protein[,1:20])
motif<-pcm2pfm(protein)
motif<-new("pfm", mat=motif, name="CAP", 
            color=colorset(alphabet="AA",colorScheme="chemistry"))
##The DNA-binding helix-turn-helix motif of the CAP family ploted by motifStack
plot(motif)

Catobolite Activator Protein Motif

If we use dagLogo to plot the motif, it will be shown as following figure. Residues 7-13 form the first helix, 14-17 the turn and 18-26 the DNA recognition helix. The glycine at position 15 appears to be critical in forming the turn.

library(Biostrings)
cap <- as.character(readAAStringSet(system.file("extdata", 
                                                "cap.fasta", 
                                                package="dagLogo")))
data(ecoli.proteome)
seq <- formatSequence(seq=cap, proteome=ecoli.proteome)
bg <- buildBackgroundModel(seq, bg="wholeGenome", 
                           proteome=ecoli.proteome, 
                           permutationSize=10L)
##The DNA-binding helix-turn-helix motif of the CAP family ploted by dagLogo
t0 <- testDAU(seq, bg)
dagLogo(t0)

Catobolite Activator Protein Motif

If the peptides are grouped by chemistry and then plot, it will be shown as following figure. Positions 10, 14, 16, 21 and 25 are partially or completely buried and therefore tend to be populated by hydrophobic amino acids, which are very clear if we group the peptides by chemistry.

## The DNA-binding helix-turn-helix motif of the CAP family grouped by chemistry
t1 <- testDAU(seq, bg, group="chemistry")
dagLogo(t1, namehash=nameHash(t1@group), legend=TRUE)

Catobolite Activator Protein Motif

5 Session Info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] dagLogo_1.12.0      motifStack_1.18.0   Biostrings_2.42.0  
##  [4] XVector_0.14.0      IRanges_2.8.0       S4Vectors_0.12.0   
##  [7] ade4_1.7-4          MotIV_1.30.0        BiocGenerics_0.20.0
## [10] grImport_0.9-0      XML_3.98-1.4        biomaRt_2.30.0     
## [13] BiocStyle_2.2.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7                RColorBrewer_1.1-2        
##  [3] plyr_1.8.4                 formatR_1.4               
##  [5] GenomeInfoDb_1.10.0        bitops_1.0-6              
##  [7] tools_3.3.1                zlibbioc_1.20.0           
##  [9] digest_0.6.10              gtable_0.2.0              
## [11] RSQLite_1.0.0              evaluate_0.10             
## [13] tibble_1.2                 lattice_0.20-34           
## [15] BSgenome_1.42.0            Matrix_1.2-7.1            
## [17] DBI_0.5-1                  yaml_2.1.13               
## [19] seqLogo_1.40.0             rtracklayer_1.34.0        
## [21] stringr_1.1.0              knitr_1.14                
## [23] Biobase_2.34.0             AnnotationDbi_1.36.0      
## [25] BiocParallel_1.8.0         rGADEM_2.22.0             
## [27] rmarkdown_1.1              pheatmap_1.0.8            
## [29] magrittr_1.5               scales_0.4.0              
## [31] Rsamtools_1.26.0           htmltools_0.3.5           
## [33] GenomicRanges_1.26.0       GenomicAlignments_1.10.0  
## [35] SummarizedExperiment_1.4.0 assertthat_0.1            
## [37] colorspace_1.2-7           stringi_1.1.2             
## [39] munsell_0.4.3              RCurl_1.95-4.8

Bembom, Oliver. 2006. “SeqLogo: Sequence Logos for DNA Sequence Alignments.” R Package Version 1.5.4.

Colaert, Niklaas, Kenny Helsens, Lennart Martens, Joel Vandekerckhove, and Kris Gevaert. 2009. “Improved Visualization of Protein Consensus Sequences by IceLogo.” Nature Methods 6 (11): 786–87.

Crooks, Gavin E., Gary Hon, John-Marc Chandonia, and Steven E. Brenner. 2004. “WebLogo: A Sequence Logo Generator.” Genome Research 14: 1188–90.

Ou, Jianhong. 2012. “MotifStack: Plot Stacked Logos for Single or Multiple DNA, RNA and Amino Acid Sequence.” R Package Version 1.5.4.

dagLogo Vignette

Jianhong Ou, Lihua Julie Zhu

17 October 2016

Abstract

Package version: dagLogo 1.12.0

Contents

1 introduction

2 Prepare environment

3 Examples of using dagLogo

3.1 Sample 1, start from given peptides positions

3.2 Step 1, fetch sequences

3.3 Step 2, build background model

3.4 Step 3, do test

3.5 Step 4, graphical representation results

4 using dagLogo to analysis Catobolite Activator Protein

5 Session Info