FindMyFriends 1.10.0
FindMyFriends is an R package for creating pangenomes of large sets of microbial organisms, together with tools to investigate the results. The number of tools to create pangenomes is large and some of the more notable ones are Roary, OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan (available from CRAN) is currently the only other entry, but the scope of FindMyFriends is much broader.
Within comparative microbiol genomics, a pangenome is defined as a collection of all the genes present in a set of related organisms and grouped together by some sort of homology. This homology measure has mainly been some sort of blast derived similarity, but as you shall see here, this is not the only possibility.
Pangenomes have many uses, both as organisational tools and as a mean to get insight into the collective genomic information present within a group of organisms. An example of the former is to ensure an equal functional annotation of all genes within a set of genomes by annotating pangenome gene groups rather than single genes. An example of the latter is to use a pangenome together with phenotypic information to discover new links between genes and phenotypes.
Usually pangenomes are calculated from the result of blasting all genes in the organisms under investigation against each other. This step is very time-consuming due to the complexity of the BLAST algorithm and the need to compare everything with everything (resulting in a quadratic scaling). FindMyFriends generally takes a different approach by ditching BLAST altogether. In its absence FindMyFriends uses alignment free sequence comparisons, based on decomposition of the sequences into K-mer feature vectors. K-mers are words of equal length derived by sliding a window of size K over the sequence and a K-mer feature vector is simply the count of each unique word, as seen below
GATTCGATTAG -> ATT: 2
CGA: 1
GAT: 2
TAG: 1
TCG: 1
TTA: 1
TTC: 1
With this decomposition the problem of calculating sequence similarity is reduced to a vector similarity problem, for which there are many different solution. The one employed in FindMyFriends is called cosine similarity, which is effectively the cosine to the angle between the two vectors. If the vector space is strictly positive then the value is bound between 0 and 1 with 1 being 100 % similarity.
Apart from the use of K-mers over BLAST, FindMyFriends also advocate a different approach to how the grouping is performed. Normally the final (or close to final) grouping is derived directly from the main similarity comparison step. In FindMyFriends this is turned around, and the main similarity step is very coarse, resulting in very large gene groups, unsuited as a final result. These groups are then investigated in more detail with regards to the sequence similarity, neighborhood similarity and sequence length. This division makes it possible to employ a very fast, but not very precise algorithm for the first step, and only use time to investigate gene pairs with a high likelihood of being equal. In the end this approach results in a linear scaling with regards to the number of genes, as well as a general speedup, and to my knowledge FindMyFriends is currently by far the fastest algorithm for creating pangenomes.
FindMyFriends is flexible and there are many ways to go about creating pangenomes - you can even specify them manually making it possible to import results from other algorithms into the framework. This vignette will only focus on the currently recommended approach.
This vignette will use a collection of 10 different Mycoplasma pneumonia and hyopneumonia strains. These organisms have an appreciable small genome making it easier to distribute and analyse.
# Unpack files
location <- tempdir()
unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'),
exdir=location)
genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta')
The first step in performing a pangenomic analysis with FindMyFriends is to load
gene data into a Pangenome object. This is done with the aptly named pangenome
function, which takes a list of fasta files, a boolean indicating whether the
genes are translated and optionally information about the position of each gene
on the chromosome and a boolean indicating whether to read all sequences into
memory.
library(FindMyFriends)
mycoPan <- pangenome(genomeFiles[1:9],
translated = TRUE,
geneLocation = 'prodigal',
lowMem = FALSE)
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## Gene groups not yet defined
##
## Genes are translated
The last parameters of the function call warrants some more explanation. Some of the algorithms in FindMyFriends uses positional information and this must be supplied with the geneLocation parameter. This can be done in a number of ways: As a data.frame with one row per gene, as a function that takes the header info of each gene from the fasta file and returns a data.frame with on row per gene, or as a string identifying the annotator used to find the gene (currently only Prodigal is supported). The gene info has the following format (all columns required):
head(geneLocation(mycoPan))
## contig start end strand
## 1 gi|71851486|gb|AE017243.1| 207 395 1
## 2 gi|71851486|gb|AE017243.1| 825 1109 1
## 3 gi|71851486|gb|AE017243.1| 1131 1409 1
## 4 gi|71851486|gb|AE017243.1| 1791 2879 1
## 5 gi|71851486|gb|AE017243.1| 2983 4521 1
## 6 gi|71851486|gb|AE017243.1| 4582 4842 1
The contig column holds information on the contig/chromosome on which the gene is located, the start and stop columns gives the start and stop position of translation and the strand column identifies on which strand the gene is located. Additional columns are allowed, but is currently unused.
The last parameter (lowMem) specifies whether all sequences should be read into memory (the default) or be kept as references to the original fasta files. The latter is useful in cases where the size of the sequence data becomes prohibitively large but is not needed for our little toy example.
Metadata about the organisms can be added and queried using the addOrgInfo
method. For instance we might add some taxonomy data.
library(reutils)
orgMeta <- lapply(orgNames(mycoPan), function(name) {
uid <- esearch(name, 'assembly')
taxuid <- elink(uid, dbTo = 'taxonomy')
reutils::content(esummary(taxuid), as = 'parsed')
})
orgMeta <- lapply(lapply(orgMeta, lapply, unlist), as.data.frame)
orgMeta <- do.call(rbind, orgMeta)
mycoPan <- addOrgInfo(mycoPan, orgMeta)
head(orgInfo(mycoPan))
## nGenes Id Status Rank Division
## AE017243 1336 262719 active NA mycoplasmas
## AE017244 1351 262722 active NA mycoplasmas
## AE017332 1316 295358 active NA mycoplasmas
## AP012303 1407 1112856 active NA mycoplasmas
## CP002077 1392 722438 active NA mycoplasmas
## CP002274 1367 907287 active NA mycoplasmas
## ScientificName CommonName TaxId AkaTaxId
## AE017243 Mycoplasma hyopneumoniae J NA 262719 0
## AE017244 Mycoplasma hyopneumoniae 7448 NA 262722 0
## AE017332 Mycoplasma hyopneumoniae 232 NA 295358 0
## AP012303 Mycoplasma pneumoniae 309 NA 1112856 0
## CP002077 Mycoplasma pneumoniae FH NA 722438 0
## CP002274 Mycoplasma hyopneumoniae 168 NA 907287 0
## Genus Species Subsp ModificationDate GenBankDivision
## AE017243 Mycoplasma hyopneumoniae NA 2010/10/08 00:00 Bacteria
## AE017244 Mycoplasma hyopneumoniae NA 2005/08/28 00:00 Bacteria
## AE017332 Mycoplasma hyopneumoniae NA 2004/10/09 00:00 Bacteria
## AP012303 Mycoplasma pneumoniae NA 2011/12/02 00:00 Bacteria
## CP002077 Mycoplasma pneumoniae NA 2010/07/28 00:00 Bacteria
## CP002274 Mycoplasma hyopneumoniae NA 2010/10/09 00:00 Bacteria
The raw sequences are easily accessible as an XStringSet object, or split by organism into an XStringSetList object:
genes(mycoPan)
## A AAStringSet instance of length 12247
## width seq names
## [1] 63 MQTNKNNLKVRTQQIRQQIE...EIIIDFTDLIAKQEVISR* gi|71851486|gb|AE...
## [2] 95 MSQVDAFLFDDIQGLANKQQ...FLRIFKHKLVEEKLEKHI* gi|71851486|gb|AE...
## [3] 93 MSKHFRNSIRELEGALKSIV...IKSEKRGKNIVHARDIAI* gi|71851486|gb|AE...
## [4] 363 MQSAILNNINSPLSSFFLKL...IVSPEKPEICQLVGLVLV* gi|71851486|gb|AE...
## [5] 513 MSKKSKNSSIEFDAIVVGGG...LAKKYNLEKRISKLKLIS* gi|71851486|gb|AE...
## ... ... ...
## [12243] 391 MFSFFKQIFKSLKKFFFLLF...TFTLDTSKLSHLDKEEFD* gi|440453185|gb|C...
## [12244] 222 MTNGITNQLICDHIDLKIAP...KIVADYLNQRPKTINEIN* gi|440453185|gb|C...
## [12245] 440 MEQFSAFKLLLKKQYETTLG...KMIEKDDSLQDIITSLVI* gi|440453185|gb|C...
## [12246] 251 MAISKKKRFFFDLAQDEDDA...NIPISRILTMILDQVIEE* gi|440453185|gb|C...
## [12247] 271 MIISFVNNKGGVLKTTMATN...EYLEITKEILNLVNHGHQ* gi|440453185|gb|C...
genes(mycoPan, split = 'organism')
## AAStringSetList of length 9
## [["AE017243"]] gi|71851486|gb|AE017243.1|_1 # 207 # 395 # 1 # ID=1_1;parti...
## [["AE017244"]] gi|71913466|gb|AE017244.1|_1 # 207 # 395 # 1 # ID=1_1;parti...
## [["AE017332"]] gi|53987142|gb|AE017332.1|_1 # 1 # 189 # 1 # ID=1_1;partial...
## [["AP012303"]] gi|358640274|dbj|AP012303.1|_1 # 691 # 1833 # 1 # ID=1_1;pa...
## [["CP002077"]] gi|301633103|gb|CP002077.1|_1 # 691 # 1833 # 1 # ID=1_1;par...
## [["CP002274"]] gi|312600973|gb|CP002274.1|_1 # 1 # 189 # 1 # ID=1_1;partia...
## [["CP003131"]] gi|506957411|gb|CP003131.1|_1 # 1 # 189 # 1 # ID=1_1;partia...
## [["CP003802"]] gi|523582824|gb|CP003802.1|_1 # 207 # 395 # 1 # ID=1_1;part...
## [["CP003913"]] gi|440453185|gb|CP003913.1|_1 # 692 # 1834 # 1 # ID=1_1;par...
Apart from that the object really needs some grouping before there is anything interesting to do with it.
A lot of the methods in FindMyFriends contains similar parameters, and a lot of these. To save typing when doing the analyses, these parameters can be set on a per-object manner. At creation a set of defaults is already set, which can be queried and modified:
# Query current defaults
head(defaults(mycoPan))
## $groupPrefix
## [1] "OG"
##
## $coreThreshold
## [1] 1
##
## $kmerSize
## [1] 5
##
## $lowerLimit
## [1] 0.5
##
## $algorithm
## [1] "infomap"
##
## $flankSize
## [1] 4
# Set a new default
defaults(mycoPan)$lowerLimit <- 0.6
Values supplied as arguments to a method will always override object defaults.
Pangenome calculation is generally split into two steps; one focusing on a
coarse clustering of genes based on sequence similarity and one focusing on
refining this clustering by comparing sequences and chromosomal neighborhood
between genes in the clusters. Currently the fastest approach to the first step
is based on the CD-Hit algorithm (cdhitGrouping
), but other approaches exists
(gpcGrouping
and graphGrouping
). There is no advantages to using the slower
algorithms, so there use is generally not encouraged. cdhitGrouping
works by
repeatedly combining gene groups based on lower and lower similarity threshold.
At each step the longest member of each gene group is chosen as a representative
for the group in the following step. You generally want to set the lowest
threshold rather low to ensure that genes belonging to the same group are
definitely clustered together.
mycoPan <- cdhitGrouping(mycoPan, kmerSize = 5, cdhitIter = TRUE, nrep = 3)
Looking at our mycoPan object now reveals some additional information about the gene groups:
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3142 gene groups defined
##
## Core|
## Accessory|===========================================:
## Singleton|======
##
## Genes are translated
This is not the end of our analysis, as these gene groups needs to be refined.
The refinement step is done with the neighborhoodSplit
function (or
kmerSplit
if chromosomal position is unavailable). This function investigates
each gene group and compares the member genes based on sequence similarity,
chromosomal neighborhood similarity, sequence length and genome mebership, and
uses this information to build up a weighted graph with the member genes as
nodes. Edges reflect similarity and are only created if the two genes passes all
thresholds with regards to the different comparisons. From this graph cliques
(complete subgraphs i.e. subgraphs were all nodes are connected to each other)
are extracted sequentially, starting with the clique containing the highest
weighted edges. This ensures that all members of the resulting gene groups share
a similarity with all other members and avoids the inclusion of genes being very
similar to a few genes but not all. After this main step highly similar gene
groups lying in parallel (sharing a gene group either up- or downstream) are
merged together to create the final grouping.
mycoPan <- neighborhoodSplit(mycoPan, lowerLimit = 0.8)
As can be seen this step results in an increase in the number of gene groups:
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3399 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
In general you should expect that the use of FindMyFriends will result in a higher number of gene groups than other algorithms. This is due to its strict approach to defining similarity and in general you can expect the final result to be of very high quality.
Once the pangenome has been calculated it is possible to perform a range of different post-processing steps and analyses on results.
FindMyFriend forcefully avoids grouping genes from the same genome together. While it is generally a boon to split up paralogue gene groups, as the genes might have different functionality within the cell (and probably be under different regulation), it can be nice to maintain a link between groups with a certain similarity. Such a link can be obtained by calculating a kmer similarity between representatives from each gene group. Once paralogue links have been created they can be used when extracting raw sequences and the information will be taken into account when calculating organism statistics. Furthermore it is possible to merge paralogues into true gene groups (effectively undoing the neighborhood splitting).
mycoPan <- kmerLink(mycoPan, lowerLimit=0.8)
genes(mycoPan, split='paralogue')[[1]]
## A AAStringSet instance of length 10
## width seq names
## [1] 61 MQIQEKIATILDTFTELSAEL...LRERKKQYAFYCDYLLNPKN* gi|358640274|dbj|...
## [2] 205 MKLEAPKFTNRVFKNRPKLTH...TNGGLIGKINDYDFHGEYVT* gi|358640274|dbj|...
## [3] 75 MQGTLAEIELDFPSKKIQEKI...LRERKKQYVFYSDYLLNPKN* gi|358640274|dbj|...
## [4] 57 MQIQEKIATILDTFTELSAEL...LRERKKQYAFYRDYLLNPKN* gi|301633103|gb|C...
## [5] 174 MAEIPIDFPPLKIQEKIATIL...TNGGLIGKINDYDFHGEYVT* gi|301633103|gb|C...
## [6] 67 MQGTLAEIELDFPSKKIQEKI...LRERKKQYVFYSDYLLNPKN* gi|301633103|gb|C...
## [7] 57 MQIQEKIATILDTFTELSAEL...LRERKKQYAFYRDYLLNLKN* gi|440453185|gb|C...
## [8] 197 MKLEAPKFTNRVFKNRPKLTH...TNGGLIGKINDYDFHGEYVT* gi|440453185|gb|C...
## [9] 63 MQGTLAEIELDFTSKKIQEKI...LRERKKQYVFYSDYLLNPKN* gi|440453185|gb|C...
## [10] 198 MEAPKFVNNACPIPNLNLSRT...TNDGELGRIKDCDFDGEYIT* gi|440453185|gb|C...
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3399 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
collapseParalogues(mycoPan, combineInfo='largest')
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3240 gene groups defined
##
## Core|
## Accessory|============================================
## Singleton|======
##
## Genes are translated
Usually genes are detected automatically by programs suchs as Prodigal and
Glimmer, but while these programs are generally good, they are not perfect. A
pangenome analysis can potentially reveal genes that, while annotated, are no
longer functioning due to frameshifts etc. These will be apparant as members of
gene groups that have vastly different length than the majority of the members.
Removing such genes will in general improve whatever downstream analysis that
the data will be used for. Using the removeGene()
function it is possible to
remove single or sets of genes from your pangenome object.
# Remove a gene by raw index
removeGene(mycoPan, ind=60)
## An object of class pgFullLoc
##
## The pangenome consists of 12246 genes from 9 organisms
## 3399 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
# Remove the first organism by index
removeGene(mycoPan, organism=1)
## An object of class pgFullLoc
##
## The pangenome consists of 10911 genes from 8 organisms
## 3310 gene groups defined
##
## Core|
## Accessory|============================================
## Singleton|======
##
## Genes are translated
# or by name
name <- orgNames(mycoPan)[1]
removeGene(mycoPan, organism=name)
## An object of class pgFullLoc
##
## The pangenome consists of 10911 genes from 8 organisms
## 3310 gene groups defined
##
## Core|
## Accessory|============================================
## Singleton|======
##
## Genes are translated
# Remove the second member of the first gene group
removeGene(mycoPan, group=1, ind=2)
## An object of class pgFullLoc
##
## The pangenome consists of 12246 genes from 9 organisms
## 3399 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
Having a nice structure around your pangenome data is just a part of a great framework - having a set of analyses at hand for investigating the results is another part. Luckily, by being part of Bioconductor and using the standard data representations, a lot of the functionality already provided in Bioconductor is at your disposal. The two main data types that you might want to extract in order to do further analysis is the pangenome matrix and raw sequences:
# Get the pangenome matrix as an ExpressionSet object
as(mycoPan, 'ExpressionSet')
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 3399 features, 9 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: AE017243 AE017244 ... CP003913 (9 total)
## varLabels: nGenes Id ... GenBankDivision (14 total)
## varMetadata: labelDescription
## featureData
## featureNames: OG1 OG2 ... OG3399 (3399 total)
## fvarLabels: description group ... nGenes (7 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
# or as a regular matrix
as(mycoPan, 'matrix')[1:6, ]
## AE017243 AE017244 AE017332 AP012303 CP002077 CP002274 CP003131 CP003802
## OG1 8 8 2 0 0 9 9 0
## OG2 8 8 1 0 0 9 9 0
## OG3 2 2 2 0 0 2 2 2
## OG4 1 1 1 0 0 2 2 2
## OG5 2 3 2 0 0 1 1 0
## OG6 2 3 2 0 0 1 1 0
## CP003913
## OG1 0
## OG2 0
## OG3 0
## OG4 0
## OG5 0
## OG6 0
# Get all genes split into gene groups
genes(mycoPan, split='group')
## AAStringSetList of length 3399
## [["OG1"]] gi|71851486|gb|AE017243.1|_567 # 377235 # 377504 # -1 # ID=1_567...
## [["OG2"]] gi|71851486|gb|AE017243.1|_568 # 377541 # 378728 # -1 # ID=1_568...
## [["OG3"]] gi|71851486|gb|AE017243.1|_197 # 135429 # 135674 # -1 # ID=1_197...
## [["OG4"]] gi|71851486|gb|AE017243.1|_332 # 219147 # 219662 # 1 # ID=1_332;...
## [["OG5"]] gi|71851486|gb|AE017243.1|_1081 # 726907 # 727254 # -1 # ID=1_10...
## [["OG6"]] gi|71851486|gb|AE017243.1|_1078 # 725485 # 725850 # -1 # ID=1_10...
## [["OG7"]] gi|71851486|gb|AE017243.1|_1080 # 726349 # 726723 # -1 # ID=1_10...
## [["OG8"]] gi|71851486|gb|AE017243.1|_1083 # 727841 # 727978 # -1 # ID=1_10...
## [["OG9"]] gi|71851486|gb|AE017243.1|_636 # 428809 # 428922 # -1 # ID=1_636...
## [["OG10"]] gi|71851486|gb|AE017243.1|_87 # 57229 # 58629 # 1 # ID=1_87;par...
## ...
## <3389 more elements>
Apart from what is already available within Bioconductor, FindMyFriends also
comes with a set of tools to investigate the results further. Two of these
calculates different statistics on the two main gene groupings in the dataset,
namely gene groups and organisms. These functions are called groupStat()
and
orgStat()
and are very straightforward to use:
groupStat(mycoPan)[[1]]
## $maxOrg
## [1] 9
##
## $minLength
## [1] 90
##
## $maxLength
## [1] 90
##
## $sdLength
## [1] 0
##
## $genes
## [1] 2834 1552 8402 7037 8491 7127 1723 1733 1761 7272 8636 8738 567 7381
## [15] 594 1925 7428 8795 7528 8894 7559 8925 7588 8954 809 929 2281 987
## [29] 1035 1055 3765 1085 9266 7902 2600 2677
##
## $backward
## [1] "OG686" "OG845" "OG2" "OG2" "OG790" "OG790" "OG2" "OG1073"
## [9] "OG2" "OG580" "OG580" "OG2" "OG1103" "OG2378" "OG525" "OG2"
## [17] "OG3007" "OG3066" "OG744" "OG744" "OG2" "OG2" "OG2" "OG2"
## [25] "OG2" "OG2" "OG856" "OG2" "OG566" "OG297" "OG8" "OG2"
## [33] "OG8" "OG3347" "OG91" "OG166"
##
## $forward
## [1] "OG2" "OG2" "OG316" "OG316" "OG2" "OG2" "OG1757" "OG2"
## [9] "OG1167" "OG2" "OG2" "OG3337" "OG2" "OG2" "OG2" "OG3383"
## [17] "OG2" "OG2" "OG2" "OG2" "OG23" "OG23" "OG58" "OG58"
## [25] "OG920" "OG1206" "OG2" "OG3154" "OG2" "OG2" "OG2932" "OG6"
## [33] "OG2" "OG2" "OG2" "OG2"
head(orgStat(mycoPan))
## nGenes minLength maxLength sdLength nGeneGroups nParalogues
## AE017243 1336 30 1824 121.1718 1316 13
## AE017244 1351 30 1808 121.5220 1325 14
## AE017332 1316 30 1825 123.5957 1310 10
## AP012303 1407 30 1141 121.8308 1404 18
## CP002077 1392 30 1141 121.6465 1389 18
## CP002274 1367 30 1815 123.5032 1348 7
To get a very broad overview of your result plotStat()
can help you:
plotStat(mycoPan, color='Species', type='qual', palette=6)
Three other plot functions exists to let you asses general properties of the
pangenome. plotEvolution()
lets you follow the number of singleton, accessory
and core genes as the number of organisms increases. Generally this type of plot
is very biased towards the order of organisms, so while it is possible to supply
a progression of organisms, the default approach is to create bootstraped values
for the plot:
plotEvolution(mycoPan)
## `geom_smooth()` using method = 'loess'
plotSimilarity()
creates a heatplot of the similarity matrix of the organisms
in the pangenome. The similarity of two organisms is here defined as either the
percent of shared genes or the cosine similarity of their total kmer feature
vector. The ordering can be done manually or automatically according to a
hierachcal clustering:
# Pangenome matrix similarity
plotSimilarity(mycoPan)
# Kmer similarity
plotSimilarity(mycoPan, type='kmer', kmerSize=5)