1 Introduction

FindMyFriends is an R package for creating pangenomes of large sets of microbial organisms, together with tools to investigate the results. The number of tools to create pangenomes is large and some of the more notable ones are Roary, OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan (available from CRAN) is currently the only other entry, but the scope of FindMyFriends is much broader.

Within comparative microbiol genomics, a pangenome is defined as a collection of all the genes present in a set of related organisms and grouped together by some sort of homology. This homology measure has mainly been some sort of blast derived similarity, but as you shall see here, this is not the only possibility.

Pangenomes have many uses, both as organisational tools and as a mean to get insight into the collective genomic information present within a group of organisms. An example of the former is to ensure an equal functional annotation of all genes within a set of genomes by annotating pangenome gene groups rather than single genes. An example of the latter is to use a pangenome together with phenotypic information to discover new links between genes and phenotypes.

Usually pangenomes are calculated from the result of blasting all genes in the organisms under investigation against each other. This step is very time-consuming due to the complexity of the BLAST algorithm and the need to compare everything with everything (resulting in a quadratic scaling). FindMyFriends generally takes a different approach by ditching BLAST altogether. In its absence FindMyFriends uses alignment free sequence comparisons, based on decomposition of the sequences into K-mer feature vectors. K-mers are words of equal length derived by sliding a window of size K over the sequence and a K-mer feature vector is simply the count of each unique word, as seen below

GATTCGATTAG  ->  ATT: 2
                 CGA: 1
                 GAT: 2
                 TAG: 1
                 TCG: 1
                 TTA: 1
                 TTC: 1

With this decomposition the problem of calculating sequence similarity is reduced to a vector similarity problem, for which there are many different solution. The one employed in FindMyFriends is called cosine similarity, which is effectively the cosine to the angle between the two vectors. If the vector space is strictly positive then the value is bound between 0 and 1 with 1 being 100 % similarity.

Apart from the use of K-mers over BLAST, FindMyFriends also advocate a different approach to how the grouping is performed. Normally the final (or close to final) grouping is derived directly from the main similarity comparison step. In FindMyFriends this is turned around, and the main similarity step is very coarse, resulting in very large gene groups, unsuited as a final result. These groups are then investigated in more detail with regards to the sequence similarity, neighborhood similarity and sequence length. This division makes it possible to employ a very fast, but not very precise algorithm for the first step, and only use time to investigate gene pairs with a high likelihood of being equal. In the end this approach results in a linear scaling with regards to the number of genes, as well as a general speedup, and to my knowledge FindMyFriends is currently by far the fastest algorithm for creating pangenomes.

FindMyFriends is flexible and there are many ways to go about creating pangenomes - you can even specify them manually making it possible to import results from other algorithms into the framework. This vignette will only focus on the currently recommended approach.

1.1 Data to use in this vignette

This vignette will use a collection of 10 different Mycoplasma pneumonia and hyopneumonia strains. These organisms have an appreciable small genome making it easier to distribute and analyse.

# Unpack files
location <- tempdir()
unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'),
      exdir=location)
genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta')

2 Creating a Mycoplasma pangenome

2.1 Creating a Pangenome object from a set of fasta files

The first step in performing a pangenomic analysis with FindMyFriends is to load gene data into a Pangenome object. This is done with the aptly named pangenome function, which takes a list of fasta files, a boolean indicating whether the genes are translated and optionally information about the position of each gene on the chromosome and a boolean indicating whether to read all sequences into memory.

library(FindMyFriends)

mycoPan <- pangenome(genomeFiles[1:9], 
                     translated = TRUE, 
                     geneLocation = 'prodigal', 
                     lowMem = FALSE)
mycoPan
## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## Gene groups not yet defined
## 
## Genes are translated

The last parameters of the function call warrants some more explanation. Some of the algorithms in FindMyFriends uses positional information and this must be supplied with the geneLocation parameter. This can be done in a number of ways: As a data.frame with one row per gene, as a function that takes the header info of each gene from the fasta file and returns a data.frame with on row per gene, or as a string identifying the annotator used to find the gene (currently only Prodigal is supported). The gene info has the following format (all columns required):

head(geneLocation(mycoPan))
##                       contig start  end strand
## 1 gi|71851486|gb|AE017243.1|   207  395      1
## 2 gi|71851486|gb|AE017243.1|   825 1109      1
## 3 gi|71851486|gb|AE017243.1|  1131 1409      1
## 4 gi|71851486|gb|AE017243.1|  1791 2879      1
## 5 gi|71851486|gb|AE017243.1|  2983 4521      1
## 6 gi|71851486|gb|AE017243.1|  4582 4842      1

The contig column holds information on the contig/chromosome on which the gene is located, the start and stop columns gives the start and stop position of translation and the strand column identifies on which strand the gene is located. Additional columns are allowed, but is currently unused.

The last parameter (lowMem) specifies whether all sequences should be read into memory (the default) or be kept as references to the original fasta files. The latter is useful in cases where the size of the sequence data becomes prohibitively large but is not needed for our little toy example.

Metadata about the organisms can be added and queried using the addOrgInfo method. For instance we might add some taxonomy data.

library(reutils)
orgMeta <- lapply(orgNames(mycoPan), function(name) {
    uid <- esearch(name, 'assembly')
    taxuid <- elink(uid, dbTo = 'taxonomy')
    reutils::content(esummary(taxuid), as = 'parsed')
})
orgMeta <- lapply(lapply(orgMeta, lapply, unlist), as.data.frame)
orgMeta <- do.call(rbind, orgMeta)

mycoPan <- addOrgInfo(mycoPan, orgMeta)
head(orgInfo(mycoPan))
##          nGenes      Id Status Rank    Division
## AE017243   1336  262719 active   NA mycoplasmas
## AE017244   1351  262722 active   NA mycoplasmas
## AE017332   1316  295358 active   NA mycoplasmas
## AP012303   1407 1112856 active   NA mycoplasmas
## CP002077   1392  722438 active   NA mycoplasmas
## CP002274   1367  907287 active   NA mycoplasmas
##                         ScientificName CommonName   TaxId AkaTaxId
## AE017243    Mycoplasma hyopneumoniae J         NA  262719        0
## AE017244 Mycoplasma hyopneumoniae 7448         NA  262722        0
## AE017332  Mycoplasma hyopneumoniae 232         NA  295358        0
## AP012303     Mycoplasma pneumoniae 309         NA 1112856        0
## CP002077      Mycoplasma pneumoniae FH         NA  722438        0
## CP002274  Mycoplasma hyopneumoniae 168         NA  907287        0
##               Genus       Species Subsp ModificationDate GenBankDivision
## AE017243 Mycoplasma hyopneumoniae    NA 2010/10/08 00:00        Bacteria
## AE017244 Mycoplasma hyopneumoniae    NA 2005/08/28 00:00        Bacteria
## AE017332 Mycoplasma hyopneumoniae    NA 2004/10/09 00:00        Bacteria
## AP012303 Mycoplasma    pneumoniae    NA 2011/12/02 00:00        Bacteria
## CP002077 Mycoplasma    pneumoniae    NA 2010/07/28 00:00        Bacteria
## CP002274 Mycoplasma hyopneumoniae    NA 2010/10/09 00:00        Bacteria

The raw sequences are easily accessible as an XStringSet object, or split by organism into an XStringSetList object:

genes(mycoPan)
##   A AAStringSet instance of length 12247
##         width seq                                        names               
##     [1]    63 MQTNKNNLKVRTQQIRQQIE...EIIIDFTDLIAKQEVISR* gi|71851486|gb|AE...
##     [2]    95 MSQVDAFLFDDIQGLANKQQ...FLRIFKHKLVEEKLEKHI* gi|71851486|gb|AE...
##     [3]    93 MSKHFRNSIRELEGALKSIV...IKSEKRGKNIVHARDIAI* gi|71851486|gb|AE...
##     [4]   363 MQSAILNNINSPLSSFFLKL...IVSPEKPEICQLVGLVLV* gi|71851486|gb|AE...
##     [5]   513 MSKKSKNSSIEFDAIVVGGG...LAKKYNLEKRISKLKLIS* gi|71851486|gb|AE...
##     ...   ... ...
## [12243]   391 MFSFFKQIFKSLKKFFFLLF...TFTLDTSKLSHLDKEEFD* gi|440453185|gb|C...
## [12244]   222 MTNGITNQLICDHIDLKIAP...KIVADYLNQRPKTINEIN* gi|440453185|gb|C...
## [12245]   440 MEQFSAFKLLLKKQYETTLG...KMIEKDDSLQDIITSLVI* gi|440453185|gb|C...
## [12246]   251 MAISKKKRFFFDLAQDEDDA...NIPISRILTMILDQVIEE* gi|440453185|gb|C...
## [12247]   271 MIISFVNNKGGVLKTTMATN...EYLEITKEILNLVNHGHQ* gi|440453185|gb|C...
genes(mycoPan, split = 'organism')
## AAStringSetList of length 9
## [["AE017243"]] gi|71851486|gb|AE017243.1|_1 # 207 # 395 # 1 # ID=1_1;parti...
## [["AE017244"]] gi|71913466|gb|AE017244.1|_1 # 207 # 395 # 1 # ID=1_1;parti...
## [["AE017332"]] gi|53987142|gb|AE017332.1|_1 # 1 # 189 # 1 # ID=1_1;partial...
## [["AP012303"]] gi|358640274|dbj|AP012303.1|_1 # 691 # 1833 # 1 # ID=1_1;pa...
## [["CP002077"]] gi|301633103|gb|CP002077.1|_1 # 691 # 1833 # 1 # ID=1_1;par...
## [["CP002274"]] gi|312600973|gb|CP002274.1|_1 # 1 # 189 # 1 # ID=1_1;partia...
## [["CP003131"]] gi|506957411|gb|CP003131.1|_1 # 1 # 189 # 1 # ID=1_1;partia...
## [["CP003802"]] gi|523582824|gb|CP003802.1|_1 # 207 # 395 # 1 # ID=1_1;part...
## [["CP003913"]] gi|440453185|gb|CP003913.1|_1 # 692 # 1834 # 1 # ID=1_1;par...

Apart from that the object really needs some grouping before there is anything interesting to do with it.

2.2 Object defaults

A lot of the methods in FindMyFriends contains similar parameters, and a lot of these. To save typing when doing the analyses, these parameters can be set on a per-object manner. At creation a set of defaults is already set, which can be queried and modified:

# Query current defaults
head(defaults(mycoPan))
## $groupPrefix
## [1] "OG"
## 
## $coreThreshold
## [1] 1
## 
## $kmerSize
## [1] 5
## 
## $lowerLimit
## [1] 0.5
## 
## $algorithm
## [1] "infomap"
## 
## $flankSize
## [1] 4
# Set a new default
defaults(mycoPan)$lowerLimit <- 0.6

Values supplied as arguments to a method will always override object defaults.

2.3 Calculating pangenomes

Pangenome calculation is generally split into two steps; one focusing on a coarse clustering of genes based on sequence similarity and one focusing on refining this clustering by comparing sequences and chromosomal neighborhood between genes in the clusters. Currently the fastest approach to the first step is based on the CD-Hit algorithm (cdhitGrouping), but other approaches exists (gpcGrouping and graphGrouping). There is no advantages to using the slower algorithms, so there use is generally not encouraged. cdhitGrouping works by repeatedly combining gene groups based on lower and lower similarity threshold. At each step the longest member of each gene group is chosen as a representative for the group in the following step. You generally want to set the lowest threshold rather low to ensure that genes belonging to the same group are definitely clustered together.

mycoPan <- cdhitGrouping(mycoPan, kmerSize = 5, cdhitIter = TRUE, nrep = 3)

Looking at our mycoPan object now reveals some additional information about the gene groups:

mycoPan
## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3142 gene groups defined
## 
##      Core|
## Accessory|===========================================:
## Singleton|======
## 
## Genes are translated

This is not the end of our analysis, as these gene groups needs to be refined. The refinement step is done with the neighborhoodSplit function (or kmerSplit if chromosomal position is unavailable). This function investigates each gene group and compares the member genes based on sequence similarity, chromosomal neighborhood similarity, sequence length and genome mebership, and uses this information to build up a weighted graph with the member genes as nodes. Edges reflect similarity and are only created if the two genes passes all thresholds with regards to the different comparisons. From this graph cliques (complete subgraphs i.e. subgraphs were all nodes are connected to each other) are extracted sequentially, starting with the clique containing the highest weighted edges. This ensures that all members of the resulting gene groups share a similarity with all other members and avoids the inclusion of genes being very similar to a few genes but not all. After this main step highly similar gene groups lying in parallel (sharing a gene group either up- or downstream) are merged together to create the final grouping.

mycoPan <- neighborhoodSplit(mycoPan, lowerLimit = 0.8)

As can be seen this step results in an increase in the number of gene groups:

mycoPan
## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3399 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated

In general you should expect that the use of FindMyFriends will result in a higher number of gene groups than other algorithms. This is due to its strict approach to defining similarity and in general you can expect the final result to be of very high quality.

2.4 Pangenome post-processing

Once the pangenome has been calculated it is possible to perform a range of different post-processing steps and analyses on results.

2.4.1 Paralogue linking

FindMyFriend forcefully avoids grouping genes from the same genome together. While it is generally a boon to split up paralogue gene groups, as the genes might have different functionality within the cell (and probably be under different regulation), it can be nice to maintain a link between groups with a certain similarity. Such a link can be obtained by calculating a kmer similarity between representatives from each gene group. Once paralogue links have been created they can be used when extracting raw sequences and the information will be taken into account when calculating organism statistics. Furthermore it is possible to merge paralogues into true gene groups (effectively undoing the neighborhood splitting).

mycoPan <- kmerLink(mycoPan, lowerLimit=0.8)

genes(mycoPan, split='paralogue')[[1]]
##   A AAStringSet instance of length 10
##      width seq                                           names               
##  [1]    61 MQIQEKIATILDTFTELSAEL...LRERKKQYAFYCDYLLNPKN* gi|358640274|dbj|...
##  [2]   205 MKLEAPKFTNRVFKNRPKLTH...TNGGLIGKINDYDFHGEYVT* gi|358640274|dbj|...
##  [3]    75 MQGTLAEIELDFPSKKIQEKI...LRERKKQYVFYSDYLLNPKN* gi|358640274|dbj|...
##  [4]    57 MQIQEKIATILDTFTELSAEL...LRERKKQYAFYRDYLLNPKN* gi|301633103|gb|C...
##  [5]   174 MAEIPIDFPPLKIQEKIATIL...TNGGLIGKINDYDFHGEYVT* gi|301633103|gb|C...
##  [6]    67 MQGTLAEIELDFPSKKIQEKI...LRERKKQYVFYSDYLLNPKN* gi|301633103|gb|C...
##  [7]    57 MQIQEKIATILDTFTELSAEL...LRERKKQYAFYRDYLLNLKN* gi|440453185|gb|C...
##  [8]   197 MKLEAPKFTNRVFKNRPKLTH...TNGGLIGKINDYDFHGEYVT* gi|440453185|gb|C...
##  [9]    63 MQGTLAEIELDFTSKKIQEKI...LRERKKQYVFYSDYLLNPKN* gi|440453185|gb|C...
## [10]   198 MEAPKFVNNACPIPNLNLSRT...TNDGELGRIKDCDFDGEYIT* gi|440453185|gb|C...
mycoPan
## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3399 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated
collapseParalogues(mycoPan, combineInfo='largest')
## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3240 gene groups defined
## 
##      Core|
## Accessory|============================================
## Singleton|======
## 
## Genes are translated

2.4.2 Removing genes

Usually genes are detected automatically by programs suchs as Prodigal and Glimmer, but while these programs are generally good, they are not perfect. A pangenome analysis can potentially reveal genes that, while annotated, are no longer functioning due to frameshifts etc. These will be apparant as members of gene groups that have vastly different length than the majority of the members. Removing such genes will in general improve whatever downstream analysis that the data will be used for. Using the removeGene() function it is possible to remove single or sets of genes from your pangenome object.

# Remove a gene by raw index
removeGene(mycoPan, ind=60)
## An object of class pgFullLoc
## 
## The pangenome consists of 12246 genes from 9 organisms
## 3399 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated
# Remove the first organism by index
removeGene(mycoPan, organism=1)
## An object of class pgFullLoc
## 
## The pangenome consists of 10911 genes from 8 organisms
## 3310 gene groups defined
## 
##      Core|
## Accessory|============================================
## Singleton|======
## 
## Genes are translated
# or by name
name <- orgNames(mycoPan)[1]
removeGene(mycoPan, organism=name)
## An object of class pgFullLoc
## 
## The pangenome consists of 10911 genes from 8 organisms
## 3310 gene groups defined
## 
##      Core|
## Accessory|============================================
## Singleton|======
## 
## Genes are translated
# Remove the second member of the first gene group
removeGene(mycoPan, group=1, ind=2)
## An object of class pgFullLoc
## 
## The pangenome consists of 12246 genes from 9 organisms
## 3399 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated

2.5 Investigating the results

Having a nice structure around your pangenome data is just a part of a great framework - having a set of analyses at hand for investigating the results is another part. Luckily, by being part of Bioconductor and using the standard data representations, a lot of the functionality already provided in Bioconductor is at your disposal. The two main data types that you might want to extract in order to do further analysis is the pangenome matrix and raw sequences:

# Get the pangenome matrix as an ExpressionSet object
as(mycoPan, 'ExpressionSet')
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 3399 features, 9 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: AE017243 AE017244 ... CP003913 (9 total)
##   varLabels: nGenes Id ... GenBankDivision (14 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: OG1 OG2 ... OG3399 (3399 total)
##   fvarLabels: description group ... nGenes (7 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
# or as a regular matrix
as(mycoPan, 'matrix')[1:6, ]
##     AE017243 AE017244 AE017332 AP012303 CP002077 CP002274 CP003131 CP003802
## OG1        8        8        2        0        0        9        9        0
## OG2        8        8        1        0        0        9        9        0
## OG3        2        2        2        0        0        2        2        2
## OG4        1        1        1        0        0        2        2        2
## OG5        2        3        2        0        0        1        1        0
## OG6        2        3        2        0        0        1        1        0
##     CP003913
## OG1        0
## OG2        0
## OG3        0
## OG4        0
## OG5        0
## OG6        0
# Get all genes split into gene groups
genes(mycoPan, split='group')
## AAStringSetList of length 3399
## [["OG1"]] gi|71851486|gb|AE017243.1|_567 # 377235 # 377504 # -1 # ID=1_567...
## [["OG2"]] gi|71851486|gb|AE017243.1|_568 # 377541 # 378728 # -1 # ID=1_568...
## [["OG3"]] gi|71851486|gb|AE017243.1|_197 # 135429 # 135674 # -1 # ID=1_197...
## [["OG4"]] gi|71851486|gb|AE017243.1|_332 # 219147 # 219662 # 1 # ID=1_332;...
## [["OG5"]] gi|71851486|gb|AE017243.1|_1081 # 726907 # 727254 # -1 # ID=1_10...
## [["OG6"]] gi|71851486|gb|AE017243.1|_1078 # 725485 # 725850 # -1 # ID=1_10...
## [["OG7"]] gi|71851486|gb|AE017243.1|_1080 # 726349 # 726723 # -1 # ID=1_10...
## [["OG8"]] gi|71851486|gb|AE017243.1|_1083 # 727841 # 727978 # -1 # ID=1_10...
## [["OG9"]] gi|71851486|gb|AE017243.1|_636 # 428809 # 428922 # -1 # ID=1_636...
## [["OG10"]] gi|71851486|gb|AE017243.1|_87 # 57229 # 58629 # 1 # ID=1_87;par...
## ...
## <3389 more elements>

Apart from what is already available within Bioconductor, FindMyFriends also comes with a set of tools to investigate the results further. Two of these calculates different statistics on the two main gene groupings in the dataset, namely gene groups and organisms. These functions are called groupStat()and orgStat() and are very straightforward to use:

groupStat(mycoPan)[[1]]
## $maxOrg
## [1] 9
## 
## $minLength
## [1] 90
## 
## $maxLength
## [1] 90
## 
## $sdLength
## [1] 0
## 
## $genes
##  [1] 2834 1552 8402 7037 8491 7127 1723 1733 1761 7272 8636 8738  567 7381
## [15]  594 1925 7428 8795 7528 8894 7559 8925 7588 8954  809  929 2281  987
## [29] 1035 1055 3765 1085 9266 7902 2600 2677
## 
## $backward
##  [1] "OG686"  "OG845"  "OG2"    "OG2"    "OG790"  "OG790"  "OG2"    "OG1073"
##  [9] "OG2"    "OG580"  "OG580"  "OG2"    "OG1103" "OG2378" "OG525"  "OG2"   
## [17] "OG3007" "OG3066" "OG744"  "OG744"  "OG2"    "OG2"    "OG2"    "OG2"   
## [25] "OG2"    "OG2"    "OG856"  "OG2"    "OG566"  "OG297"  "OG8"    "OG2"   
## [33] "OG8"    "OG3347" "OG91"   "OG166" 
## 
## $forward
##  [1] "OG2"    "OG2"    "OG316"  "OG316"  "OG2"    "OG2"    "OG1757" "OG2"   
##  [9] "OG1167" "OG2"    "OG2"    "OG3337" "OG2"    "OG2"    "OG2"    "OG3383"
## [17] "OG2"    "OG2"    "OG2"    "OG2"    "OG23"   "OG23"   "OG58"   "OG58"  
## [25] "OG920"  "OG1206" "OG2"    "OG3154" "OG2"    "OG2"    "OG2932" "OG6"   
## [33] "OG2"    "OG2"    "OG2"    "OG2"
head(orgStat(mycoPan))
##          nGenes minLength maxLength sdLength nGeneGroups nParalogues
## AE017243   1336        30      1824 121.1718        1316          13
## AE017244   1351        30      1808 121.5220        1325          14
## AE017332   1316        30      1825 123.5957        1310          10
## AP012303   1407        30      1141 121.8308        1404          18
## CP002077   1392        30      1141 121.6465        1389          18
## CP002274   1367        30      1815 123.5032        1348           7

To get a very broad overview of your result plotStat() can help you:

plotStat(mycoPan, color='Species', type='qual', palette=6)

Three other plot functions exists to let you asses general properties of the pangenome. plotEvolution() lets you follow the number of singleton, accessory and core genes as the number of organisms increases. Generally this type of plot is very biased towards the order of organisms, so while it is possible to supply a progression of organisms, the default approach is to create bootstraped values for the plot:

plotEvolution(mycoPan)
## `geom_smooth()` using method = 'loess'

plotSimilarity() creates a heatplot of the similarity matrix of the organisms in the pangenome. The similarity of two organisms is here defined as either the percent of shared genes or the cosine similarity of their total kmer feature vector. The ordering can be done manually or automatically according to a hierachcal clustering:

# Pangenome matrix similarity
plotSimilarity(mycoPan)

# Kmer similarity
plotSimilarity(mycoPan, type='kmer', kmerSize=5)