Creating pangenomes using FindMyFriends

FindMyFriends is an R package for creating pangenomes of large sets of microbial organisms, together with tools to investigate the results. The number of tools to create pangenomes is large and some of the more notable ones are OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan (available from CRAN) is currently the only other entry, but the scope of FindMyFriends is much broader.

Within comparative microbiol genomics, a pangenome is defined as a collection of all the genes present in a set of related organisms and grouped together by some sort of homology. This homology measure has mainly been some sort of blast derived similarity, but as you shall see here, this is not the only possibility.

Pangenomes have many uses, both as organisational tools and as a mean to get insight into the collective genomic information present within a group of organisms. An example of the former is to ensure an equal functional annotation of all genes within a set of genomes by annotating pangenome gene groups rather than single genes. An example of the latter is to use a pangenome together with phenotypic information to discover new links between genes and phenotypes.

Usually pangenomes are calculated from the result of blasting all genes in the organisms under investigation against each other. This step is very time-consuming due to the complexity of the BLAST algorithm. FindMyFriends is one of the few tools that take a different approach and instead relies on an alignment free methodology. In FindMyFriends gene similarities are calculated using kmer feature vectors. A kmer feature vector is a decomposition of a sequence into substrings of length k.

With this decomposition the problem of calculating sequence similarity is reduced to a vector similarity problem, for which there are many different solution. The one employed in FindMyFriends is called cosine similarity, which is effectively the cosine to the angle between the two vectors. If the vector space is strictly positive then the value is bound between 0 and 1 with 1 being 100 % similarity.

How to convert similarity measures into groupings is again a problem with many solutions. FindMyFriends opts for a for a graph theory solution, using the infomap community detection algorithm. This is similar in theory to OrthoMCL, though the algorithm differs (Infomap vs. Markov Clustering). It should be noted that the choice of community detection algorithm is open in FindMyFriends as all algorithms implemented within the igraph framework is available - Infomap is highly adviced though.

There is a lot more to FindMyFriends than this though - some of it will be discussed in the following.

Data to use in this vignette

This vignette will use a collection of 10 different Mycoplasma pneumonia and hyopneumonia strains. These organisms have an appreciable small genome making it easier to distribute and analyse.

# Unpack files
location <- tempdir()
unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'),
      exdir=location)
genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta')

Creating a Pangenome object from a set of fasta files

The first step in performing a pangenomic analysis with FindMyFriends is to load gene data into a Pangenome object. This is done with the aptly named pangenome function, which takes a list of fasta files, a boolean indicating whether the genes are translated and optionally information about the position of each gene on the chromosome and a boolean indicating whether to read all sequences into memory.

library(FindMyFriends)
library(igraph) # some return values are igraph objects

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

mycoPan <- pangenome(genomeFiles[1:9], 
                     translated=TRUE, 
                     geneLocation='prodigal', 
                     lowMem=FALSE)

## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'

## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'
## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'

mycoPan

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## Gene groups not yet defined
## 
## Genes are translated

The last parameters of the function call warrants some more explanation. Some of the algorithms in FindMyFriends uses positional information and this must be supplied with the sequenceInfo parameter. This can be done in a number of ways: As a data.frame with one row per gene, as a function that takes the header info of each gene from the fasta file and returns a data.frame with on row per gene, or as a string identifying the annotator used to find the gene (currently only Prodigal supported). The gene info has the following format (all columns required):

head(geneLocation(mycoPan))

##                       contig start  end strand
## 1 gi|71851486|gb|AE017243.1|   207  395      1
## 2 gi|71851486|gb|AE017243.1|   825 1109      1
## 3 gi|71851486|gb|AE017243.1|  1131 1409      1
## 4 gi|71851486|gb|AE017243.1|  1791 2879      1
## 5 gi|71851486|gb|AE017243.1|  2983 4521      1
## 6 gi|71851486|gb|AE017243.1|  4582 4842      1

The contig column holds information on the contig/chromosome on which the gene is located, the start and stop columns gives the start and stop position of translation and the strand column identifies on which strand the gene is located. Additional columns are allowed, but is currently unused.

The last parameter (lowMem) specifies whether all sequences should be read into memory (the default) or be kept as references to the original fasta files. The latter is useful in cases where the size of the sequence data becomes prohibitively large but is not needed for our little toy example.

Metadata about the organisms can be added and queried using the addOrgInfo method. For instance we might add some taxonomy data.

orgMeta <- data.frame(Species = c("hyopneumoniae", "hyopneumoniae", 
                                  "hyopneumoniae", "pneumoniae", 
                                  "pneumoniae", "hyopneumoniae", 
                                  "hyopneumoniae", "hyopneumoniae", 
                                  "pneumoniae"))

mycoPan <- addOrgInfo(mycoPan, orgMeta)
head(orgInfo(mycoPan))

##          nGenes       Species
## AE017243   1336 hyopneumoniae
## AE017244   1351 hyopneumoniae
## AE017332   1316 hyopneumoniae
## AP012303   1407    pneumoniae
## CP002077   1392    pneumoniae
## CP002274   1367 hyopneumoniae

The raw sequences are easily accessible as an XStringSet object, or split by organism into an XStringSetList object:

genes(mycoPan)

##   A AAStringSet instance of length 12247
##         width seq                                      names               
##     [1]    63 MQTNKNNLKVRTQQIRQQI...IIIDFTDLIAKQEVISR* gi|71851486|gb|AE...
##     [2]    95 MSQVDAFLFDDIQGLANKQ...LRIFKHKLVEEKLEKHI* gi|71851486|gb|AE...
##     [3]    93 MSKHFRNSIRELEGALKSI...KSEKRGKNIVHARDIAI* gi|71851486|gb|AE...
##     [4]   363 MQSAILNNINSPLSSFFLK...VSPEKPEICQLVGLVLV* gi|71851486|gb|AE...
##     [5]   513 MSKKSKNSSIEFDAIVVGG...AKKYNLEKRISKLKLIS* gi|71851486|gb|AE...
##     ...   ... ...
## [12243]   391 MFSFFKQIFKSLKKFFFLL...FTLDTSKLSHLDKEEFD* gi|440453185|gb|C...
## [12244]   222 MTNGITNQLICDHIDLKIA...IVADYLNQRPKTINEIN* gi|440453185|gb|C...
## [12245]   440 MEQFSAFKLLLKKQYETTL...MIEKDDSLQDIITSLVI* gi|440453185|gb|C...
## [12246]   251 MAISKKKRFFFDLAQDEDD...IPISRILTMILDQVIEE* gi|440453185|gb|C...
## [12247]   271 MIISFVNNKGGVLKTTMAT...YLEITKEILNLVNHGHQ* gi|440453185|gb|C...

genes(mycoPan, split = 'organism')

## AAStringSetList of length 9
## [["AE017243"]] gi|71851486|gb|AE017243.1|_1 # 207 # 395 # 1 # ID=1_1;par...
## [["AE017244"]] gi|71913466|gb|AE017244.1|_1 # 207 # 395 # 1 # ID=1_1;par...
## [["AE017332"]] gi|53987142|gb|AE017332.1|_1 # 1 # 189 # 1 # ID=1_1;parti...
## [["AP012303"]] gi|358640274|dbj|AP012303.1|_1 # 691 # 1833 # 1 # ID=1_1;...
## [["CP002077"]] gi|301633103|gb|CP002077.1|_1 # 691 # 1833 # 1 # ID=1_1;p...
## [["CP002274"]] gi|312600973|gb|CP002274.1|_1 # 1 # 189 # 1 # ID=1_1;part...
## [["CP003131"]] gi|506957411|gb|CP003131.1|_1 # 1 # 189 # 1 # ID=1_1;part...
## [["CP003802"]] gi|523582824|gb|CP003802.1|_1 # 207 # 395 # 1 # ID=1_1;pa...
## [["CP003913"]] gi|440453185|gb|CP003913.1|_1 # 692 # 1834 # 1 # ID=1_1;p...

Apart from that the object really needs some grouping before there is anything interesting to do with it.

Object defaults

A lot of the methods in FindMyFriends contains similar parameters, and a lot of these. To save typing when doing the analyses, these parameters can be set on a per-object manner. At creation a set of defaults is already set, which can be queried and modified:

# Query current defaults
head(defaults(mycoPan))

## $groupPrefix
## [1] "OG"
## 
## $kmerSize
## [1] 4
## 
## $lowerLimit
## [1] 0.5
## 
## $algorithm
## [1] "infomap"
## 
## $flankSize
## [1] 4
## 
## $minFlank
## [1] 0

# Set a new default
defaults(mycoPan)$lowerLimit <- 0.6

Values supplied as arguments to a method will always override object defaults.

Calculating pangenomes

FindMyFriends provide two approaches to creating the initial pangenome. Both of these are based on comparing normalised kmer counts between sequences, rather than doing a blast between all combinations of strains. Because kmer computations are so much faster than blast, this results in an enormous speed increase compared to other current tools.

All-vs-all kmer comparison

The first approach is very akin to the idea behind blasting all genes against each other. The major difference is that instead of deriving a similarity score from blasting two sequences against each other, the cosine similarity for the kmer feature vectors of the two sequences is used.

mycoSim <- kmerSimilarity(mycoPan, lowerLimit=0.8, rescale=FALSE)

The result of kmerSimilarity() will be a similarity matrix which can then be used to construct a grouping. FindMyFriends adopts a graph based approach, where the similarity matrix is translated to an undirected, weighted graph and community detection is used to split nodes (genes) into groups. The two-pass approach makes it possible to intercept the grouping and define your own algorithm though. Furthermore it makes it possible to define a similarity matrix based on other computations than kmer cosine similarity (e.g. blast), and still use the graph grouping from FindMyFriends.

mycoPan <- graphGrouping(mycoPan, mycoSim)
mycoPan

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3216 gene groups defined
## 
##      Core|
## Accessory|============================================
## Singleton|======
## 
## Genes are translated

FindMyFriends support all community detection algorithms implemented in iGraph, but the Infomap algorithm in particular has recieved praise for its performance (in areas other than pangenome grouping though). The MCL algorithm is unfortunately not available in iGraph so a direct comparison to OrthoMCL is not possible, but a recent article comparing different community detection algorithms indicated better performance using infomap (Aldecoa & Marin, 2013).

Guided Pairwise Comparison

While using a kmer based approach instead of blast yields a huge speed improvement, it doesn’t solve one of the biggest problems with pangenome calculation, mainly the way it scales. When doing all-vs-all comparisons a doubling in the number of elements will result in 4 times the number of computations. This means that as you continue to add new genomes the computational time will increase at a faster and faster rate. So while time has been cut drastically by using kmers instead of blast, there will come a time when the computer will just give up.

FindMyFriends does away with this by introducing a new approach called Guided Pairwise Comparison (GPC). It works by building a tree of the organisms in the set, and recursively combine pangenomes of subtrees into new pangenomes. Each time two pangenomes are combined, representatives for each gene group in the two pangenomes are chosen at random and compared to each other using kmer similarity. The resulting grouping is then propagated to genes in the two pangenomes. As the number of comparisons are related to the number of branching points on the tree, and the number of branching point is equal to the number of organisms minus one, the algorithm is approximatly linear in scaling. True linearity would only exist in the case where the number of gene groups are constant for all generated pangenomes. As it is expected that the number of gene groups increases as more and more distant pangenomes are merged there will be a small deviation from linearity.

mycoPan <- gpcGrouping(mycoPan, lowerLimit=0.5)
mycoPan

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 2863 gene groups defined
## 
##      Core|
## Accessory|===============================================
## Singleton|===
## 
## Genes are translated

It is possible to supply the tree yourself using the tree argument, but omitting it will result in the algorithm creating one itself based on kmer count of the combined proteins for each organism. As can be seen, due to the nature of the algorithm, grouping is done directly instead of the two-pass approach used in the standard all-vs-all algorithm.

As the algorithm scales with the number of gene groups it is generally advised to keep the grouping very liberal by providing a low lowerLimit (e.g. 0.5, the default). This will result in fewer, more heterogenous groups, but this will be resolved at a later stage (see Gene group splitting).

Manual grouping

It is also possible to define the grouping of genes manually and thus import results from other tools to use in the FindMyFriends framework. The supplied information can either be a vector of integers giving the group membership for each gene, or a list of vectors giving the indexes of genes for each group.

# Compute grouping based on hierarchical clustering - only for illustrative
# purpose. Not advisable.
distMat <- as.matrix(1/mycoSim-1)
distMat[is.infinite(distMat)] <- 1e5
htree <- hclust(as.dist(distMat), method='ward.D2')
members <- cutree(htree, k=2000)

# Add membership manually
mycoPanMan <- manualGrouping(mycoPan, members)
mycoPanMan

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 2000 gene groups defined
## 
##      Core|
## Accessory|==================================================
## Singleton|
## 
## Genes are translated

Pangenome post-processing

The initial results of either kmerSimilarity or GPC grouping can be further processed in a number of ways. Most notably gene groups can be split based on the similarity of the neighborhood of each gene within the group. This allows for splitting of paralogues into separate groups. Apart from that it is possible to link paralogue gene groups together, and add new organisms while retaining the gene group information.

Gene group splitting

If sequence location information has been provided it is possible to split up the gene groups created above into groups with similar neighborhood. The idea of using neighborhood to split paralogues has been pursued by PanOCT too, but in a slightly different way. There the blast hit score is weighted by the similarity of the neightborhood and incorporated into the initial grouping, whereas FindMyFriends provide it as a separate and optional step. When splitting by neighborhood, FindMyFriends builds up a of all genes within each group based on a range of calculations. Genes in the graph are only connected if they are not from the same organism, have a sequence similarity above a certain threshold and a neighborhood similarity above a certain threshold. Based on this graph, maximal cliques are detected and sorted by the quality of their edges, so that cliques with high internal sequence and neighborhood similarity are prefered. Cliques are then extracted as new groups until all members of the group has been assigned.

mycoPan <- neighborhoodSplit(mycoPan, lowerLimit=0.75)
mycoPan

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3450 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated

Paralogue linking

While it is generally a boon to split up paralogue gene groups, as the genes might have different functionality within the cell (and probably be under different regulation), it can be nice to maintain a link between groups with a certain similarity. Such a link can be obtained by calculating a kmer similarity between representatives from each gene group. Once paralogue links have been created they can be used when extracting raw sequences and the information will be taken into account when calculating organism statistics. Furthermore it is possible to merge paralogues into true gene groups (effectively undoing the neighborhood splitting).

mycoPan <- kmerLink(mycoPan, lowerLimit=0.8)

genes(mycoPan, split='paralogue')[[1]]

##   A AAStringSet instance of length 40
##      width seq                                         names               
##  [1]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
##  [2]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
##  [3]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
##  [4]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEEYQKQ* gi|71851486|gb|AE...
##  [5]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
##  ...   ... ...
## [36]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [37]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [38]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [39]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [40]   396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...

mycoPan

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3450 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated

collapseParalogues(mycoPan, combineInfo='largest')

## An object of class pgFullLoc
## 
## The pangenome consists of 12247 genes from 9 organisms
## 3178 gene groups defined
## 
##      Core|
## Accessory|=============================================
## Singleton|=====
## 
## Genes are translated

Removing genes

Usually genes are detected automatically by programs suchs as Prodigal and Glimmer, but while these programs are generally good, they are not perfect. A pangenome analysis can potentially reveal genes that, while annotated, are no longer functioning due to frameshifts etc. These will be apparant as members of gene groups that have vastly different length than the majority of the members. Removing such genes will in general improve whatever downstream analysis that the data will be used for. Using the removeGene() function it is possible to remove single or sets of genes from your pangenome object.

# Remove a gene by raw index
removeGene(mycoPan, ind=60)

## An object of class pgFullLoc
## 
## The pangenome consists of 12246 genes from 9 organisms
## 3450 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated

# Remove the first organism by index
removeGene(mycoPan, organism=1)

## An object of class pgFullLoc
## 
## The pangenome consists of 10911 genes from 8 organisms
## 3349 gene groups defined
## 
##      Core|
## Accessory|===========================================:
## Singleton|======:
## 
## Genes are translated

# or by name
name <- orgNames(mycoPan)[1]
removeGene(mycoPan, organism=name)

## An object of class pgFullLoc
## 
## The pangenome consists of 10911 genes from 8 organisms
## 3349 gene groups defined
## 
##      Core|
## Accessory|===========================================:
## Singleton|======:
## 
## Genes are translated

# Remove the second member of the first gene group
removeGene(mycoPan, group=1, ind=2)

## An object of class pgFullLoc
## 
## The pangenome consists of 12246 genes from 9 organisms
## 3450 gene groups defined
## 
##      Core|
## Accessory|==========================================:
## Singleton|=======:
## 
## Genes are translated

Investigating the results

Having a nice structure around your pangenome data is just a part of a great framework - having a set of analyses at hand for investigating the results is another part. Luckily, by being part of Bioconductor and using the standard data representations, a lot of the functionality already provided in Bioconductor is at your disposal. The two main data types that you might want to extract in order to do further analysis is the pangenome matrix and raw sequences:

# Get the pangenome matrix as an ExpressionSet object
as(mycoPan, 'ExpressionSet')

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 3450 features, 9 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: AE017243 AE017244 ... CP003913 (9 total)
##   varLabels: nGenes Species
##   varMetadata: labelDescription
## featureData
##   featureNames: OG1 OG2 ... OG3450 (3450 total)
##   fvarLabels: description group ... nGenes (7 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

# or as a regular matrix
as(mycoPan, 'matrix')[1:6, ]

##     AE017243 AE017244 AE017332 AP012303 CP002077 CP002274 CP003131
## OG1        1        1        1        0        0        1        1
## OG2        1        1        1        0        0        1        1
## OG3        1        1        1        0        0        1        1
## OG4        1        1        1        0        0        1        1
## OG5        1        1        1        0        0        1        1
## OG6        1        1        1        0        0        1        1
##     CP003802 CP003913
## OG1        1        0
## OG2        1        0
## OG3        1        0
## OG4        1        0
## OG5        1        0
## OG6        1        0

# Get all genes split into gene groups
genes(mycoPan, split='group')

## AAStringSetList of length 3450
## [["OG1"]] gi|71851486|gb|AE017243.1|_1 # 207 # 395 # 1 # ID=1_1;partial=...
## [["OG2"]] gi|71851486|gb|AE017243.1|_2 # 825 # 1109 # 1 # ID=1_2;partial...
## [["OG3"]] gi|71851486|gb|AE017243.1|_3 # 1131 # 1409 # 1 # ID=1_3;partia...
## [["OG4"]] gi|71851486|gb|AE017243.1|_4 # 1791 # 2879 # 1 # ID=1_4;partia...
## [["OG5"]] gi|71851486|gb|AE017243.1|_5 # 2983 # 4521 # 1 # ID=1_5;partia...
## [["OG6"]] gi|71851486|gb|AE017243.1|_6 # 4582 # 4842 # 1 # ID=1_6;partia...
## [["OG7"]] gi|71851486|gb|AE017243.1|_7 # 4992 # 5300 # 1 # ID=1_7;partia...
## [["OG8"]] gi|71851486|gb|AE017243.1|_8 # 5314 # 5904 # 1 # ID=1_8;partia...
## [["OG9"]] gi|71851486|gb|AE017243.1|_9 # 5908 # 6087 # 1 # ID=1_9;partia...
## [["OG10"]] gi|71851486|gb|AE017243.1|_10 # 6103 # 6297 # 1 # ID=1_10;par...
## ...
## <3440 more elements>

Apart from what is already available within Bioconductor, FindMyFriends also comes with a set of tools to investigate the results further. Two of these calculates different statistics on the two main gene groupings in the dataset, namely gene groups and organisms. These functions are called groupStat()and orgStat() and are very straightforward to use:

groupStat(mycoPan)[[1]]

## $maxOrg
## [1] 1
## 
## $minLength
## [1] 63
## 
## $maxLength
## [1] 63
## 
## $sdLength
## [1] 0
## 
## $genes
## [1]    1 1337 2688 6803 8170 9531
## 
## $backward
## [1] NA NA NA NA NA NA
## 
## $forward
## [1] "OG2" "OG2" "OG2" "OG2" "OG2" "OG2"

head(orgStat(mycoPan))

##          nGenes minLength maxLength sdLength nGeneGroups nParalogues
## AE017243   1336        30      1824 121.1718        1336          15
## AE017244   1351        30      1808 121.5220        1351          12
## AE017332   1316        30      1825 123.5957        1316          10
## AP012303   1407        30      1141 121.8308        1407          21
## CP002077   1392        30      1141 121.6465        1392          20
## CP002274   1367        30      1815 123.5032        1367           7

To get a very broad overview of your result plotStat() can help you:

plotStat(mycoPan, color='Species', type='qual', palette=6)

Three other plot functions exists to let you asses general properties of the pangenome. plotEvolution() lets you follow the number of singleton, accessory and core genes as the number of organisms increases. Generally this type of plot is very biased towards the order of organisms, so while it is possible to supply a progression of organisms, the default approach is to create bootstraped values for the plot:

plotEvolution(mycoPan)

plotSimilarity() creates a heatplot of the similarity matrix of the organisms in the pangenome. The similarity of two organisms is here defined as either the percent of shared genes or the cosine similarity of their total kmer feature vector. The ordering can be done manually or automatically according to a hierachcal clustering:

# Pangenome matrix similarity
plotSimilarity(mycoPan)

# Kmer similarity
plotSimilarity(mycoPan, type='kmer', kmerSize=5)

# No ordering
plotSimilarity(mycoPan, ordering='none')

plotTree() Plots a dendrogram of the organisms in the pangenome based either on the pangenome matrix or kmer counts. The tree can be augmented with additional information from either orgInfo or supplied manually:

plotTree(mycoPan, clust='ward.D2', dist='minkowski')

plotTree(mycoPan, type='kmer', kmerSize=5, clust='ward.D2', 
         dist='cosine', circular=TRUE, info='Species') + 
    ggplot2::scale_color_brewer(type='qual', palette=6)

Apart from these general statistics it is possible to get the neighborhood of any given gene group as a weighted graph to further investigate how the up- and downstream genes are organised.

library(igraph)
getNeighborhood(mycoPan, group=15, vicinity=5)

## IGRAPH DNW- 13 14 -- 
## + attr: name (v/c), centerGroup (v/l), weight (e/n)
## + edges (vertex names):
##  [1] OG10  ->OG11   OG11  ->OG12   OG12  ->OG13   OG12  ->OG14  
##  [5] OG13  ->OG14   OG13  ->OG1578 OG14  ->OG15   OG15  ->OG16  
##  [9] OG1578->OG15   OG16  ->OG17   OG17  ->OG18   OG18  ->OG19  
## [13] OG19  ->OG20   OG9   ->OG10

The resulting graph object can be plotted and handled by the functionality of the igraph package or plotted directly using FindMyFriends plotNeighborhood() method, which applies appropriate styling to the plot, making it easier to interpret:

plotNeighborhood(mycoPan, group=15, vicinity=5)

Panchromosomal analysis

While pangenomes are often envisioned as presence/absence matrices, this is not the only way to organise the information. As chromosomal position is available in most circumstances, the information can also be converted into a panchromosomal graph where each gene group is a vertex and chromosomal adjacency is weighted edges. The graph will generally be very sparse, consisting of long strings of gene groups interspersed with regions of higher variability.

panchromosome <- pcGraph(mycoPan)
plot(panchromosome, 
     layout=layout.drl(panchromosome, options = igraph.drl.coarsen),
     vertex.shape='none',
     vertex.label=NA)

This graph structure can be used for other things than random line drawing though. Often local regions of high variability can point to insertion/deletion events, frameshifts, problems with the gene grouping (god forbid) or general regions of high chromosomal plasticity. These regions can of course be detected in FindMyFriends and investigated accordingly.

localVar <- variableRegions(mycoPan, flankSize=6)
localVar[[1]]

## $type
## [1] "plastic"
## 
## $members
## [1] "OG1000" "OG1679" "OG3255" "OG998"  "OG999" 
## 
## $flank
## [1] "OG1000" "OG998" 
## 
## $connectsTo
## $connectsTo[[1]]
## [1] "OG1001" "OG1516"
## 
## $connectsTo[[2]]
## [1] "OG1678" "OG997" 
## 
## 
## $graph
## IGRAPH UNW- 5 6 -- 
## + attr: name (v/c), nMembers (v/n), organisms (v/x), strands
## | (v/x), flank (v/l), weight (e/n), organisms (e/x)
## + edges (vertex names):
## [1] OG1000--OG1679 OG1000--OG3255 OG1000--OG999  OG1679--OG3255
## [5] OG1679--OG998  OG998 --OG999

plot(localVar[[1]]$graph)

Dealing with huge datasets

Progress within computational efficiency is often followed by a quest to push these new boundaries to the fullest. FindMyFriends have been designed with this in mind and tries to cater to this as much as possible. The first option comes at the time the pangenome object is created, where it is possible to specify that sequences should not be kept in memory. This lowered footprint comes at the expense of slightly longer computational time as the sequence lookup is slower. It can become a necessity as the number of organisms increases (to date the full amount of sequenced E. coli weighs in just below 7 Gb in translated form). When using the lowMem option some calculations are also repeated instead of keeping the result in memory - most notably kmer count during GPC. For the E. coli example above the collective kmer count information in sparse format would occupy just below 50 Gb (with a kmer size of 5). Allocating this amount of memory (if possible) would often take longer than repeating a couple of computations. When utilizing parallelization (see below) this becomes even more of an issue as the memory footprint should be multiplied be the number of threads used (provided you dont use a cluster but a single multicore computer).

Parallelization

At certain computationally heavy steps it is possible to use parallelization based on the BiocParallel package interface. If a subclass of BiocParallelParam is supplied to either kmerSimilarity, gpcGrouping, orgTree or kmerLink the computations will be split out according to the supplied parameter. All of these are embarrassingly parallel and thus easy to distribute evenly.

Caching

When analysing huge number of organisms the computations can easily go on for days. In that time a lot of things can go wrong. The computer can crash due to prolonged high memory pressure, the power can be accidentally turned off etc. In order to avoid having to redo everything from scratch in these instances, FindMyFriends employs a caching mechanism courtesy of the filehash package. At the moment the caching is only implemented for the GPC algorithm as this algorithm will be the natural choice for huge datasets anyway. It works by providing a path to a folder where the cached results can be kept to the cacheDB argument in gpcGrouping(). It only works if the other parameters are kept across runs!

Good strategies for long running analyses

There is a few good strategies to optimise the analysis when analysing huge sets of organisms. In no particular order:

Precalculate the tree when using GPC instead of letting the algorithm do it.
Use caching.
Use lowMem=TRUE when creating the pangenome object.
Lower the lowerLimit compared to what you would normally do, and then rely on neighborhood splitting to resolve paralogues. This will cut computational time.
Use parallelization when available.
Be patient.