Wednesday, May 8, 2013

Automatic discovery and highlighting of meaningful clades in Bio.Phylo

A 2011 paper by Mattia Prosperi et al. [1] outlines a simple but effective algorithm for automatically partitioning a phylogeny into clades with high within-clade similarity relative to the rest of the tree. Essentially, given a percentile threshold, the algorithm traverses the tree until it finds a clade in which the median pairwise distance between taxa is less than a given percentile of the pairwise distances for the entire tree.

I'm playing around with some ideas in phylogeography which require such an algorithm, so I've implemented it in Python as part of a new package of phylogeography utilities I'd like to create. My implementation uses the Bio.Phylo library of Biopython [2]. Note that computing many distances across a large tree in Biopython is currently a bit slow because you end up traversing the tree over and over to find the path from the root to each tip, and distances and paths aren't cached; I have a pull request in that improves the efficiency, and you can use my development branch where it's already implemented.

Here's an example of the algorithm in action, various percentile thresholds:

In [1]:
import Bio.Phylo as bp
import geophy.cluster as g

tree = bp.read('test.new', 'newick')

g.color_clusters(tree, threshold=10, min_clade_size=1)
In [2]:
g.color_clusters(tree, threshold=30, min_clade_size=1)

Pretty cool. It seems like it would be possible to turn this into an optimization problem: find an optimal threshold value in order to maximize some other type of similarity between members of each clade, e.g. geographical location.

[1] Prosperi MC, et al. (2011). A novel methodology for large-scale phylogeny partition. Nature communications, 2 PMID: 21610724

[2] Talevich, Eric, et al. "Bio. Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython." BMC bioinformatics 13.1 (2012): 209.

No comments:

Post a Comment