I have other things going on and I haven't decided exactly when I'm going to do this today, but it'll happen. I think there's a good chance that this is the first one I actually finish in an hour, too. When I decide on the timing, and when I complete the project, I'll update this post with links. As always, everything will be done on GitHub so you can watch me tackle this live if you have nothing better to do.
While we're on the subject of the NCBI taxonomy...
Update 1: Busy day. The plan is to get started at 7 PM Eastern. So, theoretically, I should be finished by 8.
Update 2: Started at 7:15, and finished promptly at 8:15, so this was a success. This actually required fixing a bug in the BioPython Newick writer, as node labels in Newick trees weren't being quoted when they contained invalid characters such as spaces or parentheses. So, in addition to the 989,621-node NCBI Newick tree, I also generated a bug fix for BioPython.
The code is available at: https://github.com/bendmorris/ncbi_taxonomy
looking forward to it
ReplyDeletenice, how long does the python script take to run?
ReplyDeleteAfter downloading the files, it took about a minute and 20 seconds.
Deleteawesome
Deleteso are there plans to provide this via an API? Its such a large tree, R no likey as you would guess. Tried to read in python:
DeleteIn [1]: import Bio.Phylo as bp
In [2]: from Bio.Phylo import Newick
In [3]: tree = bp.read('path/to/ncbi_taxonomy.newick', 'newick')
but lots of errors....
Is this how you read in newick trees in Python?
You're doing everything right - the error reading in the tree should go away once BioPython accepts my pull request. Still, the whole tree is really too large to be very useful as is.
DeleteStep 2: right now I'm working on putting up a web server backed by the RDF treestore, and I just converted the Newick string into RDF (almost 2 GB.) So, once we launch (hopefully within the next month or two) you'll be able to get a subtree by providing a list of taxa, and it should be very fast.