Ben Morris' notebook: Hour challenge: NCBI taxonomy tree

Friday, February 15, 2013

Hour challenge: NCBI taxonomy tree

Today I'm planning a more science-y and useful hour challenge. Over the course of one hour, I'm going to transform the NCBI taxonomy from a tabulated text dump into a Newick tree which can be manipulated by phylogenetic tools. While other people have done this and Newick strings created from the NCBI taxonomy can be downloaded on external sites, the taxonomy is constantly updated, so it would be nice to have a reproducible process to update the tree whenever necessary.

I have other things going on and I haven't decided exactly when I'm going to do this today, but it'll happen. I think there's a good chance that this is the first one I actually finish in an hour, too. When I decide on the timing, and when I complete the project, I'll update this post with links. As always, everything will be done on GitHub so you can watch me tackle this live if you have nothing better to do.

While we're on the subject of the NCBI taxonomy...

Update 1: Busy day. The plan is to get started at 7 PM Eastern. So, theoretically, I should be finished by 8.

Update 2: Started at 7:15, and finished promptly at 8:15, so this was a success. This actually required fixing a bug in the BioPython Newick writer, as node labels in Newick trees weren't being quoted when they contained invalid characters such as spaces or parentheses. So, in addition to the 989,621-node NCBI Newick tree, I also generated a bug fix for BioPython.

The code is available at: https://github.com/bendmorris/ncbi_taxonomy

6 comments:

UnknownFebruary 15, 2013 at 10:18 AM
looking forward to it
ReplyDelete
Replies
UnknownFebruary 15, 2013 at 8:29 PM
nice, how long does the python script take to run?
ReplyDelete
Replies

Add comment

Pages

Friday, February 15, 2013

Hour challenge: NCBI taxonomy tree

6 comments: