Sunday, May 26, 2013

A simple Python phylopic.org wrapper

PhyloPic.org is an online repository of species silhouettes. I love the idea, I think it could be very useful to jazz up figures, and they have a fantastic search feature and API. This morning I wrote up a simple Python wrapper of the API - enter one or more taxon names, download thumbnail images for each one.

In [1]:
import phylopic

phylopic.download_pics('Turdus migratorius', 'Anolis carolinensis', 'Panthera leo', img_format='thumb.png')
Downloading pic for Turdus migratorius... done!
Downloading pic for Anolis carolinensis... done!
Downloading pic for Panthera leo... done!
In [2]:
imshow(imread('Turdus migratorius.thumb.png'))
Out[2]:
<matplotlib.image.AxesImage at 0x37a6c10>
In [3]:
imshow(imread('Anolis carolinensis.thumb.png'))
Out[3]:
<matplotlib.image.AxesImage at 0x36cf250>
In [4]:
imshow(imread('Panthera leo.thumb.png'))
Out[4]:
<matplotlib.image.AxesImage at 0x36fc9d0>

Currently you can download either a low-res thumbnail or an SVG (the default, not so ugly and pixel-y, but I couldn't find a good way to embed it into an IPython notebook.)

Code available on GitHub

Wednesday, May 15, 2013

Blogging with IPython: distances in higher dimensions

This is my first experiment blogging with an iPython notebook, which allows easily embedding interpreted code and figures. I've found this to be a fun and convenient way to write up blog posts that might contain some technical ideas or data, potentially turning this blog into an open lab notebook. Most of this is just me toying around and experimenting with probability distributions; it may or may not be new to you.

The question

As part of a research idea I'm playing with, I'm thinking about characterizing "average members" of populations where several traits are measured, as well as the average difference between two individuals in a population.

Let's assume for simplicity that we're dealing with normally distributed functional traits. Many traits in nature happen to be approximately normally distributed; the central limit theorem also guarantees that means of many aggregate measures tend to approach normality. If you measure the trait values for many individuals, you'll get a distribution like this:

In [1]:
import scipy.stats as s
data = s.norm(0,1).rvs(100000)
hist(data, 100)
show()

The mean of this distribution is zero. But what is the average distance from the center? It's not zero; that's the minimum distance from the center. The average distance for this data is:

In [2]:
hist(abs(data), 100)
show()
In [3]:
mean(abs(data))
Out[3]:
0.79749462465259524

...about 0.8.

Now, what happens as I increase the number of dimensions? Let's assume that we're measuring two independent traits in each individual, both normally distributed. This becomes a two-dimensional Gaussian distribution, with its peak in the center:

In [4]:
data = s.norm(0,1).rvs((2, 100000))

hexbin(data[0], data[1])
cbar = colorbar()
cbar.set_label('point density')

So, what is the average distance from the center now?

In [5]:
mean([(x**2 + y**2) ** 0.5 for x, y in zip(data[0], data[1])])
Out[5]:
1.2517969486248142

As the number of dimensions increases, the average distance from the center also increases. It's hard to visualize this as the number of dimensions continue to increase, but it turns out the average distance from the center increases predictably, with the square root of the number of dimensions.

In [6]:
import random

MAX_DIMENSIONS = 200

avg_distances = []
for d in range(1, MAX_DIMENSIONS):
    data = s.norm(0,1).rvs((d, 100))
    distances = [sum([x**2 for x in point])**0.5 for point in zip(*data)]
    avg_distances.append(mean(distances))
    
xs = range(1, MAX_DIMENSIONS)
ys = avg_distances

scatter(log(xs), log(ys))

m, b, r, p, se = s.linregress(log(xs), log(ys))

print 'slope=%s, intercept=%s, r^2=%s' % (m, b, r**2)

plot([min(log(xs)), max(log(xs))], [b+min(log(xs))*m, b+max(log(xs))*m])
xlabel('no. dimensions')
ylabel('avg. distance from center')
    
show()
slope=0.516456510273, intercept=-0.0787972583541, r^2=0.998092351213

We've looked at the "average" member of a population. My next question is, what is the average difference between two members of the population?

In one dimension and two dimensions:

In [7]:
data = s.norm(0,1).rvs((2, 100000))

hist(abs(data[0] - data[1]), 100)
show()
In [8]:
mean(abs(data[0] - data[1]))
Out[8]:
1.1254983464319279
In [9]:
d = 2
data = s.norm(0,1).rvs((2, d, 100000))

distances = [sum([x**2 for x in point]) ** 0.5 for point in zip(*(data[0] - data[1]))]
hist(distances, 100)
show()
In [10]:
mean(distances)
Out[10]:
1.7730613234364447

And finally, how does this change as the number of dimensions increases?

In [11]:
avg_distances = []

for d in range(1, MAX_DIMENSIONS):
    data = s.norm(0,1).rvs((2, d, 100))
    
    distances = [sum([x**2 for x in point]) ** 0.5 for point in zip(*(data[0] - data[1]))]
    avg_distances.append(mean(distances))

xs = range(1, MAX_DIMENSIONS)
ys = avg_distances

scatter(log(xs), log(ys))

m, b, r, p, se = s.linregress(log(xs), log(ys))

print 'slope=%s, intercept=%s, r^2=%s' % (m, b, r**2)

plot([min(log(xs)), max(log(xs))], [b+min(log(xs))*m, b+max(log(xs))*m])
xlabel('no. dimensions')
ylabel('avg. distance between two individuals')
    
show()
slope=0.516526929956, intercept=0.268035406832, r^2=0.998039284826

The mean distance between two individuals also seems to increase with the square root of the number of dimensions and seems to be between one and two times the average distance from the center.

Conclusions

IPython notebooks let me write some fast Python code to explore something I'm curious about in minutes and post the whole thing to a blog instantly. Very cool, and I'm going to keep using this in the future. I wish I hadn't put off trying this out for so long.