Wednesday, May 15, 2013

Blogging with IPython: distances in higher dimensions

This is my first experiment blogging with an iPython notebook, which allows easily embedding interpreted code and figures. I've found this to be a fun and convenient way to write up blog posts that might contain some technical ideas or data, potentially turning this blog into an open lab notebook. Most of this is just me toying around and experimenting with probability distributions; it may or may not be new to you.

The question

As part of a research idea I'm playing with, I'm thinking about characterizing "average members" of populations where several traits are measured, as well as the average difference between two individuals in a population.

Let's assume for simplicity that we're dealing with normally distributed functional traits. Many traits in nature happen to be approximately normally distributed; the central limit theorem also guarantees that means of many aggregate measures tend to approach normality. If you measure the trait values for many individuals, you'll get a distribution like this:

In [1]:
import scipy.stats as s
data = s.norm(0,1).rvs(100000)
hist(data, 100)
show()

The mean of this distribution is zero. But what is the average distance from the center? It's not zero; that's the minimum distance from the center. The average distance for this data is:

In [2]:
hist(abs(data), 100)
show()
In [3]:
mean(abs(data))
Out[3]:
0.79749462465259524

...about 0.8.

Now, what happens as I increase the number of dimensions? Let's assume that we're measuring two independent traits in each individual, both normally distributed. This becomes a two-dimensional Gaussian distribution, with its peak in the center:

In [4]:
data = s.norm(0,1).rvs((2, 100000))

hexbin(data[0], data[1])
cbar = colorbar()
cbar.set_label('point density')

So, what is the average distance from the center now?

In [5]:
mean([(x**2 + y**2) ** 0.5 for x, y in zip(data[0], data[1])])
Out[5]:
1.2517969486248142

As the number of dimensions increases, the average distance from the center also increases. It's hard to visualize this as the number of dimensions continue to increase, but it turns out the average distance from the center increases predictably, with the square root of the number of dimensions.

In [6]:
import random

MAX_DIMENSIONS = 200

avg_distances = []
for d in range(1, MAX_DIMENSIONS):
    data = s.norm(0,1).rvs((d, 100))
    distances = [sum([x**2 for x in point])**0.5 for point in zip(*data)]
    avg_distances.append(mean(distances))
    
xs = range(1, MAX_DIMENSIONS)
ys = avg_distances

scatter(log(xs), log(ys))

m, b, r, p, se = s.linregress(log(xs), log(ys))

print 'slope=%s, intercept=%s, r^2=%s' % (m, b, r**2)

plot([min(log(xs)), max(log(xs))], [b+min(log(xs))*m, b+max(log(xs))*m])
xlabel('no. dimensions')
ylabel('avg. distance from center')
    
show()
slope=0.516456510273, intercept=-0.0787972583541, r^2=0.998092351213

We've looked at the "average" member of a population. My next question is, what is the average difference between two members of the population?

In one dimension and two dimensions:

In [7]:
data = s.norm(0,1).rvs((2, 100000))

hist(abs(data[0] - data[1]), 100)
show()
In [8]:
mean(abs(data[0] - data[1]))
Out[8]:
1.1254983464319279
In [9]:
d = 2
data = s.norm(0,1).rvs((2, d, 100000))

distances = [sum([x**2 for x in point]) ** 0.5 for point in zip(*(data[0] - data[1]))]
hist(distances, 100)
show()
In [10]:
mean(distances)
Out[10]:
1.7730613234364447

And finally, how does this change as the number of dimensions increases?

In [11]:
avg_distances = []

for d in range(1, MAX_DIMENSIONS):
    data = s.norm(0,1).rvs((2, d, 100))
    
    distances = [sum([x**2 for x in point]) ** 0.5 for point in zip(*(data[0] - data[1]))]
    avg_distances.append(mean(distances))

xs = range(1, MAX_DIMENSIONS)
ys = avg_distances

scatter(log(xs), log(ys))

m, b, r, p, se = s.linregress(log(xs), log(ys))

print 'slope=%s, intercept=%s, r^2=%s' % (m, b, r**2)

plot([min(log(xs)), max(log(xs))], [b+min(log(xs))*m, b+max(log(xs))*m])
xlabel('no. dimensions')
ylabel('avg. distance between two individuals')
    
show()
slope=0.516526929956, intercept=0.268035406832, r^2=0.998039284826

The mean distance between two individuals also seems to increase with the square root of the number of dimensions and seems to be between one and two times the average distance from the center.

Conclusions

IPython notebooks let me write some fast Python code to explore something I'm curious about in minutes and post the whole thing to a blog instantly. Very cool, and I'm going to keep using this in the future. I wish I hadn't put off trying this out for so long.

1 comment:

  1. Ben, that is so amazing! I don't know anything practical about Python, but this post/notebook make the concepts you are illustrating very clear. Really cool.

    ReplyDelete