Thursday, September 15, 2011

Open data needs to be accessible

I'm trying to acquire two very different datasets in completely different fields right now: infectious disease incidence data from the CDC Morbidity and Mortality World Report, and CMIP5 global climate models. They both illustrate a simple truth: making data public is just the first step. If no one can access it in a reasonable way, it's essentially just as closed as if you were to not provide any kind of access.

The Morbidity and Mortality World Report (MMWR) is available from the CDC's website via a web interface: choose a year (1996-2011) and week number (1-53) from lists, press submit, choose a table number (there are 10-12 tables per week per year), press submit, and you're presented with an HTML table containing data for a subset of the notifiable diseases. Okay, CDC, now suppose I'm interested in large scale patterns: I want to download data for all diseases for a five year period. This is going to involve hitting "submit" 6,360 times (5 * 53 * 12 * 2). Sure, I could write a script to do it automatically, but the output is a bunch of HTML tables, each of which has a slightly different format making it difficult to "scrape" out the data.

(After toying with the idea of trying to build a scraper, I made contact with the CDC back in June to try to acquire the raw data behind the online tables. I thought this would be faster. I was mistaken. I've had a few responses, but I'm still waiting for the actual data.)

But, believe it or not, the CMIP5 data is far worse. CMIP5 stands for Coupled Model Intercomparison Project - intercomparison! To me, the word "intercomparison" suggests that data for multiple models should be easy to download simultaneously. Not so. Again, a web interface stands in your way, only this time it uses asynchronous JavaScript to build each page so you can't just go to a specific URL to get your data. You have to narrow things down by clicking on a Model, then an Experiment, then a Frequency (monthly, yearly, etc.), then a Realm (land, atmosphere, ocean), and finally a variable. At this point a list of models will be provided, with only ten results per page. You have to check a box for each dataset you want (at this point I typically want all of them), press "download all," download the WGET script they provide (seriously?) and run it.

I've started writing an automated tool to do this for me using spynner, an automatic browsing module for Python that understands JavaScript. The tools is about 500 lines of code so far and climbing. The process is like this: log in to the website, click on a model, wait ~10 seconds for the JavaScript to run, check which experiments are available on the page, open a new copy of the browser window (so I don't have to retrace my steps after each download), click the first experiment, wait 10 seconds... To give you an idea of how obnoxious this is, for their research, the Utah Climate Center wants data from: 18 models, 54 experiments, 6 frequencies, 4 realms, and 31 variables. This amounts to many terabytes of actual data, and many, many iterations of "click a link and wait ~10 seconds." Once this tool is finished, it will need to run on a server for hours, probably days, to download everything.

It's great that we've seen such an explosion of publicly available data, but one of the key words there is available. For some very important datasets, in terms of availability, we still have a long way to go.