Ben Morris' notebook: 2012

Saturday, December 29, 2012

What incentives are there to maintain software in academia?

Just read an article in PLoS Comp. Bio. called "Ten Simple Rules for the Open Development of Scientific Software" by Andreas Prlić which was linked by Karthik Ram. A Twitter discussion followed, in which 140 characters was not enough to be sufficiently expressive. Let me start off by saying that I think this was a fantastic article. I'm 100% in agreement and think that these are some important points to make. I start with this caveat because I'm about to dwell on one suggestion that I had a negative reaction to.

From rule 10, "science counts:"

As scientists, the software we write is primarily a means to advance our research and, ultimately, achieve our scientific goals. Whilst the development of software for the consumption of others aligns well with other processes of scientific advancement, it is the science that ultimately counts. Scientific software development fulfils an immediate need, but maintenance of code that is no longer relevant to your own research is a serious time sink, and will rarely lead to your next paper, or secure your next grant or position.

The author hits on the unfortunate practical reality that time spent on software development that doesn't result in widely-recognized deliverables such as publications or grants is essentially time wasted, and will be inversely correlated with your chances of success as an academic.

The troubling part is that this is an extraordinarily short-sighted view of the value of software. Outside of academia, large communities of developers frequently and happily contribute to open source projects for which they receive no tangible benefit. The rewards developers receive vary from education and experience to networking and recognition to simply having fun. Sometimes extrinsic rewards eventually present themselves, and beyond a certain level of growth money becomes increasingly necessary to keep a large project going (see the "Money" chapter from Producing Open Source Software.) Still, popular open source projects such as Linux and Python have value that far outweigh the modest amounts of money that have been funneled into them, and they're still developed largely by unpaid (sometimes anonymous) volunteers.

Scientific software is important, and even very specialized software should be more widely available and used more often. Replication is one of the cornerstones of the scientific method. I envision a future where results and figures from papers are easily replicable upon publication and where people (reviewers especially) are in the habit of checking each others' work. This is already being done on small scales - see Weecology on GitHub for some excellent examples. The problem is this: a scientist who develops code for a single analysis and makes their code publicly available is doing it to benefit the broader scientific community. But code rots over time. Inevitably, when code makes the jump from a single user to many, problems will be discovered. Thus, the benefit provided by open source software is directly related to the effort spent responding to users and maintaining code. And for most projects, this effort has a very low probability of providing the author of the code with an additional grant or publication, so there's little incentive to do it. (There are notable counterexamples - massive projects such as DataONE for which there's already funding for long-term development and maintenance and which tend to result in multiple publications and presentations for those involved.)

So, my question is this: what can be done to provide incentives for the development and maintenance of important scientific code?

Monday, December 17, 2012

Does gun ownership (A) increase violence or (B) deter violence? C: none of the above.

In the wake of a terrible tragedy, talks about gun control are at the forefront of today's political stage. Of course, both advocates and opponents of gun control point to the Connecticut shooting as a validation of their own viewpoint. It's unfortunate that occurrences like this only seem to polarize us further. We can't rely on anecdotes or emotion to solve this problem. So, what does the data say?

Using data from the Guardian on gun ownership by country, I evaluated the hypothesis that higher rates of gun ownership either (A) lead to increased gun violence (as believed by the left) or (B) actually work to deter gun violence (as believed by the right.) Note that without an experimental manipulation (which is difficult to do due to the many factors that would need to be controlled for, not to mention very questionable ethics), we can identify correlations but it's difficult to really say anything about causation.

First, I compared the total number of civilian-owned guns in each country to the total number of gun-related homicides. The results, unsurprisingly, show a strong positive correlation, most of which can be explained by population: more populous nations will tend to have more homicides and more guns. (In these figures, the size of each data point indicates population.)

To control for population, I compared the rate of gun ownership (per 100 people) to the rate of gun-related homicides (per 100,000 people) and the results were surprising. There's a weak positive relationship between the rate of firearm ownership and the rate of firearm-related homicide (p=0.25), which doesn't strongly support either side's claims:

I suspect that there are other cultural, political, and socioeconomic factors that far outweigh gun ownership as predictors of gun violence, and that both sides in this debate potentially have valid points. In some situations, the presence of guns may deter violent crime. In others, it may enable violent crime.

We can all agree on one thing: we want there to be less mass shootings in America. When considering what policy changes will move us toward that goal, it is absolutely essential that we rely on evidence instead of either emotions or anecdotes.

Additionally, since "guns don't kill people (people with guns kill people)", maybe gun control is less important to curing the modern epidemic of mass-shootings than improving access to and understanding of mental healthcare.

The data and code I used to produce these figures is available on GitHub, and you're free to use them however you like. Feedback is welcome.

Edit: Someone did some additional analysis, uncovering a couple interesting correlates of overall homicide rates (including those unrelated to guns): GDP and income inequality. See it on Reddit.

Friday, December 7, 2012

Hour challenge 12/7: zot, a command-line Zotero client

The last hour challenge was fun, but it was also a dismal failure - honestly, I had been envisioning nanote for a long time prior to developing it, and an hour was just not enough time to build in all the functionality I wanted. I've started using nanote in place of nano and have continued to build in additional functionality, and will probably continue to do so for a while. End result: I now understand how to write a program with the curses library, and my notes are much more organized than they were a week ago.

I use Zotero all the time to manage papers and books. Today I'll be developing a command line interface to Zotero. (There's not already one of these? Really?)

~~I'll be using the pygnotero library to interface with Zotero.~~ To avoid the GPL, I'm not going to use pygnotero - instead, I'll just interface directly with Zotero's sqlite database. And, for fun, I'll use SQLAlchemy, which I've never used and should really learn more about. I want my client to be able to search (by author, title, citation, tags, etc.), add notes to papers, and output bibliographies (and potentially the text from PDF articles using pdfminer? I'm going to keep thinking about this.) It'll be called zot - short, memorable names for command line tools are always a good thing. I'm going to design it to pipe output to itself, i.e. "zot search ecoinformatics | zot bibliography" to generate a bibliography for all articles on ecoinformatics.

I'll begin sketching out planned functionality while I eat lunch at 12:00 (eastern time), start coding at 1:00 and hope to be finished no later than 2:00. Code will be available at https://github.com/hourchallenge/zot. The final result will be available on the Python Package Index as soon as it's relatively functional.

Update (2:00): my command line client can currently search for articles by title, author, or publication. I'm going to go for another hour to see if I can finish up.

Update (3:00): after two hours, I'm calling this finished. To try it out:

pip install zot
zot path /path/to/your/Zotero/directory/
zot search --author Brown | zot bib

Friday, November 30, 2012

Hour challenge 11/30: nanote, a terminal note-taking app

I'm fairly stone age when it comes to productivity software - I keep track of notes, my calendar, etc. in plain text files on Dropbox and edit them with nano (yes, I use nano and sometimes gedit.) I've decided there's no excuse for this anymore, so I'm going to spend some time developing (open source) productivity tools that I can use from the terminal that are a bit more sophisticated.

Of course, I want to get something out of this, and I don't want to sink a lot of time into development. So, I'm starting a weekly "hour challenge." Every Friday afternoon, I'll be setting aside an hour to develop a specific terminal-based productivity app, and releasing the source on Github. I'll report on my progress here.

I was originally going to develop a terminal interface to Google calendar today, but I discovered gcalcli which fits my needs perfectly and there's no need to reinvent that wheel. So, this afternoon, I'll be developing a terminal-based note-taking system similar to Tomboy notes. I used to use Tomboy, but it frequently caused problems syncing with Dropbox and I constantly lost work, and it uses XML which makes trying to restore after conflicts prohibitive. My notes will use a simple markdown-based text format and allow easy linking to other notes and hierarchical note organization using directories. I'm going to borrow functionality from Nano and call my app "nanote." I'll be learning how to interact with the ncurses library in Python.

The rules: I'm allowed to look into what libraries to use, plan, etc. for one hour prior to starting to code. I won't actually start coding until exactly 1:00 PM, and at 2:00 it's pencils up. Obviously, I'll probably keep coding for a little bit after the fact, but at the end of the hour I want to have a bare bones, functional tool.

I've created an hourchallenge Github organization for storing these projects, so drop me a line if you want to participate! Suggestions for future tools are also welcome.

Shortly after 2:00, I'll update this post with the results and a link to the repository.

Update (2:00): I didn't make it, partially due to not enough planning/familiarity with ncurses and partially due to suddenly being offered free pizza. I'm giving myself another hour! I can definitely finish this off by 3:00.

The code so far is live at https://github.com/hourchallenge/nanote

Update (3:00): I got close! It's more or less functional: you can edit text, save, and link to other notes. I'll probably end up spending another couple hours filling out the missing functionality. Hey, I think this is pretty good for a two hour development sprint.

Update (12/1): after 24 hours, it does just about everything I wanted it to, and supports Markdown formatting. Mission accomplished!

So, in summary, I utterly failed to develop something in an hour (and maybe this project was too ambitious for an hour coding sprint) but I'm still pleased with what I was able to do. Tune in next week when I take another crack at it.

If you have an idea for a tool that you'd like to see developed, or want to develop a tool in an hour yourself, let me know!

nanote is now available on the Python Package Index, so "pip install nanote && nanote" should get it running.

Thursday, July 26, 2012

Gordon Research Conference: Metabolic Basis of Ecology, 2012

Just finishing up a great week long conference in Biddeford, ME. Here's the poster I presented (full-size image here):

Our school, and Utah in general, were very well represented, with three students, a postdoc and two faculty members attending from our lab. Here's Dan McGlinn (postdoc)'s poster on testing MaxEnt spatial patterns (PDF), and here's Ethan's talk on testing other aspects of MaxEnt.

The GRC was excellent, and while we're not allowed to talk about specific talks due to Gordon's "off-the-record" policy (which we all waived for our own work), I highly encourage anyone interested in metabolism to attend. Talks varied from exploring mechanisms to large-scale patterns. It's really not just for ecologists. There's content that would appeal to evolutionary biologists, physiologists, environmental scientists, and even anthropologists or biomedical researchers.

Thanks to Ethan and Morgan for the advice and funding that enabled me to make it out here. I learned a lot, met some cool people, and got some great ideas to apply to my own research.

Monday, January 2, 2012

New Scientific Programming Blog

This is my personal blog where I post on a wide range of topics. I've decided to make a second blog devoted specifically to the concept of scientific programming. You can find it at http://www.sciprogblog.com. I'll be posting various tips and tricks for scientists and other professionals who haven't been formally trained as programmers but find themselves programming out of necessity and want to learn more.

Pages