Tuesday, September 17, 2013

Using Make for reproducible scientific analyses

The two tools that I use most frequently in development are Git and Make. I've been using Git for years, but Make for only the past year or so - strangely enough, I learned about it at a Software Carpentry workshop at which I was one of the instructors. Since then, it's become a key weapon in my computational arsenal. In this post, I'll try to sell you on using Make for your own research projects.

What is Make? Make is an "automated build system." In other words, it's a software tool for automating the process of building one or more files from one or more other files they depend on. A "Makefile" contains a set of "recipes" that specify how all of your desired final products - figures, processed data files, documents, etc. - should be produced as well as their dependencies, or what files are needed to create them.

To many C or C++ programmers, Make is used for compiling code. This can often be messy, and may require other tools to get it to work correctly all the time - tools that generate very ugly looking, difficult to understand Makefiles. This has resulted in a lot of people thinking that Make itself is complicated (before learning more about it, I was one of them.) However, Make is conceptually very simple, and "how to produce a set of files from a set of other files" is a general problem that scientists run into every day.
Learning Make was simple and using it netted me three major benefits:
  1. Automation reduces cognitive load. When I come back to a project that has a Makefile, I don't need to think about which program I need to run with which file and which options to get my desired result - I simply type "make" or "make [filename]" and it handles everything for me. Previously I might've had a script like "do_everything.py" that would, well, do everything - having a Makefile instead means I don't even need to look for such a script. I know in advance how I'm going to repeat the analysis. It also means that other people can check out my code, see that there's a Makefile, and effortlessly repeat my analysis, often without even requiring instructions.
  2. Make automatically handles dependencies. When you have a complicated workflow, changing one file means that other files downstream may also need to be updated. But which ones? Without Make, I need to know which files depend on the file I changed, which files depend on those files, etc. This is another thing that Make will figure out for you, saving you the time and effort of thinking it through yourself.
  3. Makefiles provide a roadmap to your project. The best way to start a project is to map out your workflow - what input files do I start with, how do they need to be processed, what will be the end results. The more specific you can be with this roadmap, the better. This is precisely what a Makefile gives you - but in addition to simply documenting your project, it's also something that can be run.
After all, the spirit of laziness is the spirit of innovation!

An example. Here's a brief example of a hypothetical scientific workflow, and how it can be improved with Make.

My scenario: I have ten FASTA files containing DNA sequences. I want to align them using muscle, convert the alignment to another format, then produce a figure that visualizes the alignments.

What I have: ten FASTA files (homo_sapiens.fasta, pan_troglodytes.fasta, gorilla_gorilla.fasta...), a Python script (visualize_alignments.py), and muscle. My Python script requires the alignment to be in Phylip format, but muscle outputs in FASTA format.

To manually generate these figures, I would use muscle to align each FASTA file individually; use the BioPython library to convert those alignments from FASTA to Phylip; then call my python script to generate the figure. This would take 3 individual steps per FASTA file for a total of 30 steps - and as I add more DNA sequences, the number of steps will go up. Instead I'll show you how to do everything with just a single word, "make."

First, let's learn the anatomy of a Make "recipe." A Makefile is a collection of recipes, each of which specifies how to create a single "target:"

    target: dependencies...
        command

The target is the file to be produced; dependencies is a list of files it needs; command is the command that turns those dependencies into the target.

Back to our example. Working backwards, my end result is a figure visualizing the alignment for a given species. The target is the figure. To generate this figure, I first need the alignment files and the script to generate the figure. Once I have those, the command is simply "python visualize_alignments.py [alignment file]." So the first recipe would look like this:

    homo_sapiens_alignment.png: homo_sapiens.phy visualize_alignments.py
        python visualize_alignments.py homo_sapiens.phy

Note the target to the left of the colon, dependencies on the right, command below. The command needs to be indented with a tab - spaces won't work. Also, it might be strange to think of my figure "depending" on the script that produces it. What I really mean is that if my visualization script changes - if I change the figure label, or colors, or something - I need to re-build the figure. More on that later.

Since we're doing this ten (or potentially many) times, we can make this recipe even more general by changing a few parts using some ugly but useful Make syntax:

    %_alignment.png: %.phy visualize_alignments.py
        python visualize_alignments.py $<

You can see two changes here. First, in the target and dependencies, I've replaced "homo_sapiens" with %, the wildcard character. I'm telling Make that to make any file that ends in _alignment.png, I need a file of the same name that ends in .phy. I've also added $< in the command - this is short for "the first dependency," or, in this example, the .phy file. Think of the < as an arrow, pointing to the leftmost dependency.

We've written one recipe. Now I'll add two more, one to convert the alignment from FASTA (.aln) to Phylip (.phy) using biopython, and another to produce the alignment from the input file using muscle:

    %_alignment.png: %.phy visualize_alignments.py
        python visualize_alignments.py $<

    %.phy: %.aln
        python -c "import Bio.AlignIO as aio; aio.convert('$<','fasta','$@','phylip')"

    %.aln: %.fasta
        muscle -in $< -out $@

One more syntax element has been added here: $@ stands for "the name of the target."

What I have now is sufficient to generate figures individually - I can type "make homo_sapiens_alignment.png" and the entire workflow will be run, producing the figure. Here's where automatic dependency resolution becomes useful. If I run "make homo_sapiens_alignment.png" the figure will be created - if I run the same command a second time, nothing will happen. This is because Make sees that none of the dependencies have changed, and therefore nothing needs to be done. If I make a change to one file - change some of the plotting code in visualize_alignments.py, or manually edit the alignment file, for example - Make will re-run only the necessary portions of the workflow, instead of the entire thing!

Our workflow is still ten steps, one per figure - better than 30, but not yet ideal. We can make one final improvement that will allow us to simply type "make" and generate all ten figures automatically. We'll create a rule called "all" at the very top:

    all: $(patsubst %.fasta, %_alignment.png, $(wildcard *.fasta))

This "patsubst" function produces a list of files. For every file called [something].fasta (my input files), the list will contain [something]_alignment.png (the corresponding figure). By default, running "make" without the name of a target will run the recipe at the very top (in this case, "all"), and that recipe is now dependent on all of the figures we want to produce, so they'll all be created automatically if they need to be.

Conclusion.
While there are some ugly bits - the unfamiliar and alien syntax, particularly - Make is, essentially, quite simple. If you're new to Make, I would suggest avoiding the advanced syntax for now. Simply use filenames for your recipes, and you can add in more generality later by taking advantage of Make syntax elements one at a time. Remember: a Makefile is just a list of steps that need to be taken to turn your input files into the output you want. Used in this way, it is a powerful tool that can improve reproducibility and save you time, effort, and headaches.