Ben Morris' notebook: 2011

Thursday, September 15, 2011

Open data needs to be accessible

I'm trying to acquire two very different datasets in completely different fields right now: infectious disease incidence data from the CDC Morbidity and Mortality World Report, and CMIP5 global climate models. They both illustrate a simple truth: making data public is just the first step. If no one can access it in a reasonable way, it's essentially just as closed as if you were to not provide any kind of access.

The Morbidity and Mortality World Report (MMWR) is available from the CDC's website via a web interface: choose a year (1996-2011) and week number (1-53) from lists, press submit, choose a table number (there are 10-12 tables per week per year), press submit, and you're presented with an HTML table containing data for a subset of the notifiable diseases. Okay, CDC, now suppose I'm interested in large scale patterns: I want to download data for all diseases for a five year period. This is going to involve hitting "submit" 6,360 times (5 * 53 * 12 * 2). Sure, I could write a script to do it automatically, but the output is a bunch of HTML tables, each of which has a slightly different format making it difficult to "scrape" out the data.

(After toying with the idea of trying to build a scraper, I made contact with the CDC back in June to try to acquire the raw data behind the online tables. I thought this would be faster. I was mistaken. I've had a few responses, but I'm still waiting for the actual data.)

But, believe it or not, the CMIP5 data is far worse. CMIP5 stands for Coupled Model Intercomparison Project - intercomparison! To me, the word "intercomparison" suggests that data for multiple models should be easy to download simultaneously. Not so. Again, a web interface stands in your way, only this time it uses asynchronous JavaScript to build each page so you can't just go to a specific URL to get your data. You have to narrow things down by clicking on a Model, then an Experiment, then a Frequency (monthly, yearly, etc.), then a Realm (land, atmosphere, ocean), and finally a variable. At this point a list of models will be provided, with only ten results per page. You have to check a box for each dataset you want (at this point I typically want all of them), press "download all," download the WGET script they provide (seriously?) and run it.

I've started writing an automated tool to do this for me using spynner, an automatic browsing module for Python that understands JavaScript. The tools is about 500 lines of code so far and climbing. The process is like this: log in to the website, click on a model, wait ~10 seconds for the JavaScript to run, check which experiments are available on the page, open a new copy of the browser window (so I don't have to retrace my steps after each download), click the first experiment, wait 10 seconds... To give you an idea of how obnoxious this is, for their research, the Utah Climate Center wants data from: 18 models, 54 experiments, 6 frequencies, 4 realms, and 31 variables. This amounts to many terabytes of actual data, and many, many iterations of "click a link and wait ~10 seconds." Once this tool is finished, it will need to run on a server for hours, probably days, to download everything.

It's great that we've seen such an explosion of publicly available data, but one of the key words there is available. For some very important datasets, in terms of availability, we still have a long way to go.

Sunday, August 7, 2011

Developing Android apps: it's really not so bad!

There's been some talk recently about how Android development can be difficult for newcomers to jump into (for one, this rant on Hacker News, claiming that you'll need "months researching Android design patterns"; the top rated response is mine.) I thought I'd elaborate and show my roughly two week journey from deciding "hey, I'd like to try building an app some time" to releasing a full blown (albeit somewhat simple) Android app to the market. I plan on following up with more stats in a week's time to show how I'm doing with revenues, downloads, etc.

Background

I'm no stranger to programming. I taught myself how to program in QBasic in 1995 (when I was 7) on a little computer running Windows 3.1 (although I worked solely from DOS.) I made dozens of little QBasic games. Had I been born 10-15 years earlier I could've had a great career making crappy looking, low-res games.

One game that I started and never finished back in the QBasic days was my version of Archon II: Adept. Disclaimer: I never actually played Archon II, I just saw it in an old magazine once and thought the idea of summoning creatures onto a chess-like board and having them duke it out seemed pretty cool. I'm not sure if that's really what Archon II is like, but I had no money and an active imagination.

I got my first Android device, an enTourage Pocket eDGe, on July 7th, exactly 1 month ago. (I can't recommend the device enough, but unfortunately enTourage has gone out of business.) After a few weeks, I decided to try my hand at developing my own Android app. For my first Android project, I thought it would be fun to take another shot at finishing the "Archon II" concept so I could play it on my Edge.

I'd never programmed in Java before, although I have some experience with C# so the syntax isn't completely foreign.

Something that I really believe helped me get started smoothly was, frankly, my attitude. I don't waste time hating on Java, even though it's the cool thing to do. Java is what it is, it's very much ubiquitous right now, and it's just not productive to complain about it. Every language has shortcomings; Java is the standard language for Android development, so it's what I used. Also, when I first got started I hadn't yet read the plethora of blog posts about how terrible Android development is and how little money its developers make. I enjoyed the development process and, while it remains to be seen how my app will perform, I'm optimistic.

Getting Started

A simple Google search yielded some great tutorials that showed me how to get started with the Android SDK on Ubuntu. The whole process, downloading the SDK and Eclipse plugin and getting everything set up, took roughly 10 minutes, after which I grabbed a Hello World tutorial and was beyond excited when I got a custom text message to appear on my eDGe. It had begun.

I quickly became familiar with the Android development ecosystem: drawables, XML layouts, etc. I loved the fact that I drop an image into my drawable folder and refer to it by name or store it as an integer field in a class; with an XML drawable I could specify which image to use for a button by default and which to use when it was pressed. XML layouts were also a great alternative to programmatic layouts. Settings could easily be stored in preference files and retrieved by name, eliminating the need to create my own settings file read/write system. (At this point I really can't give enough props to Reddit user Lorc for his extremely high quality set of game images that have been made freely available. With a little coloring they look magnificent on a mobile device.)

Another great thing about Android development: for just about anything you might need to do (everything I needed to do, anyway), there's a helpful tutorial by someone who's done it before, and it's only a quick Google search away.

One hurdle I ran into: testing my app on the Android emulator. While convenient to test different display sizes, the emulator is painfully slow and took upwards of 10 minutes to boot up the first time. Not helping was the fact that I do most of my development on a netbook that's not especially powerful. I got around this problem by using my actual device for testing, using USB debugging mode when necessary. Problem solved.

Polish

I was busy full time during the day, but working only nights I was able to put together a working, playable prototype of my game within 4 evenings. (Confession: I stayed up pretty late some of those nights.) That was the easy part - I spent about a week and a half making various improvements and preparing for release.

The main difficulty at this point was the fact that I'd developed solely with the eDGe in mind, and there is a huge array of potential Android devices, each with their own resolution and capabilities. I came up with several ways to address this problem: using different layouts for different size devices, programmatically shrinking/expanding the size of the board and side bars, and including different sized boards for screens with different aspect ratios in the Settings menu.

I had some family members test the game and made some improvements based on feedback I got and new ideas I had. I made the UI more responsive by making the board extend SurfaceView instead of View, resulting in smoother animations. I simplified the controls, added trophies, and implemented a simple tutorial system that would give helpful messages the first time the user did certain things. I tried to keep things simple and intuitive to appeal to mainstream Android users.

I implemented an enemy AI that's pretty basic but still manages to beat me regularly. It looks at its potential moves, evaluates the condition of the board after making each move, and decides which is best based on things like proximity to allies/enemies and relative strengths and weaknesses.

Marketing and Release

A few days ago I committed on a Sunday evening release date, as I could keep polishing the app for years without releasing if I didn't make a firm decision. I reasoned that weekend evenings would probably see lots of users checking the market. I decided to release under the pseudonym "MonsterFace Games." "Monster face" refers to the face our cat makes when she's scolded, a kind of half-scowl that just barely shows teeth.

On Friday I started letting friends know that it was coming, posting to Facebook and Twitter. Saturday I learned how to integrate Google Analytics and AdMob into the game to serve ads on the free version and track statistics like what creatures and elements were most popular, what trophies were earned, and how often players won or lost.

One more unexpected speed bump - my release was about 3 hours later than I had hoped because I ran into problems putting up a free and paid version of the same app on the Android market - something that could've been a little more clear. But it's official, two weeks from concept to release of a full Android game with minimal dififculties. Stay tuned for updates on how Summoner performs during its first week.

Links

Summoner Lite - web, mobile

Summoner - web, mobile

Follow MonsterFace Games on Twitter: @monsterfacegame

Saturday, June 25, 2011

Americans don't trust experts

Here in America we only accept people as experts if they agree with us. And when we're easily convinced that someone is an "expert" who says things that coincide with our preconceptions, we open ourselves up to easy manipulation in the "wars of the experts" that play out in the media every day on a wide range of topics.

I've taken a lot of coursework in biology, particularly focusing on ecology. While ecology often presupposes the veracity of the climate change hypothesis, I am by no means an "expert" on climate change - the closest I come is having taken a year of general chemistry. I happen to accept the consensus that human activities are contributing to it. I've never done any research in climatology, so my task becomes to weigh the evidence and the various claims made to explain it, and determine who has the most compelling view. A little research shows an overwhelming international consensus by scientific organizations. It seems highly unreasonable to disagree with such a vast group of concurring professionals, especially since I don't claim access to some secret evidence that can disprove their claims. I really don't know anything about the models they use to predict climate change. But they're the ones doing the research, and I do trust that the checks and balances inherent refereed journal publication will weed out hypotheses that cannot be supported.

Many people find it difficult to reach the same conclusion, however, because there are "experts" on both sides. A little more searching reveals over 31,000 (over 9,000 PhD-holding) scientists who have signed a petition asserting that "there is no convincing scientific evidence" (I guess it depends on how you define "convincing") and even that "increases in atmospheric carbon dioxide produce many beneficial effects upon the natural plant and animal environments of the Earth." Many (over 40%!) of these "scientists" have only a Bachelor's degree; some have degrees in areas such as economics, some are medical doctors who treat patients and do not conduct scientific research on any topic, others are physicists who do not devote a significant part of their time to studying climate specifically but claim knowledge of "fundamental physical and molecular properties of gases, liquids, and solids." While the total number of signatories seems impressive, when you look more closely at its makeup the group hardly seem qualified to provide an opinion on this issue. In fact, only 39 (0.1% of all signers) identify themselves primarily as climatologists; to contrast, 9.7% of signers are medical doctors and a whopping 32% are engineers. Experts to be sure, but not experts on climate science.

Climate change, vaccination-induced autism, evolution, whether or not HIV causes AIDS, the effectiveness of unproven alternative medicine techniques that are taught without criticism...all debates in which people hold strong emotional attachments to their own viewpoints, and in which there are seeming "experts" arguing for the dissenting opinion. The average person is not trained to discern credibility in scientific claims and may just go with their feelings or make a decision for social, religious, political, or economic reasons. The media, by providing equal time to "both sides of the issue," tends to exacerbate the problem by creating the appearance of balance, when in reality only one side of the debate can be considered credible. A logical approach shows no reason to believe that vaccines cause autism, and studies have thoroughly debunked this idea, yet often anecdotes and a poor understanding of statistics cause parents to trust any "expert" that suggests such a link.

We live in a democratic society; the general public, not scientists, dictate policy, which makes it important to communicate and foster both a better understanding of science and more trust in scientists as people who are genuinely committed to solving important problems and are here to help. It's dangerous when emotions trump evidence in deciding truth. The scientific community struggles to really convey the strength of its positions - and "scientists" who claim expertise in a field they are not directly involved in aren't helping scientific credibility in general.

It's also important that the public learn to accept their own lack of expertise and defer decisions requiring advanced knowledge to those who have devoted themselves to the study of the particular issue. As Americans, it seems that distrust of authority is in our blood. But if we continue to hold stubbornly to emotional arguments that can be refuted by logic, and refuse to listen to those who have real knowledge, we will not be equipped to make the progress we'll need to face the big challenges of today and the future.

Tuesday, May 17, 2011

How do you make a programming language?

Without going into the philosophical details of "creating" a language (is a language "invented", or is it an abstract concept that always exists and is "discovered?"), I'm going to give a brief, high-level review of how a programming language interpreter is created. It's actually pretty simple to create a simple interpreter. There are two main steps: parsing and evaluating.

1. Parsing is the process of reading the text in a code file and breaking it down into expressions for your interpreter to evaluate. For example, a Scotch code file is read and deciphered into Haskell data structures: 1 + 2 + 3 becomes Add (Add (NumInt 1) (NumInt 2)) (NumInt 3), where "Add (expression) (expression)" symbolizes addition and "NumInt (number)" represents an integer.

Parsing is rather complicated, but it's made much easier if you take advantage of some of the existing libraries - I use Parsec, one of the most popular parser combinator libraries. Parsing can also be very slow. To solve this problem, when I parse a code file, I save a binary representation of the result in a new file; I only re-parse if there have been changes to the file.

2. Evaluating means breaking an expression tree down until you're left with a meaningful result. The expression above can be represented as a tree like this:

+
+ 3
1 2

The most top level expression is addition of an expression and 3. Before I can evaluate this, I need to evaluate the second addition expression, 1 + 2, which can be evaluated to 3. My tree is now simpler:

+
3 3

This can be fully evaluated to 6 and the result can be returned.

Finally, there's a process known as "bootstrapping" in which a programming language becomes "self-hosting." Scotch is not self-hosting, because it's implemented in another language, Haskell. Many popular languages (Python, Ruby) are implemented on top of another language, which is often C. Haskell is an example of a self-hosting language, because it's actually implemented in Haskell. There can be performance benefits to self-hosting languages, but it seems at first to present a kind of chicken-and-egg problem. So, how does a language become self-hosting?

1. First, an interpreter is created, using some existing language.
2. A compiler for the new language is written in the new language itself.
3. The interpreter is used to run the compiler on itself, effectively compiling itself.

You now have a compiled program that can compile other programs, free of any intermediate language.

If you want to explore the technical details of programming language implementation, feel free to browse the Scotch source code, which is available under the GNU General Public License.

Sunday, May 8, 2011

Science Denial

Vaccination, global warming...why do we have such a pervasive culture of distrusting scientists?

That's not to say that you should blindly accept anything. But when smart people who study something for a living all agree about something (say, that there's no link between autism and vaccination) and you can analyze the data yourself and it clearly points to the same conclusions (like a study of 537,000 Danish children that showed no difference in autism rates between vaccinated and unvaccinated children...) I just don't understand the arrogance it must take for someone to ignore that and think they somehow know better, based on the word of some politician or celebrity. We have outbreaks in Utah of both pertussis and measles because people think they know better than doctors and scientists.

I guess Galileo would argue that this isn't a new problem, though.

Thursday, March 17, 2011

The baby

Ruben James Morris, born 3/16, weighed 7 lbs, 13 oz. Happy birthday, Ruben!

Tuesday, February 15, 2011

Utah legislature wants to eliminate tenure

Sometimes it seems like Utah state legislators are in a competition to see who can pass the stupidest bill. Chris Herrod, R-Provo, recently threw his hat into the ring with a call to eliminate tenure for Utah professors.

There's a reason professors get tenure. Academics need to be free to pursue their research without fear of being fired for political reasons. In a state whose legislature frequently passes bills to make political statements on scientific issues, such as formally questioning global warming, this is absolutely essential for unbiased research to be carried out.

The optimist in me thinks this bill will find little support. Should it pass, it is going to be a huge blow to Utah higher education. Current tenured and tenure-track professors would not be affected, but Utah's two research institutions, University of Utah and Utah State University, would have a tough time recruiting new talent. Why would a professor choose a non-tenure track position in Utah when there are 49 states that are willing to offer tenure?

The bill is Utah House Bill 485, and its status can be tracked here.

2/24/11 update: the bill officially died in committee, though it was not without support; the House Education committee vote was 9-3 against, with 3 abstaining.

Herrod: "I don't understand the controversy."

Thursday, February 10, 2011

Infinite Functions

A cool recent development in Scotch: functions that never end, which can be used in combinination with take to get a specific number of results and stop evaluating.

There are a few functions now that use "filter" on an infinite list, i.e.

evens = filter(even, [1..])

which returns all even numbers, which there are infinitely many of. If you try to evaluate "evens" the interpreter will hang, trying to call the "even" function on an infinite list.

Unfortunately, it's impossible for Scotch to figure out that this is an infinite list; that would be equivalent to solving the halting problem, an undecidable problem in computer science. It's impossible to even tell that the evaluation of [1..] does not terminate. Why? Well, for starters, how can we tell that [1..] is not just [1..1000]? We can only tell when the evaluation of [1..] reaches 1001. But then we don't know that it's not just [1..1002]. Basically, you can't predict future behavior of the list; you can only know that it has terminated. If it never terminates, you have no way of knowing it never will. So, Scotch will happily try to evaluate this infinite function.

In practice, infinite functions can be used like this:

take 100 from evens

or

take 1000 from primes

to get a set number of results from a function that never terminates.

I also tweaked things so that using take with the sum of a list and something else will evaluate to just take from the list if it's long enough. So infinite functions that involve addition, like this,

infinite_range(n) = [n] + infinite_range(n + 1)

can be used in combination with take to get a set number of results:

>> take 10 from infinite_range(1)
[1,2,3,4,5,6,7,8,9,10]

This type of expression required a different approach to be evaluable. The function, when called, would evaluate like this:

[1] + infinite_range(2)
[1] + ([2] + infinite_range(3))
[1] + ([2] + ([3] + infinite_range(4)))
...

The problem here is that the never terminating (and therefore never fully evaluable) function infinite_range is always trapped inside parentheses with the elements that need to be added to the list, so this function could never be evaluated.

The solution was to automatically rewrite addition expressions associatively: a + (b + c) should always be rewritten as (a + b) + c. This allows infinite sums like infinite_range to evaluate to [n, n+1, n+2 ... n+m] + infinite_range(n+m+1) which can be used in combination with "take m" to get the list of m elements.

Thursday, February 3, 2011

Cross-platform Package Manager

One of the reasons I love Ubuntu is apt-get and the huge software repositories that are set up by default. As a developer, I install a lot of software, and being able to install a library or application and all its dependencies with one terminal command is a huge time saver.

So my question - why is there no native package repository on Windows/Mac? And really, why isn't there a cross-platform packaging system yet? (I've seen some attempts to create an "apt-get for Windows" but they all seem to have died out.) Developers could release a single package that would work for any supported OS and be done. The system itself could deal with platform-specific issues so the developer didn't have to. Just a thought.

Friday, January 28, 2011

Scotch 0.3.0 is here

I've been working hard lately to hit all of my release goals so I could get a new, improved version of Scotch out the door.

You can read about the changes (or the basics) in the new documentation. Aside from a lot of new functionality, one improvement that I'm really happy about is just-in-time compilation. This results in a dramatic speedup by greatly reducing the number of times each module is parsed. Previously, parsing accounted for all of the major cost centers affecting running time. A big culprit was std.lib, which is loaded at startup and then reloaded into every other imported module as it was interpreted. JIT compilation has resulted in about a 10x speedup for the cases I tested, making Scotch once again faster (again, for specific benchmarks being tested) than object oriented languages in its class such as Python and Ruby.

A Scotch website is in the works, but for now you can head to http://scotchlang.org which is a redirect to the front Wiki page of the Github project. Downloads are available here. And, check out the build instructions here.

Thursday, January 20, 2011

Support Scotch

Throw in a couple bucks and keep Scotch going!

You can now donate to support continued development of Scotch via Pledgie/PayPal using this button:

Click here to lend your support to: Support the Scotch programming language and make a donation at www.pledgie.com !

Scotch has been developed so far using a lot of my spare time, which I now have very little of. Scotch has already implemented some cool ideas and is definitely the first language of its kind, and there are still a lot of cool plans for the future. Your donation will be used to fund development and other direct costs (hosting, for example) associated with Scotch, and every dollar will be accounted for.

Our first goal is to acquire a used Mac Mini (around $350) so that we can build Mac packages and support Mac users better.

You can also support Scotch by helping with development, or by spreading the word.

Friday, January 14, 2011

Scotch : Haskell :: Python : C

A new programming language is useless if it doesn't solve a problem. With that in mind, I want to explore the main problem that Scotch solves.

Let's start with an analogy that expresses the basic motivation behind Scotch: Scotch aims to do for functional programming what Python has done for imperative. Or, in other words, Scotch : Haskell :: Python : C.

In terms of functionality, little of Scotch is new, although I do think the type system sets it apart from popular languages. Many significant functional programming languages already exist: Haskell, the Lisp family, the ML family, F#...so why do we need a new one?

Fortunately, Scotch has an important use case that none of these languages covers: it's an easy-to-use scripting language. Scotch is interpreted, its syntax is intuitive, its type system won't fight you, variable definitions can be redefined, and types never need to be declared: all features it shares with Python and Ruby, not other functional languages. A good developer equally familiar with Scotch and Haskell or Lisp or some other functional language will be able to throw a simple project together faster in Scotch. I'd also wager that, all things equal, this developer would be able to develop faster in Scotch than in Python, just as Python enables faster development than C.

Why?

I like to compare Scotch to Python; both are basically a bundle of things that have been done before, but both emphasize readability and developer productivity as two major selling points. Python achieves better readability in part by using simple, intuitive syntax. I would say that Scotch goes even farther in this direction because its syntax is almost identical to the most common, understandable notation in existence: math.

Take this example of function definition. In Python:

def area(length, width):
return length * width

In Haskell:

area :: (Int a) => a -> a -> a
area length width = length * width

Scotch:

area(length, width) = length * width

And in math:

A(l, w) = lw

To avoid identifier ambiguity, the Scotch version is not quite the same as math, but I'd say it's close enough. Comparing the first three, the third is more concise; to anyone with a basic math background, it is also more intuitive.

Computer programs express mathematics. Historically, programming languages have contorted mathematical notation to do things we wanted to do. With Scotch, the program looks like the math that it expresses. I believe that scientists, mathematicians, or anyone who uses math in their daily lives (that's everyone) will be able to read and understand a Scotch program - instead of learning a new "language," they simply have to learn how to apply a language with which they're already familiar.

I didn't develop Scotch to be popular; I invented it to scratch my own itch, creating a kind of hybrid of two of my favorite languages, Haskell and Python. At the moment, I've written more Scotch than anyone else on Earth (and I imagine it will stay that way for some time). The verdict: I've found developing the Scotch standard library to be a refreshingly painless experience compared to other languages I've worked in.

Sunday, January 9, 2011

Scotch 0.2.0 is out

School starts tomorrow, so my spare time is going to disappear for a bit. With that in mind, I'm packaging Scotch as version 0.2.0 as a release at GitHub (head to the "downloads" page.)

For help getting started, check out these code examples.

Version 0.1 was a prototype that I released just to get something up. Some significant improvements in 0.2.0 include:

Significantly more efficient - features like tail recursion and a hash table for variable/function bindings enable version 0.2.0 to run much faster
File input/output
Support for threads
Implemented hash tables
Algebraic data types
Anonymous functions
Expanded std.lib with more useful functions

Plans for the future include:

Native implementation of Scotch interpreter, in Scotch
Scotch-to-C compiler
Just-in-time compilation to an intermediate format - this will likely result in about a 50% speedup
Systems programming
Infinite lists

Monday, January 3, 2011

What's New with Scotch: 1/3/11

Here's a summary of some new features that have been recently added in Scotch:

Threads
Scotch now supports lightweight threads (lwt), which currently borrows from the Haskell lwt model. They can be used like this:

>> thread (do print 1; print 2; print 3;)
1
2
3
>> a = do print 1; print 2;
>> b = thread a;
>> execute([b] * 2)
1
2
1
2

Currently, state is not shared between threads; some sort of simple shared variables is needed, and it's on my to do list.

Algebraic Data Types
The ADT model is considerably different from other functional languages because of Scotch's type system. For object-oriented programmers, ADT are kind of like lightweight classes that you don't need to define before you use (but without encapsulation - in functional programming, behavior is handled by functions, not methods.)

For example:

>> Apple
Apple
>> Apple 1
Apple 1
>> Apple (Banana(1,2,3))
...

So, a custom type is an Atom (WhichLooksLikeThis) followed by any number of values (if there are more than one, parentheses are needed to eliminate ambiguity.) These custom datatypes can be used in function pattern matches like so:

>> a = Apple 1
>> f(Apple a) = a
>> f(a)
1

This pattern matches Apple followed by a value, and binds 'a' to that value.

Anonymous Functions
Wherever a function would be passed as an argument, like in this example...

>> apply(f, x) = f(x)
>> apply(add(10), 5)
15

...an anonymous function can also be used, which looks like this:

>> apply(a -> a + 10, 5)
15

This allows you to define functions without binding them to a variable. (Incidentally, these functions are a type of value, so they can be bound to a variable if you feel like it: f = a, b -> a + b)

Pages