On bringing stories to data, or, the trouble with the Cubists

Ben Klemens
13 min readNov 24, 2015

I did badly in every statistics class.

Econometrics 301 was my only D grade as an undergrad, and grad school was only marginally better. This is an essay about how I went from not getting it at all to writing an open-source, production-quality library of statistics functions almost from the ground up. It’s an essay about why I started working on it in 2005, and the problem in quantitative social science that led me to keep pushing until the present. There’s something of a happy ending: I kept hacking despite constant pressure to just go with the flow and accept the flaws of existing methods, and the library, Apophenia, is now in a complete enough state that one can use it for its intended purpose of bringing narratives to data.

Maybe I could start my story earlier, in ninth grade science class, where we learned about an idealized Scientific Method, that begins with a hypothesis, almost always expressed in some sort of narrative (A happens, which leads to B, which, if C is present, leads to D), and then tests that narrative using data. But applied statistics today is an eerie world where there are no stories, only a modest set of forms that bear some distant relation to the narratives they ostensibly test. This essay explains the current norms, and my efforts to come back to the ninth grade ideal of bringing narratives to data.

The first realization that got me from D-grade undergrad to my current status as a professional statistician/economist: understanding the difference between probability and statistics. Probability is an objective branch of mathematics, a bundle of theorems all of the form if these conditions hold, then this distribution will result. You can’t argue with the probability textbooks unless you want to get into philosophy of science questions.

Statistics is the subjective part of the story, the human art where we impose a view on the real world that we hope is somehow useful. After encapsulating a worldview in a set of equations and assumptions, we can describe certain properties of those models using the tools of Probability. Stats textbooks, taking an air of authority, often blur the distinction between statistics and probability, implying that it’s all on equally objective footing. As an undergrad, I couldn’t see which was which, and it was all downhill from there.

Speaking of art, here are “Woman with a lute”, by Johannes Vermeer; and “Girl with a mandolin”, by Pablo Picasso.

Vermeer and Picasso via Wikimedia Commons

Vermeer shows the context in which the person is playing to be worthy of every detail, but Picasso seems to want us to focus on the player and instrument. Picasso’s girl and mandolin blur together in parts, which is factually incorrect but perhaps a reasonable description of a person enraptured by music. I’m not going to say that one is a “better” representation of a person playing an instrument than the other, but I am willing to say that Vermeer painted a more accurate and precise representation. Given specific questions like ¿What is the player wearing? or ¿How many fingers does she have?, I’d rather have the Vermeer as a reference than the Picasso.

“Can’t escape this line of best fit.” Here’s some background music for you to enjoy while reading this essay, by Death Cab for Cutie. It looks like it was posted to Youtube by their label.

The second thing that I didn’t understand earlier about statistics is that, despite the implications of the textbooks, the linear models that have been the core of applied statistics for decades are far closer to Picasso than Vermeer.

A typical model might posit that

income = β_0 + β_1 * sex + β_2 * years of schooling + β_3 * years of work + ε.

How this works is that you give me a set of observations, and I will use linear algebra (not even probability or statistics!) to pin down the βs defining the line of best fit to the data. Then, you give me a new person’s sex, years of schooling, and years of work, and I can use the now-specified formula to sum up the prediction for his or her income. A few assumptions will even let me use theorems from probability to say something about our confidence regarding the estimates, given the model form and its assumptions.

Please let me remind you just how implausible this model is. Is third grade more or less important than a year of grad school? The model says there’s a global factor, β_2, that bumps up incomes in both cases. Does work experience affect a person differently from school? Sure, in that a year of the first raises income by β_3 and the second by β_2, but otherwise they’re structurally identical. Is sex relevant because women take time off to have babies, or because they’re not encouraged to take math/science classes, or because bosses are jerks? All we know is whether β_1 is positive or negative.

I know you’ve heard this one before: “essentially, all models are wrong, but some are useful.” But a model with no serious underlying narrative, which is a drawing of the world using only straight lines and the occasional curve, whose only structure is a sum of individual elements, scores especially badly on usefulness.

We could perhaps use a model with no narrative to show that income and education are positively correlated or not, and sometimes that’s all the utility we need. We could add confounding factors by adding extra terms to the sum [… + β_4 * height + β_5 * parent’s income + … ] and see if the correlation still holds. Adding terms like this is typically written up in academic papers as “controlling for” other factors, drawing a (theoretically ill-founded) metaphor to controlled experiments. It is how research seems to progress in the simple linear regression world, as the thing correlated to income in a study from one year gets added as an extra term to every regression in later years.

Despite all its cons, linear regression has one very notable pro: it’s easy to implement, even in FORTRAN 77. Yes, the 77 stands for 1977 which, by no mere coincidence, is the beginning of the period when linear regressions became increasingly common. In the present day, there is a regression feature on almost anything that handles numbers: stats packages, spreadsheets, database systems, expensive hand calculators.

And so, simple linear models are prevalent in the social sciences.

In November 2014, I surveyed the fifty most recent working papers from the World Bank’s research working paper series, and the fifty most recent from the Census Bureau’s Center for Economic Studies’ working papers series. It’s not a universal survey, but I’m comfortable taking economists at the WB and Census as typical examples of working PhD economists. I divided the papers into (A) papers that are only qualitative or only theoretical, and so make no effort to fit a model to real-world data (31 out of 100); (B) papers that use only linear methods in the family of the example above (58 out of 100 papers); and (C) papers that use anything from the infinite space of models outside of linear models (11 out of 100 papers).

Let’s start building an alternative narrative, something one might find in that thinner wedge of not-linear models. We might start with an underlying factor called “human capital”, which people build over their lives, maybe quickly at first but with diminishing returns. People then use the human capital they have and some chosen hours per week of work to gain income. We’re up to a three layer model: choice and method of human capital development, choice of work hours, a function mapping work to income. You can’t be a doctor and thus score doctor incomes without a medical degree, so we could also divide people by fields of study, which gives us another layer, tied to but distinct from the others.

We could implement this in computer code as a series of distributions and equations linking those distributions, a hierarchy of submodels. Or, we could simulate a million individual agents, each making life choices about fields and school and hours of work, then measure the aggregate outcomes given all those individual agents’ decisions. In the 1990s that was cutting edge; now that sort of agent-based model is a textbook exercise.

With a nontrivial narrative, different possibilities take different forms. Women might choose to opt out of the work force to have babies (choice of hours layer), or might choose different lines of work (field of study layer), or maybe bosses are jerks (layer mapping work to income). Later researchers building on the model might have a different form for the choice-of-hours layer, or might want to add another layer about location choice. This sort of tweak-by-tweak improvement and comparison is actively encouraged among agent-based modelers, and you can download hundreds of models at OpenABM and make such improvements.

[FWIW, this picture of a bartender is being used for the novel purpose of illustrating the Balkanization of social science research, taking a single fuzzy still from a feature-length movie, and is really no substitute for seeing the film, a time capsule of Chicago and its music from that era.]

There’s a scene in The Blues Brothers (a film from 1980) where the band wanders into a rural bar. They ask the waitress, “what kind of music do you usually have here?” and the waitress says, “Both kinds: country and western!”

So, that’s the state of social science modeling today. The people using the cubist linear models seem blissfully unaware that more detailed narrative models can even be written. Nor are they alone in being alone: for example, the bartendress who learned it all from a Machine Learning textbook will laugh about using support vector machines and random forests.

In the early 1900s, journals didn’t think it a big deal to publish a seventy-page treatise as just another article. In the present day, there’s a joke unit called the MPU — the minimal publishable unit — and journals are filled with papers containing exactly a single MPU. Authors want a longer résumé, which comes from lots of 1MPU articles. Journal editors will have an easier time finding peer reviewers for a 1MPU paper than something approaching a monograph. Readers want to feel that they understand the article after ten minutes of attention.

Running a somehow-novel suite of linear regressions is one MPU. It will get you funded and published.

So why push harder? In fact, a 2MPU paper with novel results using novel methods is vulnerable to criticism for using untested methods. The model underlying a linear regression is clearly implausible in almost every case, but it’s certainly familiar, to the point that it would be awkward for an editor to reject a paper because it chose the same method thousands of previous papers had used.

For a brief, shining moment, The Brookings Institution, a well-regarded and very applied think tank that has every incentive to stay within the methodological orthodoxy and pump out MPUs, had a research group with some of the pioneers of agent-based modeling. For reasons that are largely inexplicable, I was a part of this ideally-situated team of ideal members.

Working with a group that pushed new methods in a traditional context was wonderful in a hundred ways, but also had some frustrations. I wanted to talk about our work in mainstream terms to mainstream researchers but didn’t have the tools to do so. They were filling pages with precise information: “β_2 is 2.3487 with a margin of error of ±.01428” sort of things, while we had other largely qualitative results that didn’t look like those traditional forms. I tried estimating the parameters of my agent-based model of cross-border migration using R (a popular implementation of the S programming language by two guys named Ross and Robert), and R couldn’t do it. Its model syntax was clearly aimed at linear combinations, not models with millions of decision making agents. The more I tried, the more it broke my heart. I could try the agent-based modeling packages, which produced attractive Mondrian-inspired pictures, but they didn’t even contemplate accommodating traditional statistical forms.

This is when I started writing Apophenia, the library of stats tools and models that this essay is announcing.

“Apophenia” means the human tendency to see patterns in static, like faces in clouds or trends in random data streams, or swirls in the snowflake pic at the top of this essay. It’s a nice word, which is more than enough reason for using it as a name.

The MPU is a reflection of the normal course of modern science, in which a research project consists of working on some little corner of a big problem.

The big problem I’ve attempted to convey to you here is that the social sciences rely heavily on models with no serious narrative underpinnings. The little means by which I’ve been contributing to this problem is Apophenia.

The core feature is a definition of model that provides a single generic form that provides a consistent interface for any narrative, linear model, agent-based microsimulation, whatever. Then we can write down all sorts of transformation functions, that take in one or two uniform models and output another uniform model, which can be be input to the next transformation down. For those of you who really want the exposition, here’s my 45-page Census working paper (PDF) detailing the model object and some transformations.

For example, once you have a uniform model structure to describe a distribution of one person’s change in human capital in one year (herein D1), then it’s easy to simulate a million agents each building on to their stock of human capital with a new draw from D1 every year, and write down the resulting population distribution of human capital after several years as D2. Then we could easily apply a transformation of arbitrary complexity to D2 to produce a distribution of incomes, D3. If we later decide there isn’t enough jargon in the aggregate model, we could replace D1 with a Bayesian hierarchy estimated via Markov chain Monte Carlo, and the rest of the narrative won’t have to change.

Coming up with a uniform model structure may sound like a literal formality, but it is the building-block structure that lets us build chains like the above example without spending an impracticable amount of time wiring together incompatible bits and then trying to debug the mess we’ve just built.

The idea that uniformity allows us to have easily interchangeable parts is certainly not original, and on many occasions I have had a dialogue like this:

An academic: It’s completely obvious that systems should be written around models with a uniform form that can be combined and transformed in a uniform manner. Say something original.

Me: Do you have any examples of statistical and scientific computing platforms that are built around a general model object like that?

AA: Well, not exactly. But you could implement this in R in an hour.

Me: Could you point me to an R package that does this?

AA: Not really, no. But I’ll bet nobody’s done it because it’s just so obvious.

The answer to the question of why this is so obvious and yet (to the best of my knowledge) Apophenia is the first serious implementation of the idea, is that implementing this concept is hard. Every step of implementing a general model building block raised another difficult computing, statistical, modeling, or user interface question. And we don’t just want to write down these models and admire them, we want to fit them to data and make probabilistic statements about them. It’s a tall order, and I’m not sure why I thought it would be easy.

Nor do I quite know why I kept working on Apophenia for a decade, after realizing just what I’d gotten myself into. The rational me realized that the correct thing to do is to put out an MPU several times a year, and working on Apophenia was the exact opposite of that.

On the plus side, I got paid to do it. Apophenia was very appropriate for the work I did at the Census working out novel large-scale statistical processes, and every time a problem came up in a Census project, I or a coworker added a feature to Apophenia. So thanks, U.S. taxpayers, for your contribution. There are a lot of places that would never put up with an employee who fails to publish a stream of MPUs and instead puts all his efforts into writing new tools to facilitate better methods.

I don’t know if I’d do it all again, but it exists now, so nobody has to. Whatever language you’re writing in — R or Julia or Python or even Perl — has a mechanism for linking to libapophenia. And then you don’t have to work out how to get the Bayesian updating routine in this season’s favored language to accommodate agent-based models.

After one presentation on the algebraic system of models I talk about in the above-linked paper, an audience member revealed that he had received an NSF grant a few months earlier to implement something that would be a relatively easy application of Apophenia. He — and the NSF panel that funded him — didn’t know that there was a stats library that he could use as a back-end. I left that talk depressed. No matter how well we do with the technical problems, the social glitches prevail.

So, now you know it exists. Tell your friends: Apophenia version 1.0 is out! It’s part of Debian Stretch! You can download it from GitHub! You can help to make it better, or front-end it for people who aren’t as good with scientific computing as you are! You can learn from its effort to improve narrative models, throw it all away, and do something even better!

Most importantly, you don’t have to settle for describing the world as a linear combination of independent elements! There’s debate about
whether social science is worth it, but I’m a believer — research in how people make decisions and how they form what we call society can be done and can have real influence to the real world. To take all the research done in every aspect of behavior, encode it into careful, detailed models, and then fit a rough and basic simplification to the data because STATA couldn’t handle the fully structured model is heartbreaking, a vote that social science isn’t worth doing with full vigor. Let’s keep pushing to do better.

--

--

Ben Klemens

BK served as director of the FSF’s End Software Patents campaign, and is the lead author of Apophenia (http://apophenia.info), a statistics library.