One of the most annoying arguments that I see popping up again and again is the question of "Which is better, theory or data?" (A related bore-fest is "Which is better, induction or deduction?") Actually, you can't have one without the other. In a recent blog post, Paul Krugman points out that you can't have data without theory:
But you can’t be an effective fox just by letting the data speak for itself — because it never does. You use data to inform your analysis, you let it tell you that your pet hypothesis is wrong, but data are never a substitute for hard thinking. If you think the data are speaking for themselves, what you’re really doing is implicit theorizing, which is a really bad idea (because you can’t test your assumptions if you don’t even know what you’re assuming.)
True. Suppose you find a correlation between having an unfulfilling sex life and liking Charlie Kaufman movies. Does that mean that people watch Charlie Kaufman movies to ease the pain of their lame sex lives? Or did the fact that they watch Charlie Kaufman movies actually ruin their sex lives? Or are both the result of some third factor, such as being an insufferable hipster? If you don't pick one, you'll never be able to understand what's really going on. Even if all you care about is predictive power - you want to be able to catch someone watching a Charlie Kaufman movie and say "I bet girls won't touch that guy with a 10 foot pole" - you still need to assume that the correlation is stable over time, and your assumption is a theory.
It's equally true that you can't actually have theory without data. A theory is always about something that you think is going on in the world, so you can't have something to theorize about without first seeing something happen in the world (i.e. data). For example, suppose my theory - which I deduced from some sort of a priori assumptions - is that watching Charlie Kaufman movies ruins one's sex life. I couldn't have made that theory without observing the existence of Charlie Kaufman movies.
So just as "data with no theory" is really just an implicit vague theory, "theory with no data" is really just sparse, unsystematic data. You can't have one without the other.
But what you can do is be lazy with theory or be lazy with data. You can be an armchair philosopher, dreaming up ideas about how the world works without ever bothering to find out if your ideas are right. Then you get something like this:
Or you can be a "regression monkey", sitting there sifting for correlations without having any idea what you're looking at. Then you get something like this:
Obviously, if you're going to get good results, you shouldn't do either of these.
But which is a bigger menace to society, laziness about data or laziness about theory? Theory-laziness is seductive because it's easy - mining for correlations isn't very mentally taxing. But data-laziness is seductive because it's hard - the more complicated and intricate a theory you make, the smarter it makes you feel, even if the theory sucks.
In the past, data-laziness was probably more of a threat to humanity. Since systematic data was scarce, people had a tendency to sit around and daydream about how stuff might work. But now that Big Data is getting bigger and computing power is cheap, theory-laziness seems to be becoming more of a menace. The lure of Big Data is that we can get all our ideas from mining for patterns, but A) we get a lot of false patterns that way, and B) the patterns insidiously and subtly suggest interpretations for themselves, and those interpretations are often wrong.
So anyway, I hope this post destroys all of those "data vs. theory" arguments forever and ever.