Saturday, October 17, 2015

In the MISO soup


Robin Hanson declares that thanks to Big Data, we will soon discover the SUPER FACTORS that drive all of human differences:
In a factor analysis, one takes a large high-dimensional dataset and finds a low dimensional set of variables that can explain as much as possible of the total variation in that dataset. A big advantage of factor analysis is that it doesn’t require much theoretical knowledge about the nature of the variables in the data or their relations – factors are mostly determined directly by the data... 
[P]eople vary in far more ways than intelligence, ideology, and personality, and factor analyses have been applied to many of these other human feature categories. For example, there have been factors analyses of jobs, brands, faces, body shape, gait, accent, diet, leisure behavior, friendship networks, physical health, mortality, demography, national cultures, and zip codes. 
[F]actors found in different feature categories are often substantially correlated with one another. This suggests that if we put together a huge super-dataset describing many individual people in as many ways as possible, a factor analysis of this dataset may find important new super-factors that span many of these features domains. Such super-factors would be promising candidates to use in a wide range of social research, and social policy... 
I’d guess that the super-factors found in a super dataset of human details will instead be revolutionary. We will afterward see uncovering them as a seminal milestone in our progress in understanding human variation. A Nobel prize worthy level of seminality, or more. All it will take is lots of tedious work to collect a super dataset, and then do some straightforward number crunching.
Here's an object lesson in the perils of analyzing data without theory to guide you! Yes, it's easy to do a principal component analysis on a multidimensional data set and find some relatively small set of "factors" that "explain" most of the data. If we do what Robin says and throw everything we know about human characteristics into one massive data set and hit the PCA button, the STATA of the future will pop out our "super-factors" in short order.

One of the biggest super-factors will be income.

See, factor analysis doesn't tell you whether the factors cause all the other stuff, or are effects of the other stuff. In the world, there can be effects with multiple causes, and causes with multiple effects. In signals theory (a very different kind of signaling than the kind Robin is used to thinking about!), this might be called Multiple-Input-Single-Output and Single-Input-Multiple Output, or MISO and SIMO.

An example of SIMO would be anxiety disorder. A penchant for severe anxiety is going to affect your working life, your interpersonal life, your hobbies, etc. in statistically predictable ways. One cause, many effects.

An example of MISO would be income. Our marvelous market economy allows people to make money using a dizzying myriad of talents, skills, and resources. Some people make money by hitting a ball with a stick and running around a field. Some people make money by making big macro bets in financial markets, getting the first one right by luck, and then taking in billions of dollars in management fees. Some people make money by being friends with the right politician. Some people make money by inventing new kinds of semiconductors. And so on, and so on. One effect, many causes.

Since money can buy a ton of stuff, everyone wants money. And since money can buy a ton of stuff, almost anything valuable can be sold for money. So if income is among the set of characteristics in Robin's ultimate data set, it will undoubtedly emerge as one of the most important factors.

You can already see evidence of this in the media. Barely a day goes by without an announcement by Quartz or the Huffington Post that income differences predict differences in...you name it. School success, romance, self-confidence, frequency of weird eyebrow twitches. The assumption, of course, is that wealth privileges people in innumerable ways - i.e., that income is a SIMO kind of thing. But whether that's true, it's also likely true that income is a MISO kind of thing, where almost any positive or desirable trait can be leveraged - or is correlated with something that can be leveraged - to produce income. That, really, is why income is going to be correlated with almost any desirable human trait, no matter how little "privilege" remains in society.

So Robin's "super-factors" are quite possibly going to be very mundane things. MISO processes will cause a few desirable goals to be highly correlated with a large number of human traits that are useful in obtaining those goals.

Interesting, but hardly worthy of a Nobel. And a reminder that pure statistical analysis, without explicit theory to guide it, will be guided by implicit, simplistic theories.


P.S. - One thing Robin wrote that I didn't understand was the following:
As many people know, intelligence is the main factor explaining variation in cognitive test performance, ideology is the main factor explaining variations in political positions, and personality types explain much of the variation in stable attitudes and temperament.
Aren't these basically just labels? "Intelligence" is our word for cognitive test performance. "Personality type" is our word for stable attitudes and temperament. Seems to me that simply isolating a principal component and labeling it is a far cry from actually understanding what you're looking at.

19 comments:

  1. And this is why I sometimes accept that Noah Smith can be sometimes a good writer and economist.
    "The assumption, of course, is that wealth privileges people in innumerable ways - i.e., that income is a SIMO kind of thing. But whether that's true, it's also likely true that income is a MISO kind of thing, where almost any positive or desirable trait can be leveraged - or is correlated with something that can be leveraged - to produce income. That, really, is why income is going to be correlated with almost any desirable human trait, no matter how little "privilege" remains in society."
    -This is truly the heart of the post and it's awesome. You won't hear that from your typical critic of The Bell Curve voluntarily.

    "Aren't these basically just labels? "Intelligence" is our word for cognitive test performance. "Personality type" is our word for stable attitudes and temperament. Seems to me that simply isolating a principle component and labeling it is a far cry from actually understanding what you're looking at."

    -No, not just that, Noah. General intelligence is about serial correlations between multiple types of test scores (e.g., math scores and reading scores). Ideology is about serial correlations between support for dozens of individual political positions (e.g., support for full abortion rights and support for confiscatory taxes on the rich). And personality type is about correlations between multiple personality traits in individuals (these I don't know much about, because I haven't researched them). It's about the correlations. Although these could, as you've pointed out, be either MISO or SIMO, I suspect they're most likely to be the latter. There's no good MISO reason I can see being good at playing ball has any correlation with playing in the stock market (and it probably doesn't). Likewise, there's no good MISO reason I can see being good at math has any correlation with being good at reading (but these do correlate!). And no good MISO reason I can see that being in favor of unrestricted abortion has anything to do with support for confiscatory taxes on the rich (I'm guessing these do correlate).

    ReplyDelete
    Replies
    1. Sure, I understand that these are labels for principal components. I just don't think that labeling principal components does much to really explain things, for the reasons we've been talking about. Fundamentally, it's just a correlation.

      Delete
    2. So you're trying to ask the questions "where do these ideology/general intelligence/personality type single factors come from"? I presume the latter two are mostly genetic, but ideology must have more of an environmental component. Figuring out which genes actually have the greatest effect on intelligence and personality will surely be a long struggle. Ideology is more explicable, as it needn't necessarily involve any genes.

      And who's Frank?

      Delete
    3. Remember that principal components don't have to be causes of the things they "explain". They can be effects. Or they can be simple incidental correlates.

      Frank is the guy in the creepy bunny suit in Donnie Darko.

      Delete
    4. I already responded to the "effects" proposal in my longer comment. How either numerical or prose literacy can cause the other, I don't see. Or how one seemingly random political position can cause another. Incidental correlates is harder to disprove, but is still unlikely. The ideological poles of Left and Right seem to be pretty pervasive throughout the English-speaking world and continental Europe (though there are some notable exceptions; e.g., modern Russia, where you'd be hard-pressed to find a political faction that can't be labeled, in some respect, conservative). I'm not sure if that's the case for Latin America, although I think it's likely. Where do you think ideology comes from?

      Delete
    5. Another big problem interpreting PCA is that it's not scale invariant. You can think that you understand why things happen, then you decide that income should be measured in yen instead of dollars, and everything comes out differently.

      Delete
    6. Anonymous1:29 PM

      That is why the variables are almost always standardized in PCA (and should be if the scales are incommensurate).

      Delete
    7. The ideological poles of Left and Right

      Are not nearly as rigid as people sometimes believe. Look at the American Republican Party - it is an uneasy coalition between social conservatives, military hawks who would vote for a surveillance state if they could get away with it, Libertarians, small business interests, big business interests, people for whom balanced budgets are an over-riding goal and people for whom low taxes (and the deficit be damned) is the over-riding goal.

      Delete
    8. How strong is the correlation between support for top income tax rate cuts and support for increasing the military budget?

      Delete
  2. I took Robin's post to be alluding in particular to collecting massive amounts of biological data. Think fitbit on steroids. Or super-augmented-Apple Watch. Something to track everything inside and outside your body in your daily life: heartbeat, gut microbiome, muscle reactions, eye responsiveness, blood chemistry, recording and transcribing all verbal communications, everything you see and do, etc.

    This is an unexplored data set. Not just stuff we already collect like your test scores or income.

    Your point is solid about PCA, we can name a component without understanding it's underlying biology and mechanism. But.....if this new massive biological data set finds really interesting things, such as (deliberately picking randomly) eye saccades and particular gut microbiomes correlate with voting Obama, well then we take that correlation and run with it. The PCA by itself doesn't make us understand, but finding correlations from new large data sets may head us down a very novel path about what makes people tick. Super factors as an impetus to investigation rather than understanding per se. By analogy, investigating why IQ tests are predictive of life outcomes. Yes, we don't understand underlying biology now, but this correlation is pointing some people down particular paths of investigation.

    ReplyDelete
  3. Anonymous1:28 PM

    Some statistical notes: PCA is not the same as FA. PCA just finds a linear combination of your variables to return a new coordinate space in which the new variables / coordinate axes are orthogonal and ordering in decreasing order of variance. There is no causal assumption. PCA is not 'for' variable reduction, although it is often used for that. FA is often similar in its output, but is not identical. It is based on the assumption that the underlying latent factors cause the manifest variables (which are a mix of the factors and measurement error). You should not use them interchangeably. EFA is never truly theoretically neutral (despite people treating it that way), because choosing to enter a variable into an EFA analysis is an implicitly theoretical decision.

    There also seems to be some confusion between doing an EFA with a lot of variables and hierarchical EFA.

    For what it's worth, the idea of doing EFA with a lot of variables isn't that novel. In psychometrics (where EFA comes from), that research scheme lies behind the discovery of the "big five" trait theory in personality, and Spearman's g in intelligence.

    ReplyDelete
    Replies
    1. "a linear combination"

      That is a huge limitation on the usefulness of any PCA / SVD analysis. Picking the way you are going to translate information into numbers in the matrix is another: for example, do you use income or log(income) or do you put both in the matrix?

      Delete
    2. Yeah, sure. We add a magic assumption and then everything is causal. Advanced statistical techniques in action. ;-)

      Delete
    3. Anonymous12:55 PM

      I certainly didn't mean that using FA with a good fit establishes a causal relation—more like the opposite of that. My main point is that PCA and FA are not the same (i.e., they are based on different assumptions and work differently); they should not be used interchangeably. Moreover, because FA assumes the factors cause the variables, it is incoherent to use a set of variables where some are causes / effects of others.

      Delete
    4. Yup, it is incoherent. But, since we don't know which are causes of what, the kind of factor analyses that Robin talks about will inevitably be incoherent.

      Delete
  4. Anonymous3:56 PM

    Claims to be a sci fi nerd. Shits all over a guy basically trying to smuggle in psychohistory into economics. But hey, you got the valuable endorsement of 'biological realist' pithorn so thats something, warming up for your first zerohedge column I guess. Remember, if every third word isnt in bold, you cant stop the gold hating bankers and their zionist-reptilian allies.

    ReplyDelete
    Replies
    1. Maybe Noah Smith is Tyler Durden?

      Delete
    2. See the note at the bottom:

      http://noahpinionblog.blogspot.jp/2013/05/science-fiction-for-economists.html

      Delete
    3. Anonymous3:06 PM

      Why does every description of reality have to be an absurd little screaming match of ideology? And, maybe, just maybe, we can admire all three approaches to reasoning about the world, and learn from them all? I am strongly inclined to the big data approach. How about we combine that with an acknowledgement of bio realism and the econ? Huh, how about that?

      Ehhhh, Noah Smith's support for direct objective observation of reality becomes of importance yet again.

      Delete