There is an article in The Boston Globe about an MIT project dubbed Project Gaydar.
The pair weren’t interested in the embarrassing photos or overripe profiles that attract so much consternation from parents and potential employers. Instead, they wondered whether the basic currency of interactions on a social network – the simple act of “friending” someone online – might reveal something a person might rather keep hidden.
Using data from the social network Facebook, they made a striking discovery: just by looking at a person’s online friends, they could predict whether the person was gay. They did this with a software program that looked at the gender and sexuality of a person’s friends and, using statistical analysis, made a prediction. The two students had no way of checking all of their predictions, but based on their own knowledge outside the Facebook world, their computer program appeared quite accurate for men, they said. People may be effectively “outing” themselves just by the virtual company they keep.
“When they first did it, it was absolutely striking – we said, ‘Oh my God – you can actually put some computation behind that,’ ” said Hal Abelson, a computer science professor at MIT who co-taught the course.
Of course there is nothing unique about sexuality, as it ominously continues:
The work has not been published in a scientific journal, but it provides a provocative warning note about privacy. Discussions of privacy often focus on how to best keep things secret, whether it is making sure online financial transactions are secure from intruders, or telling people to think twice before opening their lives too widely on blogs or online profiles. But this work shows that people may reveal information about themselves in another way, and without knowing they are making it public. Who we are can be revealed by, and even defined by, who our friends are: if all your friends are over 45, you’re probably not a teenager; if they all belong to a particular religion, it’s a decent bet that you do, too. The ability to connect with other people who have something in common is part of the power of social networks, but also a possible pitfall. If our friends reveal who we are, that challenges a conception of privacy built on the notion that there are things we tell, and things we don’t.
So really, despite the provocative introduction, it is really just a program that uses the well known tendency for people to cluster based on similar traits. It applies for just about everything: race, gender, fitness levels, artistic tastes, so on and so forth. So is that it? Are we doomed to be read by increasingly complicated programs that figure out every single secret and exploited? Uh, not quite. And this is where it would have been really helpful for the article to have a more scientific inquiry.
To start off, let’s look at the actual methodology of the study.
They were interested in three things people frequently fill in on their social network profile: their gender, a category called “interested in” that they took to denote sexuality, and their friend links.
Using that information, they “trained” their computer program, analyzing the friend links of 1,544 men who said they were straight, 21 who said they were bisexual, and 33 who said they were gay. Gay men had proportionally more gay friends than straight men, giving the computer program a way to infer a person’s sexuality based on their friends.
Then they did the same analysis on 947 men who did not report their sexuality. Although the researchers had no way to confirm the analysis with scientific rigor, they used their private knowledge of 10 people in the network who were gay but did not declare it on their Facebook page as a simple check. They found all 10 people were predicted to be gay by the program. The analysis seemed to work in identifying gay men, but the same technique was not as successful with bisexual men or women, or lesbians.
Since it hasn’t been published (and news articles are notoriously inaccurate in reporting methodologies) it is impossible to know whether this study design is any good. Based on the summary though, it’s awful. Their only “validation” was based on personally selecting 10 people that they knew were gay and seeing what the program said? That is the definition of confirmation bias, and shows nothing about the efficacy of their program. Assuming it is an accurate representation of their check, the program could have said that everyone was gay and they would have gotten the same conclusion.
The proper technique in these correlative models is to have a training dataset that you use to develop the heuristics and then an independent as possible test dataset that you know everything about. Then you see how many true/false positive/negatives there are and adjust the model until it hits some threshold. Thus, instead of checking the validity on 947 men who they don’t know what the real answer is (if you don’t know then why bother) they should have had another 1000 men or so that were randomly selected (with has few links as possible to the original training dataset) and run it on them with the settings gathered through the training. [This is exactly how the Netflix Prize was setup].
From this, I’m dubious about whether their model is even any good, but that’s not my main problem with the article. My main problem is that they repeat these claims without talking about how the relationships are only true on the group level and that the validity really depends on the incidence rate of the trait you are studying. The group level caveat is to say that correlations of this type are only valid in explaining the differences between the test population and a random sample. If they found that 80% of the variance of sexuality was explained by who you friend, then that is completely different than saying that an individual could be determined to be gay/straight with 80% accuracy. It is saying that there is a strong relationship compared to randomly guessing, but the probability you are correct is dependent on incidence rate.
I will give an example. Let’s say that you want to guess whether someone is male based on their links and the social networks are split up 50/50 between men and women. If your algorithm is pretty good it might have the results where 80% of the people it says are men are really men (20% false positive rate), while it only misses 15% of the men (15% false negative rate). [Note: in general false negative rate will be lower than false positive because you have a bias for what you’re searching for. In medical applications it’s far better for the test to incorrectly say that someone has something (which can be confirmed with a second test at a later date) than to falsely claim they are in the clear.]
If you ran this algorithm on 10,000 people that were split 50/50, then it would say that there were a total of 5250 men, with it falsely saying 1000 women are men and correctly identifying 4250 men. That means that if it said a person was a man, then you would be right 4x more than you’d be wrong.
However, let’s look at something with low incidence, like homosexuality. Approximately 2% of males self identify as gay (I’m going to use self identity and not behavioral statistics). In their sample, 2.06% identified as gay, which is right in line with the general population. This is surprising to me, as internet users are normally a lot higher, but for all I know they selected their sample to match the general population.
I’m going to be very generous, and say that 95% of the people it says will identify as gay are gay (5% false positive) while it only misses 5% of people that are gay. Frankly, that would be an absurdly good algorithm for this type of data, but it’ll prove my point.
Testing on 1000 men, we should find 20 are gay, and it will say that a total of 69 [yeah I didn’t do that on purpose] are gay: 50 that it falsely identifies even though they are straight, and it will correctly identify 19 of the 20 that really are. So what are the results (of this test that gives them the benefit of the doubt)? If it says someone is gay, then they are 2.5 times more likely to not be — 50:19. It all comes down to incidence. The more rare something is in the population, the harder it is to correctly identify.
HIV testing really shows this. Even though the test is highly accurate, because the incidence is so low you still are something like 5x more likely to have a false negative than a true positive.
Which brings me to the biggest point. This technology is being used to try and identify terrorists. Well terrorism is a minuscule incidence, and I read on a professional counterterrorism blog where they showed that even with near perfect information, you’d still have 1000x more false positives than true positives. They argued that the billions spent by the government to setup these networks and spy on communications for this purpose are a complete waste because it leads to a huge misuse of resources. They said that instead traditional methods should be seen as the best approach.
In conclusion, it’s possible that their algorithm is really good and can detect who is gay..but the natural consequence is that it would also falsely claim that a lot of people are too. Given how imperfect the nature of the data is, my guess is that when all is said and done, there would be a 10-20% chance of someone being gay that it said was. That is still an enormous increase over random selection of 2%, but it is hardly laser precision.
The moral of the story is that if you don’t want data mining to be able to guess your traits, you are better off the weirder you are.