The big tech news Monday was that Netflix announced a winner in its infamous contest to improve the accuracy of its recommendation algorithm, Cinematch, by 10 percent. The $1 million Netflix Prize was offered three years ago. The race ended in a mathematically statistical tie. Under the contest’s complex rules the winners beat the second place team by only 23 minutes:
The Netflix contest has been widely followed because its lessons could extend well beyond improving movie picks. The researchers from around the world were grappling with a huge data set — 100 million movie ratings — and the challenges of large-scale predictive modeling, which can be applied across the fields of science, commerce and politics:
The way teams came together, especially late in the contest, and the improved results that were achieved suggest that this kind of Internet-enabled approach, known as crowdsourcing, can be applied to complex scientific and business challenges.
Netflix founder and CEO Reed Hastings crows:
“You’re getting Ph.D.’s for a dollar an hour…We strongly believe this has been a big winner for Netflix,” Mr. Hastings said.
With that Netflix launched a second challenge:
The data set for the first contest was 100 million movie ratings, with the personally identifying information stripped off. Contestants worked with the data to try to predict what movies particular customers would prefer, and then their predictions were compared with how the customers actually did rate those movies later, on a scale of one to five stars.
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
But at Princeton’s Freedom to Tinker, Paul Ohm sees an avoidable privacy blunder:
Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney’s famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of “information entropy”: even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.
I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.
Ohm notes that Arvind Narayanan and Vitaly Shmatikov of the U. Texas were able to reidentify some of the “anonymized” users in the first contest with ease, “proving that we are more uniquely tied to our movie rating preferences than intuition would suggest.”
His concern is not movie ratings. It’s that this ability “can be used to enable other, more terrifying privacy breaches.” Ohm believes Netflix might be breaking the law. Or that the FTC could fine Netflix for violating its privacy policy:
If sued or investigated, Netflix will surely argue that its acts are immunized by the policy, because the data is disclosed “on an anonymous basis.” While this argument might have carried the day in 2006, before Narayanan and Shmatikov conducted their study, the argument is much weaker in 2009, now that Netflix has many reasons to know better, including in part, my paper [link] and the publicity surrounding it. A weak argument is made even weaker if Netflix includes the kind of data–ZIP code, age, and gender–that we have known for over a decade fails to anonymize.
The good news is Netflix has time to avoid this multi-million dollar privacy blunder. As far as I can tell, the Netflix Prize 2 has not yet been launched. Dear Netflix executives: Don’t do this to your customers, and don’t do this to your shareholders. Cancel the Netflix Prize 2, while you still have the chance.
For the moment, I’m unpersuaded. I believe that the data can be traced back, just as I believe that Google search results can be and that our ISPs can trace our packets. But I think that horse las already left the barn. The remedy we need lies somewhere else.
Meanwhile, Wired has a terrific profile of Netflix and its founder and CEO, Reed Hastings, titled NetFlix Everywhere: Sorry Cable, You’re History. A snippet:
[A] full Netflix pandemic has broken out. Microsoft incorporated the service into its Windows Media Center software, meaning anyone with Vista can stream Netflix to their TV. Hastings inked deals with Sony and Samsung to put the service into Bravia TVs and Blu-ray players, respectively. The service started showing up in TVs made by Vizio, the largest seller of LCD televisions in the country. And Broadcom began baking the software into some of its flatscreen chips, making it easy for any TV maker to offer sets pre-loaded with Netflix. (As an extra incentive, Netflix pays manufacturers a bounty for any new subscribers that sign up via their products.) Investment bank Piper Jaffray estimates that 25 percent of Netflix’s 2.4 million new subscribers this year will come through one of the streaming devices.
While I came late to the Netflix party, I am now an avid reveler. And a great admirer of the company.