Google's Book Search: A Disaster for Scholars

Writing in The Chronicle, Geoff Nunberg kicks off his criticism with a look at publication dates:

To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’s Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf’s letters is dated 1900, when she would have been 8 years old. Tom Wolfe’s Bonfire of the Vanities is dated 1888, and an edition of Henry James’s What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google’s book search, but these errors are endemic. A search on “Internet” in books published before 1950 produces 527 results; “Medicare” for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. “Charles Dickens” turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

How frequent are such errors? A search on books published before 1920 mentioning “candy bar” turns up 66 hits, of which 46—70 percent—are misdated.

Then there is the “absurdist poetry” of the classification errors. Those stem from a Google decision to use the Book Industry Standards and Communications (BISAC) codes for categorization. Nunberg speculates that decision was made to facilitate advertising — with comical results, “a search for Leaves of Grass brings up ads for plant and sod retailers.” The choice was a bad one:

The BISAC scheme is well-suited for a chain bookstore or a small public library, where consumers or patrons browse for books on the shelves. But it’s of little use when you’re flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example the BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the Poetry subject heading has just 20 subheadings. That means that Bambi and Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and Verlaine have to scrunch together in the single subheading reserved for Poetry/Continental European.

Nunberg is only the most prominent of the Google Book Search metadata critics. His concern is longstanding and widespread.

Last year he called Google’s metadata a “train wreck: a mish-mash wrapped in a muddle wrapped in a mess.” After a presentation at a conference last week (here his slides from a presentation last year) Google quickly fixed some of the errors and the company says it will fix the rest.

Still, Nunberg says he is “actually more optimistic than some of my colleagues” about Google’s chances of getting it right. Books are a far more complex domain than the company first realized. The learning curve is steep but “Google is a very quick study.”

RELATED: Remember that Google claim of 129,864,880 books? Probably bunk.