Wednesday, September 2, 2009

Google's Book Search: A Disaster for Scholars?

Your humble Northwest History blogger is sometimes accused of being a Google fanboy. A fair cop. But you know who is not a Google fanboy? Geoffrey Nunberg, that is who. Over at the Chronicle of Higher Education Nunberg has a witty jerimiad, Google's Book Search: A Disaster for Scholars.

Nunberg's beef is with Google's sloppy and commercially driven metadata schemes. He demonstrates that even with such a basic item as date of publication, Google Books very frequently gets it wrong. This in turn often corrupts search results: "A search on 'Internet' in books published before 1950 produces 527 results; 'Medicare' for the same period gets almost 1,600." By comparing Google's data to that found in the catalogues of the contributing libraries Nunberg shows that these errors do in fact belong to Google, not to their partners.

Nunberg also whacks Google for the classification errors where books are placed in the wrong categories: " H.L. Mencken's The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles . . . An edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering."

Worst of all to Nunberg is Google's adoption of the Book Industry Standards and Communications categories for Google Books, which he describes as a modern commercial invention used to sell books, rather than a scholarly system of classification like the Library of Congress subject headings: "For example the BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the Poetry subject heading has just 20 subheadings. That means that Bambi and Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and Verlaine have to scrunch together in the single subheading reserved for Poetry/Continental European. In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore."

I think that Nunberg has a number of good points--point he gathers together to form a molehill, from which he conjures up a mountain. Google's metadata may be everything he says (and I think he is probably right) but how great a problem is that really? This scholar at least uses Google Books either 1) to locate a digital copy of a book I already know about, or 2) via a string of search terms. In the first case, it is not relevant to me that Google has classified Adventures of Huckleberry Finn under "wild plants" or whatever. I know perfectly well what it is, and just wanted to find a quote I remember.

In the second case, I might search for mentions of the Columbia River in books published before 1860. And suppose a faulty date in Google's database brings me to something written after 1860. So what? Surely when I click on the link and find myself reading Sherman Alexie instead of Lewis and Clark, I will notice the fact. (Actually I just did the search and on the first 10 pages of results I don't see any errors at all. Take that, Nunberg.)

So for which scholars exactly is Google Book Search a "disaster?" Nunberg cites "linguists and assorted wordinistas" who are "adrenalized" at the thought of data mining to "track the way happiness replaced felicity in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader." But who does this? OK, I know that people do it, but most data mining of this type has always struck me as more of a parlour trick than actual scholarship.

The other thing Nunberg ignores is that metadata is not that hard to fix. Google already provides a "feedback" button on every virtual page so readers can report unreadable or missing pages. If we howl loud enough we could easily see similar feedback mechanisms on the "More book information" page so we could correct names and dates and categories.

Nunberg is absolutely correct to recognize the monumental importance to scholars of the Google Book Search project. It is vital that scholars take a critical stance that will push Google to improve the project and make it even more useful. His article is a valuable push in that direction.

UPDATE 9/3/09: Reader Ed points out that Geoff Nunberg also posted a nicely illustrated version of his article on the blog Language Log, and got a brief response in the comments from
John Orwant, who manages the metadata at Google Books.


James Stripes said...

I agree that Nunberg overstates the case. Even so, his solution of asking Google to implement the Library of Congress classification system into their search algorithms is not onerous, and will improve Google Books substantially.

Too often the nearest copy of a book I want to look at is either in Pullman (eight miles away), New York (thousands of miles), or London. Google is putting many of these on my desk, and its sometimes awkward search scheme is only minimally troubling. Most of my hits in most of my searches are what I should expect. Sometimes, however, I run into problems. At least one of the copies of one of the volumes of George Bancroft's History of the United States has two title pages, and some pages of front matter missing in the scanned copy. If I want a quote from that, I may need to go to the online catalog of the library whose copy was scanned to determine which edition I'm looking at.

The issues raised by Nunberg might, however, be more of a problem for younger scholars that will never spend years browsing the stacks in their college library.

Larry Cebula said...

James: I agree that LOC classification would be a nice addition to Google Books, and I have to think they will do it sooner or later. If they want to keep the current system (which might be more useful for targeting ads) and offer LOC categories as a custom setting, that would be fine with me.

ed said...

I agree that Nunberg's piece valuable because it might help Google further improve their wonderful product.

In fact, there is probably no other organization I would trust more to be responsive to this type of criticism than Google. If the digitization project was run by, say, the government, or a consortium of universities, I'd expect progress to be much slower, and more weakened by politics, turf wars, and budgetary constraints, and plain incompetence.

So three cheers for Google. As far as the *tone* of Nunberg's piece, I'd have to say he is prone to letting the perfect be the enemy of the good.

Larry Cebula said...

Ed, we academics are actually trained to make the perfect the enemy of the good whenever possible!

ed said...

Anyone interested in the topic should really read the response from John Orwant, manager of the Google Books metatdata team, in a comment over at the language-log blog:

Bill Youngs said...


I find Google Books one of the most useful online resources in the universe, spectacularly helpful, but with this caveat: I wish it were possible to download TEXT FILES for ENTIRE BOOKS (not a page or two at a time) as you can from Project Gutenberg, eserver, and on line book sources. For me full text downloads would easily double the usefulness of Google Books.

Katrina said...

I agree that Google should adopt some kind of library classification (LOC, Dewey, whatever, just not the Waldenbooks shelving areas!).

I find googlebooks useful, despite its quirks, but the year published thing is a mess.

The main thing that irritates me is that I can search for a word in the text, and find the pages that have it, but I cannot select and copy passages of text. (if this is just some system problem of mine, and there's a workaround, please let me know!).

But Google are doing something useful, so I don't want to complain too much about it.

Larry Cebula said...

Katrina: Sure you can, at least in full text books. Look in the upper right and you'll see a "Full Text" button.