Google Books: Is It Good for History?

Robert B. Townsend, September 2007

Editor's Note: An earlier version of this article was published on the AHA Today blog in early May 2007, and received an overwhelming response. This version takes into consideration a number of the comments received, which can be found at http://blog.historians.org/articles/204/google-books-whats-not-to-like.

Google BooksThe Google Books project promises to open up a vast amount of older literature, but a closer look at the material on the site raises real worries about how well it can fulfill that promise and the potential costs to history scholarship and teaching.

This past spring I spent a fair amount of time delving into Google Books for a research project on the early history of the profession, and from a researcher's point of view I have to say the results were deeply disconcerting. Yes, the site offers up a number of hard-to-find works from the early 20th century with instant access to the text. And yes, for some books it offers a useful keyword search function for finding a reference that might not be in the index. But my experience suggests the project is falling short of its central promise of creating an accessible repository of the world's literature. Instead, it is piling mistake upon mistake with little evidence of basic quality control. The problems I encountered fit into three broad categories: (1) the quality of the scans is decidedly mixed; (2) the information about the books (the "metadata" in infospeak) is often inaccurate; and (3) the public domain is narrowly and erroneously construed, sadly restricting access to materials that should be freely available.

Poor Scan Quality

My reading of the materials was not scientific or comprehensive. Since almost three-fourths of the materials on the site are only shown as "snippets" of text, it is impossible to make a comprehensive assessment. But in the course of trying to use the site for my own research, I was surprised at the sheer number of mistakes I found. Of 36 books I downloaded from the site, 19 had basic scanning errors on one or more pages.

Basic scanning errors undercut an essential premise of the project—to the extent that digitizing the text is intended to be a means for aggregating data for mining and use. Take, for example, the Google Books version of the Report of the Committee of Ten from 1893 (the start of the great curriculum chase for the secondary schools). Many of the pages in the online version appear more than once. Page 3 appears twice, for instance, and page 147 appears in a number other places, most annoyingly in place of page 165 (which is simply missing).1 And like a number of the other books on the site, some pages appear to have been scanned in mid-transit through the scanner, while pages and tables too large to fit within their imaging constraints were simply cut off. This makes much of the text unreadable and presumably less "discoverable."2

Even this small sample raises serious questions about the procedures for quality control at Google Books. Over the past decade I have digitized a number of the AHA's old publications and appreciate that scanners don't always work as they should, and pages can often get jammed.3 But even rudimentary quality controls should catch those problems before they go live online. After years of implementing those kinds of quality checks here—precisely because friends in the library community took me to task about their necessity—I find it strange that so many libraries are joining Google's headlong rush to digitize, apparently without similar quality requirements.

Faulty Metadata

Beyond the fundamental quality of the scanning, a more significant problem is the incredibly poor descriptive information attached to many of the books on the site (the "metadata"). This is particularly evident in the serial publications, where having the proper name and date of a publication is especially important. Take, for example, a volume of History Teacher's Magazine that is labeled as a volume of Social Studies (the name the magazine took in 1934) and dated as published in 1953 (even though it seems to be from 1917).4 In another instance, the AHA Annual Report from 1942 is dated as published in 1861 (we were established in 1884, Google, in case you didn't know).5

These problems seemed fairly pervasive among serial publications on the site, which seem to take the acquisition date from the library catalog without any further review or input from those scanning in the text. Unfortunately, this creates two significant problems for historians trying to use the site. First, the inaccurate dating makes it difficult, if not impossible, to physically locate (in an actual hard copy of the source) a particular item "discovered" by using Google Books.6 At the same time, in many instances you will be unable to inspect public domain items more closely, because the erroneous date places the information on the wrong side of the copyright line.

Truncated Public Domain

These problems are exacerbated by Google's rather peculiar views on copyright. While taking an expansive view of copyright for recent works, it has taken a very narrow view about books that actually are in the public domain. According to the U.S. Copyright Office, "works by the U.S. government are not eligible for U.S. copyright protection." But Google locks all government documents published after 1922 behind the same wall as any other copyrighted work. Among other things, that locks up works that should be in the public domain, such as the circulars from the U.S. Bureau of Education. This problem is made worse by the often inaccurate data about when these materials were published—which places these works even further beyond reach.

At the same time, Google Books also ignores the wishes of those who wanted their work placed in the public domain, and purposely chose not to claim copyright in their published works. My predecessors at the AHA made a conscious decision to publish their annual reports—which included thousands of pages of primary source materials, bibliographies, and reports about the profession—through the Government Printing Office and free of copyright. But the Google Books project locks all the volumes published after 1922 behind the wall of copyright regardless.

The Future for History

What particularly troubles me is the likelihood that these problems will just be compounded over time. From my own online publishing experience here at the AHA, I know how hard it is to go back and correct mistakes when the imperative is always to move forward, to add content and thus to inevitably pile more mistakes on top of the ones already buried one or two layers down. With Google adding in more than 3,000 new books each day, the number of mistakes is likely to grow much higher.7

The problem of quality control only deepens my most basic worry about the larger rush to digitize every scrap of information—that we are adding to the pile much faster than the technology can advance to extract the information in a useful or meaningful way. When I ask people who know a lot more about the technology than me about this problem, they tend to wave their hand and mumble about "brilliant scientists" and "technological progress." Forgive me if I remain unconvinced. Even as someone fairly proficient in Boolean search terms I find a lot of the results from Google Books (and Google more generally) just page after page of useless and irrelevant information—a point made more convincingly by Thomas Mann of the Library of Congress, in a recent report on his efforts to find information on tribute payments in the Peloponnesian War.8 Given that, I find it increasingly hard to believe that Google can add tens of thousands of additional books each month to the information pile—many containing basic mistakes in content and metadata—and the information results will actually grow better over time.

So I have to ask, what's the rush? In Google's case the answer seems clear enough. Like any large corporation with a lot of excess cash the company seems bent on scooping up as much market share as possible, driving competition off the board, and increasing the number of people seeing (and clicking on) its highly lucrative ads or "renting" copies of the books. But I am not sure why the rest of us should share the company's sense of haste. Surely the libraries providing the content, and anyone else who cares about a rich digital environment, need to worry about the potential costs of creating a "universal library" that is filled with mistakes and an increasingly impenetrable smog of (mis)information.

As historians we should ponder the costs to history if the real libraries take error-filled digital versions of particular books and bury the originals in a dark archive or the dumpster. And we should weigh the cost to historical thinking if the only substantive information one can glean from Google is precisely the kind of narrow facts and dates that earn history classes such a poor reputation. It is time, it seems, to think in a careful and systematic way about how this will affect our discipline, and the new modes of training and apparatus that will make it possible to negotiate the volume and flaws of the emerging digital landscape.

The poor digital quality of the texts raises another important concern for scholars trying to rely on these texts as sources, one that strikes particularly close to the basic requirements of historical scholarship. For example, when I tried to cite a particular page in the original version of this article, one commenter took me task because he could not find the page I was referring to. As it turned out, the digital apparatus for the book prevents readers from getting to the page directly.9 So while this project appears to solve a critical problem for historians in rural and poor institutions that lack adequate library facilities, the quality control issue raises a fresh concern about whether a scholar could rely on it for a footnote. As Roy Rosenzweig noted in a much more extensive survey of these issues, digital media pose a fundamental challenge to the apparatus of scholarship.10

It is hard, I know, to balance the desire to post ever more content online with a commitment to ensuring that the content will stand the test of time and future research. I understand that we have to make choices and compromises to get the job done. But the high-toned rhetoric about this project, and the rather glib assurances from proponents of the project that sacrificing quality is the necessary price for opening up the content, stands in the way of a substantive discussion about what this project (and other digitization projects) will mean for the future of scholarship in history.11 Surely as scholars we are capable of subtler distinctions, and a more precise weighing of our choices and options.

—Robert B. Townsend is assistant director for research and publications, and a doctoral candidate at George Mason University. Special thanks are due to the many comments received in response to the version posted on the blog, and to Siva Vaidhyanathan, whose comment at the 2006 JSTOR Publisher's meeting prompted me to be more reflective in my efforts to use Google Books.

Notes

1. To add to the confusion, there now seem to be two versions of the report online. I refer here to the edition at http://books.google.com/books?vid=0MYkKYle1O3CDBYQbunqFY7&id=1WYWAAAAIAAJ&, but another version has been added since my first visit to the page.

2. A few other examples of volumes with a particularly high large volume of basic scanning errors can be found in the missing and blurred pages of A History Syllabus for Secondary Schools (1904; online at http://www.google.com/books?id=LhYAAAAAYAAJ&) or their copy of volume 1 of the American Historical Review (online at http://www.google.com/books?id=N48LAAAAIAAJ&) where among many missing and blurry pages, one catches just a glimpse of the inserted map between page 74 and 75 passing through the scanner.

3. The results of these efforts, digital warts and all, can be found online in the AHA Archives page at http://www.historians.org/info/history.cfm.

4. Online at http://books.google.com/books?vid=0CZa2WIQxt-p-YVwvtYrlvE&id=nncVAAAAIAAJ&pgis=1.

5. Online at http://books.google.com/books?id=e3UWAAAAIAAJ&.

6. "Google Book Search Tips," a primer offered by the University of Michigan libraries, observes that "There is no single 'right' way to find the needed year or volume number. There are some general tips, though, to try to tease this information out of Google Book Search." Online at http://www.lib.umich.edu/mdp/GoogleBooks.pdf .

7. See Michael Liedtke, "Google Book-Scanning Efforts Spark Debate," Washington Post (December 20, 2006).
Online http://www.washingtonpost.com/wp-dyn/content/article/2006/12/20/AR2006122000213_pf.html.

8. Thomas Mann, "The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries," Prepared for AFSCME 2910 (June 13, 2007), online at http://www.guild2910.org/Pelopponesian%20War%20June%2013%202007.pdf.

9. In the Committee of Ten example cited above, If you type 147 into the Page field at the top it takes you to a page 147 that looks (mostly) fine. But if you type 145 into the same field, it takes you to a page 163 that is followed two pages later by the page 147 I identified.

10. Roy Rosenzweig, "Scarcity or Abundance? Preserving the Past in a Digital Era," The American Historical Review (June 2003). Online at http://www.jstor.org/stable/10.1086/529596.

11. See for instance Kevin Kelly, "Scan This Book!," New York Times Magazine, May 14, 2006.