Google Books

The Google Books Library Project is digitizing millions of books and magazines from major research libraries such as the University of Michigan, the New York Public Library, and Oxford University. Google Books searches the full text of these whether they are in copyright or in the public domain.

Google BooksThe Google Books website says that Google Books is an enhanced card catalog of the world’s books and that Google wants to create a comprehensive, searchable, virtual card catalog of all books in all languages. In March 2012, Google announced that they had digitized 20 million books. The digitization of volumes continues; although some of their partner libraries in the U.S. told the Chronicle of Higher Education in 2012 that the pace of scanning had slowed.

Search

Google Books searches the full-text of the digitized collection as well as the metadata–descriptions such as author, publisher etc. (It is possible to search magazines separately.) However, the results of searches in books that are still protected by copyright are only partially visible in Google Books. Books that were published before 1923, and are therefore thought to be in the public domain, can be viewed in their entirety as well as being downloaded. One of the key benefits of using Google Books is being able to do a full-text search of the collection. This means that books that contain chapters of interest or even the briefest of references can be located. Users can quickly zoom in on information within books that would be impossible to find with traditional library catalog searches. Because a search in Google Books can return enormous numbers of results, some librarians recommend always using the Advanced Search function to reduce the number of search results to be sifted. The Google Books website has a description of how Google Books works.

Google Books Ngram Viewer (see also the Conversations tab)

A search in Google Books searches the full-text of the digitized collection and the metadata (descriptions such as author, publisher etc.). A search with the Google Books Ngram Viewer does not search the same material.

Errors in Google Books

There have been several studies done on the issue of errors in Google Books, especially as it relates to scholarship and preservation. For articles about these studies and links to examples of scholars using quantitative research methods with material from Google Books, click on the Conversations tab in Beyond Citation.

Locating physical books

Because Google Books links to WorldCat, a catalog of tens of thousands of libraries, when users click on “Find this book in a library,” they will be given options to locate the closest library that has the physical book.

Disappearing Google Books?

In an exchange between historian Ben Schmidt and a commenter on his blog post, it was recently pointed out that some scanned pre-1923 books that were previously available for download as files from Google Books because of their presumed public domain status are no longer available. Scroll down to read the comments in this blog post. The commenter speculated about the possibility that unscrupulous publishers might have claimed ownership of certain out-of-copyright works causing Google Books to pull the digitizations from public view.

International access

The version of Google Books that users access may differ by country. For example, a Dutch university library guide notes these differences between the .nl the .com version. The Dutch version puts books in the Dutch language higher in the results list, and the other way around! This is especially the case in searching for proper names that are equal in both languages.

Date range: 1500 – 2008

Publisher: Google

Publisher About page: http://books.google.com/googlebooks/about

Object type: Books

Location of original materials: Multiple

Exportable image: Yes, for books published before 1923.

Facsimile image: Yes

Full text searchable: Yes

Titles list links: None

Original catalog: None

Digitized from microfilm: No

Original sources: Google says that they are working with 40 major libraries in the U.S. and Europe and one library in Japan and India but provide only a partial list of the libraries.

International access

The version of Google Books that users access may differ by country. For example, a Dutch university library guide notes these differences between the .nl the .com version. The Dutch version puts books in the Dutch language higher in the results list, and the other way around! This is especially the case in searching for proper names that are equal in both languages.

Access through HathiTrust

While Google Books is freely available on the web, researchers might also want to consider using HathiTrust, especially if their institution is a member of the Hathi partnership. HathiTrust contains most of the content of Google Books but has superior indexing through the use of Library of Congress subject headings. This type of indexing is most beneficial to researchers who are attempting to sift very large amounts of search results. See the HathiTrust entry in Beyond Citation for instructions about how to get much more precise search results by combining keyword searches in HathiTrust with Library of Congress subject headings.

Google’s guide to Using Google Books

Studies of errors related to the digitization of Google Books

Because most of the items in HathiTrust were scanned and digitized for the Google Books Project, academic studies of works in HathiTrust are also relevant to understanding the effects of digitization on Google Books. Information scientist Paul Conway undertook a large-scale study of materials from HathiTrust funded by the Andrew W. Mellon Foundation and the Institute for Museum and Library Services. All of the items studied had been digitized by Google. Conway sampled English-language books and serials published before 1923 that were scanned and processed by Google between 2004 and 2010. He writes that fully one-quarter of the volumes [sampled] contain at least one page image whose content is unreadable (or about one percent of the total content). And also, that the findings suggest that the imperfection of digital surrogates is an obvious and nearly ubiquitous feature of Google Books and that such imperfection has become and will remain firmly ensconced in collaborative preservation repositories. Phase two of the project used “statistically valid measures” to determine patterns of error. A report of Conway’s findings were published in 2013. Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust. Paywalled article.

Another information scientist, Ryan James, studied a sample of items taken from Google Books in an article published in 2012. James found that thirty-six percent of books sampled in this study contained metadata errors of author, title, publisher, and publication year. Comparable rates of major errors in print library catalogs range from 1 % to 12 %. James says that errors in the author field occur mainly because Google Books conflates the role of author, editor, and translator, treating them as the same intellectual entityAn Assessment of Google Books’ Metadata. Paywalled article.

In 2009, Geoffrey Nunberg published a critique of Google Books in the Chronicle of Higher Education focusing on the number of errors. The article estimates the number of Google Books in the public domain to be about 15 percent of the total. Google’s Book Search: A Disaster for Scholars.

Understanding the content of Google Books at the collection level

Although Google claims that it wants Google Books to be a virtual card catalog of all books, it isn’t possible to use Google Books in the way that one might use a database to see a listing of items by sorting or comparing categories. For example, there is neither a publicly available master titles list nor a complete list of the libraries from which the books have been digitized. While HathiTrust contains material from other sources, because most items in HathiTrust were digitized by Google Books, studies of  HathiTrust may also be relevant to understanding Google Books at the collection level. For a breakdown of the contents of HathiTrust by source library and also by language see this presentation by librarian Heather Christenson.

To gain a better understanding of the contents of Google Books, in 2005 the nonprofit Online Computer Library Center (OCLC) compared the collections of the first five scholarly libraries digitized for the Google Books Project based on their listings in WorldCat. OCLC studied where the collections overlapped, the language of the materials, copyright and other issues. They conclude that rareness is common. Sixty-one percent of unique imprints of books were held by only one collection. About half of these unique imprints are not English-language. About half of the books were published after 1974; three-quarters were published after World War II. Thirty-four percent of the books were estimated to be out of copyright. Anatomy of Aggregate Collections: The Example of Google Print for Libraries.

Discussions and examples of research in Google Books and other databases

Scholars are now widely using electronic sources for research. A recent study by Alexandra Chassanoff found that more than 60% of historians are consulting books online. Below are pointers to a selection of articles and blog posts that offer case studies of research methods and issues around using digital sources.

In their article about longue durée historiography, David Armitage and Jo Guldi discuss the role of digital databases in historical research spanning centuries. They recommend tools such as Google Book Search and Paper Machines for their ability to generate new hypotheses for the researcher and provide characterizations of long periods, aiding the researcher’s confidence in her characterizations of periods and placesThe Return of the Longue Durée: An Anglo-American Perspective.

On the theoretical side, digital humanists Paul Gooding, Melissa Terras, and Claire Warwick explore the concept of the digital sublime and the theoretical debates around mass digitization that have created an almost mythological view of technological determinationThe Myth of the New: Mass Digitization, Distant Reading, and the Future of the Book.

Historian Jo Guldi offers warnings and some tips about research in databases. While Guldi thinks that nineteenth-century Britain is uniquely positioned as a field for experimenting with online databases, she points out the pitfalls of naive word searching are many. Each digital database has constraints that render historiographical interventions based upon scholars’ queries initially suspect. Guldi recommends careful periodization, selection of terms, iteration, and contextual reading to craft a word-count exercise into a nuanced series of keyword searchesThe History of Walking and the Digital Turn: Stride and Lounge in London, 1808–1851. Paywalled article.

Composition and Rhetoric historian Janine Solberg suggests that scholars should reflect on how they use and are shaped by research technologies including full-text search within Google Books. She reminds us that researcher positionality is technologically mediated as well as ideologically constitutedGoogling the Archive: Digital Tools and the Practice of History. Paywalled article.

For those who wish to experiment with the data in Google Books, historian David McKenzie has written a guide to stripping out unwanted text from the file of a downloaded Google Book. Short Tutorial: Cleaning up a Google Book for Text Mining.

Ben Schmidt discusses the constraints that are placed on researchers by databases, including the Google Books Ngram viewer, by virtue of their design in his blog post What historians don’t know about database design….

In another blog post, Schmidt had an exchange with a commenter about some files of Google Books that are no longer available for download. The commenter pointed out that some scanned pre-1923 books that were previously available for download from Google Books because of their presumed public domain status are no longer available. Scroll down to read the comments in this blog post. The commenter speculated about the possibility that unscrupulous publishers might have claimed ownership of certain out-of-copyright works causing Google Books to pull the digitizations from public view.

Historians Fred Gibbs and Dan Cohen explore text mining strategies for literature with the Google Books Ngram Viewer. A Conversation with Data: Prospecting Victorian Words and Ideas. Paywalled article.

Stanford Literary Lab scholars Ryan Heuser and Long Le-Khac offer their ideas for quantitative analysis of data from literary works and take issue with the methodology employed by Gibbs and Cohen. Learning to Read Data: Bringing out the Humanistic in the Digital Humanities.

The theoretical challenge of digital historiography is the subject of an article by Joshua Sternfeld. Sternfeld notes that the Google Books Ngram Viewer highlights difficult questions about the evidential value of word trajectoriesArchival Theory and Digital Historiography: Selection, Search, and Metadata as Archival Processes for Assessing Historical Contextualization. Paywalled article.

Google Books Ngram Viewer

A search in Google Books searches the full-text of the digitized collection and the metadata (descriptions such as author, publisher etc.). A search with the Google Books Ngram Viewer does not search the same material. Clicking on the search results in the Ngram Viewer leads to the same query in Google Books. However, when users leave the Ngram Viewer and go to Google Books, the search rules are different. Google Books searches are case sensitive and search the full-text of the entire set of Google Books. They do not search the selections of data in the Ngram corpora (collections of data).

The Ngram Viewer searches a particular corpus (dataset) of books. According to the Ngram Viewer information page, all corpora of books were originally generated in 2009 or 2012 and will be periodically updated. Books with low quality OCR and serials are excluded. Besides the default setting, it is possible to select a particular corpus to search. For example, “American English 2012” contains books predominantly in the English language that were published in the United States. “English One Million” books are in English with dates ranging from 1500 to 2008. Of these, no more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled.

Issues with research with the Google Books Ngram viewer

Historian Ben Schmidt writes about the unreliability of Google Ngrams. Starting around 2008, Google Books data include printed copies of digitized books that were originally printed before 1923. That means that the data set for the Ngram Viewer includes books from the 19th century that have returned into print in the 21st century on the order of hundreds of thousands. These printed copies of books from earlier centuries are being produced by print-on-demand publishers. Adding this data from the enormous numbers of reprints of out-of-copyright works means that, to the extent that it was previously possible to trace the frequency of use of words with Ngrams, it is much less statistically accurate for recent years. Schmidt suggests that the quick credibility-check for anyone citing Ngrams is that they don’t use the post-2000 books as any sort of evidence. See also Schmidt’s blog post What historians don’t know about database design….

Example of research using Google Books Ngram Viewer 

Jefferson Pooley discusses his use of Google Books Ngram Viewer with JSTOR search and a search of the PsycARTICLES database to determine the historical frequency of terms related to the behavioral sciences. A “Not Particularly Felicitous” Phrase: A History of the “Behavioral Sciences” Label.

Locating bibliographic information

To cite, click on “About this book” in the left column of the Google Books page. Scroll down to the bottom to see the bibliographic information. The citation can be retrieved with bibliographic management software such as BIBtex or Endnote. Caution may be appropriate when using the Google Books automated citation feature because, according to a study by Ryan James, 36% of Google Books had errors in author, title, publisher, and publication year. Click the Conversations tab in Beyond Citation for details about the study.

MLA style

To cite online and electronic works, MLA style, see the guides at Research and Documentation Online, 5th edition, Diana Hacker.

Citing the digital

While digital resources are ubiquitous, comparatively few scholars cite them, preferring instead to cite the print version. Yet if digital sources are not cited, it is impossible for archives, university libraries, funders and other entities to know whether the collections are useful and should be expanded, or to make evidence-based arguments for future digitization of other materials. Jonathan Blaney of the Humanities Research Institute offers an egregious example of this, saying that 31 million page views of British History Online over a two-year period resulted in only 89 citations in Google Scholar and Scopus combined. Blaney asks scholars to embrace the principle—cite what you used.

Posted in BC_database