HathiTrust Digital Library

HathiTrust is a full-text searchable digital library platform of over 11 million monographs, serials, and pamphlet collections from Hathi partner libraries of which an estimated 3.7 million items are in the public domain. While the majority of volumes were digitized by the Google Book Project, the HathiTrust website offers a unique interface and search mechanism as well as other benefits for researchers. HathiTrust is operated by a partnership of international research libraries for the purpose of preserving the cultural record.

HathiTrustHeather Christenson explains the benefits of the unusual structure of HathiTrust in an article for Library Resources & Technical Services. Libraries own both the search mechanism and the content on which it acts. This ownership is significant for several reasons. For end users, the selectiveness and ranking of search results are not influenced by commercial interests, and the material covered by the search is a known corpus of materials selected, cataloged, and curated by librarians with the interests of academic users in mind.

Access

Access to HathiTrust materials depends on the copyright status of the materials sought and whether the user is affiliated with one of HathiTrust’s partners among other factors. Click the Access tab in Beyond Citation for details. Bulk data is available for text mining purposes and support is offered for researchers creating computational tools to analyze the corpus of digital text.

Contents

For a breakdown of the contents of HathiTrust by source library and also by language see this presentation by librarian Heather Christenson. For information about the legibility of and errors in HathiTrust material, click on the Conversations tab.

Search

The options for search are catalogue search, full-text search, and search or browse the HathiTrust collections. There are also searches by title, author, subject, ISSN/ISBN, publisher, series title, year of publication, and all fields, or one can perform an advanced search using Boolean parameters with multiple options to refine search results.

Because of the enormous number of volumes that it contains, searches in HathiTrust may return very large amounts of results. To enable more precision in search, HathiTrust offers Library of Congress subject headings. Librarian Eamon Duffy describes in detail how to combine LC terms with keyword search in “Advanced Full-text Search.” His example is a search with a keyword which returns 46,010 results. A LC subject heading search of the same material returns 362 results. But the combination of keyword and LC terms yields 150 more precise results. To further focus a search, HathiTrust also offers the ability to see the number of times that the terms appear in each work.

Review

A reviewer for the Charleston Advisor gave HathiTrust four and a half stars each for content and searchability. (Paywalled article.)

Date range: pre 1500-2013

Publisher: HathiTrust

Publisher About page: http://www.hathitrust.org/about

Object type: Books, Journals, Ephemera

Location of original materials: Multiple

Exportable image: Yes

Facsimile image: Yes

Full text searchable: Yes

Original catalog: Multiple

Digitized from microfilm: some items

Original sources: Multiple

Gita Gunatilleke. HathiTrust. The Charleston Advisor 13.4 (2012): 43–46. Paywalled article.

HathiTrust has complicated rules about the types of users and the types of access they have to HathiTrust materials. Heather Christenson of the California Digital Library has parsed the distinctions between the types of access. She writes that Google-digitized public domain volumes are available in a full PDF download to authenticated users from partner institutions; public domain volumes digitized via Internet Archive and locally by partners are available in full PDF to all. All public domain volumes can be viewed on the web in a page-turner application. Volumes that are in copyright are discoverable via large-scale search, and users may view a list of pages on which their search term appears (snippets are not yet available).

Bulk data is available for text mining purposes and support is offered for researchers creating computational tools to analyze the corpus of digital text.

Access outside of the U.S.

Access to certain materials also depends upon the geographic location of the user because of differing international rules about copyright. Works that were published between 1873 and 1923 are not accessible from IP addresses outside the U.S. Click on this link for details.

Contents

For a breakdown of the contents of the collection by source library and also by language see this presentation by librarian Heather Christenson.

Legibility and errors

Information scientist Paul Conway undertook a large-scale study of materials from HathiTrust funded by the Andrew W. Mellon Foundation and the Institute for Museum and Library Services. Conway sampled English-language books and serials published before 1923 that were digitized by Google between 2004 and 2010. He writes that fully one-quarter of the volumes [sampled] contain at least one page image whose content is unreadable (or about one percent of the total content). And also, that the findings suggest that the imperfection of digital surrogates is an obvious and nearly ubiquitous feature of Google Books and that such imperfection has become and will remain firmly ensconced in collaborative preservation repositories. Phase two of the project used “statistically valid measures” to determine patterns of error. A report of Conway’s findings were published in 2013. Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust. Paywalled article.

MLA style

To cite online and electronic works, MLA style, see the guides at Research and Documentation Online, 5th edition, Diana Hacker.

Citing the digital

While digital resources are ubiquitous, comparatively few scholars cite them, preferring instead to cite the print version. Yet if digital sources are not cited, it is impossible for archives, university libraries, funders and other entities to know whether the collections are useful and should be expanded, or to make evidence-based arguments for future digitization of other materials. Jonathan Blaney of the Humanities Research Institute offers an egregious example of this, saying that 31 million page views of British History Online over a two-year period resulted in only 89 citations in Google Scholar and Scopus combined. Blaney asks scholars to embrace the principle—cite what you used.

Posted in BC_database