Objective

The objective of this project is to create a free-to-read, searchable collection of one million books, primarily in the English language, available to everyone over the Internet. This task is accomplished by scanning the books and indexing their full text. The text file is created, where possible, through optical character recognition. The result will be a unique resource accessible to anyone in the world 24x7x365, without regard to nationality or socioeconomic background.

Typical large high-school libraries house fewer than 30,000 volumes. One million volumes is the approximate size of the combined libraries at Carnegie Mellon University. The total number of different titles indexed in OCLC’s WorldCat is about 48 million. One million books, therefore, is more than the holdings of any high-school, equivalent to the library at a substantial university and a significant fraction of all available books.

Executive Summary

Creating a universal free to read, digital library containing over one million scanned books, with optical character recognition when possible to support full text searching, is the goal of the million book digital library project. Such a resource will lead to the democratization of knowledge by making available on the web, a unique library resource to scholars, students, and citizens around the world. The availability of online search allows users to locate relevant information quickly and reliably thus enhancing student willingness and success in their research endeavors. This 24x7x365 resource would also provide an excellent testbed for language processing research in areas such as machine translation, summarization, intelligent indexing, and information mining.

A portion of the content would include out of copyright, pre-1920 materials. A "best books" feature of the project would involve requesting permission to scan titles in the core collection development tool Books for College Libraries. A preliminary Carnegie Mellon University Libraries pilot suggests that 22% of the 80,000 titles might become available. Further, when 80% of the million books are finished, scholars will be recruited to review collections in their disciplines and to select remaining books of importance.

Mirroring the site at several locations worldwide will protect the integrity and availability of the data. Several models for sustainability are being explored and are discussed in this report. Usability studies would also be conducted to ensure that the materials are easy to locate, navigate, and use. Appropriate metadata for navigation and management would also be created.

National Science Foundation is providing funding for Scanners, Computers, Servers, and Software. These resources from NSF are augmented by almost twenty to one since China and India will be providing the necessary manpower (2,000 man years each, over a four year period), as their contribution to this project, to assist in selection of documents, software development and in digitizing these materials. Indigenous Chinese and Indian materials would form a portion of the content scanned as would English language materials already resident in those countries. In addition, U.S. libraries, primarily members of the Digital Library Federation, would ship materials to be scanned and returned.