| Database Creation
MBP will produce approximately 300 million pages or 600 billion characters of information when it is completed. The database will house both an image file and a text file at about 50-60 megabytes per book. Creating and managing such a vast information base poses many technological challenges and provides a fertile test bed for innovative research in many areas. Mirroring the database in several places in China and USA will not only provide fast access, as the network speeds at the various nodes would be different, but also ensure security and long-term preservation. The Internet Archive is a project partner, providing a permanent archive for the Million Book Collection, quality control tools, and assistance with acquiring books. Research in distributed caching and active networks would be needed to ensure that the look and feel of the database is the same from any location.
The images of the MBP digital books are used for viewing, while the texts produced by OCR are used for searching. In order to speed up the viewing of the images, we have tested and applied the DjVu as the publishing format to replace HTML with TIFF image in Zhejiang University Libraries. DjVu technology is a highly sophisticated imaging language based on six advanced technological breakthroughs developed at AT&T Labs. Conventional image-viewing software decompresses images in their entirety before displaying them. This is impractical for high-resolution document images like the MBP digital books because the large file sizes involved typically exceed the memory capacity of most computers. DjVu technology, on the other hand, keeps the image in memory in a compact form and decodes only the area displayed on the screen in real time as the user views the image. As a result, the initial view of the page loads very quickly, and the visual quality progressively improves as more bits arrive[6]. DjVu technology can achieve file size reduction ratios as great as 500:1 while preserving excellent image quality.
|