Data Production

The scanning approach of MBP is to digitize the documents at the archival quality of 600 dots per inch (dpi) as a binary image (1 bit per pixel) for B/W pages. The resulting images are stored as Tagged Image File Format (TIFF). The resolution of the images is high enough so as to permit printing as legible as the original pages, say nothing of reading on the screen.

Typical TIFF data are provided for most tags. A specification for the TIFF header is produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service as the data are supplied by software default settings[4].

Images are named in sequential order, with corresponding 8.3 filenames, e.g., 00000001.tif as the first image in volume sequence and 00000341.tif as 341st image in volume sequence. Volumes provided to the project are assigned unique identifiers in 8 character lenth(e.g., 06000001 as identifier for volume will result in directory with same name and made by Nanjing Uni..). The images are in directories named with the corresponding identifier. Images and directories are written to DVD discs according to agreed upon specifications.

Minolta PS-7000 overhead scanners, which are provided to the Chinese scanning centers by CMU are used in the project. The image processing software for curvature correction, de-skewing, de-speckling and cropping allows for thick books to be scanned either flat or in an angled cradle that reduces wear on the spine.

The English OCR program Abby Fine Reader and the Chinese program TH-OCR 2000 are tested in the project. The OCR output is not corrected manually, as the primary function of OCR is to allow searching inside the text. Therefore, the output of the OCR text will not be used for direct displaying, and only be used for creating a searchable index.