This paper describes a two-year project at Cornell University Library to develop an archiving solution to safeguard Cornell's digital image collections, consisting of two and a half million images created over a decade. The Project has inventoried Cornell's digital image collections, investigated current and emerging file formats for long-term utility, explored functional requirements for storage, and drafted recommendations for requisite preservation metadata. These efforts demonstrated the need to develop institutional policies and resources to facilitate long-term maintenance. The final goal of this Project, therefore, is the development of a mainstreaming effort to move digital preservation from a project to program mode. A Digital Preservation Policy Working Group has been assembled that will present its recommendations to the Library Management Team in September 2000. Their report will cover requirements for conversion and metadata, a technical strategy for storage, maintenance, and long-term care, and an assessment of the resources needed to maintain these digital assets.
Contents
Introduction
Project Goals
Work to Date
Extending Project Goals: Developing a Digital Preservation Policy
ConclusionIntroduction
Cornell University Library has been a pioneer in the use of digital image technology to convert library and archival materials to digital image form to serve both preservation reformatting and access goals. A primary emphasis of this effort has been the development of quality requirements to support the creation of replacement copies for text-based material. In a series of brittle books digitization projects in the 1990s, printed pages produced from digital images were compared to the quality of preservation photocopies and computer output microfilm was judged against national standards for preservation microfilming. The conclusion of both these investigations was that 600 dpi 1-bit scanning could result in the creation of replacements for brittle books that was comparable to quality obtained via conventional reformatting means. Throughout the 1990s, Cornell Library followed a hybrid approach in the use of digital image technology, which coupled the creation of digital surrogates for brittle books to improve access with the production of either COM or paper replacements printed on acid free paper for preservation. This approach preserved the informational content contained in the original materials, but did not address the needs to safeguard the digital content itself. In using digital imaging, Cornell did not decrease its overall preservation problem but transferred the concern to a new area.
Because most of the imaging efforts were project-based, by the end of the 1990s, Cornell University Library had assembled an impressive mass of digital information, but it was not managing these digital assets in a manner to ensure their long-term viability. The IMLS-funded project offered the means to plan and implement a coherent digital preservation strategy for Cornell's digital image collections.
Project Goals
Cornell's Project identified the following goals:
- Investigate emerging file formats for long-term utility, assess the extent to which image quality and metadata are altered during migration to those formats, and identify one or more standard file formats for digital masters maintained by Cornell.
- Develop and consistently apply requisite metadata to enable digital image files to become self-preserving objects.
- Assess alternative storage solutions and implement the best approach for amalgamating, managing, preserving, and making accessible Cornell's digital image collections.
- Place the master files at the center of the digital library system, and develop on-the-fly conversion capabilities to create alternative views and formats to meet current and future user needs.
- Report to the nation on the implementation of a preservation program for digital image materials, including an assessment of resource needs, trade-offs, and risk management factors.
Work to Date
The results of work in progress is presented on the Project's Web site ( www.library.cornell.edu/imls/index.htm). In brief, these efforts have focused in the following categories.
Collection inventory
Since January 1999, the Project team has taken an inventory of some thirty different Cornell digital image collections, the results of which have shown the value of creating good preservation metadata and adequate documentation at the time of image creation. Recreating this information after the fact has proven not only difficult and expensive, but has highlighted the lack of uniformity in approach over the course of a decade's work, and the value of standardization for long-term management. See the project Web site, Cornell University Library, Digital Image Collections Inventory."
File format investigation
The investigation into current and emerging file formats for long-term utility led Cornell to conclude that their current choice of image formats (TIFF 5.0 and 6.0 ITU-T6 compressed) was acceptable and that there was no immediate need to migrate these files to new formats. There was, however, a need to address the proprietary nature of the RDO files created using the Xerox XDOD scanning system. An RDO file contains information about the structure of an image document, as well as a file location pointer for each page image in that document. Because the XDOD system is proprietary, the structure of image documents can be displayed only through the use of the appropriate Xerox software. Since 1994, Cornell has converted RDO files to an open Cornell Digital (CDL) format. (The specifications for CDL was released in August 1994 through a Request for Comments, #1691.) The projects inventory, however, revealed that not all files had been converted, and a conversion program was instituted as part of the effort to consolidate all the image files. A risk analysis study was conducted to determine the risks involved during this migration process (See the Project Web site, "Risks Involved in Migrating RDO Files.")
Preservation metadata
A third area of work has centered on the requisite metadata to support preservation, and draft recommendations stemming from this effort are posted on the Project Web site (see "Metadata Recommendations"). This effort has shown that the easy part is to define requirements; implementing these requirements and determining where to store such information becomes more problematic. A decision was made to expand and apply the descriptive metadata proposal first as the implementation of the structural metadata elements will necessitate resolving some system architecture-related issues first. Work on those recommendations is still underway, but most likely will result in a temporary solution until Cornell completes its transition to a new library management system (Endeavor's Voyager) and the assessment of its image database development (Encompass).
Persistent identifiers
In tandem with this project, Cornell established a repository and naming service (based on OCLC's Persistent URL [PURL]), and started creating persistent identifiers for the image files.
Storage
Because this Project took as guiding principles the centralization of digital content and utilizing digital masters in the delivery system, the storage solution had to be sufficiently capacious and robust to meet access needs. The project inventory enabled us to determine space requirements and evaluate different storage options. As part of a cost-share requirement for this grant, Cornell secured in-kind and reduced costs for a SUN E3500 storage system. Work to date has focused on organizing the disk storage for the files, and installing PERL, Apache, WebServer, and C Compiler to prepare the central repository. Following an analysis of alternatives for backup, Cornell chose ADSM, an IBM-based backup solution as it allows customization of services for static directories. A report on alternatives and an assessment of the effectiveness of Cornell's choice will be prepared and posted on the Web site in summer 2000.
Extending Project Goals: Developing a Digital Preservation Policy
This Project has demonstrated the need for developing a coherent approach to digital collection building in order to facilitate its long-term care. It has demonstrated the down side of having secured outside funds to underwrite a myriad of projects. To date, Cornell's digital imaging initiatives have been project driven, with little standardization or thought given to what happens after the project has been completed. Our assessment of Cornell's digital image collections confirms that we have created digital assets of considerable value, which has led to a renewed commitment to maintain them over time. It has also motivated project staff to move digital preservation planning from a project orientation to a mainstreamed program, which to succeed, must develop policies and requirements for content creation for those digital collections that will come under the Library's purview for perpetual care. Toward that end, the team developed recommendations for the creation of a policy framework for future digital collection efforts, and sought support from the Library Management Team (LMT) to establish a campus-wide committee to develop a plan for consideration by the LMT in fall 2000.
With strong support from senior administrators a Digital Preservation Policy Working Group was charged in February 2000 with developing policies governing prospective digital imaging projects that will be mainstreamed into Cornell's Digital Library. The goal is to develop specific recommendations for image collections designed to facilitate their long-term management. These recommendations would cover such things as storage, file formats, metadata, and cataloging. We are not assuming that all projects would adhere to these requirements, but only those that will become the Library's long-term responsibility. Our charge will be limited to digital imaging collections, but we anticipate that much of the work would apply as well to other kinds of digital material.
This group identified four key areas of work and divided themselves into subcommittees covering the following:
- Deposit requirements covering selection and creation of digital resources that have a high probability of use and reuse over time. The motivation is to address preservation concerns from the ground up, including adequate quality capture and review, as well as recommendations on formats, compression, quality control, descriptive metadata, documentation, and persistent identifiers.
- Technical guidelines associated with storage, backup, redundancy, database management, centralization, technical metadata, and preservation options (migration, etc.) This work presumes that the master files should not be placed at risk by applying short-term solutions to short-term problems (many of today's constraints will not be tomorrow's, and we should avoid building an approach that becomes quickly outdated or superseded).
- Resource requirements necessary to mainstream digital preservation, acknowledging that digital preservation requires perpetual care, and providing for on-going sustainability, including a secure administration home, finances, personnel, hardware/software, training, and space. Without on-going resource commitment and a process for continuous review/updating of recommendations, a digital preservation strategy will fail to thrive.
- Publicity requirements, encompassing outreach, consensus building, and review. More specifically, this sub-committee is charged with devising a buy-in strategy to secure acceptance and adoption of this plan. Cornell has been a leader on a number of digital fronts. Projects have been characterized by a highly independent spirit. The committee realizes the resource implications of preservation planning and the uniformity imposed by deposit requirements. Seeking input and advice is not only a good idea but also provides the means for securing key support.
This project will conclude in September 2000. Its real impact will not be assessed at that time, but only in the years to come.
About the Authors
Anne R. Kenney is Associate Director of the Department of Preservation and Co-Director, Cornell Institute for Digital Collections, at Cornell University. For the past decade, she has led research and production projects focusing on the use of digital imaging to retrospectively convert library and archival materials. She has written and spoken widely on the topic, and is the co-author of the award-winning Digital Imaging for Libraries and Archives (Cornell University Library, 1996). Kenney is a Fellow and past president of the Society of American Archivists, and a commissioner of the National Historical Publications and Records Commission.
E-mail: ark3@cornell.eduOya Y. Rieger is the Coordinator of the Digital Imaging and Preservation Unit at the Department of Preservation, Cornell University Library. She has a diverse background in digital libraries, including managing numeric files collections, coordinating the development of the award-winning USDA Economics and Statistics System, and conducting user studies. She has written and spoken frequently on various aspects of developing digital libraries. During the last three years, her research focus has been on digital imaging and preservation. She currently serves on the joint Research Libraries Group - Digital Library Federation Task Force on Long-Term Retention of Digital Information.
E-mail: oyr1@cornell.eduKenney and Rieger have collaborated on numerous training and research projects, and share the editorship of RLG DigiNews. They are the principal authors and editors of a new monograph, Moving Theory into Practice: Digital Imaging for Libraries and Archives, published by the Research Libraries Group (April 2000); see http://www.rlg.org/preserv/mtip2000.html
Editorial history
Paper received 1 May 2000; accepted 10 May 2000.
Copyright ©2000, First Monday
Preserving Digital Assets: Cornell's Digital Image Collection Project by Anne R. Kenney and Oya Y. Rieger
First Monday, volume 5, number 6 (June 2000),
URL: http://firstmonday.org/issues/issue5_6/kenney/index.html