Choosing the components of a digital infrastructure

Choosing the components of a digital infrastructure by Tim DiLauro

This paper is based on a talk of the same name given at the IMLS–sponsored Web–Wise 2004 conference.The purpose of this paper — as with the talk before it — is to highlight some issues and help inform the choices associated with developing digital environments within a single institution or among many. While the bulk of this discussion focuses on digital repositories as a key component of the digital infrastructure, persistent identifiers, assumptions surrounding digital preservation, and integration of digital library services are also discussed.

Contents

Introduction
Preservation
Identifiers
Open systems and standards
Activity at Johns Hopkins
Unbundle, then integrate
Conclusion

Introduction

As digital library tools and applications mature, more institutions are implementing or exploring implementation of these systems. What I hope to do is provide information that will be useful to you as you develop digital library environments — especially their infrastructure components — within your own organization or across organizations.

There is a lot of misunderstanding in the digital library arena because of the inconsistent application of terminology. That is, many of the terms that we regularly use mean different things to different people. With that in mind, I’d like to provide the simple definition that describes what I mean when I use the term "infrastructure," acknowledging that your definition may differ:

Infrastructure is the core of general functionality upon which other applications can be built.

In mathematical terms, you might think of infrastructure not as the lowest common denominator, but as the greatest common factor: The set of systems and services that should not have to be recreated for each application in a digital library.

Repositories are an important building block of this infrastructure. Here, again, however, the term can be misleading. In its simplest form, a repository would support mechanisms to import, export, identify, store, and retrieve digital objects and metadata.

Current implementations of repository and digital asset management system (DAMS) software do support these functions, but they also delve into other areas. DSpace ( http://dspace.org), for example, supports item–level ingestion workflows on the provider side and searching and browsing by the consumer. So on its surface, it feels more like a publishing system than a repository. It is relatively easy for end users to ingest objects into DSpace and Dspace has a security model that supports this. DSpace presents retrieved objects exactly as they were ingested. If you store a PowerPoint presentation, then a PowerPoint presentation is what will be returned, meaning that the document consumer will need a software application that understands this format.
Fedora ( http://fedora.info), on the other hand, provides more facilities for acting on data (including data stored outside of its own local storage), often for presentation and interaction with the end user, but useful for many other — often more powerful — functions. But barriers to ingestion are higher with Fedora. The Fedora security model is geared more toward repository managers ingesting collections and end users accessing them.
Both Fedora and DSpace implement an Open Archives Initiative Protocol for Metadata Harvesting (OAI–PMH) Data Provider, allowing metadata to be harvested from these systems by an OAI Service Provider. Other digital asset management products — commercial and open source — similarly implement functionality beyond what would be considered basic to a repository.

The challenge here will be evaluating the strengths and weaknesses of the various systems and selecting one that satisfies or most closely matches your goals. In some cases, however, it might be useful to implement more than one repository or to integrate complementary systems to achieve needed functionality. For example, one approach that shows promise is an implementation that combines Fedora and Dspace. Storage and end–user ingestion would be implemented in Dspace, since the barrier is lower. Access interfaces might then be implemented using Fedora, which could then render the content stored in DSpace.

Regardless of the system you choose, simply getting digital content into an organized repository is a good first step toward being able to manage and preserve it.

Preservation

Issues of preservation are a source of great confusion for many. Creating a preservation program is a challenging process, involving strategy development, planning, documentation, and the integration of multiple services. The Digital Preservation pre–conference, presented before the main Web–Wise 2004 conference by Priscilla Caplan and Robin Dale, provided a glimpse of that. Yet there are some who believe that by simply placing their content into a repository, they have "digital preservation."

Neither Fedora nor DSpace provides digital preservation per se, but both provide facilities that could be used to support it. For example, DSpace provides a mechanism for indicating the levels of format support for bitstream types. The exact meaning of these must be defined for each implementation; there are no system enforced semantics and DSpace currently does not perform any action based on these values. It is simply a tool to communicate the level of support to users who are ingesting content.

Identifiers

As a rule, digital objects need identifiers. But objects deemed worthy of preservation need an identifier that will be useful over the period that the object will be preserved. There are several schemes upon which such persistent identifiers can be built. For example, DSpace uses the Handle System (Sun et al., 2003) from the Corporation for National Research Initiatives (CNRI) and includes services that specifically depend upon and support handles as identifiers. Fedora allows any identifier that complies with the Uniform Resource Name (URN) specification (Moats, 1997), but otherwise imposes no semantics.

Some identifier schemes — for example, handles and Archival Record Keys (ARKs) (Kunze, 2003) — permit access to additional information about themselves. Such capabilities facilities might be useful for communicating policies associated with the identifier (e.g., how long the identifier is guaranteed to be resolvable) or policies of the associated digital object. These attributes can be both human– and machine–readable.

Unfortunately, missing at this time is a global resolver service that can process and redirect identifier queries independent of scheme. Therefore, each user or process that needs to resolve an identifier must know, a priori, which service should be used to dereference that identifier. This needs to be part of the infrastructure.

Open systems and standards

When I want to plug in my telephone in another room, or take it with me when I move, I can do so because of the standardization of telephone system interfaces. Ideally, it should be just as simple to move content and to plug in new services. In addition, it should be easy to move content from one system to another without needing to know much about the receiving system. As a survivor of a library management system migration, I can tell you that this is not always the case — even when one has a lot of information about all of the systems involved.

It should be equally simple to discover content with certain attributes (e.g., a particular bitstream format) and access it in order to perform functions like migration to new formats. In order for this to scale in useful ways, it would be ideal if the applications that perform these activities could be developed once, yet used for many different repositories and services. Standards can help in this regard. However, if a standard is too flexible and there is not agreement about implementation details, then problems remain.

As an example, both Fedora and the not–yet–released version 1.2 of Dspace support import and export of METS, but each has a specific profile that it supports. It’s not clear that they will be compatible and it seems likely that they won’t be.

Still, open systems and standards often ease the integration of other applications with the repository. Finally, open interfaces allow you to reduce dependencies on a third party and allow migration to different systems.

Activity at Johns Hopkins

In the Sheridan Libraries at Johns Hopkins University, we are evaluating multiple systems, building some of our own, choosing the appropriate components based on our users, collections, and service needs. We are currently (or will be in the not too distant future) developing, evaluating, or implementing the following applications: DSpace, Fedora, and WebWare ( http://www.webwarecorp.com/) (repository/DAMS); EPrints ( http://eprints.org), Virginia Tech ETD–db software ( http://scholar.lib.vt.edu/ETD-db/), DiVA (Müller et al., 2003), and LOCKSS ( http://lockss.stanford.edu/) (scholarly communication and publishing); Coursework (http://getcoursework.stanford.edu/), Claroline (http://www.claroline.net/), and FLE ( http://fle.uiah.fi/) (pedagogy/virtual learning environments); and Gamera ( http://dkc.mse.jhu.edu/gamera/), ANAC (Warner and Brown, 2001; DiLauro et al., 2001), and SCALE ( http://dkc.jhu.edu/scale_project.html) (tools and services). The publishing and virtual learning applications will need to interface with one or more repositories to support storage and retrieval of digital objects. Tools and services will act on objects in a repository, or perhaps be part of an ingestion or retrieval workflow. Given the level of integration required, it should be clear that we have a strong interest in better interoperability with the infrastructure.

Unbundle, then integrate

Digital libraries are, at the highest level, similar to their traditional counterparts. As our Director and Dean of University Libraries, Winston Tabb, has said, they are each built through the successful marrying of collections and services through a supporting infrastructure.

Unfortunately, many of these services are not available discretely, but are bundled up in monolithic — sometimes called "integrated" — systems. Sometimes interfaces to these internal or underlying services do exist, but they are implemented so differently that custom programming is necessary for each pairing of services. As the number of services becomes larger, the number of required interfaces grows very quickly.

To some extent it doesn’t matter if these larger systems remain intact, as long as as their internal services can be exposed in a consistent and appropriately constrained manner. This paradigm might be described as "unbundle, then interface."

Combined with trusted digital repository certification, improved interoperability would enable institutions to participate in the digital arena without requiring that each of them implement all solutions locally. Service provider organizations — commercial or consortial — would be able to provide some or all digital library services for organizations that lack the desire and/or the resources to do it themselves.

Regardless of whether support is delegated, we need to impress upon our system designers — vendors, open source, and local developers — that we want access to these interfaces so that we can recognize the true potential of our digital infrastructure.

Conclusion

Building a digital infrastructure is a complex process, involving requirements analysis, careful planning, and well-informed choices. As with developing a preservation program, it is much more complicated than simply installing some software and loading content. The issues discussed here are only a subset of those that you will encounter, but hopefully they will provide some useful guidance as you plan for your own environment. The payoff is certainly worth the effort.

About the Author

Tim DiLauro is is the Deputy Director of the Digital Knowledge Center of the Sheridan Libraries at Johns Hopkins University.
E–mail: timmo@jhu.edu

References

Tim DiLauro, Golam Sayeed Choudhury, Mark Patton, James Warner, and Elizabeth Brown, 2001. "Automated Name Authority Control and Enhanced Searching in the Levy Collection," D–Lib Magazine, volume 7, number 4, at http://www.dlib.org/dlib/april01/dilauro/04dilauro.html.
John A. Kunze, 2003. "Towards electronic persistence using ARK identifiers," Third ECDL Workshop on Web Archives, Trondheim, Norway (21 August), at http://bibnum.bnf.fr/ecdl/2003/, accessed 21 April 2004.

Ryan Moats, 1997. "URN Syntax," RFC 2141 (May), at http://www.ietf.org/rfc/rfc2141.txt, accessed 14 April 2004.

Eva Müller, Uwe Klosa, Stefan Andersson, and Peter Hansson, 2003. "The DiVA Project - Development of an Electronic Publishing System," D–Lib Magazine,, volume 9, number 11, at http://www.dlib.org/dlib/november03/muller/11muller.html, accessed 15 February 2004.
Sam Sun, Larry Lannom, and Brian Boesch, 2003. "Handle System Overview," RFC 3150 (November), at http://hdl.handle.net/4263537/4069, accessed 14 April 2004.

James Warner and Elizabeth Brown, 2001. "Automated Name Authority Control," Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries (JCDL), Roanoke, Virginia (24–28 June), pp. 21–22.

Editorial history

Paper received 21 April 2004; accepted 25 April 2004.

Copyright ©2004, First Monday

Copyright ©2004, Tim DiLauro

Choosing the components of a digital infrastructure by Tim DiLauro
First Monday, volume 9, number 5 (May 2004),
URL: http://firstmonday.org/issues/issue9_5/dilauro/index.html