|Please note: This document is kept here for historical purposes, but it has not been kept up to date. Also, you may also like to see another document frozen in time, "Specifications for Metadata Processing Tools" (PDF)|
Roy Tennant, California Digital Library *
Last text revision 14 May 2004
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH, http://www.openarchives.org/) specifies a method for digital repositories (also called "data providers") to expose metadata about their objects for harvesting by aggregators (also called "service providers"). Metadata is exposed via "sets," or collections of metadata that data providers decide to make available for harvesting. Service providers harvest sets from data providers of interest, and provide search services for the resulting collections of metadata (for a good example of a service provider, see http://www.oaister.oclc.org/). Data providers also decide which metadata formats to expose for harvesting, beyond the one required data format of simple Dublin Core (see http://dublincore.org/).
The OAI-PMH is relatively new, and both data and service providers are still learning the best methods for exposing metadata for harvesting and gathering that metadata into centralized search services. This paper attempts to outline some of the major harvesting issues that service providers have discovered, and outlines some suggested solutions and next steps for data providers, service providers, and those involved with revising and extending the Open Archives Initiative infrastructure. For additional issues specific to service providers, see "Service Provider Issues" by Kat Hagedorn.
The OAI-PMH was specifically designed to make participation relatively easy for data providers, so that those with useful content to share can experience as little pain as possible to comply with the protocol. As stated in an official OAI document, "The OAI-PMH is designed to provide a low barrier to implementation for repositories and this means that in places burden has placed on harvesters in order to simplify repository implementation." This is why, for example, simple Dublin Core is the only metadata schema required. But this low barrier does not preclude a much higher ceiling, and the OAI-PMH specifically allows the use of much richer metadata schemes. The fact that few data providers are taking advantage of this flexibility may be at least partly due to the relatively recent development of OAI-compliant systems, the small number of OAI service providers, and the relative inexperience of most of those involved.
As service providers gain more experience in OAI harvesting, and run into some of the problems identified below, pressure may build on data providers to expose richer sets of metadata. However, as the OAI directorate intended, service providers will likely continue to shoulder a greater burden and responsibility to perform the types of normalization and transformation routines that are touched on in the final section of this paper.
There are presently no guidelines for data providers regarding set specification. Since no guidelines exist, data providers create sets that may only make sense for that particular institution, and that make no sense whatsoever to a service provider attempting to provide discovery services for a group of data providers. Some specific set difficulties are:
Most data providers are not exposing the richest metadata format possible -- most expose only simple Dublin Core. This presents a number of problems:
Artifacts are anachronistic practices that have been inherited from other metadata schemas or uses to which the metadata was put. For example, at least one data provider provides Dublin Core <title> elements with "[electronic resource]" as part of the title. This is a hold-over from MARC, and outside the context of library catalogs (and one could argue even within the context of library catalogs) this information is, at least, misplaced. Metadata from other providers can be found to include HTML markup such as "<br>" for a line break -- clearly a hold-over from some user display requirement.
Object Level -- Some data providers expose individual metadata records for components of one intellectual object. For example, the individual digitized pages of a diary may be represented by individual metadata records by one data provider, and represented by one "set record" by another. Reconciling these differences present particular problems for service providers.
Element Level -- Unqualified Dublin Core provides opportunities for all sorts of disasters. One such disaster is the loss of granularity. A specific example is "dumbing down" the various constituent components of a personal name into one unstructured field. Once granularity is lost, it can be extremely difficult to recover it unless the records are consistent and can be parsed by a reasonably competent algorithm.
The Dublin Core "date" element is a prime example of the challenges service providers face with variant encoding practices. Even among only five data providers, the following date entries have been discovered:
Such wide variability in the method of encoding dates may render this field unusable for searching, unless service providers can create well-crafted normalization routines. So far, however, even such experienced service providers as the Core Integration team of the National Science Digital Library do only simple transformations of the date field (i.e., from CCYY, CCYY-MM, and CCYY-MM-DD to the W3C Date and Time Format).
Service providers will be required in almost all cases to at least normalize the harvested metadata, if not perform more sophisticated functions such as adding fields (e.g., the source of the data), increasing the metadata granularity (e.g., splitting out personal names into the constituent parts), or qualifying fields (e.g., specifying that a subject term comes for a specific vocabulary).
Some of these transformations can be accomplished without conversation with the data provider (e.g., date normalization), but in a number of cases service providers will find it fruitful to communicate with the data provider. If a richer set of metadata cannot be exposed by the data provider, at least the service provider may be able to discover information that would be useful in a metadata transformation process. Therefore, an iterative process to setting up a repository for harvesting is proposed.
The process begins with an initial prototype harvest, which leads to analysis of the metadata and specification of a harvesting profile and required metadata transformations (see Figure 1). The Core Integration Team of the National Science Digital Library has found software applications such as Microsoft Excel and SpotFire DecisionSite to be useful in visualizing the harvested metadata for spotting anomalies and characteristics.
Figure 1. Metadata analysis, creation of harvest profile and metadata transformation process
The second stage consists of subsetting activities -- operations required to select the appropriate set of records for a given search service (see Figure 2).
Figure 2. Creation of subsetting profile and procedures
The final stage is the establishment of a production harvesting service, which uses the created profile and the established subsetting and transformation processes to create the appropriate set of harvested and transformed metadata (see Figure 3).
Figure 3. The production harvesting process.
Properly accomplished, at least some of aspects of the production harvesting process could be deployed to perform metadata transformations for other metadata ingest activities. Libraries will increasingly be called upon to accept, process, and make available metadata in a variety of formats (e.g., ONIX), and the more skilled we become at ingesting and processing metadata the better service we will be equipped to provide our clientele.
 Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting -- Guidelines for Harvester Implementers <http://www.openarchives.org/OAI/2.0/guidelines-harvester.htm>.
 "NSDL 'Safe' Transforms," NSDL Metadata Primer, <http://metamanagement.comm.nsdlib.org/safeXform.html>. See < http://www.w3.org/TR/NOTE-datetime> for the W3C Date and Time Format.
Note: This paper has been placed in the public domain in honor of Michael Hart, founder of Project Gutenberg. 8 September 2011