roytennant.com
Please note: This document is kept here for historical purposes, but it has not been kept up to date. Also, you may also like to see another document frozen in time, "Specifications for Metadata Processing Tools" (PDF)

Bitter Harvest: Problems & Suggested Solutions for
OAI-PMH Data & Service Providers

Roy Tennant, California Digital Library * roy.tennant@ucop.edu
Last text revision 14 May 2004

Background

 

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH, http://www.openarchives.org/) specifies a method for digital repositories (also called "data providers") to expose metadata about their objects for harvesting by aggregators (also called "service providers"). Metadata is exposed via "sets," or collections of metadata that data providers decide to make available for harvesting. Service providers harvest sets from data providers of interest, and provide search services for the resulting collections of metadata (for a good example of a service provider, see http://www.oaister.oclc.org/). Data providers also decide which metadata formats to expose for harvesting, beyond the one required data format of simple Dublin Core (see http://dublincore.org/).

 

The OAI-PMH is relatively new, and both data and service providers are still learning the best methods for exposing metadata for harvesting and gathering that metadata into centralized search services. This paper attempts to outline some of the major harvesting issues that service providers have discovered, and outlines some suggested solutions and next steps for data providers, service providers, and those involved with revising and extending the Open Archives Initiative infrastructure. For additional issues specific to service providers, see "Service Provider Issues" by Kat Hagedorn[1].

 

The OAI-PMH was specifically designed to make participation relatively easy for data providers, so that those with useful content to share can experience as little pain as possible to comply with the protocol. As stated in an official OAI document, "The OAI-PMH is designed to provide a low barrier to implementation for repositories and this means that in places burden has placed on harvesters in order to simplify repository implementation."[2] This is why, for example, simple Dublin Core is the only metadata schema required. But this low barrier does not preclude a much higher ceiling, and the OAI-PMH specifically allows the use of much richer metadata schemes. The fact that few data providers are taking advantage of this flexibility may be at least partly due to the relatively recent development of OAI-compliant systems, the small number of OAI service providers, and the relative inexperience of most of those involved.

 

As service providers gain more experience in OAI harvesting, and run into some of the problems identified below, pressure may build on data providers to expose richer sets of metadata. However, as the OAI directorate intended, service providers will likely continue to shoulder a greater burden and responsibility to perform the types of normalization and transformation routines that are touched on in the final section of this paper.

Harvesting Issues

Sets

There are presently no guidelines for data providers regarding set specification. Since no guidelines exist, data providers create sets that may only make sense for that particular institution, and that make no sense whatsoever to a service provider attempting to provide discovery services for a group of data providers. Some specific set difficulties are:

 

 

Suggested Solutions for Data Providers

Provide a method for service providers to create dynamic sets; see <http://www.cdlib.org/inside/diglib/repository/harvest/> for an example

Suggested Solutions for Service Providers

Build tools for post-harvest subsetting (i.e., tools to identify clusters of related objects that can be extracted for indexing and display)

Suggested Changes to the OAI Infrastructure

Provide a method for data providers to make available richer descriptions of sets for service providers

Provide methods and/or guidelines for specifying sets from different viewpoints; e.g., sets based on format (text, image, etc.), sets based on topic area, etc.

Metadata Issues

Simple DC is Too Simple

Most data providers are not exposing the richest metadata format possible -- most expose only simple Dublin Core. This presents a number of problems:

 

 

Suggested Solutions for Data Providers

Provide metadata in a variety of formats; at minimum, the required Dublin Core and the richest, most granular form of metadata available

Suggested Solutions for Service Providers

Check for richer metadata formats than Dublin Core (e.g., the Library of Congress makes MODS and MARC available)

Create and use metadata normalization and enhancement routines

Suggested Changes to the OAI Infrastructure

Encourage OAI data providers to make metadata available in schemes richer than unqualified Dublin Core

 

Metadata Artifacts

Artifacts are anachronistic practices that have been inherited from other metadata schemas or uses to which the metadata was put. For example, at least one data provider provides Dublin Core <title> elements with "[electronic resource]" as part of the title. This is a hold-over from MARC, and outside the context of library catalogs (and one could argue even within the context of library catalogs) this information is, at least, misplaced. Metadata from other providers can be found to include HTML markup such as "<br>" for a line break -- clearly a hold-over from some user display requirement.

 

Suggested Solutions for Data Providers

Provide metadata that is free of any anachronistic data elements; i.e., "wash" the metadata of elements that are useless or problematic outside of the home context

Suggested Solutions for Service Providers

Create and use metadata normalization and enhancement routines

Granularity

Object Level -- Some data providers expose individual metadata records for components of one intellectual object. For example, the individual digitized pages of a diary may be represented by individual metadata records by one data provider, and represented by one "set record" by another. Reconciling these differences present particular problems for service providers.

 

Suggested Solutions for Data Providers

Unless a "typical" user would be happy to discover an individual component of a larger work by itself, consider collapsing records for one intellectual object into one record

Suggested Solutions for Service Providers

Create and use metadata normalization and enhancement routines; in this case, routines to discover and rectify multiple records for the same intellectual object

Suggested Changes to the OAI Infrastructure

Work toward a common understanding of the appropriate granularity for an object-level record

 

Element Level -- Unqualified Dublin Core provides opportunities for all sorts of disasters. One such disaster is the loss of granularity. A specific example is "dumbing down" the various constituent components of a personal name into one unstructured field. Once granularity is lost, it can be extremely difficult to recover it unless the records are consistent and can be parsed by a reasonably competent algorithm.

 

Suggested Solutions for Data Providers

Expose the richest, most granular form of metadata possible

Suggested Solutions for Service Providers

Request that the data provider expose the richest, most granular form of metadata possible; create and use metadata normalization and enhancement routines

 

Encoding Variances

The Dublin Core "date" element is a prime example of the challenges service providers face with variant encoding practices. Even among only five data providers, the following date entries have been discovered:

 

Such wide variability in the method of encoding dates may render this field unusable for searching, unless service providers can create well-crafted normalization routines. So far, however, even such experienced service providers as the Core Integration team of the National Science Digital Library do only simple transformations of the date field (i.e., from CCYY, CCYY-MM, and CCYY-MM-DD to the W3C Date and Time Format[3]).

 

Suggested Solutions for Data Providers

Consider carefully the ramifications of proprietary methods of encoding dates; adopt whenever possible a common and unambiguous date encoding scheme

Suggested Solutions for Service Providers

Request that the data provider expose the richest, most granular form of metadata possible; create and use metadata normalization and enhancement routines

Suggested Changes to the OAI Infrastructure

Work toward a common understanding of the appropriate method to encode all the various ways a date may need to be expressed

 

Best Practices for Data Providers

 

Best Practices for Service Providers

 

 

A Suggested Harvesting Model for Service Providers

 

Service providers will be required in almost all cases to at least normalize the harvested metadata, if not perform more sophisticated functions such as adding fields (e.g., the source of the data), increasing the metadata granularity (e.g., splitting out personal names into the constituent parts), or qualifying fields (e.g., specifying that a subject term comes for a specific vocabulary).

 

Some of these transformations can be accomplished without conversation with the data provider (e.g., date normalization), but in a number of cases service providers will find it fruitful to communicate with the data provider. If a richer set of metadata cannot be exposed by the data provider, at least the service provider may be able to discover information that would be useful in a metadata transformation process. Therefore, an iterative process to setting up a repository for harvesting is proposed.

 

The process begins with an initial prototype harvest, which leads to analysis of the metadata and specification of a harvesting profile and required metadata transformations (see Figure 1). The Core Integration Team of the National Science Digital Library has found software applications such as Microsoft Excel and SpotFire DecisionSite to be useful in visualizing the harvested metadata for spotting anomalies and characteristics[4].

 

 

Figure 1. Metadata analysis, creation of harvest profile and metadata transformation process

 

The second stage consists of subsetting activities -- operations required to select the appropriate set of records for a given search service (see Figure 2).

 

 

Figure 2. Creation of subsetting profile and procedures

 

The final stage is the establishment of a production harvesting service, which uses the created profile and the established subsetting and transformation processes to create the appropriate set of harvested and transformed metadata (see Figure 3).

 

Figure 3. The production harvesting process.

 

Properly accomplished, at least some of aspects of the production harvesting process could be deployed to perform metadata transformations for other metadata ingest activities. Libraries will increasingly be called upon to accept, process, and make available metadata in a variety of formats (e.g., ONIX), and the more skilled we become at ingesting and processing metadata the better service we will be equipped to provide our clientele.



[1] Hagedorn, Kat. Service Provider (SP) Issues, <http://www.kathagedorn.com/SP_issues.html>.

[2]  Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting -- Guidelines for Harvester Implementers <http://www.openarchives.org/OAI/2.0/guidelines-harvester.htm>.

[3] "NSDL 'Safe' Transforms," NSDL Metadata Primer, <http://metamanagement.comm.nsdlib.org/safeXform.html>. See < http://www.w3.org/TR/NOTE-datetime> for the W3C Date and Time Format.

[4] Dushay, Naomi and Diane I. Hillman, "Analyzing Metadata for Effective Use and Re-Use," 2003 Dublin Core Conference, 28 September 2 October, 2003, Seattle, WA, <http://www.siderean.com/dc2003/501_Paper24.pdf>.

Note: This paper has been placed in the public domain in honor of Michael Hart, founder of Project Gutenberg. 8 September 2011