| Please note: This document is kept here for historical purposes, but it has not been kept up to date. Also, you may also like to see another document frozen in time, "Specifications for Metadata Processing Tools" (PDF) |
Roy Tennant, California Digital Library *
roy.tennant@ucop.edu
Last
text revision 14 May 2004
The Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH, http://www.openarchives.org/)
specifies a method for digital repositories (also called "data providers") to
expose metadata about their objects for harvesting by aggregators (also called
"service providers"). Metadata is exposed via "sets," or collections of
metadata that data providers decide to make available for harvesting. Service
providers harvest sets from data providers of interest, and provide search
services for the resulting collections of metadata (for a good example of a
service provider, see http://www.oaister.oclc.org/).
Data providers also decide which metadata formats to expose for harvesting,
beyond the one required data format of simple Dublin Core (see http://dublincore.org/).
The OAI-PMH is relatively
new, and both data and service providers are still learning the best methods
for exposing metadata for harvesting and gathering that metadata into
centralized search services. This paper attempts to outline some of the major
harvesting issues that service providers have discovered, and outlines some
suggested solutions and next steps for data providers, service providers, and
those involved with revising and extending the Open Archives Initiative
infrastructure. For additional issues specific to service providers, see
"Service Provider Issues" by Kat Hagedorn[1].
The
OAI-PMH was specifically designed to make participation relatively easy for
data providers, so that those with useful content to share can experience as
little pain as possible to comply with the protocol. As stated in an official
OAI document, "The OAI-PMH is designed to provide a
low barrier to implementation for repositories and this means that in places
burden has placed on harvesters in order to simplify repository implementation."[2] This is why, for example, simple Dublin Core is
the only metadata schema required. But this low barrier does not preclude a
much higher ceiling, and the OAI-PMH specifically allows the use of much richer
metadata schemes. The fact that few data providers are taking advantage of this
flexibility may be at least partly due to the relatively recent development of
OAI-compliant systems, the small number of OAI service providers, and the
relative inexperience of most of those involved.
As service providers gain more experience in OAI harvesting, and run
into some of the problems identified below, pressure may build on data
providers to expose richer sets of metadata. However, as the OAI directorate
intended, service providers will likely continue to shoulder a greater burden
and responsibility to perform the types of normalization and transformation
routines that are touched on in the final section of this paper.
There are presently no
guidelines for data providers regarding set specification. Since no guidelines
exist, data providers create sets that may only make sense for that particular
institution, and that make no sense whatsoever to a service provider attempting
to provide discovery services for a group of data providers. Some specific set
difficulties are:
Provide
a method for service providers to create dynamic sets; see <http://www.cdlib.org/inside/diglib/repository/harvest/>
for an example
Build
tools for post-harvest subsetting (i.e., tools to identify clusters of related
objects that can be extracted for indexing and display)
Provide a method for data providers to make available richer descriptions of sets for service providers
Provide methods and/or guidelines for specifying
sets from different viewpoints; e.g., sets based on format (text, image, etc.),
sets based on topic area, etc.
Most data providers are
not exposing the richest metadata format possible -- most expose only
simple
Dublin Core. This presents a number of problems:
Provide
metadata in a variety of formats; at minimum, the required Dublin Core and the
richest, most granular form of metadata available
Check
for richer metadata formats than Dublin Core (e.g., the Library of Congress
makes MODS and MARC available)
Create
and use metadata normalization and enhancement routines
Encourage OAI data providers to make metadata
available in schemes richer than unqualified Dublin Core
Artifacts are
anachronistic practices that have been inherited from other metadata schemas or
uses to which the metadata was put. For example, at least one data provider
provides Dublin Core <title> elements with "[electronic resource]" as
part of the title. This is a hold-over from MARC, and outside the context of
library catalogs (and one could argue even within the context of library catalogs) this information
is, at least, misplaced. Metadata from other providers can be found to include
HTML markup such as "<br>" for a line break -- clearly a hold-over
from
some user display requirement.
Provide
metadata that is free of any anachronistic data elements; i.e., "wash" the
metadata of elements that are useless or problematic outside of the home
context
Create and use metadata normalization and
enhancement routines
Object
Level -- Some data
providers expose individual metadata
records for components of one intellectual object. For example, the individual
digitized pages of a diary may be represented by individual metadata records by
one data provider, and represented by one "set record" by another. Reconciling
these differences present particular problems for service providers.
Unless
a "typical" user would be happy to discover an individual component of a larger
work by itself, consider collapsing records for one intellectual object into
one record
Create
and use metadata normalization and enhancement routines; in this case, routines
to discover and rectify multiple records for the same intellectual object
Work toward a common understanding of the
appropriate granularity for an object-level record
Element
Level -- Unqualified
Dublin Core provides opportunities
for all sorts of disasters. One such disaster is the loss of granularity. A
specific example is "dumbing down" the various constituent components of a
personal name into one unstructured field. Once granularity is lost, it can be
extremely difficult to recover it unless the records are consistent and can be
parsed by a reasonably competent algorithm.
Expose
the richest, most granular form of metadata possible
Request that the data provider expose the richest,
most granular form of metadata possible; create and use metadata normalization
and enhancement routines
The Dublin Core "date"
element is a prime example of the challenges service providers face with
variant encoding practices. Even among only five data providers, the following
date entries have been discovered:
Such wide variability in
the method of encoding dates may render this field unusable for searching,
unless service providers can create well-crafted normalization routines. So
far, however, even such experienced service providers as the Core Integration
team of the National Science Digital Library do only simple transformations of
the date field (i.e., from CCYY, CCYY-MM, and CCYY-MM-DD to the W3C Date and
Time Format[3]).
Consider
carefully the ramifications of proprietary methods of encoding dates; adopt
whenever possible a common and unambiguous date encoding scheme
Request
that the data provider expose the richest, most granular form of metadata
possible; create and use metadata normalization and enhancement routines
Work toward a common understanding of the
appropriate method to encode all the various ways a date may need to be
expressed
Service providers will be
required in almost all cases to at least normalize the harvested metadata, if
not perform more sophisticated functions such as adding fields (e.g., the
source of the data), increasing the metadata granularity (e.g., splitting out
personal names into the constituent parts), or qualifying fields (e.g.,
specifying that a subject term comes for a specific vocabulary).
Some of these
transformations can be accomplished without conversation with the data provider
(e.g., date normalization), but in a number of cases service providers will
find it fruitful to communicate with the data provider. If a richer set of
metadata cannot be exposed by the data provider, at least the service provider
may be able to discover information that would be useful in a metadata
transformation process. Therefore, an iterative process to setting up a
repository for harvesting is proposed.
The process begins with
an initial prototype harvest, which leads to analysis of the metadata and
specification of a harvesting profile and required metadata transformations
(see Figure 1). The Core Integration Team of the National Science Digital
Library has found software applications such as Microsoft Excel and SpotFire
DecisionSite to be useful in visualizing the harvested metadata for spotting
anomalies and characteristics[4].
Figure 1. Metadata analysis, creation of harvest profile and metadata
transformation process
The second stage consists
of subsetting activities -- operations required to select the
appropriate set of
records for a given search service (see Figure 2).
Figure 2. Creation of subsetting profile and procedures
The final stage is the
establishment of a production harvesting service, which uses the created
profile and the established subsetting and transformation processes to create
the appropriate set of harvested and transformed metadata (see Figure 3).
Figure 3. The production harvesting process.
Properly accomplished, at
least some of aspects of the production harvesting process could be deployed to
perform metadata transformations for other metadata ingest activities.
Libraries will increasingly be called upon to accept, process, and make
available metadata in a variety of formats (e.g., ONIX), and the more skilled
we become at ingesting and processing metadata the better service we will be
equipped to provide our clientele.
[1] Hagedorn, Kat. Service Provider (SP) Issues, <http://www.kathagedorn.com/SP_issues.html>.
[2]
Implementation
Guidelines for the Open Archives Initiative Protocol for Metadata
Harvesting -- Guidelines for Harvester Implementers
<http://www.openarchives.org/OAI/2.0/guidelines-harvester.htm>.
[3] "NSDL 'Safe' Transforms," NSDL Metadata Primer, <http://metamanagement.comm.nsdlib.org/safeXform.html>.
See < http://www.w3.org/TR/NOTE-datetime>
for the W3C Date and Time Format.
[4] Dushay, Naomi and Diane I. Hillman, "Analyzing
Metadata for Effective Use and Re-Use," 2003 Dublin Core Conference, 28 September Ó 2 October, 2003, Seattle, WA,
<http://www.siderean.com/dc2003/501_Paper24.pdf>.
Note: This paper has been placed in the public domain in honor of Michael Hart, founder of Project Gutenberg. 8 September 2011