Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
roytennant.com :: Digital Libraries Columns
Metadata's Bitter Harvest

07/15/2004
   I recently conducted my first harvest. Not pulling in corn or wheat but
   bibliographic records. Before long I had nearly 100,000 of them on my
   laptop, all describing free online resources held by five different
   libraries. Using the Open Archives Initiative Protocol for Metadata
   Harvesting (OAI-PMH) it was a breeze--anyone could do it with the right
   software, of which there is much to choose from. But I could hardly
   believe the results.

   What I had was a pile of metadata problems that in hindsight I should
   have expected. Certainly those who have created union catalogs could
   predict some of the issues. Even so, union catalog efforts typically
   deal with the same type of records (MARC) using the same set of rules
   (AACR2). What I saw occurs when there is only a very simple format
   (Dublin Core) and no application rules to speak of. It was a complete
   mess.

   This mess is neither caused nor prevented by the harvesting protocol
   (OAI-PMH) and the guidelines for its use. The OAI developers
   specifically created an infrastructure with both a low threshold (a low
   barrier to implementation and use) as well as a high ceiling (the
   opportunity to create much richer interactions among collaborating
   institutions). It was brilliant, but it also sets up problems if the
   collaborative community of users doesn't apply a set of common
   guidelines and practices.

   OAI developers clearly expected communities to agree on how to use the
   harvesting protocol to best effect--by concurring on a richer metadata
   format for sharing, for example. The problem is that libraries have yet
   to decide these issues, although movement is beginning. But first it
   might be helpful to review some of the current metadata problems.
   Metadata woes

   Data providers (libraries with records to share) make their metadata
   available for harvesting in segments called "sets." This allows service
   providers (those who aggregate records from a variety of repositories
   for searching) to take smaller chunks rather than the entire pile of
   records. The problem is that libraries have yet to decide how to create
   logical and useful sets, and therefore data providers create whatever
   sets they wish.

   Some create sets based on item format (text, image, etc.). Others' sets
   are based on administrative units (e.g., university departments). Still
   others devise sets based on particular collections (which are not
   always logical subject groupings). A further complication arises when
   some institutions include page images of text documents in an image
   set.

   As for the metadata records, the only required format is simple
   (unqualified) Dublin Core. Yet simple Dublin Core is, for many
   purposes, too simple. For example, at least one institution offers
   three fields with the same label, but only one is the actionable URL
   with which to retrieve the object online. Without a qualifier to
   identify which of the three fields is the appropriate URL, the service
   provider must guess.

   Even the data within the elements can be problematic. Nearly all
   institutions are dumbing-down from a richer internal metadata scheme to
   simple Dublin Core. This process means some elements can be incorrectly
   mapped into the wrong Dublin Core element.

   Even when the correct data is in the correct element, there can be
   encoding issues. In the records I retrieved, there were dozens of
   different methods for encoding dates. One institution might use
   1991-10-01, while another one uses October 1, 1991.

   For more information on metadata problems, see the paper by Naomi
   Dushay and Diane Hillmann and "Bitter Harvest: Problems & 
Suggested
   Solutions for OAI-PMH Data and Service Providers" (available on the
   California Digital Library [CDL] Harvesting Project web site).
   Hopeful signs

   Not long after my harvesting epiphany, my colleagues at the CDL and I
   talked with others experienced with harvesting. The University of
   Illinois at Urbana-Champaign, University of Michigan, and Cornell are
   all experienced at OAI harvesting and their perceptions largely
   predated and paralleled ours.

   Many institutions are now working, through the sponsorship of the
   Digital Library Federation (DLF), to develop best practices and
   guidelines for both data providers and service providers. A DLF working
   group will share metadata normalization and transformation tools and
   techniques and encourage data providers to expose richer metadata
   formats (e.g., MODS) than simple Dublin Core--always an option with
   OAI-PMH.

   Despite some problems with effective usage of OAI-PMH today, we are
   still in the early days of understanding how best to implement the
   protocol. As communities come together to specify how to use the
   protocol, our users will be much better served.
     __________________________________________________________________

   LINK LIST
   California Digital Library Harvesting Project
   [123]www.cdlib.org/inside/projects/harvesting Dublin Core
   [124]dublincore.org
   Open Archives Initiative
   [125]www.openarchives.org Dushay and Hillmann Paper
   [126]www.siderean.com/dc2003/501_Paper24.pdf