Open Archives: A Key Convergence


   Over two years ago I identified interoperability -- the capacity of a
   user to treat multiple digital library collections as one -- as a key
   digital library challenge (LJ 11/15/97, p. 31-32). In a follow-up
   column on this "grand challenge" (LJ 7/98, p. 38ff.), I more fully
   discussed interoperability and some methods to achieve it.

   Unfortunately, there's been little progress since then -- except for
   one significant recent development. In October 1999, several
   organizations -- including the Digital Library Federation, Association
   of Research Libraries, and Los Alamos National Laboratory -- recruited
   a group of experts "to work towards achieving a universal service for
   author self-archived scholarly literature." Self-archiving denotes the
   process of authors depositing their own papers into an archive. A
   common practice among scientists is to make preliminary drafts of their
   papers (or "preprints") available to colleagues prior to publication.
   Preprints can subsequently undergo peer review and be published in a
   professional journal.

   One outcome of the meeting, which took place in Santa Fe, NM, was the
   establishment of the Open Archives initiative (formerly known as the
   Universal Preprint Service initiative). The initiative aims to develop
   an open architecture that supports simultaneous searching and retrieval
   of papers from disparate archives.

   This is the logical next step after several separate projects have
   successfully archived papers of various kinds (such as technical
   reports, theses, dissertations, preprints, working papers, and
   conference papers). There are lessons to learn from each of the

   arXiv e-Print Archive
   Although ten years is not long for a print-based archive, it's very
   long for a digital one. The e-Print Archive at the Los Alamos National
   Laboratory has been around for almost a decade and has developed into a
   large (over 100,000 papers) and busy collection. It accepts and
   provides access to preprints (papers prior to publication) in physics
   and to a lesser degree in other scientific disciplines.

   Authors self-submit their papers to the archive and can also replace or
   remove them. Submissions are not reviewed, but authors must register
   before contributing papers. This service is free, which may be readily
   apparent from the spare and sometimes obtuse user interface. Since
   scientists are accustomed to sharing their work in preprint form, this
   archive model works well. It is unlikely, however, to work as well for
   humanities scholars who don't tend to work this way.

   Networked Computer Science Technical Reports (or NCSTRL, pronounced
   "ancestral") provides access to computer science technical reports from
   over 100 institutions worldwide through a single interface. Originally
   supported as a research project by a grant from the Defense Advanced
   Research Projects Agency (DARPA), NCSTRL is no longer merely a research
   project but is now a production service for archiving these reports.

   A search initiated at any NCSTRL site queries a central database of
   metadata. When the user selects a particular paper to view, it is
   fetched from its remote repository. The underlying NCSTRL
   infrastructure consists of Dienst, a protocol and software suite
   overlying a standard web server. See the article "The NCSTRL Approach
   to Open Architecture" for more information on the underlying

   The Networked Digital Library of Theses and Dissertations (NDLTD),
   based at Virginia Tech, archives digital theses and dissertations.
   According to the project, more than 70 institutions (mostly
   universities) have joined the effort, but so far only a relatively
   small number of digital works are available. Nonetheless, this effort
   brings to the initiative a useful and unique area of "grey" literature
   that otherwise would be available only through a commercial service or
   directly from each university.

   NASA Technical Reports Server
   NASA Technical Reports Server (NTRS) is a gateway to 20 different U.S.
   government-based technical report servers that contain three to four
   million abstracts and more than 100,000 full-text reports. NTRS uses
   wide area information servers (WAIS) technology developed by Thinking
   Machines, Inc., which was popular in the early 1990s and has now
   largely disappeared from the Internet. Although the technical
   infrastructure is nearly ancient history in Internet terms, the system
   nonetheless works.

   Tying it all together
   The Open Archives initiative aims to specify the methods by which these
   various individual archives can interoperate. Such interoperability
   will largely be achieved by specifying first a protocol for
   "harvesting" (gathering) metadata from participating archives; then
   criteria that can be used to selectively harvest metadata; and lastly,
   a common metadata format for archives to use in responding to
   harvesting requests.

   At first, the initiative will use a modified version of the Dienst
   protocol that comes out of the NCSTRL effort as the harvesting
   protocol. Dienst is well established for this kind of activity, having
   supported the same kind of work on behalf of computer science technical
   reports for some years. Accession date was thought by the Santa Fe
   meeting attendees to be the most important criteria for selective
   harvesting, with author affiliation, subject, and publication type also
   being deemed important.

   For the metadata component, a minimal set of the Dublin Core elements
   will be used. An early experimental implementation of such a service is
   the Universal Preprint Service (UPS) prototype server.

   Additional participants in the Open Archives effort include the
   California Digital Library of the University of California, through its
   eScholarship initiative, CogPrints (Cognitive Sciences archive) RePEc
   (Research Papers in Economics), and EconWPA (Economics Working Papers

   Why this is important
   In today's world, the hapless user who simply wants to discover what
   unpublished literature exists on a particular topic is faced with a
   dilemma: trying to find all the individual servers that might possibly
   have the information by searching web search engines, then searching
   each archive individually. This is a "solution" that will neither scale
   nor suffice.

   "Digital libraries have historically been islands of information," says
   Michael Nelson, the manager of the NASA Technical Reports Server, "and
   there has been no way to bind them together into one collection."
   That's why he and others are excited about the possibilities of
   achieving some measure of interoperability among archives of scholarly
   and scientific papers and preprints.

   The Open Archives initiative may be just what's required, while leading
   the way for binding together other kinds of digital library
   collections. If this initiative can specify an appropriate architecture
   for treating multiple, disparate collections as one, then those
   planning projects with other kinds of material may be able to either
   learn from their experience or use some of the same infrastructure.

