Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Open Archives: A Key Convergence
Over two years ago I identified interoperability -- the capacity of a user to treat multiple digital library collections as one -- as a key digital library challenge (LJ 11/15/97, p. 31-32). In a follow-up column on this "grand challenge" (LJ 7/98, p. 38ff.), I more fully discussed interoperability and some methods to achieve it. Unfortunately, there's been little progress since then -- except for one significant recent development. In October 1999, several organizations -- including the Digital Library Federation, Association of Research Libraries, and Los Alamos National Laboratory -- recruited a group of experts "to work towards achieving a universal service for author self-archived scholarly literature." Self-archiving denotes the process of authors depositing their own papers into an archive. A common practice among scientists is to make preliminary drafts of their papers (or "preprints") available to colleagues prior to publication. Preprints can subsequently undergo peer review and be published in a professional journal. One outcome of the meeting, which took place in Santa Fe, NM, was the establishment of the Open Archives initiative (formerly known as the Universal Preprint Service initiative). The initiative aims to develop an open architecture that supports simultaneous searching and retrieval of papers from disparate archives. This is the logical next step after several separate projects have successfully archived papers of various kinds (such as technical reports, theses, dissertations, preprints, working papers, and conference papers). There are lessons to learn from each of the projects. arXiv e-Print Archive Although ten years is not long for a print-based archive, it's very long for a digital one. The e-Print Archive at the Los Alamos National Laboratory has been around for almost a decade and has developed into a large (over 100,000 papers) and busy collection. It accepts and provides access to preprints (papers prior to publication) in physics and to a lesser degree in other scientific disciplines. Authors self-submit their papers to the archive and can also replace or remove them. Submissions are not reviewed, but authors must register before contributing papers. This service is free, which may be readily apparent from the spare and sometimes obtuse user interface. Since scientists are accustomed to sharing their work in preprint form, this archive model works well. It is unlikely, however, to work as well for humanities scholars who don't tend to work this way. NCSTRL Networked Computer Science Technical Reports (or NCSTRL, pronounced "ancestral") provides access to computer science technical reports from over 100 institutions worldwide through a single interface. Originally supported as a research project by a grant from the Defense Advanced Research Projects Agency (DARPA), NCSTRL is no longer merely a research project but is now a production service for archiving these reports. A search initiated at any NCSTRL site queries a central database of metadata. When the user selects a particular paper to view, it is fetched from its remote repository. The underlying NCSTRL infrastructure consists of Dienst, a protocol and software suite overlying a standard web server. See the article "The NCSTRL Approach to Open Architecture" for more information on the underlying architecture. NDLTD The Networked Digital Library of Theses and Dissertations (NDLTD), based at Virginia Tech, archives digital theses and dissertations. According to the project, more than 70 institutions (mostly universities) have joined the effort, but so far only a relatively small number of digital works are available. Nonetheless, this effort brings to the initiative a useful and unique area of "grey" literature that otherwise would be available only through a commercial service or directly from each university. NASA Technical Reports Server NASA Technical Reports Server (NTRS) is a gateway to 20 different U.S. government-based technical report servers that contain three to four million abstracts and more than 100,000 full-text reports. NTRS uses wide area information servers (WAIS) technology developed by Thinking Machines, Inc., which was popular in the early 1990s and has now largely disappeared from the Internet. Although the technical infrastructure is nearly ancient history in Internet terms, the system nonetheless works. Tying it all together The Open Archives initiative aims to specify the methods by which these various individual archives can interoperate. Such interoperability will largely be achieved by specifying first a protocol for "harvesting" (gathering) metadata from participating archives; then criteria that can be used to selectively harvest metadata; and lastly, a common metadata format for archives to use in responding to harvesting requests. At first, the initiative will use a modified version of the Dienst protocol that comes out of the NCSTRL effort as the harvesting protocol. Dienst is well established for this kind of activity, having supported the same kind of work on behalf of computer science technical reports for some years. Accession date was thought by the Santa Fe meeting attendees to be the most important criteria for selective harvesting, with author affiliation, subject, and publication type also being deemed important. For the metadata component, a minimal set of the Dublin Core elements will be used. An early experimental implementation of such a service is the Universal Preprint Service (UPS) prototype server. Additional participants in the Open Archives effort include the California Digital Library of the University of California, through its eScholarship initiative, CogPrints (Cognitive Sciences archive) RePEc (Research Papers in Economics), and EconWPA (Economics Working Papers Archive). Why this is important In today's world, the hapless user who simply wants to discover what unpublished literature exists on a particular topic is faced with a dilemma: trying to find all the individual servers that might possibly have the information by searching web search engines, then searching each archive individually. This is a "solution" that will neither scale nor suffice. "Digital libraries have historically been islands of information," says Michael Nelson, the manager of the NASA Technical Reports Server, "and there has been no way to bind them together into one collection." That's why he and others are excited about the possibilities of achieving some measure of interoperability among archives of scholarly and scientific papers and preprints. The Open Archives initiative may be just what's required, while leading the way for binding together other kinds of digital library collections. If this initiative can specify an appropriate architecture for treating multiple, disparate collections as one, then those planning projects with other kinds of material may be able to either learn from their experience or use some of the same infrastructure. LINK LIST arXiv.orgr http://arxiv.org/ CogPrints http://cogprints.soton.ac.uk/ Dublin Core http://purl.org/dc/ EconWPA http://wuecon.wustl.edu/ eScholarship http://www.cdlib.org/eschol/ NASA Technical Reports Server http://techreports.larc.nasa.gov/ cgi-bin/ntrs NCSTRL http://www.ncstrl.org/ The NCSTRL Approach to Open Architecture http://www.dlib.org/dlib/ december98/leiner/ 12leiner.html NDLTD http://www.ndltd.org/ Open Archives http://www.openarchives.org RePEc http://netec.mcc.ac.uk/RePEc/ Santa Fe Convention of the Open Archives Initiative http://www.openarchives.org/sfc/ sfc_entry.htm UPS Prototype Server http://ups.cs.odu.edu/