Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
roytennant.com :: Digital Libraries Columns
The Importance of Being Granular

05/15/2002
   Our libraries are increasingly dependent on metadata. Besides the
   obvious (our catalogs), other uses are becoming more commonplace.
   Virtually any content we digitize and make available to our clientele
   requires metadata for discovery and access. Every interlibrary loan
   transaction is a slug of metadata that helps libraries get a book or
   journal article to a user. Libraries now license so many databases and
   collections of online content that they increasingly offer a way for
   users to search for a resource based on their topic. Such a service
   requires metadata.

   Last month, in "[123]Metadata as if Libraries Depended on It" (LJ
   4/15/02, p. 32ff.), I discussed metadata and its various components: a
   standard container, qualification, usage guidelines, and the
   information being captured. In that overview I set aside one topic as
   being worthy of its own column: metadata granularity.

   How you chop it

   Granularity refers to how finely you chop your metadata. For example,
   in the standard for encoding the full text of books using the Text
   Encoding Initiative (TEI) schema, a book author may be recorded as:
   <docAuthor>William Shakespeare</docAuthor>. That's all 
well and good,
   if you never need to know which string of text comprises the author's
   last name and which the first. If you do and most library catalogs
   should have this capability, you're not going to get very far with
   information extracted from a book encoded using the TEI tag set.

   Although TEI has been around in one form or another for 15 years, its
   focus is mainly on the recording of aspects of a work for humanities
   scholars. As such, it is not particularly well suited for library-style
   bibliographic description. Nonetheless, as more texts are digitized in
   their entirety, libraries will increasingly be using either some form
   of TEI or another similar schema (e.g., ISO 12083). Therefore, it
   behooves us to know how well or how poorly standards such as TEI and
   MARC can interoperate.

   How granularity helps

   Granularity is good. It makes it possible to distinguish one bit of
   metadata from another and can lead to all kinds of additional user
   services. For example, you can't sort records on author names if you
   can't tell the last name. Wait a minute, you're thinking, we do it all
   the time since the MARC record is sufficiently granular. And you would
   be correct.

   Generally speaking, most of the information in a MARC record is
   sufficiently granular for the purposes for which it was designed. But
   it becomes less than adequately granular should you wish to start
   loading up the MARC record with such things as book reviews. Then you
   are reduced to such questionable tactics as smashing it into a note
   field. As time goes on, in other words, we may begin to find that MARC
   isn't quite as extensible or granular as it will need to be.

   External compliance

   The issue of granularity becomes critical in the apparent slavish
   devotion we tend to have toward standards. Don't get me wrong.
   Standards are vital to sharing data with others. They are important to
   any situation in which you must interoperate with other systems. They
   are important to providing a method to layer services easily on top of
   a collection of metadata. But we sometimes confuse internal compliance
   with external compliance.

   External compliance with standards means that you can export your data
   into whatever metadata standard applies to a given situation. For
   example, some libraries are involved with the Open Archives Initiative
   (OAI), which aims to share metadata among working paper archives.
   Although internally a given archive may have a richer and more granular
   collection of metadata, OAI specifies that at minimum the archive
   should be able to make its metadata available for "harvesting"
   (collecting via software) using the Dublin Core metadata specification.
   Therefore, an OAI-compliant archive will likely "dumb down" its
   metadata (in some cases making granular metadata more homogenous) to
   meet this minimum specification.
   Not all metadata are equal

   Internal compliance means storing the metadata in a particular standard
   even when it makes little sense to do so. Some standards are meant to
   provide interoperability among systems (such as the Dublin Core), while
   others are designed to provide a base level of standardization upon
   which software systems can be built (such as MARC).

   Therefore, not all metadata standards are created equal. They are
   sometimes inadequate for your internal needs or would prevent you from
   complying with a different standard. In the case of the OAI-compliant
   archive above, for example, being internally compliant with the Dublin
   Core would make no sense in and of itself. So long as it could "speak"
   Dublin Core when required, a richer set of internal metadata may allow
   many other additional uses of the same information (such as MARC
   records for a library catalog).

   Granularity questions

   Nearly all metadata standards raise granularity issues. In the TEI
   example, for greater flexibility the author's name should be chopped up
   at least into the part of the name upon which sorting can take place
   (usually the last name). Therefore, should I decide to encode a
   digitized book using the TEI set of XML tags, I will have metadata that
   is only adequate for TEI compliance.

   On the other hand, should I create my own set of tags--perhaps the TEI
   tag set plus additional tags, for example, to identify the author's
   first and last name--to provide more granularity, then a standard such
   as TEI can be covered like a blanket. And a number of other metadata
   standards that may be important (such as MARC) can be supported as
   well. Once you have your metadata stored in a standard,
   machine-parsable container, whether in a database or an XML data
   stream, it's easy to spit out the information in various configurations
   and formats.

   Remember: select (or create) and use metadata containers that are
   granular enough for any purpose to which you can imagine putting them.
   If you do this, not only can you serve your own purposes, but you can
   also share your metadata with anyone you wish. If this is not
   practical, then you must decide which needs will remain unfulfilled.

   Highly granular metadata doesn't come cheap. There is a trade-off
   between all possible uses that you may wish to support and the staff
   time required to capture the metadata required to do so. In some cases,
   the benefit will not warrant the cost; in others, it will be worth it.
   Another path to granularity

   Good granularity doesn't necessarily mean that any single metadata
   standard or container must chop up every field into the smallest
   reduceable part. For example, the emerging standard for digital object
   description, METS, is designed to take advantage of other, more
   granular metadata containers.

   As a wrapper, it is meant to enclose some things and link to others. It
   can refer to a metadata record for the item being described. Therefore,
   a digital object described using the METS schema may, in fact, refer to
   a MARC record for descriptive metadata.

   Granularity of metadata is hard-won and easily lost. Identifying and
   appropriately encoding metadata elements usually requires a person--and
   one with training. Once granularity has been achieved, it should not be
   permanently surrendered through internal compliance with an external
   standard, unless the benefits clearly outweigh the drawbacks and no
   alternatives are possible.

   The time of cataloging staff is valuable, and once granularity is lost
   it may not be practical to recover it. Our libraries depend on
   metadata. They are becoming even more dependent as we move into the
   realm of creating, managing, and preserving collections in digital
   form. Doing so well requires us to understand thoroughly what is at
   stake and the consequences of our actions.
     __________________________________________________________________

Link List

   Dublin Core
   [125]dublincore.org

   ISO 12083
   [126]www.xmlxperts.com/12083.htm

   MARC
   [127]www.loc.gov/marc

   METS
   [128]www.loc.gov/standards/mets

   Open Archives Initiative
   [129]www.openarchives.org

   Text Encoding Initiative
   [130]www.tei-c.org