EBIB   Standards and organisation. Article - EBIB No.4/2002

   

Marek Nahotko
Old and new standards for describing electronic documents

Energoprojekt - Kraków SA

This article was translated thanks to the grant received from the Open Society Institute

The world of wide networks, mainly the Internet, is still a dynamically evolving environment. Almost every year, newer and better tools for information exchange emerge while many others disappear. Considering this, what is the situation regarding tools for describing Internet resources? What makes this question even more important is the notion that the most valuable element of a computer system is neither the equipment nor the software, but the DATA therein contained.

There are many discussions among catalogers of electronic documents, especially Internet resources, regarding the possibility of substituting traditional bibliographic formats with new ones, known as metadata. On the one hand, the idea of moving away from traditional formats, like MARC 21, are shocking to librarians because, as they point out, there are millions of records already existing in this format. On the other hand, there is a pressing need to replace the traditional formats with more advanced tools for providing catalog access to online resources. Furthermore, very high costs of cataloging electronic documents in traditional formats have shifted more interest towards new standards, such as, Dublin Core.

Let us examine methods of describing online resources using traditional tools, such as the MARC format, and using newer, constantly evolving solutions, like Dublin Core.

Is the MARC format still alive?

The MARC format, Library of Congress Subject Headings (LCSH) and the WorldCat database were milestones of the 20th century, which set a direction of change in the representation of bibliographic and library information. Their achievements in standardizing the structure and the form of bibliographic data, as well as, in the globalization of bibliographic information can be seen as rather symbolic.

For the past 30 years, libraries around the world have used the MARC format to exchange information with each other. During that time, computer technology has seen tremendous advancements, but the format for recording bibliographic data remains the same. However, is it really the same format? If we compare its past versions, we see that MARC has evolved. The emergence of electronic documents has only given it an additional reason for change.

The modifications of the MARC format created in order to meet the needs of the electronic document record headed into two directions:

1. Adaptation of the format structure by adding new fields.
The first direction of change is the modification of the structure of the format itself. A frequently used example of this is field 856 (electronic location and access), created in the 1990s, before the concept of the URL became fully established. Field 856 contains, among others, an electronic address enabling a direct connection from the bibliographic record to the electronic document available on a remote computer. Another important field in the electronic document record is 256 (computer file characteristics), which describes the type and size of the cataloged file. In field 256, the cataloger classifies documents into three most general categories: electronic data, electronic program, electronic data and program. The type of the file is given in leader/06 by assigning a letter code "m" for computer files. Because the letter code in leader/06 identifies the type of fixed field 008 (fixed-length data elements), choosing "m" in leader/06 means that fixed field 008 will stand for a computer file. As a result, 008/26 will contain the code for the type of computer file. When the resource cannot be identified as a computer file (i.e. does not have the "m" code in leader/06), the cataloger may add fixed field 006 (computer files), containing information about an aspect of a computer file, especially its type (006/09). Let us notice that fixed field 007 is required in case the physical form of the resource is a computer file, since this field contains information on the physical characteristics of the cataloged document.

In the notes area, field 538 (system details mode) is used to describe system requirements and the mode of access to the electronic document. Field 538 may include the computer name and model, size of RAM, operating system, software (including programming language), peripheral devices and, in case of online documents, information on remote access.

2. MARC vs. SGML and XML.
In the early 1990s, works began on applying SGML to represent the structure of bibliographic data contained in the MARC format, which resulted in the creation of MARC DTD [1]. At the end of 1995, Library of Congress joined the effort by appointing a special team, which after some intense discussions developed a set of general requirements and guidelines for creating DTD. In mid-1996, the first trial version of DTD was developed. Two years later, special software was created to enable data conversion between MARC 21 and SGML.

If we treat SGML as a high-level programming language, DTD would be software written in this programming language. MARC DTD consists of over 20,000 tags of code and refers to 12 external elements.

Below is an example of a fragment of MARC field 245 and its MARC DTD equivalent, written in SGML, as it appears in Sally McCallum's work titled "SGML/MARC Mapping" (McCallum, 1997):

  • MARC:
    245 10 $aSGML : $b an author's guide to the standard generalized markup language / $c
    Martin Bryan
  • SGML:
    SGML : an author's guide to the standardized generalized markup language /
    Martin Bryan

Since 1998, there have been the following projects on the conversion of MARC to XML (Extensible Markup Language):

New metadata formats: Dublin Core

Metadata are data about data. As such, they were always of interest to librarians since catalog or bibliographic records are none other than examples of metadata. Nowadays, the term "metadata" has been accepted to describe bibliographic records of electronic documents, especially those with remote access on a wide network. Hence, MARC records discussed above can be called metadata.

In addition to the already used formats, new ones were also developed to provide descriptions of electronic documents. Currently, there are hundreds of them but the best known is Dublin Core Metadata Element Set (DCMES), managed by the Dublin Core Metadata Initiative (DCMI - http://dublincore.org/). DCMES is a set of 15 data elements useful to describe electronic documents. The semantics of these elements were established through consensus of an international interdisciplinary team, which includes professionals from fields of library science, computer science, museum and archival studies, as well as, other related areas. DC elements are optional, repeatable and can be entered at any order.

The aim of DCMES developers was to create a format that would be easy to operate, assuming that its users and catalogers, would be the actual authors of electronic documents. However, it quickly became inevitable that 15 elements would not be enough. Therefore, the team decided to "qualify" elements through specifying their meaning by attaching special qualifiers to each element. In addition to that, developing local sets of qualifiers, which seems to be a common practice, makes it possible to adapt DCMES to individual needs. DCMI created a standard way to "qualify" elements with various kinds of qualifiers.

There are two classes of qualifiers:

  • Element refinement (qualifiers that make the element narrower or more specific). A refined element shares the meaning of the unqualified element, but with a more restricted scale. A client (i.e. software) that does not understand the meaning of the specific element refinement term should ignore the qualifier and treat the metadata value as if it were unqualified (broader).
  • Encoding scheme (qualifiers that identify schemes which support the interpretation of an element value). These include controlled vocabularies and formal notations or rules. A value expressed with the use of the encoding scheme will thus be a symbol selected from the controlled vocabulary (for example, a term from a classification scheme or a subject heading) or a string formatted according to a formal notation (for example, "2002-03-15" as a standard expression of a date). If an encoding scheme is not understandable to the client software, it may still be clear to a human user.

The guiding rule in the qualification process of Dublin Core elements is the, so-called, Dumb-Down principle, which says that a client should have the option of ignoring every qualifier and use a description as if it were unqualified. While this obviously causes a loss of some specific meanings, the remaining element value (without the qualifier) must stay correct.

As an Internet standard, Dublin Core is a tool designed to be used in HTML and, currently, efforts are made to establish ways of using it in XML and RDF (Resource Description Framework) (http://www.w3.org/TR/PR-rdf-syntax/).
Since its conception, Dublin Core developers aimed to create a format that would have the following characteristics:

  • Ease of creation and maintenance. The set of Dublin Core elements should be kept as small and simple as possible to enable easy and inexpensive bibliographic record creation by non-librarians, while ensuring effective retrieval of records in the networked environment.
  • Standardized record creation. Retrieval of information from within the vast Internet resources is hampered by differences in terminology and by usage of different descriptive rules in bibliographic record creation. As a commonly used standard, Dublin Core can help the so-called "digital tourist," a non-specialist in the field of information science, satisfy his or her information needs by providing access to a set of elements, whose meaning is universally known and understood.
  • International scope. DCMES was originally designed in English but versions in many languages were also developed (http://dublincore.org/groups/languages/ - related), including Finnish, Norwegian, Thai, Japanese, French, Portuguese, German, Indonesian, Spanish, as well as, Polish.
  • Extensibility. In order to balance the need for simplicity in describing digital resources with the need for precision in information retrieval, Dublin Core developers recognized the importance of a mechanism that would enable extending the DC element set to facilitate searching through the diversity of Internet resources. It is expected that groups of experts from various backgrounds will develop and administer additional sets of metadata. It should be possible to link metadata elements from other developed sets to Dublin Core in order to enable its expandability (extensibility). This model allows different communities to use the Dublin Core elements to create basic descriptive records that would be usable across the Internet, while expanding them by additional elements, functional in limited areas. Special instructions for applying this model are currently under development.

The creators of all metadata formats aim to make sure that there is a possibility of converting their formats to DCMES, a universally recognized and used metadata standard. Conversion is also possible between Dublin Core and some of the older metadata formats, mainly MARC. The concept of converting between DC and MARC led to the development of CORC.

Applying MARC and Dublin Core formats

Before introducing CORC, a project that illustrates a strong coexistence of MARC and Dublin Core, let us attempt to imagine the way libraries could apply these two formats in their functions. Libraries will continue to catalog certain Internet resources using MARC in order to provide access to them in their OPACs. Most likely, these resources will include electronic journals or databases with remote access purchased by the library. Therefore, it will be necessary to update bibliographic records with acquisitions information, which needs to be attached to the main record. This way bibliographic records of Internet resources purchased by the library will become visible in its computer catalog.

However, the issue with free Internet resources is rather different. Internet documents, such as organization or company websites, cataloged earlier in MARC, will not require a full bibliographic record with acquisitions information. These types of documents are of great interest to catalogers who create metadata records using the Dublin Core standard. Similarly, Dublin Core metadata will become written into the source code of electronic documents, created by the libraries themselves. As a result, there will no longer be a need to create "Important links" sections on library websites. A database of Dublin Core documents will perform this function. In addition, creating common dictionaries of controlled vocabulary for both databases (MARC-based OPACs and Dublin Core) seems indispensable, since it would facilitate simultaneous searching and any future merger of the databases.

The table below shows a comparison between Dublin Core and USMARC. It will help to assess the possibility for conversion between the two formats. The comparison is general and does not consider DC qualifiers or certain specific cases addressed by USMARC.

Table 1. Comparison of Dublin Core and USMARC [2] .
Dublin Core Element MARC Tag
Title Title statement (245 $a)
Creator Personal name - main entry (100 $a)
Subject Uncontrolled index term (653 $a)
LCSH Subject heading (650 $a)
Description Summary, etc. note (520 $a)
Publisher Name of publisher (260 $b)
Contributor Personal name - added entry (700 $a)
Date Date of publication (260 $c)
Type Document type/form (655 $2)
Format Electronic format type (856 $q)
Identifier Electronic resource identifier (856 $u)
Source Data source entry (786 $n)
Language Language note (546 $a)
Language code (041 $a)
Relation Nonspecific relationship entry note (787 $n)
Nonspecific relationship entry - other item identifier (787 $o)
Coverage General note (500 $a)
Rights Terms governing use and reproduction (540 $a)

CORC - old and new metadata formats

CORC, Cooperative Online Resource Catalog (http://corc.oclc.org/corc), is one of OCLC's most recent projects, which considers using MARC and DCMES to create databases of Internet resources selected for their high quality. The integration of CORC with WorldCat collections, will give the system a great chance of becoming a principal resource of information about scholarly publications available on the World Wide Web. Even though, currently, there is no direct connection between CORC and WorldCat, the possibility of exporting records from WorldCat to CORC seems rather likely. Moving records from CORC to WorldCat, although technically possible, may cause problems when exporting records created in Dublin Core. Since MARC is a more precise format than Dublin Core, imported records might end up missing appropriate MARC-specific subfield code, tag and fixed field information. Exchangeability of data is one of the key qualities of CORC. Project participants can import records in both, MARC and Dublin Core. Following this, their records are converted to XML and can be displayed to the user in one of the two formats, as well.

Even though conversion of DC qualified elements into MARC fields using CORC is a technologically simple, fast and errorless process, the conceptual differences between the two formats can cause some conflicts. For example, this occurs when the tag for the DC element Creator is converted into MARC field 100 and the author name resulting from the operation does not correspond to the used AACR2 authority name. A similar problem can happen when the DC element-qualifier Contributor-Corporate is converted into field 710, even in case the corporate name does not show in any of the fields of the record, which is required in MARC according to AACR2. The majority of web resource developers, who include metadata tags in their documents, apply the DC standard, usually, without any kind of knowledge of MARC or AACR2. Dublin Core has many advantages because it was designed especially for cataloging Internet documents while the first edition of AACR was written primarily for print materials and AACR2 is still highly "bibliocentric." Considering the possibilities it offers, as well as, ease of use, especially, for the non-cataloger and non-librarian metadata creators, it is not surprising that Dublin Core is a preferred standard.

One of the most important features of CORC is its ability to create a preliminary record by extracting metadata information from the tags of electronic documents, as well as, by performing an intelligent content analysis to generate keywords and/or Dewey classification numbers.

In order to enhance the system of bibliographic record creation in CORC, OCLC also developed a project of embedding so-called "pathfinders" in the database of electronic document records. The term "pathfinders" refers to lists of key scholarly resources in a specific subject area. Pathfinders usually contain most important encyclopedias, periodicals, dictionaries, sets of keywords and subject headings, and other resources available to the library user. Such lists of subject-oriented resources are a traditional equivalent to webpages that register links accompanied by short descriptions used to facilitate information retrieval on the web. Libraries can create webpages containing standard lists of links with descriptions but they can also use dynamic searches of the CORC catalog to be presented on the webpage. Search results are integrated and displayed to users in the form of links, accompanied by descriptions. Despite some drawbacks to this system resulting from a limited range of formatting options, the success of CORC rests on its ability to divide the maintenance of URLs among all program participants. If one member institution makes a URL correction, links on all pages with pathfinders will simultaneously change in systems of all other members.

As in the case of previous OCLC projects, CORC takes advantage of interlibrary cooperation in building its collections. One of the effects of this cooperation is a better chance of dealing with problems caused by the inevitable changes in the content and localization of web resources. CORC offers prototype solutions, which allow for controlling and correcting URLs, as well as, monitoring the content of web documents. This type of automatic control will mainly depend on the collective effort of members to keep updating records of the constantly changing documents. It can only be achieved through a wide-scale cooperation since corrections of URLs or resource records performed by one library become available to all member libraries.

A demo version of the CORC system is available at the following address: http://www.oclc.org/corc/learning/demo/

Bibliography

  1. EDMUNDS, Jeff and Roger BRISSON. "Cataloging in CORC: A Work in Progress" [online].
    [Accessed March 14, 2001]. Available online at:
    <http://www.personel.psu.edu/faculty/r/o/rob1/corc/>.
  2. MCCALLUM, Sally. "Extending MARC for Bibliographic Control in the Web Environment: Challenges and Alternatives" [online].
    Washington DC: Library of Congress, December 2000 [Accessed April 4, 2001]. Available online at:
    <http://lcweb.loc.gov/catdir/bibcontrol/mccallum_paper.html>.
  3. MCCALLUM, Sally. "SGML/MARC Mapping" [online]. February 1997 [Accessed April 4, 2001]. Available online at:
    <http://www.columbia.edu/cu/libraries/inside/projects/ sgml/sgmlmarc/lc.status.9702.html>
  4. NAHOTKO, Marek. "Metadane." In EBIB [online], 2000, No. 6(14) [Accessed April 4, 2001]. Available online at:
    <http://ebib.oss.wroc.pl/arc/e014-02.html>.
  5. RZOŃCA Irena and Krystyna SZYLHABEL. "Dokumenty elektroniczne w systemie APIN," Bibliotekarz, 2000 no. 3, p. 2-7.
  6. SANETRA, Krystyna. "Katalogowanie dokumentów elektronicznych" [online], Cracow: Jagiellonian Library, February 1999 [Accessed February 4, 2002]. Available online at:
    <http://phmm.bj.uj.edu.pl/~krystyna/kel.htm>.

Footnotes

[1] DTD (Document Type Definition) is a set of rules used to describe components of a document or a class of documents.

[2] Based on: http://www.loc.gov/marc/bibliographic/ecbdhome.html

Translated by Marta Sobieszek


Old and new standards for describing electronic documents [Electronic document] . - Access mode: http://ebib.oss.wroc.pl/english/grant/nahotko.php
Last modification: 2.01.2003