Catalogues and cataloguing in digital libraries

The assembly of materials from many different sources and of many different formats is a challenge to techniques of bibliographic control developed in the context of static, print-based collections. Digital resources can be created with no information about their provenance, change location or disappear without warning, have content which changes rapidly and lack any kind of quality control. Some more familiar media found in digital libraries also pose problems, such as still and moving images, which are especially difficult to describe for effective retrieval. Unfortunately, providing access to information remains a sophisticated task not amenable to automation.

There are two levels at which cataloguing of digital libraries might be considered. The "micro" scale of describing individual items has seen extensions to existing cataloguing rules to cope with the vagaries of digital information outlined above. Of fundamental importance in the digital library is the information represented by a link between two resources, such as scholarly papers, and there are several proposals for maintaining these relationships against passing changes in URLs. At the "macro" scale of describing collections or archives and ways to access them, there have been new approaches to integrating catalogues and much reflection on the purpose of those catalogues. Is the cataloguer's function now the creation of bibliographies of high-quality websites rather than a simple list of resources? Should there be one super-catalogue or many catalogues searchable at once?

A theme common to many of the sites presented below is that the imposition of a single comprehensive solution is not only impractical but quite possibly not desirable. Resources likely to be combined in a digital library are sufficiently various in type and distributed in location that it is sensible only to attempt interoperability rather than centralisation. The use of Z39.50 as a neutral intermediary between databases and the deliberate simplicity of Dublin Core are examples of the power of this approach.

Finding the sites

Most of the online information about cataloguing refers to electronic resources, so it was sufficient to search for terms like "catalog(u)ing" and "bibliographic control" in the subject directories BUBL Link / 5:15 and PICK. In turn, these referred to e-journals such as Ariadne, D-lib magazine and the Journal of digital information which were searched or scanned for relevant material. The directories also listed useful bibliographies of websites such as those provided by IFLA, UKOLN and the ALA as well as the Meta matters site. Free-text search engines were considered but dismissed because of their tendency to unmanageably high recall.

It is in the nature of the Web that one resource leads to countless others, and in many cases it isn't possible to say how the sites below were found beyond following a trail of links from the bibliography of one article to the home page of another institution and from there to another site entirely...

Resources

  1. Metadata: cataloging by any other name ... by Jessica Milstead and Susan Feldman. Online 23(1), 1999.

    Argues that cataloguing, in the guise of metadata, remains essential because of, not despite, the dynamic and intangible nature of electronic material. Textual metadata such as subject headings and keywords allow a variety of documents to be matched both consistently and correctly, even if those documents are not themselves textual. As well as identifying the intellectual content of a work familiar from book cataloguing, metadata can record level and appropriateness or administrative information such as access and copyright conditions. The accompanying piece "Metadata projects and standards" gives a valuable guide to the multiplicity of conflicting metadata schemes under development and there is a bibliography of metadata resources on the Web.

  2. Options for organizing electronic resources: the coexistence of metadata by Sherry L. Vellucci. Bulletin of the American Society for Information Science 24(1), 1997.

    The coexistence of a profusion of metadata schemes created for widely different datasets need not be a problem provided that broad provision is made for interoperability. The set of headings known as the Dublin Core is a compromise between the poor quality of automatic indexing and the expense and difficulty of human evaluation; its simplicity makes it a suitable intermediate framework for translating other standards. The author imagines a "metacatalog" which, like a digital library, would exploit a common protocol for communication between resources to integrate different catalogues. This would allow identification and selection of relevant sources for searching (OPACs, subject gateways, and so forth) with documents presented uniformly regardless of format or language.

  3. Dublin Core Metadata Initiative

    The home page of the Dublin Core scheme for description of electronic and other resources. Dublin Core is the most prominent of several such frameworks because it is designed to be completely general, but capable of both refinement and extension to improve retrieval in specialised subject areas. This site provides authoritative technical information on the scheme and reports on work in progress. There is little direct guidance on applying Dublin Core, this being given implicitly by a large collection of links to sites that have implemented it.

  4. Z39.50 for all by Paul Miller. Ariadne 21, 1999.

    An introduction to the Z39.50 protocol for standardising information retrieval. Interfaces to distributed information resources such as library catalogues, museum inventories and online databases take many forms, so searching a number of them requires subtly different approaches. An interface using Z39.50 acts as an intermediary between the user and a database, converting requests and results into formats understood by each party. An important corollary of this is that multiple sources can be queried simultaneously, so that it is no longer necessary to know in advance the best locations to search. Digital libraries can rely on the protocol to unify disparate sources without restricting each source's information retrieval capabilities.

  5. A distributed architecture for resource discovery using metadata by Michael Roszkowski and Christopher Lukas. D-lib magazine, June 1998.

    Considers existing collections of evaluated Internet resources as "third-party metadata records" which alleviate the well-known problems of Internet search engines such as low precision and low quality. Describes Project Isaac, aimed at providing an interface for searching multiple distributed directories as if they formed a single collection in a digital library. Since these directories already consist of metadata records for their sites, it is possible to index that metadata, mapped to an extended implementation of the Dublin Core framework. There are difficulties in merging metadata from different sources, such as incompatibility between different subject descriptions schemes, which may hamper effective searching. Like UKOLN's ROADS software, but unlike Z39.50, Project Isaac's prototype interface searches the indexes rather than the resources themselves, conserving bandwidth and reducing search times.

  6. Cross-searching subject gateways: the query routing and forward knowledge approach by John Kirriemuir et al. D-lib magazine, January 1998.

    After listing the variety of subject gateways available on the Web, this paper points out some problems with the single-subject approach. In medicine, for example, there are gateways whose scope overlaps, while for interdisciplinary queries there may be no single appropriate gateway. Z39.50 and other interfaces make it possible to send a query to all relevant databases, but this has the disadvantage of overloading systems that may not contain any relevant records. Details are given of a technique for routing queries to the most suitable sources by searching previously constructed indexes. There is a discussion of some potential problems and the possibility of browsing, rather than searching, a collection of subject gateways.

  7. Cataloging Internet resources: a manual and practical guide edited by Nancy B. Olson. 2nd ed., 1997.

    A manual of interpretations of and extensions to the Anglo-American Cataloguing Rules with a wide selection of examples, preceded by a concise discussion of the rationale behind cataloguing Internet sites. The general approach is to follow the standards laid down in AACR, for computer files and for all types of document, as far as possible even when the rules could be made more appropriate for electronic resources on the Internet. The majority of the manual deals with descriptive cataloguing issues, with access points and classification seen as presenting few problems unique to electronic material.

  8. Metadata: cataloguing practice and Internet subject-based information gateways by Ann Chapman, Michael Day and Debra Hiom. Ariadne 18, 1998.

    Compares the effectiveness of cataloguing Internet resources using traditional methods with the collection of resources for a subject gateway (SOSIG), where cataloguing is combined with selection and evaluation. SOSIG uses a simple template for its records, similar to the Dublin Core format, and this is considered adequate for the purpose because cataloguers have the opportunity for much fuller description of content than is normally permitted in a library catalogue. This compensates for the frequent lack of information about an electronic resource, such as the institution responsible for "publishing" it or the date it was created.

  9. Distributed and part-automated cataloguing: a DESIRE issues paper by Emma Worsfold. 1998.

    DESIRE is an EU-funded project investigating Web indexing and subject cataloguing. This report suggests two possibilities for increasing the efficiency of subject gateway maintenance: distributed cataloguing (in which the laborious task of discovering and evaluating resources is shared between many people) and automatic cataloguing (where metadata is generated by computers). Procedures for part-time paid work by librarians and submissions from volunteers to various subject gateways are described and compared with keeping record creation in-house and developing cross-searching techniques to benefit from other gateways' collections. Work on the automatic "harvesting" of metadata from within websites is mentioned.

  10. The virtual union catalog: a comparative study by Karen Coyle. D-lib magazine 6(3), 2000.

    Examines the idea of a virtual union catalogue created by providing a common search interface to the several catalogues of University of California libraries, contrasted with an existing union catalogue containing all records from those catalogues and maintained in parallel to them. Equivalent searches were made of both union catalogues and individual libraries' catalogues and the results compared. There were significant differences in the number of records retrieved, suggesting that applying a common search interface to disparate systems can give misleading and unpredictable results.

  11. Mapping entry vocabulary to unfamiliar metadata vocabularies by Michael Buckland et al. D-lib magazine 5(1), 1999.

    Reports on the development of natural language indexes allowing searchers to exploit unfamiliar controlled vocabularies, an increasing problem as a wider variety of digital resources becomes available. Rather than relying on links created by human indexers, the project uses statistical techniques to form associations between preferred terms and the concepts sought by users training the system. Indexes are created only for small subject areas in order to maximise their precision. An incidental benefit is that multilingual access to metadata is made possible.

  12. Who will create the metadata for the Internet? by Charles F. Thomas and Linda S. Griffin. First Monday 3(12), 1998.

    Puts forward the provocative thesis that information creators have no economic incentive to create extensive metadata describing their work, with the implication that it will be generated only by librarians and for scholarly projects. Academic institutions and government agencies have already provided metadata for their resources but it has proved to be extremely time-consuming and, as with traditional library cataloguing, its benefits are difficult to quantify. Commercial indexes such as Yahoo! are proposed as the best source of metadata because their profits from advertising depend on widespread use, depending in turn on effective retrieval.

Compiled by Owen Massey for the Advanced Internet & Digital Libraries module.
Links checked 22 March 2000. 1,855 words.