Many libraries in the United States and in Japan have not yet cataloged all of their holdings, and may not have all of these records in machine-readable format. A shortage of staff, sometimes rocky transitions from manual to automated cataloging work flows, and the "information explosion," have created arrearages or backlogs in cataloging departments that most institutions do not publicize. The discipline of cataloging has devised methods and policies to describe physical artifacts such as books, periodicals, microforms, sound recordings, and maps. These descriptions are largely based on the physical "container" in which the information resides, and thus are considered format-based description. (For more on bibliographic description, see AACR2.) Intellectual description, that is, data about the subject of the information in the "container" and a classification number reflecting subject analysis, is created by catalogers. The Library of Congress develops and maintains a very large thesaurus of controlled subject headings with documentation of references and related terms (see LC Subject Headings). The Library of Congress also maintains the LC Classification Schedule and the Dewey Decimal Classification Schedule. Variants of these classification schemes are being used by libraries in Japan. The library community's cooperation in development of these systems has made possible interoperability and interchange of bibliographic information on a global basis. Cooperatively built databases of bibliographic information such as OCLC and RLIN in the United States, and NACSIS in Japan, provide economies of scale as libraries collectively create and share the world's bibliography.
A discussion of "granularity," or the level at which an item is described, is a conceptual key for understanding digital information organization. "Item level" cataloging is probably most familiar to readers as they use online library catalogs to find monographs and multimedia materials. That is, one cataloging record is made for one work. Archives and special collections often catalog at the "collection level," insofar as it is not feasible to individually describe every letter in a huge archive or assign meaningful classification numbers to millions of photographs. With indexed journal articles, the "item" to be cataloged might be the title of the journal along with an accounting of the individual issues, or holdings. Article-level indexing information gives further description of the intellectual content of "pieces" of each issue of the journal. Conversely, one catalog entry might exist only at the title level of the serial publication without the more in-depth indexing information. Clearly, the article-level indexing provides greater access and description; it is also more expensive and labor-intensive to create and maintain. The topic of granularity of description is important because the creation of cataloging data is one of the more expensive aspects of traditional library methods of providing access to materials.
The digital world requires this same cataloging data as well as information necessary to structure and present electronic documents. The library community is cooperating with professionals from the computer science, text encoding, and museum communities to develop the Dublin Core metadata standard, a fifteen-element set to describe digital resources (see Dublin Core Metadata). Generally, metadata as discussed in this context falls into three major categories.
Physical/structural metadata is information about the digital object and its relationship to other digital objects in a repository. Structural metadata might include file location on a server or in a repository; file format; file size; relationship to other files; sequence, or date of creation. For example, a sequence of 35 mm photographic negatives may have been imaged. To present the negatives in the order in which the photographer created them, information is needed to structure the images in the original sequence in addition to the format and size of the file, date of creation, and internal numbering scheme. To extend the analogy to books on library shelves would be data about the shelf number (physical location); number of pages (size); the fact that the pages are numbered (sequencing); and binding (defining the item). This indication is a logical way to structure these materials as well as a means to indicate provenance of the materials.
The term intellectual metadata refers to information that provides access to the subject or content of a digital object. Intellectual metadata can be thesaurus terms associated with a file or item; indexing achieved by full text search and retrieval as described in Dr. Croft's Chapter 5; classification according to standard schema; and associations with related sources of intellectual data such as bibliographies, archival finding aids, or cataloging records. Again using the example of 35 mm negatives from a roll of film, intellectual metadata would consist of who created the images (photographer/author), controlled or uncontrolled vocabulary terms describing the images, and perhaps a classification number. The related data might be the photographer's captions or references to a work in which the images were published.
Rights and permission/access management metadata functionally describe the goal of the encoding of rights and permissions information at the computer file level for digital objects. For example, an archival collection of photographs may have been made available to researchers, one at a time, in a special collections reading room for many years. However, the literary trustee of that collection may nonetheless object to widespread dissemination of these images on the World Wide Web. Thus, rights and permissions data must be associated with each image to indicate its status for distribution.
As mentioned previously, the creation of metadata has traditionally been one of the most expensive aspects of making library materials available. At the Library of Congress, one of the key decision factors when selecting collections for digitization is whether or not cataloging information already exists for a collection, especially in terms of intellectual metadata. The costs of scanning and even having text keyed and proofed are minimal in comparison with paying subject experts and professional catalogers to describe materials according to standardized methodology. Strategies for minimal level cataloging and reducing cataloging access points abound, but the activity remains very expensive.
There are aspects of current library practices in the United States and Japan that impact the potential reality of a global digital library at many levels. The costs and complexity of creating appropriate data needed to present materials reformatted digitally have been discussed. Additionally, if this information is not created accurately and with future presentation needs in mind, the digital materials can be unusable. For example, if a book is mislabeled or misshelved in library stacks, it is still available by inspection by a deck attendant or users. Conversely, if a digital file is not linked correctly to its related bibliographic record, finding aid, or previous pages in a sequence, it is essentially irretrievable. Thus, data must be created and checked for quality at a high level to ensure usability (see LC RFP 96-18).