CHAPTER 5

TEXT

W. BruceCroft

BACKGROUND

Much of the information in digital library or digital informationorganization applications is in the form of text. Even when the applicationfocuses on multimedia objects such as images, video or audio, these objects areprimarily described using text or words and phrases from controlledvocabularies such as the Library of Congress Subject Headings. Compared to thedata stored in traditional database systems, text is relatively unstructuredand has less well-defined semantics. Consequently, the processes of indexing (both manual and automatic) and retrieving textualinformation are fundamental to digital library applications. Informationretrieval (IR) is an area of both computer and information sciencethat studies these processes.

Research on effective and efficient automatic techniques for IR has beenunderway in the United States and Europe since the 1960s (Sparck, Jones andWillett 1997). Dramatic improvements in disk storage technology and then thegrowth of Internet information access led to a significant increase in IRresearch activity in the 1990s. The U.S. government has also providedsubstantial funding in this area through efforts such as the DARPA TIPSTERproject and NSF Digital Library Initiative. The TIPSTER project focused on thedevelopment of new techniques for text retrieval, filtering and extraction. TheTREC annual evaluation conference was started by DARPA in conjunction withTIPSTER and is now run by the National Institute of Standards and Technology.TREC has resulted in standard text collections and evaluation methodologiesthat have increased both the size of the text-related research community andthe number of new results.

In recent years, the IR research community in the United States hasbroadened its interests beyond retrieval to include new areas such asfiltering, distributed and scalable IR, cross-lingual IR, informationextraction, summarization, visualization, text data mining, and eventdetection.

In Japan, information retrieval research has not been a major focus ofcomputer and information scientists until much more recently. Because of therelative difficulty of indexing Japanese text, much of the earlier work inJapan focused on hardware and software techniques to support exact matching ofBoolean combinations of text strings. The first studies on the effectiveness ofranking techniques with Japanese text databases were not done until the 1990s(e.g., Ogawa et al. 1993) and some of those were done in the United States aspart of the TIPSTER project (Fujii and Croft 1993). The advent of the Internet,the visibility of the U.S. government initiatives, and the general commercialinterest in text search has changed this situation considerably. Many industrylaboratories and university groups are working on a range of technologies fortext. In addition, an increasing number of papers from Japan are submitted tothe major IR conferences, new conferences focusing on Asian language issueshave been started (e.g., IRAL 1997), and government initiatives exist in areassuch as digital libraries.


Published: February 1999; WTECHyper-Librarian