Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia objects such as images, video or audio, these objects are primarily described using text or words and phrases from controlled vocabularies such as the Library of Congress Subject Headings. Compared to the data stored in traditional database systems, text is relatively unstructured and has less well-defined semantics. Consequently, the processes of indexing (both manual and automatic) and retrieving textual information are fundamental to digital library applications. Information retrieval (IR) is an area of both computer and information science that studies these processes.
Research on effective and efficient automatic techniques for IR has been underway in the United States and Europe since the 1960s (Sparck, Jones and Willett 1997). Dramatic improvements in disk storage technology and then the growth of Internet information access led to a significant increase in IR research activity in the 1990s. The U.S. government has also provided substantial funding in this area through efforts such as the DARPA TIPSTER project and NSF Digital Library Initiative. The TIPSTER project focused on the development of new techniques for text retrieval, filtering and extraction. The TREC annual evaluation conference was started by DARPA in conjunction with TIPSTER and is now run by the National Institute of Standards and Technology. TREC has resulted in standard text collections and evaluation methodologies that have increased both the size of the text-related research community and the number of new results.
In recent years, the IR research community in the United States has broadened its interests beyond retrieval to include new areas such as filtering, distributed and scalable IR, cross-lingual IR, information extraction, summarization, visualization, text data mining, and event detection.
In Japan, information retrieval research has not been a major focus of computer and information scientists until much more recently. Because of the relative difficulty of indexing Japanese text, much of the earlier work in Japan focused on hardware and software techniques to support exact matching of Boolean combinations of text strings. The first studies on the effectiveness of ranking techniques with Japanese text databases were not done until the 1990s (e.g., Ogawa et al. 1993) and some of those were done in the United States as part of the TIPSTER project (Fujii and Croft 1993). The advent of the Internet, the visibility of the U.S. government initiatives, and the general commercial interest in text search has changed this situation considerably. Many industry laboratories and university groups are working on a range of technologies for text. In addition, an increasing number of papers from Japan are submitted to the major IR conferences, new conferences focusing on Asian language issues have been started (e.g., IRAL 1997), and government initiatives exist in areas such as digital libraries.