IR RESEARCH IN JAPAN

Japanese IR research covers basically the same areas as in the United States, although there is more emphasis on indexing techniques appropriate for Japanese and other major Asian languages (especially Chinese and Korean). Indexing issues that are common concerns include the following:

Dealing With Different Character Encodings

There are a number of standard two-byte character encodings for Japanese and Chinese (e.g., JIS and GB). Unicode is a developing standard that addresses most of the problems in this area.

Input Methods for Japanese and Other Languages

Segmentation of Japanese into "Words"

A number of techniques based on dictionaries have been proposed to identify the words and phrases in Japanese text for indexing. Some of these techniques are now available commercially and achieve high accuracy rates.

Morphological Analysis

The identification and normalization of inflected word forms is less important for Japanese than for some languages, and techniques have been developed to do this as part of segmentation.

N-gram Indexing

To avoid the segmentation process, many researchers have proposed indexing Japanese (and Chinese) using pairs (bigrams) or trigrams of characters. Experimental results with this technique have shown that retrieval based on n-gram indexing is approximately as effective as retrieval based on segmented words.

The Use of Controlled Vocabularies and Ontologies for Indexing

As more work is done in this area, the language-dependent indexing issues are becoming less important. Full text indexing for Japanese is now relatively straightforward, and the focus of Japanese IR researchers is increasingly on the core issues of effectiveness and efficiency. Some of the specific research areas mentioned during the WTEC industrial and university visits were as follows:

There was also significant interest in multilingual systems. The major focus here was on supporting the major Asian languages in one system, and to a lesser extent, supporting English. There is some work being done on cross-lingual retrieval (asking questions in one language and retrieving documents in many languages), but it is currently more of an interest rather than an active research topic. Machine translation was mentioned frequently and is considered an important part of a multilingual digital library application. The primary focus in this area was on translation from English to Japanese, particularly for Internet applications. Other Asian language and English translation pairs also exist or are being developed.

A number of digital library applications based on libraries were discussed during the WTEC visit, and these applications typically made use of traditional indexing approaches with controlled vocabularies, manual indexing and catalogs for searching. The issues involved with creating and searching online public access catalogs (OPACS) as text databases are essentially the same as those studied in the United States. As an extension of this approach, metadata standards for information on the Internet (such as the Dublin Core) are being developed and debated in the international community. Japanese information scientists are part of that debate.


Published: February 1999; WTEC Hyper-Librarian