Japanese IR research covers basically the same areas as in the UnitedStates, although there is more emphasis on indexing techniques appropriate forJapanese and other major Asian languages (especially Chinese and Korean).Indexing issues that are common concerns include the following:

Dealing With Different Character Encodings

There are a number of standard two-byte character encodings for Japanese andChinese (e.g., JIS and GB). Unicode is a developing standard that addressesmost of the problems in this area.

Input Methods for Japanese and Other Languages

Segmentation of Japanese into "Words"

A number of techniques based on dictionaries have been proposed to identifythe words and phrases in Japanese text for indexing. Some of these techniquesare now available commercially and achieve high accuracy rates.

Morphological Analysis

The identification and normalization of inflected word forms is lessimportant for Japanese than for some languages, and techniques have beendeveloped to do this as part of segmentation.

N-gram Indexing

To avoid the segmentation process, many researchers have proposed indexingJapanese (and Chinese) using pairs (bigrams) or trigrams of characters.Experimental results with this technique have shown that retrieval based onn-gram indexing is approximately as effective as retrieval based on segmentedwords.

The Use of Controlled Vocabularies and Ontologies for Indexing

As more work is done in this area, the language-dependent indexing issuesare becoming less important. Full text indexing for Japanese is now relativelystraightforward, and the focus of Japanese IR researchers is increasingly onthe core issues of effectiveness and efficiency. Some of thespecific research areas mentioned during the WTEC industrial and universityvisits were as follows:

There was also significant interest in multilingual systems. The major focushere was on supporting the major Asian languages in one system, and to a lesserextent, supporting English. There is some work being done on cross-lingualretrieval (asking questions in one language and retrieving documents in manylanguages), but it is currently more of an interest rather than an activeresearch topic. Machine translation was mentioned frequently and is consideredan important part of a multilingual digital library application. The primaryfocus in this area was on translation from English to Japanese, particularlyfor Internet applications. Other Asian language and English translation pairsalso exist or are being developed.

A number of digital library applications based on libraries were discussedduring the WTEC visit, and these applications typically made use of traditionalindexing approaches with controlled vocabularies, manual indexing and catalogsfor searching. The issues involved with creating and searching online publicaccess catalogs (OPACS) as text databases are essentially the same as thosestudied in the United States. As an extension of this approach, metadatastandards for information on the Internet (such as the Dublin Core) are beingdeveloped and debated in the international community. Japanese informationscientists are part of that debate.

Published: February 1999