ELECTRONIC DICTIONARY RESEARCH (EDR) PROJECT

History and Goals

EDR was spun out from ICOT in 1986 with a nine-year charter to develop a large- scale, practical electronic dictionary system that could be used in support of a variety of natural language tasks, such as translation between English and Japanese, natural language understanding and generation, speech processing, and so forth.

EDR was established as a consortium in collaboration with eight member corporations, each of which supports a local EDR research group. About 70 percent of EDR funding comes from the Japan Key Technology Center (a government agency established by MPT and MITI); roughly 30 percent comes from the member companies. EDR was established with a nine year budget of approximately $100 million. This money pays for staff of about 70 members at both EDR headquarters and eight laboratories located at the member companies. About 50 of them are at the laboratories, and they are experienced in both lexicography and computer science. EDR places orders with publishing companies for various types of raw data for electronic dictionaries. The number of workers at these publishers at one time reached the maximum of 300, but it is currently less than 100.

The EDR conception of an electronic dictionary is quite distinct from the conventional online dictionaries that are now common. These latter systems are designed as online reference books for use by humans. Typically, they contain the exact textual contents of a conventional printed dictionary stored on magnetic disk or CD-ROM. Usage is analogous to a conventional dictionary, except that the electronic version may have hypertext-like features for more rapid and convenient browsing.

EDR's primary goal, in part, is to capture: "all the information a computer requires for a thorough understanding of natural language" (JEDRI, 1990). This includes: the meanings (concepts) of words; the knowledge needed for the computer to understand the concepts; and the information needed to support morphological analysis and generation, syntactic analysis and generation, and semantic analysis. In addition, information about word co-occurrences and listings of equivalent words in other languages are necessary. In short, the system is intended to be a very large but shallow knowledge base about words and their meanings.

The goals for the EDR research are to produce a set of electronic dictionaries with broad coverage of linguistic knowledge. This information is intended to be neutral in that it should not be biased towards any particular natural language processing theory or application; extensive in its coverage of common general purpose words and of words from a corpus of technical literature; and comprehensive it its ability to support all stages of linguistic processing, such as morphological, syntactic and semantic processing, the selection of natural wording, and the selection of equivalent words in other languages.

Accomplishments

Component Dictionaries

. EDR is well on the way to completing a set of component dictionaries which collectively form the EDR product. These include word dictionaries for English and Japanese, the Concept Dictionary, co-occurrence dictionaries for English and Japanese and two bilingual dictionaries: English to Japanese and Japanese to English. Each of these is extremely large scale, as follows:

Word Dictionaries
     General Vocabulary:     
          English              200,000 Words
          Japanese             200,000 Words

     Technical Terminology
          English              100,000 Words
          Japanese             100,000 Words

Concept Dictionary            400,000 Concepts
          Classification
          Descriptions

Co-Occurrence Dictionaries
          English              300,000 Words
          Japanese             300,000 Words

Bilingual Dictionaries
          English-Japanese     300,000 Words
          Japanese-English     300,000 Words

The several component dictionaries serve distinct roles. All semantic information is captured in the Concept Dictionary, which is surface-language independent. The Concept Dictionary is a very large semantic network capturing pragmatically useful concept descriptions. (We will return to this later). The word dictionaries capture surface syntactic information (e.g., pronunciation, inflection) peculiar to each surface language. Each word entry consists of the "headword" itself, grammatical information, and a pointer to the appropriate concept in the Concept Dictionary. The co-occurrence dictionaries capture information on appropriate word combinations. Each entry consists of a pair of words coupled by a co-occurrence relation (there are several types of co-occurrence relations). Strength of a co-occurrence relation is shown by the certainty factor, whose value ranges from 0 to 255. Zero means that the words cannot be used together. This information is used to determine that "he drives a car" is allowable, that "he drives a unicycle" isn't and that "he rides a unicycle" is. This can be used to determine the appropriate translation of a word. For example, the Japanese word naosu corresponds to several English words, e.g., "modify," "correct," "update," and "mend." However if the object of naosu is "error," then the appropriate English equivalent is "correct." Finally, the bilingual dictionaries provide surface information on the choice of equivalent words in the target language as well as information on correspondence through entries in the concept dictionary.

From the point of view of KBS, the Concept Dictionary is the most interesting component of the EDR project. As mentioned above, this is a very large semantic network capturing roughly 400,000 word meanings. What is most impressive about this is the sheer breadth of coverage. Other than the Cyc project at MCC, there is simply no other project anywhere in the world which has attempted to catalogue so much knowledge.

The Concept Dictionary has taken the approach of trying to capture pragmatically useful definitions of concepts. This differs from much of the work in the U.S., which attempts to capture the set of necessary and sufficient conditions for a concept to obtain (e.g., KL-One and its descendants). During our visit we asked how the researchers knew they had a good definition, particularly for abstract notions like "love." Their answer was, in effect, 'we're not philosophers, we capture as much as we can.'

EDR describes the concept dictionary as a "hyper-semantic network." By which is meant that a chunk of the semantic network may be treated as a single node that enters into relationships with other nodes. Entries in the Concept Dictionary (for binding relations) are triples of two concepts (nodes) joined by a relationship (arc). For example, "<eat> - agent --> <bird>" which says birds can be the agents of an eating action (i.e., birds eat). The relationship can be annotated with a certainty factor; a 0 certainty factor plays the role of negation. For restrictive relations the entry consists of a pair of a concept and an attribute (with an optional certainty factor); e.g., "walking" is represented by "<walk> -- progress /1 -->."

The "hyper" part of this hyper-semantic network is the ability of any piece of network to be aggregated and treated as a single node. For example, "I believe birds eat" is represented by building the network for "birds eat," then treating this as a node which enters into an object relationship with "believe." Quantification is indicated by use of the "SOME" and "ALL" attributes; the scope of the quantification is indicated by nesting of boxes.

EDR has identified a set of relations which seem to be sufficient; these are shown below. In our discussions, the EDR people seemed to be relatively convinced that they had reached closure with the current set.

Relation Labels    

Relation        Description             

Agent       Subject of action
Object      Object affected by action or change
Manner      Way of action or change
Implement   Tool/means of action
Material    Material or component
Time        Time event occurs
Time-from   Time event begins
Time-to     Time event ends
Duration    Duration of event
Location    Objective place of action
Place       Place where event occurs
Source      Initial position of subject or object of event
Goal        Final position of subject or object of event
Scene       Objective range of action
Condition   Conditional relation of event/fact
Co-occurrence   Simultaneous relation of event/fact
Sequence    Sequential relation of event/fact
Quantity    Quantity
Number      Number

Semantic Relation Labels

Part of     Part-whole
Equal       Equivalence relation
Similar     Synonymous relation
Kind of     Super concept relation

This framework for representation is similar to semantic network formalisms used in the U.S.; it is most similar to those which are developed for language understanding tasks (e.g., Quillian's work in the late 1960s; OWL; SNePS; and Conceptual Dependency). However, the EDR framework seems to be a bit richer and more mature than most of this research -- particularly so, since this particular thread of research seems to have waned in the U.S.

EDR has tried to develop a systematic approach to the construction of the Concept Dictionary. This involves a layering of the inheritance lattice of concepts as well as a layering of the network structures which define concepts.

Use of Hierarchies. As in most knowledge representation work, the use of Kind Of (IsA) hierarchies plays an important role, capturing commonalities and thereby compressing storage. In the EDR Concept Dictionary a top-level division is made between "Object" and "Non-Object" concepts; the second category is further divided into "Event" and "Feature." Under this top-level classification, EDR has tried to structure the inheritance hierarchy into three layers. In the upper section, "Event" is broken into "Movement," "Action," and "Change," which are further refined into "Physical-Movement" and "Movement-Of-Information."

The refinement of these concepts is governed by the way they bind with the subdivision of the "Object" concept. A concept is further divided if the sub-concepts can enter into different relations with subconcepts of the "Object" hierarchy. The middle section of the hierarchy is formed by multiple inheritance from concepts in the first layer. Again, concepts are subdivided to the degree necessary to correspond to distinct relationships with the subconcepts of "Object."

The third section consists of individual concepts. The subdivision of the "Object" hierarchy is constructed in the same manner, guided by the distinctions made in the "Non-Object" part of the network. The second layer is built by multiple inheritance from the first layer and bottoms out at concepts that need no further division to distinguish the binding relationships with parts of the Non-Object network (e.g., bird). The third layer consists specifically of instances of concepts in the second section (e.g., Swallow Robin).

Network Structures and Methods. The definition section of the network has six layers:

EDR's minimal goal is to finish sections A through C; they believe that this is achievable within their time-scale and will also be pragmatically useful.

EDR has adopted a pragmatic, example-driven approach to construction of the whole dictionary system. Although they have constructed computer tools to help in the process, people are very much in the loop. The work began with the selection and description of the dictionary contents of 170,000 vocabulary items in each language. Each of these is entered as a "Headword" in the word dictionary. Next, the corresponding word in the other language is determined and entered in the word dictionary if not present. A bilingual correspondence entry is then made in the bilingual dictionaries. A corpus of 20 million sentences was collected from newspapers, encyclopedias, textbooks and reference books and a keyword in context (KWIC) database was created indicating each occurrence of each word in the corpus. Each word is checked to see whether it is already present in the word dictionaries; if not it is entered. Automated tools perform morphological analysis; the results are output on a worksheet for human correction or approval. For each morpheme, the appropriate concept is selected from the list of those associated with the word. Syntactic and semantic analysis is performed by computer on the results of the morphological analysis and output on worksheets. The results are verified by individuals. The relations between the concepts are determined and filled out on worksheets. Once corrected and approved, the morphological information is added to the word dictionaries. The parse trees and extracted concept relations, once approved, are stored away in the EDR Corpus. Co-occurrence information is extracted from the parse trees and entered into the co-occurrence dictionary. Concept information extracted in this process is compared to the existing concept hierarchy and descriptions. New descriptions are entered into the Concept Dictionary.

Milestones. In January 1991, EDR published the first edition of the dictionary interface. By December 1991, EDR had begun to organize an external evaluation group. In November 1992, the preliminary versions of the Word and Concept Dictionaries were distributed to twelve universities and research institutes, including two American universities, for evaluation. In March of 1993, EDR was scheduled to release for commercial use the second edition of the Word Dictionaries, the first edition of the Concept Dictionary, the first edition of the Bilingual Dictionaries and the first edition of the Co-occurrence Dictionaries. The second edition of the Dictionary Interface was also scheduled for release at this time. EDR plans to offer the dictionaries in commercial form world-wide; the commercial terms are planned to be uniform across national borders, although price has not yet been set. Discussions concerning academic access to the EDR dictionaries in the U.S. are ongoing.

Evaluation

Strengths and Weaknesses

. During our visit to EDR, we asked several times what facilities were provided for consistency maintenance or automatic indexing. Our EDR hosts answered that there were a variety of ad hoc tools developed in house, but that they had no overall approach to these problems. They told us that it was very likely that a concept might appear more than once in the dictionary because there weren't really very good tools for detecting duplications. In general, their approach has been to write simple tools that can identify potential problems and then to rely on humans to resolve the problems. Although they seemed quite apologetic about the weakness of their tools, they also seemed reasonably secure that success can be achieved by the steady application of reasonable effort to the problem. The chief indicator of this is a feeling that closure is being reached, that most words and concepts currently encountered are found in the dictionary. In short, there is a feeling that as the mass of the dictionary grows, the mass itself helps to manage the problem.

In contrast, research in the U.S. has placed a much stronger emphasis on strong theories and tools which automatically index a concept within the semantic network, check it for consistency, identify duplication, etc. For example, a significant sector of the knowledge representation community in the U.S. has spent the last several years investigating how to limit the expressiveness of representation languages in order to guarantee that indexing services can be performed by tractable computations (the clearest examples are in the KL-One family of languages). However, no effort in the U.S. has managed to deal with "largeness" as an issue. With the exception of Cyc, knowledge bases in the U.S. are typically much smaller than the EDR dictionary system. Also, the theoretical frameworks often preclude even expressing a variety of information which practitioners find necessary.

Comparison with U.S. Research. In the U.S., we know of only three efforts which approach the scale of the EDR effort. The Cyc project is attempting to build a very large system that captures a significant segment of common sense "consensus reality." At present the number of named terms in the Cyc system is still significantly smaller than the EDR Concept Dictionary, although the knowledge associated with each term is much deeper. A second project is the "lexicon collaborative" at the University of Arizona, which is attempting to build a realistic scale lexicon for English natural language processing. To our knowledge this work is not yet as extensive as EDR's. Finally, NIH has sponsored the construction of a medical information representation system. A large part of the information in this system comes from previously existing medical abstract indexing terms, etc. This system is intended to be quite large scale, but specialized to medical information.

In short, we might characterize the large knowledge base community in the U.S. as long on theories about knowledge representation and use of deep knowledge but short on pragmatism: the number of surface terms in its knowledge bases is relatively small. EDR, on the other hand, has no particular theory of knowledge representation, but has built a practical knowledge base with several hundred thousand concepts.

The JTEC team found EDR's pragmatic approach refreshing and exciting. Although U.S. researchers have spent significantly more time on theoretical formulations, they have not yet succeeded in building any general knowledge base of significant size. EDR's pragmatic approach (and ability to enlist a large number of lexicographers in a single national project) has allowed it to amass a significant corpus of concepts with significant coverage of the terms of natural language. The organizing ideas of the EDR dictionary are not particularly innovative; they have been in play since Quillian (mid 1960s) and Schank (late 1970s). While stronger theoretical work and better tools are both necessary and desirable, there is no substitute for breadth of coverage and the hard work necessary to achieve it. EDR has accomplished this breadth and this is an essentially unique achievement. EDR's work is among the most exciting in Japan (or anywhere else).

Next Steps: The Knowledge Archives Project

Background and Organization

. Dr. Yokoi, general manager of EDR, mentioned at the time of our visit to EDR that he has been working with others to prepare a plan for a new effort to be called the Knowledge Archives Project. At the time of our visit, the only information he had available was extremely preliminary and vague. We have since been sent a more complete draft entitled, A Plan for the Knowledge Archives Project, March 1992 with the following affiliations: The Economics Research Institute (ERI); Japan Society for the Promotion of Machine Industry (JPSMI); and Systems Research and Development Institute of Japan (SR&DI). An introductory note says that the project is called NOAH (kNOwledge ArcHives). "The knowledge archives," it states, "is the Noah's ark which will overcome the chaos of the turn of the century and build a new foundation for the 21st century" (ERI 1992).

The goal of the Knowledge Archives Project is to amass very large knowledge bases to serve as the "common base of knowledge for international and interdisciplinary exchange in research and technology communities." This will be achieved primarily through the use of textual knowledge sources (i.e., documents), being (semi-)automatically processed into very large scale semantic networks.

Achieving the goal will require the development of a variety of new technologies. As the draft plan states:

The technology in which the acquisition and collection of vast amounts of knowledge are automated (supported); the technology in which knowledge bases are self-organized so that substantial amounts of knowledge can be stored systematically; the technology which supports the creation of new knowledge by using vast amounts of existing knowledge and by developing appropriate and applicable knowledge bases which fulfill the need for various knowledge usage; and the technology which translates and transmits knowledge to promote the interchange and common use of knowledge. In addition, development of a basic knowledge base which can be shared by all applications will be necessary. (ERI 1992)

The proposal is explicitly multi-disciplinary. There is a strong emphasis on multi-media technologies, the use of advanced databases (deductive and object oriented, such as the Quixote system at ICOT), and the use of natural language processing technology (such as that developed at EDR). An even more interesting aspect of the proposal is that there is a call for collaboration with researchers in the humanities and social sciences as well as those working in the media industry.

The proposal calls for a new national project to last for eight years beginning in the Japanese fiscal year 1993 and running through fiscal year 2000. There are three phases to the proposed project: (1) A basic research stage (1993-1994); (2) initial prototype development (1995-1997); and (3) final prototype (1998-2000). The output of the project is not imagined as a completed system, but rather as a usable system and a sound starting point for continuing research. The basic organization will comprise one research center and eight to ten research sites. Staffing will consist of transferred employees from research institutes and participating corporations.

Research Themes. A variety of research themes are associated with the archive project, including:

  1. Knowledge-grasping semantics. It is important to be able to understand meanings, to have computers interact with users at levels deeper than syntax. This theme is stated several times throughout the report and is often mentioned in the context of "reorganizing information processing systems from the application side." However, the proposal also notes that "AI has limited the range of objects and tried to search for their meaning in depth. Now it is important to deal with objects in a broader range but deal with the meaning in a shallow range" (ERI 1992).
  2. The importance of being very large. "Handling of small amounts of slightly complicated knowledge would not be enough. Computers have to be capable of dealing with very large-scale knowledge and massive information." Largeness in and of itself is also inadequate. "Totally new technologies are required to automatically acquire and store massive knowledge as efficiently as possible. The technologies are exactly what the Knowledge Archive Project has to tackle" (ERI 1992). Reference is also made to memory-based (case-based) reasoning, and neural network technologies also may be relevant.
  3. Necessity for diverse knowledge representation media. Humans represent knowledge in a variety of ways: textually, graphically, through images, sounds, etc. "Representation media for humans will play the leading role in knowledge representation, and those for computers will be considered as media for `computer interface.' Knowledge and software will no longer be represented and programmed for computers, but will be represented and programmed for the combination of humans and computers" (ERI 1992).
  4. Necessity for research based on the ecology of knowledge. "Knowledge is varied, diversified and comes in many forms. Knowledge is made visible as one document represented by a representation media. The key technology of knowledge processing for the Knowledge Archives automatically generates, edits, transforms, stores, retrieves, and transmits these documents as efficiently as possible" (ERI 1992).
  5. Necessity for a shared environment for the reuse of knowledge. "The first step is to standardize knowledge representation media [Editor's note: It's significant that media is plural] and make them commonly usable. Logic and logic programming have to be considered as the basis of knowledge representation languages, that is the medium for computers" (ERI 1992). The mechanism for storing large bodies of knowledge and retrieving knowledge on request form a key part of this agenda. Next generation databases (as opposed to current AI KBs) are seen as the best initial technology for this goal.
  6. Progress of separate technologies. The goal of the NOAH project requires the use of a variety of technologies which the authors think have matured enough to enable an effort to begin. However, it will be a key goal to integrate these technologies and to foster their further development. The technologies identified are: natural language processing, knowledge engineering, multimedia, and software engineering.
  7. The accumulation of information and the progress of related technologies. Although Japan has lagged the west in online text processing (because of its particular language and alphabet problems), it is now the case that a vast amount of information is becoming available in electronic form. Much of this information is now online and accessible through networks. The publishing industry, libraries and others traditionally involved in the management of large amounts of information are seen as prospective users of the fruits of the project and also as prospective collaborators in its development.
  8. Fruitful results of various projects. The Fifth Generation Computer Project has produced new constraint logic programming languages which are seen as potential knowledge representation languages. The Parallel Inference Machines (PIMs) (or some follow-on) are expected to function as high performance database machines. EDR is seen as having produced robust natural language processing technology and "has made natural language the kernel language of knowledge representation media" (ERI 1992). The EDR's dictionary is cited as a very large knowledge base of lexical knowledge. Finally, the various machine translation projects around Japan are cited as other examples of natural language processing technology.
What Will Be Done

. Although attention is paid to the notion that human knowledge has many representation media, the proposal singles out for special attention two media of importance: natural language, in particular, modern Japanese as the media to be used by humans, and the system knowledge representation language that will be used internally with the Knowledge Archives system.

Documents are seen as the natural unit of knowledge acquisition. Knowledge documents written in the system knowledge representation language form the knowledge base of the Knowledge Archives. Words, sentences, texts, and stories are examples of the various levels of aggregation which make up knowledge objects. Knowledge objects are mutually related to one another in a semantic network which enables them to be defined and to have attached attributes. Both surface structure and semantic objects are stored in the Knowledge Archives. Surface representations are retained because the semantic objects are not required to capture all the meanings of the surface documents.

Relationships connect knowledge objects. These include: links between surface structure objects, e.g., syntactic relationships and inclusion relationships; semantic links, e.g., case frame links between words, causality links between sentences; correspondence links between surface and semantic objects; equality and similarity links, etc. Corresponding to each type of link will be a variety of inference rules that allow deductions to be drawn from the information. Also, it is a goal to develop mechanisms for learning and self-organization.

Initially, much of the work will be done manually or in a semi-automated fashion. However, it is hoped that the system can take on an increasingly large part of the processing. To make the task tractable in the initial phases, documents will not be processed as they are but rather, summary documents will be prepared. These will be analyzed by the computer and attached to the complete document with synonym links.

The envisioned sources of documents for processing are at the moment quite broad, encompassing collaboration with humanities research groups such as the Opera Project (chairman, Seigou Matsuoka), newspaper articles, scientific and technical literature (primarily information processing), patents, legal documents, manuals, and more.

Summary and Analysis. Much of the archives proposal seems vague, although the general spirit is clear. It should be noted that the proposal is the initial edition of the plan. EDR is currently working on more concrete details for it. Using the technologies developed by EDR, machine translation projects, and to a lesser degree ICOT, the project will attempt to build a very large-scale knowledge base. Documents, particularly textual documents, will form the core knowledge source with natural language processing technology converting natural language text into internal form. At least initially, the work will be at best semi-automated, but as the technologies get better the degree of automation will increase. The knowledge base will support a variety of knowledge storage, retrieval and transformation tasks. Largeness is seen as the core opportunity and challenge in the effort. Cyc is the U.S. project most similar to the one proposed here, although it is important to note the difference in perspective (Lenat and Guha 1990). In Cyc, it is the job of human knowledge engineers to develop an ontology and to enter the knowledge into it. In the Knowledge Archives Project, the source of knowledge is to be existing textual material and the ontology should (at least somewhat) emerge from the self-organization of the knowledge base.

The proposal explicitly mentions the intention of collaborating with researchers outside Japan and of encouraging the formation of similar efforts in other countries. The proposal has not yet been approved, but its progress should be followed.


Published: May 1993; WTEC Hyper-Librarian