EDR was spun out from ICOT in 1986 with a nine-year charter to develop a large- scale, practical electronic dictionary system that could be used in support of a variety of natural language tasks, such as translation between English and Japanese, natural language understanding and generation, speech processing, and so forth.
EDR was established as a consortium in collaboration with eight member corporations, each of which supports a local EDR research group. About 70 percent of EDR funding comes from the Japan Key Technology Center (a government agency established by MPT and MITI); roughly 30 percent comes from the member companies. EDR was established with a nine year budget of approximately $100 million. This money pays for staff of about 70 members at both EDR headquarters and eight laboratories located at the member companies. About 50 of them are at the laboratories, and they are experienced in both lexicography and computer science. EDR places orders with publishing companies for various types of raw data for electronic dictionaries. The number of workers at these publishers at one time reached the maximum of 300, but it is currently less than 100.
The EDR conception of an electronic dictionary is quite distinct from the conventional online dictionaries that are now common. These latter systems are designed as online reference books for use by humans. Typically, they contain the exact textual contents of a conventional printed dictionary stored on magnetic disk or CD-ROM. Usage is analogous to a conventional dictionary, except that the electronic version may have hypertext-like features for more rapid and convenient browsing.
EDR's primary goal, in part, is to capture: "all the information a computer requires for a thorough understanding of natural language" (JEDRI, 1990). This includes: the meanings (concepts) of words; the knowledge needed for the computer to understand the concepts; and the information needed to support morphological analysis and generation, syntactic analysis and generation, and semantic analysis. In addition, information about word co-occurrences and listings of equivalent words in other languages are necessary. In short, the system is intended to be a very large but shallow knowledge base about words and their meanings.
The goals for the EDR research are to produce a set of electronic dictionaries with broad coverage of linguistic knowledge. This information is intended to be neutral in that it should not be biased towards any particular natural language processing theory or application; extensive in its coverage of common general purpose words and of words from a corpus of technical literature; and comprehensive it its ability to support all stages of linguistic processing, such as morphological, syntactic and semantic processing, the selection of natural wording, and the selection of equivalent words in other languages.
. EDR is well on the way to completing a set of component dictionaries which collectively form the EDR product. These include word dictionaries for English and Japanese, the Concept Dictionary, co-occurrence dictionaries for English and Japanese and two bilingual dictionaries: English to Japanese and Japanese to English. Each of these is extremely large scale, as follows:
Word Dictionaries General Vocabulary: English 200,000 Words Japanese 200,000 Words Technical Terminology English 100,000 Words Japanese 100,000 Words Concept Dictionary 400,000 Concepts Classification Descriptions Co-Occurrence Dictionaries English 300,000 Words Japanese 300,000 Words Bilingual Dictionaries English-Japanese 300,000 Words Japanese-English 300,000 Words
The several component dictionaries serve distinct roles. All semantic information is captured in the Concept Dictionary, which is surface-language independent. The Concept Dictionary is a very large semantic network capturing pragmatically useful concept descriptions. (We will return to this later). The word dictionaries capture surface syntactic information (e.g., pronunciation, inflection) peculiar to each surface language. Each word entry consists of the "headword" itself, grammatical information, and a pointer to the appropriate concept in the Concept Dictionary. The co-occurrence dictionaries capture information on appropriate word combinations. Each entry consists of a pair of words coupled by a co-occurrence relation (there are several types of co-occurrence relations). Strength of a co-occurrence relation is shown by the certainty factor, whose value ranges from 0 to 255. Zero means that the words cannot be used together. This information is used to determine that "he drives a car" is allowable, that "he drives a unicycle" isn't and that "he rides a unicycle" is. This can be used to determine the appropriate translation of a word. For example, the Japanese word naosu corresponds to several English words, e.g., "modify," "correct," "update," and "mend." However if the object of naosu is "error," then the appropriate English equivalent is "correct." Finally, the bilingual dictionaries provide surface information on the choice of equivalent words in the target language as well as information on correspondence through entries in the concept dictionary.
From the point of view of KBS, the Concept Dictionary is the most interesting component of the EDR project. As mentioned above, this is a very large semantic network capturing roughly 400,000 word meanings. What is most impressive about this is the sheer breadth of coverage. Other than the Cyc project at MCC, there is simply no other project anywhere in the world which has attempted to catalogue so much knowledge.
The Concept Dictionary has taken the approach of trying to capture pragmatically useful definitions of concepts. This differs from much of the work in the U.S., which attempts to capture the set of necessary and sufficient conditions for a concept to obtain (e.g., KL-One and its descendants). During our visit we asked how the researchers knew they had a good definition, particularly for abstract notions like "love." Their answer was, in effect, 'we're not philosophers, we capture as much as we can.'
EDR describes the concept dictionary as a "hyper-semantic network." By which is meant that a chunk of the semantic network may be treated as a single node that enters into relationships with other nodes. Entries in the Concept Dictionary (for binding relations) are triples of two concepts (nodes) joined by a relationship (arc). For example, "<eat> - agent --> <bird>" which says birds can be the agents of an eating action (i.e., birds eat). The relationship can be annotated with a certainty factor; a 0 certainty factor plays the role of negation. For restrictive relations the entry consists of a pair of a concept and an attribute (with an optional certainty factor); e.g., "walking" is represented by "<walk> -- progress /1 -->."
The "hyper" part of this hyper-semantic network is the ability of any piece of network to be aggregated and treated as a single node. For example, "I believe birds eat" is represented by building the network for "birds eat," then treating this as a node which enters into an object relationship with "believe." Quantification is indicated by use of the "SOME" and "ALL" attributes; the scope of the quantification is indicated by nesting of boxes.
EDR has identified a set of relations which seem to be sufficient; these are shown below. In our discussions, the EDR people seemed to be relatively convinced that they had reached closure with the current set.
Relation Labels Relation Description Agent Subject of action Object Object affected by action or change Manner Way of action or change Implement Tool/means of action Material Material or component Time Time event occurs Time-from Time event begins Time-to Time event ends Duration Duration of event Location Objective place of action Place Place where event occurs Source Initial position of subject or object of event Goal Final position of subject or object of event Scene Objective range of action Condition Conditional relation of event/fact Co-occurrence Simultaneous relation of event/fact Sequence Sequential relation of event/fact Quantity Quantity Number Number Semantic Relation Labels Part of Part-whole Equal Equivalence relation Similar Synonymous relation Kind of Super concept relation
This framework for representation is similar to semantic network formalisms used in the U.S.; it is most similar to those which are developed for language understanding tasks (e.g., Quillian's work in the late 1960s; OWL; SNePS; and Conceptual Dependency). However, the EDR framework seems to be a bit richer and more mature than most of this research -- particularly so, since this particular thread of research seems to have waned in the U.S.
EDR has tried to develop a systematic approach to the construction of the Concept Dictionary. This involves a layering of the inheritance lattice of concepts as well as a layering of the network structures which define concepts.
Use of Hierarchies. As in most knowledge representation work, the use of Kind Of (IsA) hierarchies plays an important role, capturing commonalities and thereby compressing storage. In the EDR Concept Dictionary a top-level division is made between "Object" and "Non-Object" concepts; the second category is further divided into "Event" and "Feature." Under this top-level classification, EDR has tried to structure the inheritance hierarchy into three layers. In the upper section, "Event" is broken into "Movement," "Action," and "Change," which are further refined into "Physical-Movement" and "Movement-Of-Information."
The refinement of these concepts is governed by the way they bind with the subdivision of the "Object" concept. A concept is further divided if the sub-concepts can enter into different relations with subconcepts of the "Object" hierarchy. The middle section of the hierarchy is formed by multiple inheritance from concepts in the first layer. Again, concepts are subdivided to the degree necessary to correspond to distinct relationships with the subconcepts of "Object."
The third section consists of individual concepts. The subdivision of the "Object" hierarchy is constructed in the same manner, guided by the distinctions made in the "Non-Object" part of the network. The second layer is built by multiple inheritance from the first layer and bottoms out at concepts that need no further division to distinguish the binding relationships with parts of the Non-Object network (e.g., bird). The third layer consists specifically of instances of concepts in the second section (e.g., Swallow Robin).
Network Structures and Methods. The definition section of the network has six layers:
EDR's minimal goal is to finish sections A through C; they believe that this is achievable within their time-scale and will also be pragmatically useful.
EDR has adopted a pragmatic, example-driven approach to construction of the whole dictionary system. Although they have constructed computer tools to help in the process, people are very much in the loop. The work began with the selection and description of the dictionary contents of 170,000 vocabulary items in each language. Each of these is entered as a "Headword" in the word dictionary. Next, the corresponding word in the other language is determined and entered in the word dictionary if not present. A bilingual correspondence entry is then made in the bilingual dictionaries. A corpus of 20 million sentences was collected from newspapers, encyclopedias, textbooks and reference books and a keyword in context (KWIC) database was created indicating each occurrence of each word in the corpus. Each word is checked to see whether it is already present in the word dictionaries; if not it is entered. Automated tools perform morphological analysis; the results are output on a worksheet for human correction or approval. For each morpheme, the appropriate concept is selected from the list of those associated with the word. Syntactic and semantic analysis is performed by computer on the results of the morphological analysis and output on worksheets. The results are verified by individuals. The relations between the concepts are determined and filled out on worksheets. Once corrected and approved, the morphological information is added to the word dictionaries. The parse trees and extracted concept relations, once approved, are stored away in the EDR Corpus. Co-occurrence information is extracted from the parse trees and entered into the co-occurrence dictionary. Concept information extracted in this process is compared to the existing concept hierarchy and descriptions. New descriptions are entered into the Concept Dictionary.
Milestones. In January 1991, EDR published the first edition of the dictionary interface. By December 1991, EDR had begun to organize an external evaluation group. In November 1992, the preliminary versions of the Word and Concept Dictionaries were distributed to twelve universities and research institutes, including two American universities, for evaluation. In March of 1993, EDR was scheduled to release for commercial use the second edition of the Word Dictionaries, the first edition of the Concept Dictionary, the first edition of the Bilingual Dictionaries and the first edition of the Co-occurrence Dictionaries. The second edition of the Dictionary Interface was also scheduled for release at this time. EDR plans to offer the dictionaries in commercial form world-wide; the commercial terms are planned to be uniform across national borders, although price has not yet been set. Discussions concerning academic access to the EDR dictionaries in the U.S. are ongoing.
. During our visit to EDR, we asked several times what facilities were provided for consistency maintenance or automatic indexing. Our EDR hosts answered that there were a variety of ad hoc tools developed in house, but that they had no overall approach to these problems. They told us that it was very likely that a concept might appear more than once in the dictionary because there weren't really very good tools for detecting duplications. In general, their approach has been to write simple tools that can identify potential problems and then to rely on humans to resolve the problems. Although they seemed quite apologetic about the weakness of their tools, they also seemed reasonably secure that success can be achieved by the steady application of reasonable effort to the problem. The chief indicator of this is a feeling that closure is being reached, that most words and concepts currently encountered are found in the dictionary. In short, there is a feeling that as the mass of the dictionary grows, the mass itself helps to manage the problem.
In contrast, research in the U.S. has placed a much stronger emphasis on strong theories and tools which automatically index a concept within the semantic network, check it for consistency, identify duplication, etc. For example, a significant sector of the knowledge representation community in the U.S. has spent the last several years investigating how to limit the expressiveness of representation languages in order to guarantee that indexing services can be performed by tractable computations (the clearest examples are in the KL-One family of languages). However, no effort in the U.S. has managed to deal with "largeness" as an issue. With the exception of Cyc, knowledge bases in the U.S. are typically much smaller than the EDR dictionary system. Also, the theoretical frameworks often preclude even expressing a variety of information which practitioners find necessary.
Comparison with U.S. Research. In the U.S., we know of only three efforts which approach the scale of the EDR effort. The Cyc project is attempting to build a very large system that captures a significant segment of common sense "consensus reality." At present the number of named terms in the Cyc system is still significantly smaller than the EDR Concept Dictionary, although the knowledge associated with each term is much deeper. A second project is the "lexicon collaborative" at the University of Arizona, which is attempting to build a realistic scale lexicon for English natural language processing. To our knowledge this work is not yet as extensive as EDR's. Finally, NIH has sponsored the construction of a medical information representation system. A large part of the information in this system comes from previously existing medical abstract indexing terms, etc. This system is intended to be quite large scale, but specialized to medical information.
In short, we might characterize the large knowledge base community in the U.S. as long on theories about knowledge representation and use of deep knowledge but short on pragmatism: the number of surface terms in its knowledge bases is relatively small. EDR, on the other hand, has no particular theory of knowledge representation, but has built a practical knowledge base with several hundred thousand concepts.
The JTEC team found EDR's pragmatic approach refreshing and exciting. Although U.S. researchers have spent significantly more time on theoretical formulations, they have not yet succeeded in building any general knowledge base of significant size. EDR's pragmatic approach (and ability to enlist a large number of lexicographers in a single national project) has allowed it to amass a significant corpus of concepts with significant coverage of the terms of natural language. The organizing ideas of the EDR dictionary are not particularly innovative; they have been in play since Quillian (mid 1960s) and Schank (late 1970s). While stronger theoretical work and better tools are both necessary and desirable, there is no substitute for breadth of coverage and the hard work necessary to achieve it. EDR has accomplished this breadth and this is an essentially unique achievement. EDR's work is among the most exciting in Japan (or anywhere else).
. Dr. Yokoi, general manager of EDR, mentioned at the time of our visit to EDR that he has been working with others to prepare a plan for a new effort to be called the Knowledge Archives Project. At the time of our visit, the only information he had available was extremely preliminary and vague. We have since been sent a more complete draft entitled, A Plan for the Knowledge Archives Project, March 1992 with the following affiliations: The Economics Research Institute (ERI); Japan Society for the Promotion of Machine Industry (JPSMI); and Systems Research and Development Institute of Japan (SR&DI). An introductory note says that the project is called NOAH (kNOwledge ArcHives). "The knowledge archives," it states, "is the Noah's ark which will overcome the chaos of the turn of the century and build a new foundation for the 21st century" (ERI 1992).
The goal of the Knowledge Archives Project is to amass very large knowledge bases to serve as the "common base of knowledge for international and interdisciplinary exchange in research and technology communities." This will be achieved primarily through the use of textual knowledge sources (i.e., documents), being (semi-)automatically processed into very large scale semantic networks.
Achieving the goal will require the development of a variety of new technologies. As the draft plan states:
The technology in which the acquisition and collection of vast amounts of knowledge are automated (supported); the technology in which knowledge bases are self-organized so that substantial amounts of knowledge can be stored systematically; the technology which supports the creation of new knowledge by using vast amounts of existing knowledge and by developing appropriate and applicable knowledge bases which fulfill the need for various knowledge usage; and the technology which translates and transmits knowledge to promote the interchange and common use of knowledge. In addition, development of a basic knowledge base which can be shared by all applications will be necessary. (ERI 1992)
The proposal is explicitly multi-disciplinary. There is a strong emphasis on multi-media technologies, the use of advanced databases (deductive and object oriented, such as the Quixote system at ICOT), and the use of natural language processing technology (such as that developed at EDR). An even more interesting aspect of the proposal is that there is a call for collaboration with researchers in the humanities and social sciences as well as those working in the media industry.
The proposal calls for a new national project to last for eight years beginning in the Japanese fiscal year 1993 and running through fiscal year 2000. There are three phases to the proposed project: (1) A basic research stage (1993-1994); (2) initial prototype development (1995-1997); and (3) final prototype (1998-2000). The output of the project is not imagined as a completed system, but rather as a usable system and a sound starting point for continuing research. The basic organization will comprise one research center and eight to ten research sites. Staffing will consist of transferred employees from research institutes and participating corporations.
Research Themes. A variety of research themes are associated with the archive project, including:
. Although attention is paid to the notion that human knowledge has many representation media, the proposal singles out for special attention two media of importance: natural language, in particular, modern Japanese as the media to be used by humans, and the system knowledge representation language that will be used internally with the Knowledge Archives system.
Documents are seen as the natural unit of knowledge acquisition. Knowledge documents written in the system knowledge representation language form the knowledge base of the Knowledge Archives. Words, sentences, texts, and stories are examples of the various levels of aggregation which make up knowledge objects. Knowledge objects are mutually related to one another in a semantic network which enables them to be defined and to have attached attributes. Both surface structure and semantic objects are stored in the Knowledge Archives. Surface representations are retained because the semantic objects are not required to capture all the meanings of the surface documents.
Relationships connect knowledge objects. These include: links between surface structure objects, e.g., syntactic relationships and inclusion relationships; semantic links, e.g., case frame links between words, causality links between sentences; correspondence links between surface and semantic objects; equality and similarity links, etc. Corresponding to each type of link will be a variety of inference rules that allow deductions to be drawn from the information. Also, it is a goal to develop mechanisms for learning and self-organization.
Initially, much of the work will be done manually or in a semi-automated fashion. However, it is hoped that the system can take on an increasingly large part of the processing. To make the task tractable in the initial phases, documents will not be processed as they are but rather, summary documents will be prepared. These will be analyzed by the computer and attached to the complete document with synonym links.
The envisioned sources of documents for processing are at the moment quite broad, encompassing collaboration with humanities research groups such as the Opera Project (chairman, Seigou Matsuoka), newspaper articles, scientific and technical literature (primarily information processing), patents, legal documents, manuals, and more.
Summary and Analysis. Much of the archives proposal seems vague, although the general spirit is clear. It should be noted that the proposal is the initial edition of the plan. EDR is currently working on more concrete details for it. Using the technologies developed by EDR, machine translation projects, and to a lesser degree ICOT, the project will attempt to build a very large-scale knowledge base. Documents, particularly textual documents, will form the core knowledge source with natural language processing technology converting natural language text into internal form. At least initially, the work will be at best semi-automated, but as the technologies get better the degree of automation will increase. The knowledge base will support a variety of knowledge storage, retrieval and transformation tasks. Largeness is seen as the core opportunity and challenge in the effort. Cyc is the U.S. project most similar to the one proposed here, although it is important to note the difference in perspective (Lenat and Guha 1990). In Cyc, it is the job of human knowledge engineers to develop an ontology and to enter the knowledge into it. In the Knowledge Archives Project, the source of knowledge is to be existing textual material and the ontology should (at least somewhat) emerge from the self-organization of the knowledge base.
The proposal explicitly mentions the intention of collaborating with researchers outside Japan and of encouraging the formation of similar efforts in other countries. The proposal has not yet been approved, but its progress should be followed.