Site: Advanced Telecommunications Research Institute
2-2 Hikaridai Seika-cho
Kyoto 619-02, Japan
Tel.: 81 774 95 1112
Fax: 81 774 95 1109
Date Visited: May 22, 1995
Report Author: J. Foley
ATR was founded in 1986 with support from industry, academia, and government, to serve as a major center of basic and creative telecommunications R&D. The government provides 70% of the annual budget; 141 companies provide the remaining 30%. Government supervision is by the Ministry of International Trade and Industries (MITI) and the Ministry of Posts and Telecommunications (MPT); with government funding coming from dividends on its shares in the partially privatized Nippon Telephone and Telegraph (NTT).
Within ATR are five active research programs, each with a defined life-time:
This research program covers teleconferencing with realistic sensations, human image processing in facial recognition, recognition and synthesis of human motion, 3-D image databases, retrieval and manipulation of information using language and gesture, hand gesture recognition, gaze detection, facial expression detection, secure communications systems, and automatic generation of communications software.
Program components include speaker-independent speech recognition, processing prosody, acoustic models of speech, example-based natural language translation, intention understanding, integration of speech and language processing, human interfaces for multimodal communications, use of massively parallel machines for speech processing, and speech and language databases.
Issues addressed by this research program are speech production mechanisms, neural information processing in speech perception, image formation process, neural information processing of visual patterns, visual information generating mechanisms, human integration of motor control and vision, and evolutionary systems.
Optical intersatellite communications, mobile communications technologies, and optical and electronic devices comprise this research program.
This lab is just being established. Its basic charter is to develop next-generation teleconferencing capabilities using technologies such as virtual environments, computer graphics, 3-D vision, 3-D audio, and tactile feedback. This lab is in some sense the successor to the Communications Systems Research Lab, the charter of which expires in 1996.
ATR's 230 staff members come primarily from member companies, which assign employees to ATR for an average stay of 3 years. About 50 of the researchers are from abroad. Since founding, about 500 Japanese and 200 international researchers have worked there.
The ATR Human Information Processing Research Lab had 69 staff members as of September 1994: 41 had engineering backgrounds, and 28 had science backgrounds. The majority of those with science backgrounds were in the fields of psychology, cognitive science, and linguistics.
Each lab has a Board of Directors and President. The JTEC team's host, Dr. Habara, serves as Chairman of the Board of all of the labs, and as Executive Vice President of ATR International, which is the holding company for the labs.
Dr. Tom Ray of the Evolutionary Systems Department is studying evolution in the medium of digital computers. In his system, machine language programs (as digital creatures) get CPU time as an "energy" resource and memory as a "material" resource. Mutation generates new forms, and evolution proceeds by natural selection as different genotypes compete for CPU and memory space. His system starts with a creature that has an 80-byte machine language program. Then diverse ecological communities quickly emerge. Parasites appear, then their hosts evolve immunity, then parasites evolve to circumvent the immunity. Hosts evolve the means to exploit parasites by stealing their energy resources. Social organisms evolve that can only replicate through cooperation, and creatures evolve to take advantage of this cooperation. Dramatic optimizations result, producing fourfold reduction in the size of code (to 22 bytes) while maintaining the same functionality as the original 80-byte program. In another case, a digital creature that is capable of unrolling the copy loop was produced. This indicates that Dr. Ray's system has an ability to create a complex optimization technique that is designed to be very clever.
Future directions include first doing an SIMD-type parallelism, then MIMD by having explicit fork-and-join and message-passing primitives included in the set of primitives that can be genetically introduced.
Dr. Shigeru Akamatsu heads a group of researchers studying different aspects of the human face. The research agenda is three-fold: (1) to use the face as a source of information, that is, to recognize facial features in order to identify people and to understand what they are expressing with their motions, their emotions, and their words; (2) to model the face's shape and textures; and (3) to use the facial models to convey information to viewers. The overall purpose of the research agenda is to support human to human communications via computer/telecommunications-based collaboration.
With regard to studying the face as a source of multimodal information, Dr. Akamatsu described studies of the "McGurk Effect." When a subject is presented with a face whose lips are speaking one sound while at the subject simultaneously hears a different sound, the subject perceives a third sound that is a mixture of the visually-presented and acoustically-presented sound. Dr. Akamatsu did not describe how to exploit the effect for teleconferencing.
ATR hosts described a study of face recognition from different orientations, similar to the Shepherd-Metzler mental rotation experiments: "In the real world we have to process faces despite variations in the viewpoint from which they are seen. Strong viewpoint dependence was observed when subjects were required to generalize from a single view, but very different patterns of generalization were observed for the three learned views. For full face, generalization fell with increasing angle of rotation; for three-quarter, the opposite was readily recognized. Profile generalization was equally poor to all unlearned views. The opposite view advantage for three-quarter but not profile may be because a virtual view can only be generated for the three-quarter, but the necessary symmetric feature points occlude each other in profile. The results suggest that, while no one viewpoint is inherently better, different views provide different information for generalization."
Under the direction of Masaomi Oda a study was conducted of how well human judgment of face similarities correlate with a mathematical representation of faces. If two faces that are close together in the mathematical representation are also close together as perceived by viewers, then the mathematical representation could be used for face identification (Oda, Akamatsu, and Fukamachi 1994). The conclusion is that there is good correspondence, and that 20 to 50 degrees of freedom (dimensions) in the mathematical representation are needed.
The Facial Image Retrieval System (Oda and Kato 1993) is another way to study perceived similarity in faces is using line-drawn schematic faces, as shown in Figure .ATR1. In one experiment, subjects are asked to select from a database of 60,000 such faces a set of faces that match a target face; in another experiment, subjects are asked to pick a set of their favorite faces. The study found that face shape, eyebrow tilt, and eye shape were more important in matching faces than the other features.
A system has been developed to morph a face to change its expression, to convey different emotions such as gentle and fierce (see Fig. 4.6).
Fig. ATR.1. Examples of line-drawn faces (Oda and Kato 1993).
Dr. Jun Ohya presented the Virtual Space Teleconferencing System (Ohya 1995). The objective of the system is to support computer conferencing with multiple parties in which each participant is represented as a computer-generated model of the actual participant. Each participant's position and gestures are sensed and transmitted to each conference site. At each site, a high-performance Silicon Graphics 3-D display creates a view that includes each of the participants. The participants' overall body shape and face are digitized in advance of the conference. Facial photos are texture-mapped onto the facial model to create a somewhat realistic-looking face.
The advantage of this approach is that computer-generated objects can be introduced into the conferencing environment, for collaborative design activities in a VR (virtual-reality)-type environment. In addition, lower bandwidth is needed to transmit the sensed positions and gestures than would be needed to transmit video images of the participants. The disadvantage is that current motion capture technology depends on data gloves and video cameras tracking reflective dots of tape on the conferees faces. Over time, the ability to use video cameras to track the body, arms, and face without any explicit fiduciary marks is likely to improve, removing this disadvantage.
Dr. Ohya showed the JTEC team videos of two-person and three-person collaborations, and had a demonstration of a VR environment for assembling pieces of a building. As a substitute for the haptic feedback experienced when objects that the subject is assembling collide, the system on impact changes the color of colliding objects and makes a sound.
Dr. Yasuhiro Yamazaki, President, ATR Interpreting Telecommunications Research Laboratories, demonstrated a system for translating spoken Japanese into synthesized spoken English and German. The system has a 1,500-word vocabulary, requires training, and accepts continuous speech. Portions of the system were developed at CMU and Siemens R&D in Munich. The JTEC team saw an improved version of the C-Star system, demonstrated in 1993, which accepted only discrete speech. The system runs on high-performance UNIX workstations with no special DSP hardware, and takes 15 - 20 seconds for a sentence with 5 - 10 words. Figure ATR.2 shows the logical organization of the system.
Fig. ATR.2. Organization of ASURA: Advanced Speech Understanding and Rendering System of ATR.
ATR Institute International. 1995. ATR symposium on face and object recognition '95.
Oda, Masaomi, Shigeru Akamatsu, and Hideo Fukamachi. 1994. Similarity judgment of a face by K-L expansion technique based on a comparison to human judgment. In Japan-Israel binational workshop on computer vision and visual communication (CVVC '94), 25-30.
Oda, Masaomi and Takashi Kato. 1993. What kinds of facial features are used in face retrieval? In Proc., IEEE international workshop on robot and human communication, 265-270.
Oda, Masaomi. 1995. Human interface for an ambiguous image retrieval system. In Proceedings HCI '95 conference. Japan.
Ohya, J., Y. Kitamura, F. Kishino, and N. Terashima. 1995. Virtual space teleconferencing: Real-time reproduction of 3D human images. Journal of Visual Communication and Image Representation 6(1) (March):1-25.
ATR Human Information Processing Laboratories. 1994. Research Reports.