1、Alejandro Jaimes, Nicu Sebe, Multimodal humancomputer interaction: A survey, Computer Vision and Image Understanding, 2007.多模态人机交互综述摘要:本文总结了多模态人机交互(MMHCI, Multi-Modal Human-Computer Interaction)的主要方法,从计算机视觉角度给出了领域的全貌。我们尤其将重点放在身体、手势、视线和情感交互(人脸表情识别和语音中的情感)方面,讨论了用户和任务建模及多模态融合(multimodal fusion),并指出了多模态
2、人机交互研究的挑战、热点课题和兴起的应用(highlighting challenges, open issues, and emerging applications)。1. 引言多模态人机交互(MMHCI)位于包括计算机视觉、心理学、人工智能等多个研究领域的交叉点,我们研究MMHCI 是要使得计算机技术对人类更具可用性 (Usable),这总是需要至少理解三个方面:与计算机交互的用户、系统(计算机技术及其可用性 )和用户与系统间的交互。考虑这些方面,可以明显看出MMHCI 是一个多学科课题,因为交互系统设计者应该具有一系列相关知识:心理学和认知科学来理解用户的感知、认知及问题求解能力(pe
3、rceptual, cognitive, and problem solving skills);社会学来理解更宽广的交互上下文;工效学(ergonomics)来理解用户的物理能力;图形设计来生成有效的界面展现;计算机科学和工程来建立必需的技术;等等。 MMHCI的多学科特性促使我们对此进行总结。我们不是将重点只放在MMHCI 的计算机视觉技术方面,而是给出了这个领域的全貌,从计算机视觉角度I讨论了 MMHCI中的主要方法和课题。1.1. 动 机在人与人通信中本质上要解释语音和视觉信号的混合。很多领域的研究者认识到了这点,并在单一模态技术unimodal techniques(语音和音频处理及
4、计算机视觉等)和硬件技术hardware technologies (廉价的摄像机和其它类型传感器) 的研究方面取得了进步,这使得 MMHCI方面的研究已经有了重要进展。与传统HCI应用(单个用户面对计算机并利用鼠标或键盘与之交互 )不同,在新的应用(如:智能家居 105、远程协作、艺术等)中,交互并非总是显式指令(explicit commands),且经常包含多个用户。部分原因式在过去的几年中计算机处理器速度、记忆和存储能力得到了显著进步,并与很多使普适计算ubiquitous computing 185,67,66成为现实的新颖输入和输出设备的有效性相匹配,设备包括电话(phones)、
5、嵌入式系统(embedded systems)、个人数字助理(PDA) 、笔记本电脑(laptops) 、屏幕墙(wall size displays),等等,大量计算具有不同计算能量和输入输出能力的设备可用意味着计算的未来将包含交互的新途径,一些方法包括手势(gestures)136 、语音(speech)143 、触觉(haptics)9、 眨眼(eye blinks)58和其它方法,例如:手套设备 (Glove mounted devices)19 和and 可抓握用户界面(graspable user interfaces)48及有形用户界面(Tangible User interfa
6、ce)现在似乎趋向成熟(ripe for exploration), 具有触觉反馈、视线跟踪和眨眼检测69的点设备(Pointing devices)现也已出现。然而,恰如在人与人通讯中一样,当以组合方式使用不同输入设备时,情感通讯(effective communication)就会发生。多模态界面具有很多优点34:可以防止错误、为界面带来鲁棒性、帮助用户更简单地纠正错误或复原、为通信带来更宽的带宽、对不同的状况和环境增加可选的通信方法。在很多系统中,采用多模态接口消除易出错模态(error prone modalities)的模糊性是多模态应用的重要动机之一,如Oviatt 123所述,易
7、出错技术可以相互补充,而不是给接口带来冗余和减少纠错的需要。然而,必须指出的是:多模态单独(multiple modalities alone)并不为界面带来好处,多模态的使用可能是无效的(ineffective),甚至是无益的(disadvantageous),据此,Oviatt124 已经提出了多模态接口的共同错误概念 (common misconceptions or myths),其中大多数与采用语音作为输入模态相关。本文中,我们调研了我们认为是MMHCI本质的研究领域,概括了当前研究状况(the state of the art),并以我们的调研结果为基础,给出了MMHCI中的主要趋
8、势和研究课题(identify major trends and open issues)。我们按照人体将视觉技术进行了分组(如图1所示)。大规模躯体运动(Largescale body movement)、姿势(gesture)和注视(gaze) 分析用于诸如情感交互中的表情识别任务或其它各种应用。我们讨论了情感计算机交互(affective computer interaction),多模态融合、建模和数据收集中的课题及各种正在出现的MMHCI 应用。由于 MMHCI是一个非常动态和广泛的研究领域,我们不是去呈现完整的概括,因此,本文的主要贡献是在对在MMHCI中使用的主要计算机视觉技术概
9、括的同时,给出对MMHCI中的主要研究领域、技术、应用和开放课题的综述。Fig. 1. 采用以人为中心多模态交互概略1.2. Related surveys已经有在多个领域中广泛的综述发表,诸如人脸检测190,63,人脸识别196,人脸表情分析(facial expression analysis)47,131,语音情感(vocal emotion)119,109,姿态识别(gesture recognition) 96,174,136,人运动分析(human motion analysis)65,182,182,56,3,46,107,声音-视觉自动语音识别(audio-visual aut
10、omatic speech recognition)143和眼跟踪(eye tracking)41,36。对基于视觉HCI的综述呈现在142 和 73中,其重点是头部跟踪 (head tracking),人脸和脸部表情识别(face and facial expression recognition),眼睛跟踪(eye tracking)及姿态识别(gesture recognition)。文40中讨论了自适应和智能HCI,主要是对用于人体运动分析的计算机视觉的综述和较低手臂运动检测、人脸处理和注视分析技术的讨论;125128,144,158,135,171中讨论了多模态接口。84和77中讨论
11、了HCI的实时视觉技术(Real-time vision),包括人体姿态、对象跟踪、手势、注视力和脸姿态等。这里,我们不讨论前面综述中包含的工作,增加前面综述中没有覆盖的领域(如:84,40,142,126,115) ,并讨论在兴起领域中的新的应用,着重指出了主要研究课题。相关的的会议和讨论会包括:ACM CHI、IFIP Interact、IEEE CVPR、IEEE ICCV、ACM Multimedia、International Workshop on Human-Centered Multimedia (HCM) in conjunction with ACM Multimedia、
12、International Workshops on Human-Computer Interaction in conjunction with ICCV and ECCV、Intelligent User Interfaces (IUI) conference和International Conference on Multimodal Interfaces (ICMI)。2. 多模态交互概要术语“multimodal”已经在很多场合使用并产生了多种释义(如10-12中对模态的解释)。对于我们来讲,多模态HCI系统简单地是一个以多种模态或通信通道响应输入的系统( 如:语音speech、姿态
13、gesture、书写 writing和其它等等),我们采用“以人为中心 ”的方法(human-centered approach),所指的“借助于模态(by modality)”意味着按照人的感知(human senses)的通信模式和由人激活或衡量人的量( 如:血压计)的计算机输入设备,如图1所示。人的感知包括视线 (sight)、触觉(touch)、听力(hearing)、嗅觉 (smell)和味觉 (taste);很多计算机输入设备的输入模态对应于人的感知:摄像机cameras(sight)、 触觉传感器 haptic sensors (touch)9、麦克风microphones(he
14、aring)、嗅觉设备olfactory (smell)和味觉设备 taste92,然而,很多其它由人激活的计算机输入设备可以理解为对于人的感觉的组合或就没有对应物,如:键盘(keyboard)、鼠标 (mouse)、手写板(writing tablet)、运动输入(motion input)(如:自身运动用来交互的设备)、电磁皮肤感应器(galvanic skin response)和其它生物传感器(biometric sensors)。.在我们的定义中,字“input”是最重要的,恰如在实际中大多数与计算机的交互都采用多个模态而发生。例如:当我们打字时,我们接触键盘上的键以将数据输入计算机
15、,但有些人也汇同时用视线阅读我们所输入的或确定所要按压键的位置。因此,牢记交互过程中人所在做的(what the human is doing)与系统实际接收作为输入(what the system is actually receiving as input)间的差异是十分重要的。例如,一台装有麦克风的计算机可能能理解多种语言和仅是不同类型的声音(如:采用人性化界面(humming interface)来进行音乐检索),尽管术语“multimodal”已常用来指这种状况(如:13中的多语言输入被认为是多通道的multimodal),但本文仅指那些采用不同模态( 如:通信通道)结合的系统是多模
16、态的,如图1 所示。例如:一个系统仅采用摄像机对人脸表情和手势作出响应就不是多模态的,即使输入信号来自多个摄像机;利用同样的假设,具有多个键的系统也不是多模态的,但具有鼠标和键盘输入的则是多模态的。尽管对采用诸如鼠标和键盘、键盘和笔等多种设备的多模态交互已经有研究,本文仅涉及视觉(摄像机)输入与其它类型输入结合的人机交互技术。在HCI中,多模态技术可以用来构造多种不同类型的界面( 如图1),我们特别感兴趣的是感知 (perceptual)、注意(attentive)和活跃(enactive)界面。如在177中定义的那样,感知界面(Perceptual interfaces)176是高度互动(i
17、nteractive)且能使得与计算机丰富、自然和有效交互的多模态界面(multimodal interfaces),感知界面寻找感知输入 sensing(input)和绘制输出rendering(output)的杠杆技术以提供利用标准界面及诸如键盘、鼠标和其它监视器等公共I/O设备177所不能实现的交互,并使得计算机视觉成为很多情况下的核心要件(central component);注意界面(Attentive interfaces)180是依赖于人的注意力作为主要输入的上下文敏感界面(context-aware interfaces)160,即:注意界面120采用收集到的信息来评估与用户通
18、信的最佳时间和方法;由于注意力主要由眼睛注视(eye contact)160和姿态(gestures)(尽管诸如鼠标移动等度量也指示) 来体现,计算机视觉在注意界面中起着主要作用。活跃界面(Enactive interfaces)是那些帮助用户以手或身体的主动使用为基础来与一组理解任务的知识通信(help users communicate a form of knowledge based on the active use of the hands or body for apprehension tasks),活跃知识不只是简单的多感知中间知识(multisensory mediated
19、 knowledge),而是以机器响应(motor responses)的形式贮存,并由“Doing”动作来获取。典型的例子是诸如打字、驾驶汽车、跳舞、弹奏乐器和粘土制模等任务所需要的能力,所有这些任务是难以采用图标或符号形式来描述的。3. Human-centered visionWe classify vision techniques for MMHCI using a human-centered approach and we divide them according to the human body: (1) large-scale body movements, (2) ha
20、nd gestures, and (3) gaze. We make a distinction between command (actions can be used to explicitly execute commands: select menus, etc.) and non-command interfaces (actions or events used to indirectly tune the system to the users needs) 111,23.In general, vision-based human motion analysis systems
21、 used for MMHCI can be thought of as having mainly four stages: (1) motion segmentation, (2) object classification, (3) tracking, and (4) interpretation.While some approaches use geometric primitives to model different components (e.g., cylinders used to model limbs, head, and torso for body movemen
22、ts, or for hand and fingers in gesture recognition), others use feature representations based on appearance (appearance-based methods). In the first approach, external markers are often used to estimate body posture and relevant parameters. While markers can be accurate, they place restrictions on c
23、lothing and require calibration, so they are not desirable in many applications. Moreover, the attempt to fit geometric shapes to body parts can be computationally expensive and these methods are often not suitable for real-time processing. Appearance based methods, on the other hand, do not require
24、 markers, but require training (e.g., with machine learning, probabilistic approaches, etc.). Since they do not require markers, they place fewer constraints on the user and are therefore more desirable.Next, we briefly discuss some specific techniques for body, gesture, and gaze. The motion analysi
25、s steps are similar, so there is some inevitable overlap in the discussions. Some of the issues for gesture recognition, for instance, apply to body movements and gaze detection.3.1. Large-scale body movementsTracking of large-scale body movements (head, arms, torso, and legs) is necessary to interp
26、ret pose and motion in many MMHCI applications. However, since extensive surveys have been published in this area 182,56,107,183, we discuss the topic briefly.There are three important issues in articulated motion analysis 188: representation (joint angles or motion of all the sub-parts), computatio
27、nal paradigms (deterministic or probabilistic), and computation reduction. Body posture analysis is important in many MMHCI applications. For example, in 172, the authors use a stereo and thermal infrared video system to estimate the drivers posture for deployment of smart air bags. The authors of 1
28、48 propose a method for recovering articulated body pose without initialization and tracking (using learning). The authors of 8 use pose and velocity vectors to recognize body parts and detect different activities, while the authors of 17 use temporal templates.In some emerging MMHCI applications, g
29、roup and non-command actions play an important role. In 102, visual features are extracted from head and hand/forearm blobs: the head blob is represented by the vertical position of its centroid, and hand blobs are represented by eccentricity and angle with respect to the horizontal. These features
30、together with audio features (e.g., energy, pitch, and speaking rate, among others) are used for segmenting meeting videos according to actions such as monologue, presentation, white-board, discussion, and note taking. The authors of 60 use only computer vision, but make a distinction between body m
31、ovements, events, and behaviors, within a rule-based system framework.Important issues for large-scale body tracking include whether the approach uses 2D or 3D, desired accuracy, speed, occlusion and other constraints. Some of the issues pertaining to gesture recognition, discussed next, can also ap
32、ply to body tracking.3.2. Hand gesture recognitionAlthough in humanhuman communication gestures are often performed using a variety of body parts (e.g., arms, eyebrows, legs, entire body, etc.), most researchers in computer vision use the term gesture recognition to refer exclusively to hand gesture
33、s. We will use the term accordingly and focus on hand gesture recognition in this section.Psycholinguistic studies of human-to-human communication 103 describe gestures as the critical link between our conceptualizing capacities and our linguistic abilities. Humans use a very wide variety of gesture
34、s ranging from simple actions of using the hand to point at objects, to the more complex actions that express feelings and allow communication with others. Gestures should, therefore, play an essential role in MMHCI 83,186,52, as they seem intrinsic to natural interaction between the human and the c
35、omputer-controlled interface in many applications, ranging from virtual environments 82 and smart surveillance 174, to remote collaboration applications 52.There are several important issues that should be considered when designing a gesture recognition system 136. The first phase of a recognition t
36、ask is choosing a mathematical model that may consider both the spatial and the temporal characteristics of the hand and hand gestures. The approach used for modeling plays a crucial role in the nature and performance of gesture interpretation. Typically, features are extracted from the images or vi
37、deo, and once these features are extracted, model parameters are estimated based on subsets of them until a right match is found. For example, the system might detect n points and attempt to determine if these n points (or a subset of them) could match the characteristics of points extracted from a
38、hand in a particular pose or performing a particular action. The parameters of the model are then a description of the hand pose or trajectory and depend on the modeling approach used. Among the important problems involved in the analysis are hand localization 187, hand tracking 194, and the selecti
39、on of suitable features 83. After the parameters are computed, the gestures represented by them need to be classified and interpreted based on the accepted model and based on some grammar rules that reflect the internal syntax of gestural commands. The grammar may also encode the interaction of gest
40、ures with other communication modes such as speech, gaze, or facial expressions. As an alternative to modeling, some authors have explored the use of combinations of simple 2D motion based detectors for gesture recognition 71.In any case, to fully exploit the potential of gestures for an MMHCI appli
41、cation, the class of possible recognized gestures should be as broad as possible and ideally any gesture performed by the user should be unambiguously interpretable by the interface. However, most of the gesture-based HCI systems allow only symbolic commands based on hand posture or 3D pointing. Thi
42、s is due to the complexity associated with gesture analysis and the desire to build real-time interfaces. Also, most of the systems accommodate only single-hand gestures. Yet, human gestures, especially communicative gestures, naturally employ actions of both hands. However, if the two-hand gestures
43、 are to be allowed, several ambiguous situations may appear (e.g., occlusion of hands, intentional vs. unintentional, etc.) and the processing time will likely increase. Another important aspect that is increasingly considered is the use of other modalities (e.g., speech) to augment the MMHCI system
44、 127,162. The use of such multimodal approaches can reduce the complexity and increase the naturalness of the interface for MMHCI 126.3.3. Gaze detectionGaze, defined as the direction to which the eyes are pointing in space, is a strong indicator of attention, and it has been studied extensively sin
45、ce as early as 1879 in psychology, and more recently in neuroscience and in computing applications 41. While early eye tracking research focused only on systems for in-lab experiments, many commercial and experimental systems are available today for awide range of applications.Eye tracking systems c
46、an be grouped into wearable or non-wearable, and infrared-based or appearance-based. In infrared-based systems, a light shining on the subject whose gaze is to be tracked creates a red-eye effect: the difference in reflection between the cornea and the pupil is used to determine the direction of sig
47、ht. In appearance based systems, computer vision techniques are used to find the eyes in the image and then determine their orientation. While wearable systems are the most accurate (approximate error rates below 1.4_ vs. errors below 1.7_ for nonwearable infrared), they are also the most intrusive.
48、 Infrared systems are more accurate than appearance-based, but there are concerns over the safety of prolonged exposure to infrared lights. In addition, most non-wearable systems require (often cumbersome) calibration for each individual 108,121.Appearance-based systems usually capture both eyes usi
49、ng two cameras to predict gaze direction. Due to the computational cost of processing two streams simultaneously, the resolution of the image of each eye is often small. This makes such systems less accurate, although increasing computational power and lower costs mean that more computationally intensive algorithms can be run in real time. As an alternative, in 181, the authors propose using a single high-resolution image of one eye to improve accuracy. On the ot