Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
Session 4B: SS4: Multimedia and Multimodal Interaction for Health and Basic Care Applications
This special session aims at presenting the most recent works and applications in the area of multimedia analysis and multimodal interaction in the context of health and basic care. As devices and systems are becoming increasingly powerful and in parallel the content analytics and retrieval technologies are boosting, the interface between human and computer is often lagging behind and constitutes a bottleneck for efficient use for real world applications. This is especially important in health and basic care applications, in which the interaction with humans is even more critical due to the special needs and urgent situations involved. Leveraging on multidisciplinary expertise combining knowledge from research in multimedia analysis, as well as the multimodal interaction domains, new technologies are required to offer interactions, which are closer to the communication patterns of human beings and allow for a more “natural” communication with systems in the context of health and basic care. This is currently envisioned by recent research, which aims at developing knowledge-based autonomous human-like social agents that can analyze, retrieve information and learn from conversational spoken and multimodal interaction in order to support care giving scenarios. In parallel, over the last few years we could observe an increasing need of video content processing for health applications. A very characteristic example is the videos from endoscopic procedures and surgeries, since endoscopists and surgeons are switching over to archive the videos they actually used to perform the endoscopic intervention. These endoscopic videos contain valuable information that can be used for later inspection, for explanations to patients, for case investigations, and for training purposes. Therefore there is an important need for the development of powerful multimedia systems that can effectively process huge amounts of video data with highly similar content and make them available for content exploration and retrieval.
For more details of this session, please visit: http://mklab.iti.gr/mmih/.
Deep Learning of Shot Classification in Gynecologic Surgery Videos
Alpen Adria Universität Klagenfurt, Austria
In the last decade, advances in endoscopic surgery resulted in vast amounts of video data which is used for documentation, analysis, and education purposes. In order to find video scenes relevant for afore- mentioned purposes, physicians manually search and annotate hours of endoscopic surgery videos. This process is tedious and time–consuming, thus motivating the (semi–)automatic annotation of such surgery videos. In this work, we want to investigate whether the single-frame model for semantic surgery shot classification is feasible and useful in practice. We approach this problem by further training of AlexNet, an already pre- trained CNN architecture. Thus, we are able to transfer knowledge gathered from the Imagenet database to the medical use case of shot classification in endoscopic surgery videos. We annotate hours of endoscopic surgery videos for training and testing data. Our results imply that the CNN-based single-frame classification approach is able to provide useful suggestions to medical experts while annotating video scenes. Hence, the annotation process is consequently improved. Future work shall consider the evaluation of more sophisticated classification methods incorporating the temporal video dimension, which is expected to improve on the baseline evaluation done in this work.
Classification of sMRI for AD diagnosis with Convolutional Neuronal Networks: a pilot 2-D+ε study on ADNI
1University Bordeaux/LABRI, France; 2CNRS UMR 5287 - INCIA; 3University Ibn Zohr; 4ENSEIRB/LaBRI
In interactive health care systems, Convolutional Neural Networks (CNN) are starting to have their applications, e.g. the classification of structural Magnetic Resonance Imaging (sMRI) scans for Alzheimer’s disease Computer-Aided Diagnosis (CAD). In this paper we focus on the hippocampus morphology which is known to be affected in relation with the progress of the illness. We use a subset of the ADNI 4 database to classify images belonging to Alzheimer’s disease (AD), mild cognitive impairment (MCI) and normal control (NC) subjects. As the number of images in such studies is rather limited regarding the needs of CNN, we propose a data augmentation strategy adapted to the specificity of sMRI scans. We also propose a 2-D+ε approach, where only a very limited amount of consecutive slices are used for training and classification. The tests conducted on only one - saggital - projection show that this approach provides good classification accuracies: AD/NC 82.8% MCI/NC 66% AD/MCI 62.5% that are promising for integration of this 2-D+ε strategy in more complex multi-projection and multi-modal schemes.
Description Logics and Rules for Multimodal Situational Awareness in Healthcare
Information Technologies Institute, CERTH, Greece
We present a framework for semantic situation understanding and interpretation of multimodal data using Description Logics (DL) and rules. More precisely, we use DL models to formally describe contextualised dependencies among verbal and non-verbal descriptors in multimodal natural language interfaces, while context aggregation, fusion and interpretation is supported by SPARQL rules. Both background knowledge and multimodal data, e.g. language analysis results, facial expressions and gestures recognized from multimedia streams, are captured in terms of OWL 2 ontology axioms, the de facto standard formalism of DL models on the Web, fostering reusability, adaptability and interoperability of the framework. The framework has been applied in the eminent field of healthcare, providing the models for the semantic enrichment and fusion of verbal and non-verbal descriptors in dialogue-based systems.
Speech Synchronized Tongue Animation by Combining Physiology Modeling and X-ray Image Fitting
University of Science and Technology of China, China, People's Republic of
This paper proposes a speech synchronized tongue animation system from text or speech. Firstly, an anatomically accurate physiological tongue model is built, and then produces tremendous tongue deformation samples ac- cording to the randomly input muscle activation samples. Secondly, these input and output samples are used to train a neural network for establishing the relationship between the muscle activation and tongue contour deformation. Thirdly, the neural network is used to estimate the non-rigid tongue movement para- meters, namely tongue muscle activations, from a collected X-ray tongue movement image database of Mandarin Chinese phonemes after removing the rigid tongue movement, and then the estimation results are used for construct- ing the tongue physeme (the sequences of the tongue muscle activations and the rigid movement) database corresponding to the Mandarin Chinese phoneme database. Finally, the physemes corresponding to the phonemes extracted from input text or speech are blended to drive the physiological tongue model for producing the speech synchronized tongue animation according to the durations of phonemes. Simulation results demonstrate that the synthesized tongue animations are visually realistic and approximate the tongue medical data well.
Boredom Recognition based on Users' Spontaneous Behaviors in Multiparty Human-Robot Interactions
1Tokyo Institute of Technology; 2Honda Research Institute Japan Co., Ltd., Japan
To recognize boredom in users interacting with machines is valuable to improve user experiences in human-machine long term interactions, especially for intelligent tutoring systems, health-care systems, and social assistants. This paper proposes a two-staged framework and feature design for boredom recognition in multiparty human-robot in- teractions. At the first stage the proposed framework detects boredom-indicating user behaviors based on skeletal data obtained by motion capture, and then it recognizes boredom in combination with detection results and two types of multiparty information, i.e., gaze direction to other participants and incoming-and-outgoing of participants. We experimentally confirmed the effectiveness of both the proposed framework and the multiparty information. In comparison with a simple baseline method, the proposed framework gained 35 percentage points in the F1 score.
Contact and Legal Notice · Contact Address:
|Conference Software -
ConfTool Pro 2.6.107+TC
© 2001 - 2017 by H. Weinreich, Hamburg, Germany