• English
    • Deutsch
  • English 
    • English
    • Deutsch
  • Login
View Item 
  •   Home
  • Universität Ulm
  • Publikationen
  • View Item
  •   Home
  • Universität Ulm
  • Publikationen
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Contextual time-continuous emotion recognition based on multimodal data

Thumbnail
Thesis_Fedotov_kiz.p ... (26.03Mb)
Erstveröffentlichung
2022-02-15
Authors
Fedotov, Dmitrii
Referee
Minker, Wolfgang
Karpov, Alexey
Dissertation


Faculties
Fakultät für Ingenieurwissenschaften, Informatik und Psychologie
Institutions
Institut für Nachrichtentechnik
External cooperations
ITMO University, St. Petersburg, Russland
Abstract
This thesis outlines novel approaches to integrate contextual information for improving the performance of automatic emotion recognition systems. Emotion recognition has been of interest for researchers for a long period of time and during the last two decades it has received a noticeable development due to hardware and software improvements, as well as the increasing demand on intelligent conversational agents. In spite of the fact that modern emotion recognition research aims towards spontaneous data (recorded not in laboratory conditions) and real world applications, it is often being performed in an isolated manner. This implies that the recognition is done without accounting for previous actions or emotional status of a user, information about his/her interlocutor (if any) and environment while designing the pipeline. All these aspects and sources of information play an important role in defining or affecting the current mood and emotional status of the user; hence, they should be analyzed in order to get a precise and comprehensive estimation. The main objectives of this thesis are development, application and evaluation of approaches to utilization of contextual information in emotion recognition systems. As mentioned above, such information may be available at three different levels. In our work we made use of them all, defining three aims, corresponding to each level. The first aim is to figure out if the amount of contextual information about the user, i.e. his/her previous speech or facial expressions, is related to the emotion recognition performance. If so, which amount of data is optimal and on which factors it is dependent. To tackle this question, we initially aligned features and labels using a combination of algorithms for reaction lag correction. Then, we tested models of two types: time-dependent (based on Recurrent Neural Networks) and time-independent (based on Multilayer Perceptron, Linear Regression with L2 Regularization, Support Vector Regressor, and Gradient Boosted Decision Trees). To prove a hypothesis of existing dependencies between the amount of context and system performance, we have developed a flexible approach to contextual modeling, considering each stage of the recognition pipeline. We have carried out extensive experiments on three corpora of spontaneous time-continuous audio-visual data, annotated in arousal and valence. Based on the results achieved with various approaches, context length, models, modalities and dimensions, we have figured out that there are indeed dependencies between amount of used context and model performance. More precisely, the optimal amount is not dependent on a feature set, the amount of time steps for recurrent models and data frequency. Nevertheless, our experiments have showed that the optimal context length is affected by the modality, corpus and model type. For more detailed conclusions regarding the later aspects, experiments on additional databases should be conducted. Moreover, we have conducted experiments in cross-corpus scenario, that have showed that contextual dependencies are often inherited from training and target corpora. The second aim of the thesis is to develop approaches to integration of interlocutor's (conversational partner's) data into an emotion recognition system for the user in order to increase its performance. This approach should be applicable to the time-continuous problem statement. Our developed approaches are based on feature-level and decision-level fusion and allow context variation for the user and interlocutor in a dependent as well as in an independent scenario, i.e. using similar and different data amount in each sample. We have conducted our experiments on four corpora of spontaneous interactions with audio-visual data, annotated in arousal and valence. In total, we have tested four approaches to contextual emotion recognition in dyadic interaction. Based on the performance comparison of these approaches to a speaker-only baseline, we have concluded that incorporating interlocutor’s data into emotion recognition system may significantly improve its performance. Among the tested approaches, the fully independent one showed the highest performance, while it is also the most resource demanding. The simpler approaches (partially independent or dependent ones) have showed slightly lower performances on average, but due to fewer parameters they are easier to start with. The third aim is to figure out if the information about user's surroundings may help to build an emotion recognition system. If so, which modalities provide the highest performance. Here we have focused on a specific use case, when environmental context has strong influence on the user's emotions, namely, a sightseeing tour. As no off-the-shelf corpora are available for this task, we have collected our own dataset of emotionally labelled touristic behaviour using several devices, and annotated it on various scales. Then, we have trained several uni-, bi-, tri- and multimodal systems for emotion, satisfaction and touristic experience quality (novel labelling approach for smart tourism domain) estimation using feature sets designed to extract meaningful characteristics from collected data. Our experiments have showed that the features describing head movements (tilts and turns) provide the highest performance for emotion recognition task; head movements combined with eye movements based ones -- for satisfaction estimation; and the audio-visual features -- for touristic experience quality prediction. The feature-level fusion of all available modalities has showed performance gain over the best unimodal systems only for satisfaction estimation; and the decision-level fusion has outperformed unimodal systems for each of three problem statements. The performance of decision-level fusion approach was also much higher compared to other approaches.
 
Diese Dissertation entstand im Rahmen einer Kooperation mit der Universität ITMO in St. Petersburg. / This dissertation was written in the context of a cooperation with ITMO University in St. Petersburg.
Date created
2020
Earlier version(s)
http://fppo.ifmo.ru/dissertation/?number=225860
Subject headings
[GND]: Kontextualismus
[LCSH]: Recognition (Psychology)
[MeSH]: Voice recognition
[Free subject headings]: Contextual emotion recognition | Audio visual emotion recognition | Smart tourism | Dyadic interactions
[DDC subject group]: DDC 150 / Psychology | DDC 410 / Linguistics
License
CC BY 4.0 International
https://creativecommons.org/licenses/by/4.0/

Metadata
Show full item record

DOI & citation

Please use this identifier to cite or link to this item: http://dx.doi.org/10.18725/OPARU-41799

Fedotov, Dmitrii (2022): Contextual time-continuous emotion recognition based on multimodal data. Open Access Repositorium der Universität Ulm und Technischen Hochschule Ulm. Dissertation. http://dx.doi.org/10.18725/OPARU-41799
Citation formatter >



Policy | kiz service OPARU | Contact Us
Impressum | Privacy statement
 

 

Advanced Search

Browse

All of OPARUCommunities & CollectionsPersonsInstitutionsPublication typesUlm SerialsDewey Decimal ClassesEU projects UlmDFG projects UlmOther projects Ulm

My Account

LoginRegister

Statistics

View Usage Statistics

Policy | kiz service OPARU | Contact Us
Impressum | Privacy statement