Audio-visual Russian speech recognition

Erstveröffentlichung
2022-04-05Authors
Ivanko, Denis
Referee
Minker, WolfgangKarpov, Alexey
Dissertation
Faculties
Fakultät für Ingenieurwissenschaften, Informatik und PsychologieInstitutions
Institut für NachrichtentechnikExternal cooperations
ITMO University, St. Petersburg, RusslandAbstract
This thesis investigates the developing of robust and reliable automatic speech recognition system based on the processing of audio (voiced speech) and video (lip-reading) information (AVSR). Despite a large number of applications and the recent interest of scientists in recognizing audio-visual speech there is still a room of improvement.
We consider several main peculiarities of the problem. First of all, studies on inflectional languages are practically absent, therefore, in our work we focused on the research of one of the most widespread representatives of such a language - Russian. Another important issue is a small number of representative audio-visual speech databases that are publicly available. For the Russian language, they simply did not exist before the present research. Along with this, in the scientific literature there are no studies on the effect brought by the high-speed video data on the recognition accuracy of the automated lip-reading systems and very little researches regarding investigating the influence of different noisy condition on the AVSR performance.
Thus, the main objective of this thesis is the developing and increasing the performance of the first automatic audio-visual Russian speech recognition system. To achieve this goal, we have solved four main tasks: (1) we collected a representative database of audio-visual Russian speech with high-speed recordings, (2) we performed an advanced feature engineering approach and develop own feature extraction method, (3) we researched for an effective method of modelling high-speed audio-visual data and built from scratch automatic speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and End-to-end, (4) we conducted experimental evaluations with different frame rate, under various acoustic conditions, Russian language-specific experiments, etc.
For the first aim, we have proposed a new methodology for the collection of audio-visual speech datasets. Based on it, AVSpeechDBRecord software was developed and a first of its kind audio-visual speech corpus HAVRUS was collected. It consists of video files without compression (the optical resolution is 640×480 pixels with 200 fps) and of audio files without compression (WAV files with 44.1 kHz sampling rate); text files of temporal annotation into phrases, words, phonemes and visemes are also included. 20 native monolingual Russian speakers (10 males and 10 females) with no language or hearing problems participated in the recordings.
For the second aim, we researched different informative feature extraction methods and proposed own modification of geometry-based visual features representation. The distinctive feature of our modification is the extraction of geometric information of lip movements by calculating the Euclidean distances between certain key points on the speaker’s lips (24 distances in total between 20 lips landmarks). 20 Key points on the lips are detected using a pre-trained active appearance model (AAM). Such a configuration has been selected experimentally and convey the most valuable information about uttered speech. We have evaluated different types of informative features extracted from visual modality, including state-of-the-art pixel-based and own modification of geometry-based features. The use of proposed geometric features has shown a significant improvement (up to 9%) in recognition accuracy in comparison with pixel-based features.
For the third aim, we have implemented from scratch automatic audio-visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in our research. It is worth noting that for the Russian speech recognition, such systems were created for the first time. We have performed a comparative study of these three different architectures. In order to better assess their pros and cons, the training of systems was carried out both on the collected corpus of Russian speech HAVRUS, and on the GRID corpus of English speech. All experiments on GRID dataset have clearly shown that NN-based methods are superior to traditional GMM-CHMM approach. However, when it comes to small-sized corpora, such as HAVRUS, the traditional GMM-CHMM approach demonstrated the best recognition results.
For the fourth aim, we have evaluated the influence of different video frame rate on the audio-visual speech recognition accuracy. In comparison with the regular 25 fps (recording speed used on most devices) using 200 fps high-speed video camera results in an increase of WRR by an average value of 1.48% for all speakers. Experiments with visual-only speech recognizer show an increase of WRR by an average value of 3.10%. According to the authors, experments with such high fps video data speech recognition were conducted for the first time.
The developed Russian AVSR system has been evaluated in several different acoustically noisy conditions (SNR vary in a range from 40 to 0 dB). We have discovered that in the range from 0 to 15-20 dB SNR (at which the accuracy of acoustic-only speech recognition systems decreases significantly) the use of video information gives the greatest boost in recognition accuracy. Thus, we have experimentally proved the importance of video modality for the task of automatic speech recognition.
We have performed experimental search for the optimal number of recognizable viseme classes of Russian speech. We researched from 2 (division into vowels and consonants) to 48 viseme classes (by the number of phonemes), with a step 2. Presented research demonstrated that the use of high-speed video data (200 fps) makes it possible to expand the number of visually distinguishable viseme classes of Russian speech to 20. Experiments carried out with this configuration led to an improvement in the average word recognition accuracy for all speakers in comparison with previously used 10 viseme classes for a regular 25 fps video data.
In this work we addressed the problem of automatic audio-visual speech recognition. Highlighting the fact that visual information about speech is essential for building both, robust and accurate speech recognition system, especially in acoustically noisy conditions. In addition, we prove that video frame rate is crucial for correct representation of fast dynamic of lips movements during continuous speech and has a significant impact on the resulting recognition accuracy. We have collected the first audio-visual speech corpus with high-speed video recordings HAVRUS and developed AVSpeechRecognition software, designed for recognition of continuous audio-visual speech with a small or medium vocabulary (up to a thousand speech commands) based on a high-speed video camera and digital microphone. Diese Dissertation entstand im Rahmen einer Kooperation mit der Universität ITMO in St. Petersburg. / This dissertation was written in the context of a cooperation with ITMO University in St. Petersburg.
Date created
2020
Earlier version(s)
https://dissovet.itmo.ru/dissertation/?number=231510Subject headings
[GND]: Automatische Spracherkennung | Maschinelles Sehen | Maschinelles Lernen[LCSH]: Audio-visual translation | Computer vision | Machine learning
[Free subject headings]: Audio-visual speech recognition | Automated lip-reading | Computer vision | Machine learning
[DDC subject group]: DDC 600 / Technology (Applied sciences) | DDC 620 / Engineering & allied operations
Metadata
Show full item recordDOI & citation
Please use this identifier to cite or link to this item: http://dx.doi.org/10.18725/OPARU-42762
Ivanko, Denis (2022): Audio-visual Russian speech recognition. Open Access Repositorium der Universität Ulm und Technischen Hochschule Ulm. Dissertation. http://dx.doi.org/10.18725/OPARU-42762
Citation formatter >