Accession Number : ADA329886
Title : Lip Tracking for Audio-Visual Speech Recognition.
Descriptive Note : Doctoral thesis,
Corporate Author : AIR FORCE INST OF TECH WRIGHT-PATTERSON AFB OH
Personal Author(s) : Kaucic, Robert A., Jr
PDF Url : ADA329886
Report Date : 30 SEP 1997
Pagination or Media Count : 168
Abstract : Human speech is conveyed through both acoustic and visual channels and is therefore inherently multi-modal. Further, the two channels are largely complementary in that the acoustic signal typically contains information about the manner of articulation while the visual signal embodies knowledge of the place of articulation. This orthogonal nature of the audio and visual components has enticed researchers to develop audio-visual speech recognition systems that have been shown to be robust to acoustic noise. A fundamental requirement of automatic audio-visual speech recognition is the need for real-time tracking; however, this necessity has been largely ignored by the lipreading community. This work presents a new approach for tracking unadorned lips in real time (50 fields/sec). The tracking framework presented combines comprehensive shape and motion models learnt from continuous speech sequences with focused image feature detection methods. Statistical models of the grey-level appearance of the mouth are shown to enable identification of the lip boundary in poorly contrasted grey-level images. The combined armory of the these modeling approaches permits robust, real-time tracking of unadorned lips. Isolated-word recognition experiments using dynamic time warping and Hidden Markov Model-based recognizers demonstrate that real-time, contour-based, lip tracking can be used to provide robust recognition of degraded speech. In noisy acoustic conditions, the performance of recognizers incorporating visual shape parameters are superior to the acoustic-only solutions, providing for error rate reductions up to 44%.
Descriptors : *SPEECH RECOGNITION, *VISUAL SIGNALS, *ACOUSTIC CHANNELS, MATHEMATICAL MODELS, REAL TIME, HUMANS, DYNAMICS, PARAMETERS, MOTION, RATES, TRACKING, SEQUENCES, REDUCTION, TIME, ERRORS, SOUND, ACOUSTIC SIGNALS, RECOGNITION, VISION, STATISTICAL ANALYSIS, AUTOMATIC, DUAL CHANNEL, NOISE(SOUND), MOUTH, AUDIOVISUAL AIDS.
Subject Categories : Voice Communications
Distribution Statement : APPROVED FOR PUBLIC RELEASE