Accession Number : ADA307187

Title :   The Unsupervised Acquisition of a Lexicon from Continuous Speech.

Descriptive Note : Memorandum rept.,

Corporate Author : MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB

Personal Author(s) : De Marcken, Carl

PDF Url : ADA307187

Report Date : NOV 1995

Pagination or Media Count : 29

Abstract : We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.

Descriptors :   *SPEECH RECOGNITION, *ARTIFICIAL INTELLIGENCE, COMPUTER PROGRAMS, DATA BASES, MATHEMATICAL MODELS, ALGORITHMS, OPTIMIZATION, EFFICIENCY, LEARNING MACHINES, INPUT OUTPUT PROCESSING, DATA ACQUISITION, ACOUSTIC SIGNALS, WORDS(LANGUAGE), PATTERN RECOGNITION, SPEECH ANALYSIS, DATA COMPRESSION, VOCABULARY, HIERARCHIES, COMPUTATIONAL LINGUISTICS, SYNTAX, NATURAL LANGUAGE, TEXT PROCESSING, WORD RECOGNITION, LEXICOGRAPHY, MACHINE TRANSLATION, PHONETICS, PHRASE STRUCTURE GRAMMARS, SPEECH COMPRESSION, CONTEXT SENSITIVE GRAMMARS, PHONEMES.

Subject Categories : Cybernetics
      Linguistics

Distribution Statement : APPROVED FOR PUBLIC RELEASE