When: September 15, 2004 @ 3:00pm
Where: Gerontology Auditorium (GER 124)
Abstract: As storage
costs plummet and speech recognition technology progressively improves,
it becomes feasible to think of archiving and publishing "spoken
documents" that can be accessed as easily as we do online text documents.
The range of potentially interesting spoken documents is vast, including
records of meetings, committee hearings, news broadcasts, and call center
data, as well as multi-media documents that include speech recordings.
Language processing technology for spoken documents is even more critical
than for text, since it is much more cumbersome to mine audio recordings
than text for useful information. A key component of both speech recognition
technology and many subsequent language processing technologies is statistical
language modeling. Language models are used to characterize word sequences
as an information source (a discrete stochastic process) that is to be
decoded from noisy observations, such as acoustic features in speech recognition
or words in another language in machine translation. Despite the fact
that language is known to have long-distance structure, the most widely
used language model is a simple n-gram or (n-1)-order Markov process,
estimated from word sequence counts in data representative of the target
task. In addition, performance gains in language modeling in recent years have been driven as much by data collection as by advances in
representation of linguistic structure. As vast text resources are increasingly
available via the web, one might argue that this trend will continue.
However, spoken language can be quite different from written language,
particularly for informal conversational speech, transcripts of which
are not as readily available as written text. Human language can vary
substantially depending on topic and register, such that the addition
of mismatched text to the training set can actually hurt language modeling
performance when using simple n-gram models. These observations argue
for a decomposition of language at several levels, in terms of factors
related to speaking style, topic, syntax and even morphology. This talk
will show that leveraging larger data resources in learning models is
synergistic with and not simply an alternative to representing structure
in language, with examples of success stories in different languages and
speech recognition tasks.
Mari Ostendorf joined the Speech Signal Processing Group at BBN Laboratories in 1985, where she worked on low-rate coding and acoustic modeling for continuous speech recognition. Two years later, she went to Boston University in the Department of Electrical and Computer Engineering, where she taught undergraduate and graduate signal processing and pattern recognition courses and ran a large speech research lab. She joined the University of Washington in 1999. Her early work was in speech coding; more recently she has been involved in projects on both continuous speech recognition and speech synthesis, as well as some other types of signals. Current efforts include segment-based acoustic modeling for spontaneous speech recognition, dependence modeling for adaptation, use of out-of-domain data in language modeling, and stochastic models of prosody for both recognition and synthesis. She has published over 100 papers on various problems in speech and language processing. Dr. Ostendorf has served on the Speech Processing and the DSP Education Committees of the IEEE Signal Processing Society and numerous workshop committees.