Comments on "Evidence for nonlinear production mechanisms in the vocal tract," HM Teager and SM Teager, NATO Advance Study Institute, Speech Production and Speech Modelling, Chateau Bonas, France, July 17-29, 1989. Teager's invited leecture was given at Bonas, July 24, 1989 by JF Kaiser. Appears in the bound proceedings of the NATO ASI, speech Production and Speech Modelling, WJ Hardcastle and A Marchal, Editors, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1990.
The paper is split into three parts:
1. vocal tract - describing the non-linear flow, the flow inside is NOT linear, passive, nor acoustic.
2. ear - ear is MORE than a simple frequency analyzer
3. nonlinear processing techniques - to overcome fourier artifacts.
Vocal tract:
Teager observed that, even though velocity of sound in helium is THREE times greater that that in air, the shift in the formants (fundamental frequencies in speech) is about 1.6 times (approx. square root of the velocity) and also the shift (increase) in the pitch is about the same 1.6 times. This works couter-intutive as glottis (if considered a passive mechanical system) should not show any change at all. But a linear acoustic system can shown a increase in resonance with velocity.
observations with hot-wire anemometer:
1. actual flow at different locations in the oral cavity (mouth) differ, flows are more location specific across the cavity,
2. the ratio air-to-pressure fluctuation does not indicate acoustic impedance, the term that it should represent.
3. formant flows essentially stuck to the walls of the vocal tract.
The data are consistent with a pulsed JET whose average flow axis is close to the PALATE but whose direction is perturbed at the formant frequency. The flow at different locations in the oral cavity shows a different pattern, thus has different formants. Thus, formants existing in mouth may / may not exist outside.
Using multiple sensors, Teager also found that, velocity out (towards the lips and out) is always positive, and may contain large vortexes of an axial or radial type.
As the ratio of air-to-pressure does not OBEY the acoustic impedance relationship (by a factor of 100 smaller) the sound wave cannot be considered as a acoustic wave. i.e, we have a wave that is not travelling by compression.
If the wave equations describing acoustics, (and Laplace's wave equation for electromagnetic theory) were used to analyse the sound signal, then speech signal must be considered a nonlinear system.
Refer Morse and Ingard's wave equations (after adding the convection term).
As there exist no (or little) pressure difference across the cross-section, change in pressure is not the one causing change in velocity. The sound wave (can be believed to be) propogating by losing a little bit of its kinetic energy.
based on the cotinuity equation and f=ma type second equation, four kind ofwaves can exist: p+ive, n-ive, going with and against the flow. Each vowel sounds have a distinctively (and unique) different flow patternwhich my (blogger's) feeling is obvious.] These can be a combinations of separated flow, axial votexes, radial vortexes, and a variety of interactions between them.
Teager wants the sound generation system to be termed as aerodynamic system, and NOT an acoustic system. Teager observed five different instabilities (or modes of oscillations)
1. whistle: jet tangentially exciting the cavity,
2. wall of the cavity: jet that is along the inside the wall of the cavity,
3. inside the cavity: jet with a swirl inside a cavity,
4. radial vortex jet,
5. Old Aeolean instability.
First Model: Whistle [ordinary policeman's whistle, with or w/o a pea or ball inside]
Whistle is a relaxation oscillator. The cavity pressure oscillates (increases and descreases). When the cavity pressure is low, the jet of air (from the mouth) builds the pressure. When the cavity pressure is above a certain value, the jet of air blows out (thus the sound is generated). The jet of incoming air deflects as it enters the whistle cavity generating a vortex. This vortex amplitude-modulates the air flow, giving the typical sound to it.
If the above experiment is done with a different gas (helium, for example) [helium is lighter than air, speed of sound in helium is three times as compared to air] pitch was up by a factor of 1.6
Second Model: Aeolian instability - time behavior of the vortexes bound behind the wire,
All the above models can be represented by a different set of dynamic equations, but commonality is they are some form of regenerative oscillators. Thus, the system cannot be passive.
**** the contents of this paragraph *** are verbatim version of the original
Sound waves are assumed to be able to travel freely in any direction. Jets and votexes cannot. A jet of air inside a cavity with an inlet and an outlet, such as the mouth, acts as a barrier to the cavity'e outlet. An axial vortex in a similar cavity can also act as a barrier to that cavity's outlet, but n a different manner than the jet. The swirling axial vortex acts a a nonlinear plug. When the pressure inside the cavity is increased, the vortex is compressed cutting off the flow. When the pressure inside the cavity is decreased, the vortex expands allowing more flow. This is exactly the description of a positive feedback system which will oscillate under almost any circumstance, and indeed does.
Part II] Hearing
Ear is NOT a Fourier Frequency analyzer. Seeback concluded that one hears periodicities rather than multiple pure tones.
Helmholtz and von Bekesy hold the premise that ear is a tonal analyzer, with hair cells acting as resonator, resonating at different frequencies.
The fluid inside the ear has a damping effect, so Teager makes a point that, it is NOT possible to cause hair cell to respond with "resonance".
While other scientists believe that energy in the ear is carried as an acoustic wave in the bluk motion of the fluid, Teager believes that most of the energy is traveling as a wave along the inside the cochlear surface.
Teager believes that, outer hair cells might be setting up their own vortex which would then act as an amplifier. [a real low-noise amplifier to measure the deflection of the order of 1/100th of an angstrom unit.]
Teager puts across a point that a small bird who does not have a cochlea can sing and modulate its voice. Hence, he thinks that there is no way that human ears do mechanical frequency selection.
Fourier analysis makes sense for stationary, periodic signals, which is NOT the case for speech signals (speech signals has variability, and modulations).Speech signal is made up of TRANSIENTS, or repetitive transients, which if analyzed through Fourier analysis, DOES NOT make sense, or will not yeild what we are looking for.
Teager feels that, instead of breaking the sound into individual frequency components, 1) "we need to understand the energy involved in producing that sound." 2) Implying that we are interested in both square of the frequency and the square of the amplitude of the sound wave. 3) also interested in mode of oscillations, 4) the structure within oscillations - the model structure as well as amplitude structure. This is how transients will be represented. ear is interested in energy modulation that generated the transient sound.
Teager postulates that "there is something that is going on in the ear and the brain": the system does the following:
1. Filters the sound (apply a filter to a sound, what type of filter needs to be applied??) The idea is that we need to focus our attention on a particular band of frequency at a time.
2. Demodulate the result (does this mean, demodulate the result obtained at each filter of the filter bank??)
3. do correlation to find out what is going on: (does this mean, perform correlation across all the filters of th filter bank??) or work within the individual filter??
Fourier analzyer tries to multiple the speech wave by a group of sine waveforms, and integrates and averages, this destroys the basic information we wanted to extract.
A) Vocal tract is a nonlinear oscillator, hence it obeys a special property of "mode-locking". So to say, that any nonlinear oscillator CANNOT generate "all possible frequencies" at the same time.
B) Energy can be transferred: the energy from high frequency components gets coupled into the lower frequency modes, after it deforms the system with itself, and attaches attributes to the system, this gives rise to all modes of oscillations;
C) Fourier analysis will not help conclude by looking at the signal (the way it does), and help conclude whether the signal was generated by a active or passive system. But Teager believes (which is not said explicitly, but I infer that) the way he sees the system or analyzes the system, we can figure out whether the system is active or passive??
Part II: Speech, Hearing, and related signal processing:
Inspite of its ability to produce infinite variations and sounds, human communications (in most of the cases) relies on some "short" numbers of sound. Thus, we can conclude or attach some special "attributes" to these sounds.
If a set of wide bandpass filters is used (to finding out the frequency bands in which energy is concentrated), wide frequency bands of energy is seen, or the other hand, if narrow bandpass filters are implemented, energy bands all across the spectrum will be found.
For example: for broad-band sound spectrogram, numerous horizontal bands are defined as formants, while vertical striations corresponds to the pitch periods. But a closer look reflects that, a different grouping or banding of frequencies are observed. Hence, definition of "a formant" and "a frequency band" becomes hazy.
Teager during his experimentation concluded that "no pitch periods were ever the same twice".
Helmholtz, Pertson, Barney either claimed or substantiated that vowels are a combinations of two pure tones.
Teager shows that a couple of continuants were missed out in the observations for the experiments conducted above by Perterson and Barney.
Licklider claimed that each of the front vowels [e, ih, eh ..] can be identified with one formant only. This experiment was found to be true, as by Teager as well. This lead Teager to another question: where does the information lie or come from?
Teager expects that same word said by different speakers or even by same speakers at different times, sounds a little different but still the listner is able to perceive the meaning. Hence he concludes that phenomenon that distinguishes the sound of a word or phoneme is not the pure tones in that word or phoneme, but rather it is the modulations of those tones.
Most importantly,
these modulations can be both tracked and quantified.
Teager's approach:
1) locate the modes of oscillations,
2) adaptively bandpass filter the speech,
3) demodulate the results in order to characterize and identify the speech sounds.
**** comments on the filters used for the purpose: (no changes made to the contents)
filters used to obtain the output were soft in the sense that they were highly damped and did not produce any long lasting oscillations. Although these filters are linear, they are unconventional. If one uses very sharp narrow-band filters to separate the modes of oscillations prior to demodulation, then the response of those filters to a pulse of energy will be dominated by the transient rigning, or lasting oscillation, of the filters. Instead, it is best to use wide-band filters that are as narrow as possible without destroying or rearranging the energy in the original wave. filters with Gaussian-like responses work very well and are the type of filters that were actually used.
a nonlinear demodulation algorithm was applied to the output of each of the bandpass filters, to understand t he differences in modes of oscillation and their different modulation patterns.
**** As is expected of linear acoustic theory, the formants would be represented by a primarily damped sine waves, hence the wave pattern (energy pattern at each of the bandpass filters) would NOT have bumps. As fig 5 [right side waveforms] shows bumps or concentration of energy, Teager believes that formants are a result of pulsatile flow interactions.
Teagers observations on the energy profiles:
lowest bandpass filter output has a very large single pulse of energy occuring once every pitch period, indicating that this pulse is "a puff" from the glottis. Hence the lowest bandpass filter represents the energy of the glottal wave.
second bandpass filer output has heavily modulated mode of oscillation (indicated by the three successively decreasing pulses nearly evenly spaced in each pitch period.
highest filter output is mostly the rough sounds which one could also listen to alone and distinguishly decipher the original sound.
thus, energy traces indicate that the formants are modulated.
Sound generation in the vocal tract is an active distributed process.
The output at each filter indicates the place where these sounds "energy profiles" were or would have been generated. Indicate that these were generated at separate parts of the vocal tract, and the residue noise is produced by the teeth and lips.
With linear passive system, it is NOT possible to locate the source of sound generation.
The high frequency components generated in the glottis do not make it through the oral cavity to the outside of the mouth.The pulsatile sheet jet coming out of the narrow slit of the vocal folds during phonation, generate a considerable amount of high frequency noise which in inherent in the process, the pulsatile jet proceeds through the vocal tract and drives or excited everything downstream from it. even though the sound generated by the second-order process is heard, the main source of energy in the glottal jet.
*** the sounds tha thuman beings almost universally utilize for speech are in fact completely distinguishable on the basis of the amplitude and frequency modulations of their energy envelopes. each vowel sound has a unique modulation that is generally tied in with its high frequency second formant. this unique characteristic of the selected speech sounds and the fact that they are not difficult to generate might well account for their universality.
The nonlinear processes (the primary sound producing mechanisms) arise from the nonlinear interaction of the sheet jet flows and the generated flow vortexes within the confiend geometry of the vocal tract, with the vortex probably playing the role of the active oscillator in effecting modulations.
=======================
questions:
1. why should the modes [pg 12 - mode-lock] space themselves apart by at least the factor of TWO in frequency??
2. How can Teager use fourier analysis to help assist his claims, when he himself says that fourier analysis "smears and destroys the very information we are trying to extract [page 12/13]?"
3. What would the reason behind "no ptich periods were ever the same twice, pitch periods vary from being slightly different to veing very different, but they were always different"??
4. how [pg 17] did he compute the noise residue?
5. what are these wide-band Gaussian-like filters?
6. what is this [pg 17/18] nonlinear demodulation algorithm?
7. How can [figure 5, right side energy profiles] Teager get P+VE energy profiles for all the energy patterns, whereas when we implement a TEO operator, we DO GET N-VE values?? There should be something else that Teager has in his nonlinear demodulation algorithm....
Explain the source-filter model,
what is a linear system, state its advantages for speech,
how do you characterize speech
what all needs to be considered (or what are the allowances) in linear
speech model?
formants, and articulators
characterization of airflow and airflow dynamics -
Navier Stokes equation Continuity equation
electrical equivalence of acoustic waveform
what are planer waves / wavefront?
how is lip / teeth characterized?
what is difference between pitch and fundamental frequency?
what is the effect of sampling / quantization.
Gabor filter - band pass filter properties and why do we require "this properties"
No comments:
Post a Comment