Teager Energy Operator: February 2008

Wednesday, February 20, 2008

Evidence fro nonlinear production mechanisms in the vocal tract

Comments on "Evidence for nonlinear production mechanisms in the vocal tract," HM Teager and SM Teager, NATO Advance Study Institute, Speech Production and Speech Modelling, Chateau Bonas, France, July 17-29, 1989. Teager's invited leecture was given at Bonas, July 24, 1989 by JF Kaiser. Appears in the bound proceedings of the NATO ASI, speech Production and Speech Modelling, WJ Hardcastle and A Marchal, Editors, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1990.

The paper is split into three parts:
1. vocal tract - describing the non-linear flow, the flow inside is NOT linear, passive, nor acoustic.
2. ear - ear is MORE than a simple frequency analyzer
3. nonlinear processing techniques - to overcome fourier artifacts.

Vocal tract:
Teager observed that, even though velocity of sound in helium is THREE times greater that that in air, the shift in the formants (fundamental frequencies in speech) is about 1.6 times (approx. square root of the velocity) and also the shift (increase) in the pitch is about the same 1.6 times. This works couter-intutive as glottis (if considered a passive mechanical system) should not show any change at all. But a linear acoustic system can shown a increase in resonance with velocity.

observations with hot-wire anemometer:
1. actual flow at different locations in the oral cavity (mouth) differ, flows are more location specific across the cavity,
2. the ratio air-to-pressure fluctuation does not indicate acoustic impedance, the term that it should represent.
3. formant flows essentially stuck to the walls of the vocal tract.

The data are consistent with a pulsed JET whose average flow axis is close to the PALATE but whose direction is perturbed at the formant frequency. The flow at different locations in the oral cavity shows a different pattern, thus has different formants. Thus, formants existing in mouth may / may not exist outside.
Using multiple sensors, Teager also found that, velocity out (towards the lips and out) is always positive, and may contain large vortexes of an axial or radial type.

As the ratio of air-to-pressure does not OBEY the acoustic impedance relationship (by a factor of 100 smaller) the sound wave cannot be considered as a acoustic wave. i.e, we have a wave that is not travelling by compression.

If the wave equations describing acoustics, (and Laplace's wave equation for electromagnetic theory) were used to analyse the sound signal, then speech signal must be considered a nonlinear system.

Refer Morse and Ingard's wave equations (after adding the convection term).

As there exist no (or little) pressure difference across the cross-section, change in pressure is not the one causing change in velocity. The sound wave (can be believed to be) propogating by losing a little bit of its kinetic energy.
based on the cotinuity equation and f=ma type second equation, four kind ofwaves can exist: p+ive, n-ive, going with and against the flow. Each vowel sounds have a distinctively (and unique) different flow patternwhich my (blogger's) feeling is obvious.] These can be a combinations of separated flow, axial votexes, radial vortexes, and a variety of interactions between them.
Teager wants the sound generation system to be termed as aerodynamic system, and NOT an acoustic system. Teager observed five different instabilities (or modes of oscillations)
1. whistle: jet tangentially exciting the cavity,
2. wall of the cavity: jet that is along the inside the wall of the cavity,
3. inside the cavity: jet with a swirl inside a cavity,
4. radial vortex jet,
5. Old Aeolean instability.

First Model: Whistle [ordinary policeman's whistle, with or w/o a pea or ball inside]
Whistle is a relaxation oscillator. The cavity pressure oscillates (increases and descreases). When the cavity pressure is low, the jet of air (from the mouth) builds the pressure. When the cavity pressure is above a certain value, the jet of air blows out (thus the sound is generated). The jet of incoming air deflects as it enters the whistle cavity generating a vortex. This vortex amplitude-modulates the air flow, giving the typical sound to it.
If the above experiment is done with a different gas (helium, for example) [helium is lighter than air, speed of sound in helium is three times as compared to air] pitch was up by a factor of 1.6
Second Model: Aeolian instability - time behavior of the vortexes bound behind the wire,

All the above models can be represented by a different set of dynamic equations, but commonality is they are some form of regenerative oscillators. Thus, the system cannot be passive.

**** the contents of this paragraph *** are verbatim version of the original
Sound waves are assumed to be able to travel freely in any direction. Jets and votexes cannot. A jet of air inside a cavity with an inlet and an outlet, such as the mouth, acts as a barrier to the cavity'e outlet. An axial vortex in a similar cavity can also act as a barrier to that cavity's outlet, but n a different manner than the jet. The swirling axial vortex acts a a nonlinear plug. When the pressure inside the cavity is increased, the vortex is compressed cutting off the flow. When the pressure inside the cavity is decreased, the vortex expands allowing more flow. This is exactly the description of a positive feedback system which will oscillate under almost any circumstance, and indeed does.

Part II] Hearing
Ear is NOT a Fourier Frequency analyzer. Seeback concluded that one hears periodicities rather than multiple pure tones.
Helmholtz and von Bekesy hold the premise that ear is a tonal analyzer, with hair cells acting as resonator, resonating at different frequencies.
The fluid inside the ear has a damping effect, so Teager makes a point that, it is NOT possible to cause hair cell to respond with "resonance".
While other scientists believe that energy in the ear is carried as an acoustic wave in the bluk motion of the fluid, Teager believes that most of the energy is traveling as a wave along the inside the cochlear surface.
Teager believes that, outer hair cells might be setting up their own vortex which would then act as an amplifier. [a real low-noise amplifier to measure the deflection of the order of 1/100th of an angstrom unit.]

Teager puts across a point that a small bird who does not have a cochlea can sing and modulate its voice. Hence, he thinks that there is no way that human ears do mechanical frequency selection.
Fourier analysis makes sense for stationary, periodic signals, which is NOT the case for speech signals (speech signals has variability, and modulations).Speech signal is made up of TRANSIENTS, or repetitive transients, which if analyzed through Fourier analysis, DOES NOT make sense, or will not yeild what we are looking for.
Teager feels that, instead of breaking the sound into individual frequency components, 1) "we need to understand the energy involved in producing that sound." 2) Implying that we are interested in both square of the frequency and the square of the amplitude of the sound wave. 3) also interested in mode of oscillations, 4) the structure within oscillations - the model structure as well as amplitude structure. This is how transients will be represented. ear is interested in energy modulation that generated the transient sound.

Teager postulates that "there is something that is going on in the ear and the brain": the system does the following:
1. Filters the sound (apply a filter to a sound, what type of filter needs to be applied??) The idea is that we need to focus our attention on a particular band of frequency at a time.
2. Demodulate the result (does this mean, demodulate the result obtained at each filter of the filter bank??)
3. do correlation to find out what is going on: (does this mean, perform correlation across all the filters of th filter bank??) or work within the individual filter??

Fourier analzyer tries to multiple the speech wave by a group of sine waveforms, and integrates and averages, this destroys the basic information we wanted to extract.

A) Vocal tract is a nonlinear oscillator, hence it obeys a special property of "mode-locking". So to say, that any nonlinear oscillator CANNOT generate "all possible frequencies" at the same time.
B) Energy can be transferred: the energy from high frequency components gets coupled into the lower frequency modes, after it deforms the system with itself, and attaches attributes to the system, this gives rise to all modes of oscillations;
C) Fourier analysis will not help conclude by looking at the signal (the way it does), and help conclude whether the signal was generated by a active or passive system. But Teager believes (which is not said explicitly, but I infer that) the way he sees the system or analyzes the system, we can figure out whether the system is active or passive??

Part II: Speech, Hearing, and related signal processing:
Inspite of its ability to produce infinite variations and sounds, human communications (in most of the cases) relies on some "short" numbers of sound. Thus, we can conclude or attach some special "attributes" to these sounds.

If a set of wide bandpass filters is used (to finding out the frequency bands in which energy is concentrated), wide frequency bands of energy is seen, or the other hand, if narrow bandpass filters are implemented, energy bands all across the spectrum will be found.

For example: for broad-band sound spectrogram, numerous horizontal bands are defined as formants, while vertical striations corresponds to the pitch periods. But a closer look reflects that, a different grouping or banding of frequencies are observed. Hence, definition of "a formant" and "a frequency band" becomes hazy.

Teager during his experimentation concluded that "no pitch periods were ever the same twice".

Helmholtz, Pertson, Barney either claimed or substantiated that vowels are a combinations of two pure tones.
Teager shows that a couple of continuants were missed out in the observations for the experiments conducted above by Perterson and Barney.

Licklider claimed that each of the front vowels [e, ih, eh ..] can be identified with one formant only. This experiment was found to be true, as by Teager as well. This lead Teager to another question: where does the information lie or come from?

Teager expects that same word said by different speakers or even by same speakers at different times, sounds a little different but still the listner is able to perceive the meaning. Hence he concludes that phenomenon that distinguishes the sound of a word or phoneme is not the pure tones in that word or phoneme, but rather it is the modulations of those tones.
Most importantly,
these modulations can be both tracked and quantified.
Teager's approach:
1) locate the modes of oscillations,
2) adaptively bandpass filter the speech,
3) demodulate the results in order to characterize and identify the speech sounds.

**** comments on the filters used for the purpose: (no changes made to the contents)
filters used to obtain the output were soft in the sense that they were highly damped and did not produce any long lasting oscillations. Although these filters are linear, they are unconventional. If one uses very sharp narrow-band filters to separate the modes of oscillations prior to demodulation, then the response of those filters to a pulse of energy will be dominated by the transient rigning, or lasting oscillation, of the filters. Instead, it is best to use wide-band filters that are as narrow as possible without destroying or rearranging the energy in the original wave. filters with Gaussian-like responses work very well and are the type of filters that were actually used.

a nonlinear demodulation algorithm was applied to the output of each of the bandpass filters, to understand t he differences in modes of oscillation and their different modulation patterns.

**** As is expected of linear acoustic theory, the formants would be represented by a primarily damped sine waves, hence the wave pattern (energy pattern at each of the bandpass filters) would NOT have bumps. As fig 5 [right side waveforms] shows bumps or concentration of energy, Teager believes that formants are a result of pulsatile flow interactions.

Teagers observations on the energy profiles:
lowest bandpass filter output has a very large single pulse of energy occuring once every pitch period, indicating that this pulse is "a puff" from the glottis. Hence the lowest bandpass filter represents the energy of the glottal wave.
second bandpass filer output has heavily modulated mode of oscillation (indicated by the three successively decreasing pulses nearly evenly spaced in each pitch period.
highest filter output is mostly the rough sounds which one could also listen to alone and distinguishly decipher the original sound.
thus, energy traces indicate that the formants are modulated.
Sound generation in the vocal tract is an active distributed process.
The output at each filter indicates the place where these sounds "energy profiles" were or would have been generated. Indicate that these were generated at separate parts of the vocal tract, and the residue noise is produced by the teeth and lips.

With linear passive system, it is NOT possible to locate the source of sound generation.
The high frequency components generated in the glottis do not make it through the oral cavity to the outside of the mouth.The pulsatile sheet jet coming out of the narrow slit of the vocal folds during phonation, generate a considerable amount of high frequency noise which in inherent in the process, the pulsatile jet proceeds through the vocal tract and drives or excited everything downstream from it. even though the sound generated by the second-order process is heard, the main source of energy in the glottal jet.

*** the sounds tha thuman beings almost universally utilize for speech are in fact completely distinguishable on the basis of the amplitude and frequency modulations of their energy envelopes. each vowel sound has a unique modulation that is generally tied in with its high frequency second formant. this unique characteristic of the selected speech sounds and the fact that they are not difficult to generate might well account for their universality.

The nonlinear processes (the primary sound producing mechanisms) arise from the nonlinear interaction of the sheet jet flows and the generated flow vortexes within the confiend geometry of the vocal tract, with the vortex probably playing the role of the active oscillator in effecting modulations.

=======================
questions:
1. why should the modes [pg 12 - mode-lock] space themselves apart by at least the factor of TWO in frequency??
2. How can Teager use fourier analysis to help assist his claims, when he himself says that fourier analysis "smears and destroys the very information we are trying to extract [page 12/13]?"
3. What would the reason behind "no ptich periods were ever the same twice, pitch periods vary from being slightly different to veing very different, but they were always different"??
4. how [pg 17] did he compute the noise residue?
5. what are these wide-band Gaussian-like filters?
6. what is this [pg 17/18] nonlinear demodulation algorithm?
7. How can [figure 5, right side energy profiles] Teager get P+VE energy profiles for all the energy patterns, whereas when we implement a TEO operator, we DO GET N-VE values?? There should be something else that Teager has in his nonlinear demodulation algorithm....

Explain the source-filter model,
what is a linear system, state its advantages for speech,
how do you characterize speech
what all needs to be considered (or what are the allowances) in linear
speech model?
formants, and articulators
characterization of airflow and airflow dynamics -
Navier Stokes equation Continuity equation
electrical equivalence of acoustic waveform
what are planer waves / wavefront?
how is lip / teeth characterized?
what is difference between pitch and fundamental frequency?
what is the effect of sampling / quantization.
Gabor filter - band pass filter properties and why do we require "this properties"

Stress Detection in computer users baed on DSP of noninvasive physiological signals

Feature Extraction Algorithm (flowchart)

Comments on " Stress Detection in Computer Users based on digital signal processing of Noninvasive physiological variables," Jing Zhai, Armando Barreto, Proceedings of the 28th IEEE EMBS Annual International Conference, New York City, USA, Aug 30-Sept 3, 2006. pp. 1335-1358.

Detect: mental or cognitive stress associated with computer interaction.

physiological signals:
Galvanic Skin response (GSR)
Blood Volume Pulse (BVP)
Pupil Diameter (PD)
Skin temperature (ST)

classification strategy: SVM based, to classify between "stressed" and "relaxed" response.

Dataset:
32 students (ages 21-42).

Procedure:
first 5 minutes, subjects were shown 30 still emotionally neutral pictures to relax.
then subjected to "Paced Stroop Test" (http://en.wikipedia.org/wiki/Stroop_task). The subjects had 3 seconds to answer with a mouse click.

Features:
sampling rate: 360 Hz.
from BVP: based on the InterbeatInterval (IBI calculations) and power spectrum analysis: (4 features)
L/H ratio (low frequency: 0.05-0.15Hz, high frequency: 0.16-0.40Hz)
Mean IBI
standard deviation of IBI
amplitude of BV
from GSR: based on response detection: (5 features)
# of response
mean value of GSR
amplitude response
rising time of response
energy of response
from ST: after low pass filtering: (1 feature)
slope of ST
from PD: based on linear interpolation of PD samples (1 feature)
mean value of PD

"to account for differences in the initial arousal levels due to individual differences, normalization of the data was needed prior to use of features, between [0, 1]"

Classification: Support Vector machines (weka software). the classification performance was evaluated using 20-fold cross validation, 20 samples were pulled out as test samples, and the remaining samples were sued to train the classifiers.

the authors have also compared SVM based classifier with naive-based classifier and a decision tree classifier.
The authors were mainly interested in determining the added recognition capability that can achieved with pupil diameter measurements (in junction with other physiological signals).

Thursday, February 7, 2008

Acoustic Sensors in the Helmet Detect Voice and Physiology

Comments on, Michael Scanlon, "Acoustic Sensors in the Helmet detect Voice and Physiology," Proceedings of SPIE -- Volume 5071,Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Defense and Law Enforcement II, Edward M. Carapezza, Editor, September 2003, pp. 41-51.

Acoustic monitoring of first responders physiology for health and performance surveillance

Comments on, "Acoustic Monitoring of first responder's physiology for health and performance surveillance" Michael Scanlon, Proceedings of SPIE -- Volume 4708, Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Defense and Law Enforcement, Edward M. Carapezza, Editor, August 2002, pp. 342-353

The main focus is on body-worn acoustic sensors located at the nech to detect heartbeats and other physiological parameters. The author suggests that these sensors do a pretty good job but during rigorous activity session, a lot of artifacts get added which prevents infering conclusions. But the author argues that if there is a lot of activity then it would indicate the person is "in good shape" and there is nothing to worry about.

The parameter measured in - heart-rate variability (beat-to-beat timing fluctuations derived from the interval between two adjacent beats.)

The technique used:
Lomb prediodogram is used to derive heart-rate variability. Simple peak-detection above and below a certain threshold or waveform derivative parameters can produce the timing and amplitude features necessart for the Lomb periodogram and cross-correlation technique.

The sensor - gel-coupled sensor - has impedance properties similar to that offered by the skin, but has a significant mismatch for the airborne noises.
This technology can be used to measure: heartbeats, breaths, blood presure, motion, voice and other indicators. Other specific events - cough, gag, wheeze, and vomits can also be detect. [the author does not say it, but I feel that the sensor can help detect this events as well.]

(pg 345) Getting the resting heart-rate helps to know HRV and is a good indicator of which personnel is ready to reenter the hazard situation. The duration of elevated heart rates and the maximum rate achieved can also be an indicator of a person's ability to safely and effectively perform his/her mission (task).

data details: (remarks only on the data of concern to me)
fs=1500, anti-aliasing filter corner frequency=500Hz, 30-minutes of data.
heart beats clearly seen from the neck-sensors (left and right).

** how the IBI's fluctuate on a beat-by-beat basis, as well as long-term trends, is termed HRV - and gives an indication of how well the body is regulating blood pressure, breathing, and core temperature. These IBI's also can indicated mental activity related to concentration on a task, suh as when the IBI's become very regular due to a task with intense concentration and precision muscle control, whereas the IBI's may vary significantly for tasks with varying mental and physical distractions. --refer Mulder G and Mulder LJM, "Information Processing and Cadiovascular Control," Psychophysiology, 1981, 18, pp 392-405.

To measure blood pressure, we need to have heart-rate measurement done at two different (would be nice if these to locations are far off from each other) locations on the body causing the time lag in heart beat measurement, time lag and the distance relates to the blood pressure. delta-time between the neck and wrist acoustic pulses: a long time-delta indicates a slow wave (low systolic pressure) while short - fast wave - high systolic pressure.

** Systolic pressure can also (use this method with caution) be done from the slope of second heart-sound but is not accurate. It is also possible that breath rates can be derived from acoustic pulses at the wrist by analyzing changes in amplitude that result from the lungs over- and under-pressurizing the heart.

(pg 348) the neck acoustic data clearly shows high-amplitude heartbeat pulsations in the low-frequency (0-120Hz) region, high-amplitude harmonics of the voice structure, and medium-level braodband breath sounds in the 200- and 500-Hz region. The anti-alising filters can be seen to attenuate those sounds above 500Hz.
(pg 350) one method to monitor the personnel isto look at tshort-term energy detected at the sensors. The RMS energy from the right-neck sensor shows high levels from head turns, voice, jacket, hood, mask movements, and muscular activity from lifting or crawling.

decrease in RMS energy at all acoustic sensors will indicate a decrease in activity.

for breath-rate detection, high-passed neck data reveals a lot of braodband high-frequency energy resulting from the airflow in the throat. Using FFT to monitor the temporal fluctuations of the RMS energy produces a breath rate peak in the power spectrum results. If the data is clipped (at the level of three times the median value of the absolute value of the band-passed filtered data), the advantage is it removes the influence high-amplitude motion artifacts have on the RMS calculations.

Friday, February 1, 2008

"on a simple algorithm to calculate the "energy" of a signal"

Comments on Jim Kaiser, "On a simple algorithm to calculate the 'energy' of a signal". IEEE ICASSP 1990, pp 381-384

The paper outlines the algorithm or alternate way to represent energy for a signal, specifically speech signal (or for that matter any signal which is generated by a mechanical assembly).

The existing procedure to compute the energy of a signal is (two methods are described here):
1. average the sum of squares of the amplitude of the signal (usually for a shorter segment)
2. Using Praseval' theorem (I think), in frequency domain, take the discrete fourier transform and square the magnitudes of the frequency components.

The drawback here is: for any mechanical system generating a signal will require more energy to generate the signal of higher frequency than that required to generate a lower frequency. Thus, the energy term calculated either by (1) or (2) does not account for the frequency of the signal.

But [bloggers comment] my feeling is adding the frequency dimensions makes the computation "complicated". Because, if a signal is combination of "some" frequency components, now how to you compute the signal energy. Also, if the signal was not generated by "a linear" system, what method will you use to decompose the signal into it's frequency constituents.. These questions are not trivial and does not seem to be answered by this "simple algorithm".

Teager Energy Operator