Teager Energy Operator: 2008

Thursday, October 2, 2008

Wednesday, June 11, 2008

Organizing a technical paper

This blog actually belongs to Dr. John Hansen, my advisor at University of Texas at Dallas. He took so much pain and put in efforts to tell us all about importance of writing a technical paper.

During the regular lab meeting today, June 11, 2008, John felt an urge to talk on paper writing.

To have a snap of what he talked, please view the image on left hand side of this blog. [thanks to Leo Chang, for copying this via his camera-phone onto my machine.]

John told us to layout the paper and explore the availability of estate for your writing.

You can, also lay the different sections on the paper. remember actual writing towards it will come latter on. right now, we are just in planning stage.

The very first thing that should come to your mind is:

What is the problem statement? - what are you trying to resolve or solve. what is your intention behind the research that you are continuing.

This will be about 3-4 sentences long (max).

what is new or contribution from your side towards the solution? are you talking about implementing a new feature for the long known problem, are you talking about a different algorithm for the long known problem, or it would be knowledge discovery / probe experiment conducted by you. Be clear on this, because, if more information you have on this - problem statement - more you can expand it latter on.

John gave an example of dialect classification - so, if you say that you

outline of the paper:

problem statement: New / Contribution - Feature, Training Algorithm, Knowledge.

Introduction: Background, Issues, New, Outline (flow of the paper).
CORPUS description:
Baseline Algorithm:
Proposed Algorithm:
Evaluations
Discussions and Conclusion:

Figures (Tables) : All titles/Text should be in Arial. label the axes properly.
Tables: highlight the values important to you. specify the units for the values.

Friday, April 25, 2008

my vision

http://memelabs.com/texasinstruments/index.php?play=3844

Tuesday, March 18, 2008

Nonlinear feature based classfication of speech under stress

TEO-AutoEnv feature extraction Algo

TEO-CB-AutoEnv feature extraction algorithm

the equation to compute TEO for a signal

Equation to compute autocorrelation of a TEO-profile

Text-dependent pairwise stress classification results

Text independent pairwise stress classification

TEO-FM-Var FM variation feature (??)

Basic AM-FM discrete energy separation algorithm (DESA)

left side: neutral and stressed speech signal right side: nonlinear airflow structure

Comments on, "Nonlinear feature based classfication of speech under stress," Guojun Zhou, John HL Hansen, Jim Kaiser, IEEE transaction on speech and audio processing, vol 9, no 3, march 2001, pp. 201-216

first: the outline of the paper or comments on the contents of the paper, paragraph wise.

1. Introduction:
para1: defn of stress and studies related to effect of stress on speech production
para2: effect of speech under stress on automatic speech recognition ASR system
para3: techniques to overcome the effect of stress on ASR performance,
para4: application of stress classification, and use of stress classfication algorithm.
para5, 6: features used for stress classfication so far
para7: comments of research done by Teager, and nonlinear air flow description
para8: Teager energy operator
para9: study with TEO-based features

II. Stress Classification features
A. background of the teager energy operator
para1: discrete teager energy operator, and discrete energy separation algorithms (DESA I, II, IIA/B)
B. TEO-FM-Var variation of FM component
para1: how is the feature extracted, and why / how will it represent stress or variations caused due to stress.
C.TEO-AutoEnv - Normalized TEO Autocorrelation envelope Area
motivation behind this feature, and why normalized, and why autocorrelation, and the defination of segment size.
D.TEO-CB-AutoEnv: Critical Band based TEO Autocorrelation Envelope
para1: motivation, and how / why is the feature extracted, how is it different from the previous two nonlienar features.
1. Harmonic Analysis:
para1. harmonic analysis is necessary to understand the shift in the number of harmonics in a "critical band" or a filter of the filter bank, with stress on speech. the study is more focussed on voiced segment of the words, and differences between the number of harmonics across stress condition.
2. Quantitative Analysis:
mathematical formulation of how stressed speech is different from the neutral speech, and what does autocorrelation area represent for a neutral / stressed speech.
3. Waveform analysis:
describes that not only does the pitch differ (between the neutral and the stressed speech) but also there is SOMETHING else which cannot be quantified completely with pitch or change in pitch. that (this change) can be attributed to the change in muscle tension, change in air flow, change in the way articulators are used for speech production during non-neutral conditions.
III. Evaluations
A. Database:
three domains: neutral, simulated speech, speech under stress sections of SUSAS database were used for the analysis and evaluation, speech sampled 8KHz, 16-bit data.
stress model: 5-state HMM continuous distributions with each state of two-Guassian mixtures.
B. Traditional features:
MFCC - their effectiveness in representing the spectral variations of speech,
pitch - obtained from pitch tracking algorithm
C. Stress Classification results:
three style of evaluations were carried out:
1. to evaluate the set of features (MFCC, pitch, and the three features from authors), which three of them are the best - text-dependent stress detection "pairwise" is done.
2. after finding the three best features, these three feature sets will be put to text-independent stress detection "pairwise" is done
3. these three features sets are then again evaluated for stress classification between neutral against all the stress categories (angry. loud. lombard combined into a single class).
4. study of how much does text-dependency play a major role in stress classification, is done by evaluating the features set for their applicability for stress classification as well as ASR evaluation.

Detailed notes on the actual technical aspect of the paper:
Stress exists while working in noisy backgrounds (Lombard effect), emergency conditions, high workload stress, multitasking, fatigue due to sustained operation, physical environmental factors, emotional moods, or caused due to chemical ( medicines, or otherwise; prescribed or otherwise) comsumption.

stress can cause:
speech to sound slower, faster, softer, louder, change in respiration pattern and muscle tension of the vocal tract.
Speaker at times may use a nonuniform set of speech production adjustments to convey their stress states.

How to improve the performance of ASR for speech under stress:
1. retraining the reference models (adjusting so that the trained and test conditions match),
2. training the speech models under all conditions combined together,
3. speaker dependent training,
4. using speech perturbation models within the HMM framework

The idea that should be adopted is to use stress classification algorithm to classify speech (either into neutral or stress or specific style of stress) and then use the models adopted to that specific stress condition. This is the utility of stress classification algorithm.

utility of stress classification algorithm:
1. to improve the robustness of ASR engine,
2. to prioritize the calls, based on the "emergency index" for the call. High stressed call need to be addressed with top priority.
3. to access the caller's emotional state,
4. to aid psychiatrists to aid their objective assessment of the subject,
5. forensic speech analysis.

research with speaker stress classification:
1. using pitch and variation index of pitch,
2. spectral features based on linear speech production models,
3. estimated vocal tract area profiles,
4. acoustic tube area coefficients,
5. MFCC, delta MFCC, double-delta MFCC, autocorrelation of MFCCs.
6. phoneme duration, intensity, glotttal source structure (especially spectral slope), vocal tract formant structure.

classifiers used:
distance metrics, NN based classifiers, HMM-based structure.

Airflow through the vocal tract is separate and concomitnat vortices which makes in nonlinear. stress will cause changes in muscle tensions, therefore, changes in the vortex-flowinteraction patterns; leading to a difference in nonlinear flow structure.

"while vocal tract articulators do move to configure the vocal tract shape, it is the resulting airflow properties which serve to excite those models (phoneme - speech production models - voice models) which a listener will perceive as a particular phoneme.Teager formulated TEO operator which also accounted for hearing - he associated hearing as a process of detecting energy.

Background of Teager Energy Operator:
The TEO is typically applied to a bandpass filtered speech signal, since its intent is to reflect the energy of the nonlinear flow within the vocal tract for a single resonant frequency. Although the output of a bandpass filter still contains more than one frequency component, it can be considered as an AM-FM signal.
*** As speech production unit generates freq-locked signal and will generate frequencies not close to each other "if close to each other then they will coalase", then the critical band approach seems to be a valid approach.

Although TEO processing is intended to be used for a signal with a single resonant frequency, the TEO energy of a multi-frequency signal does not only reflect individual frequency components but also reflects interactions between them. This characteristic extends the use of TEO to speech signals filtered with wide bandwidth band-pass filters (BPF).
If the two waveforms are observed carefully, neutral and stressed look similar, though stressed has less peaks; the number of peaks between the two maximas is less in stressed speech, but the amplitude of peaks is more prominent as compared to neutral speech - reason probably being glottal folds are not able to close, that fast because of 1) its own response time, 2) as well as tension on glottal muscles, 3) also resistance offered by the airflow (may be the air is flowing for the lesser time, but as the quantity of air to be passed in same, then the glottal folds opening time as compared to closed time is more; the ratio of fold open to close is more; the total duration [pitch] as well changing).

TEO-FM-Var Variation of FM component
the fine excitation variations observed in the speech signal are due to the effects of modulations. implying that a stress calssfication feature is needed which reflects these modulation variations.
No two pitch periods or values (even if measured subsequently) are alike [refer Teager's paper]. TEO-FM-Var features is tied to pitch, it is pitch-synchrous. If speech is considered as AM-FM signal, then the signal needs a carrier to carry the signal, this modulating frequency is Fo. The GBF gabor band pass filter is designed with center frequency of F0 and bandwidth (RMS bandwidth) F0/2
Gabor filter has excellent sidelobe cancellation property.
Using absolute magnitude difference function - AMDF - F0 is computed for the TEO profile of the signal. TEO profile assists the AMDF algorithm because of its squaring property.
Frame based FM-Variations (???) is extracted.

Still not clear vhat is the FM variation metric or index

TEO-AutoEnv normalized TEO Autocorrelation envelope area
to reflect instantaneous excitation variations of speech.

NOT IMPLEMENTED - DUE TO TECHNICAL DIFFICULTIES IS ASSESSING FORMANT and FORMANT TRACKING

if a filter bank is used to bandpass filter voiced speech around each of its formant frequencies, the modulation pattern around each formant can be obtained using TEO AM-FM decomposition, from which variations of modulation patterns across different frequency bands can be obtained.

four fixed bandpass filters were implemented in 0-1KHz, 1-2KHz, 2-3, and 3-4 KHz. number of formants in each band will be 0 to 2 for both neutral as well as stressed speech.

To obtain TEo-AutoEnv feature:
1. filter the raw speech utterance through this 4 filters,
2. get the TEO profile for each filter,
3. filter each TEO profile at fundamental frequency, and 3dB bandwidth of F0/2 (CONFUSED, WHY??)
the the reason is: TEO output of the signal is roughly proportional to the square of both its amplitude and frequency; and AM component for a single formant exhibits periodicity similar to the fundamental frequency.
the major assumption is that we are going to have ONE formant in the filter bank which may not be "true" for all the bands. Furthermore if NO formants exists in a filterband, then do we need to go through this step. AM component of a single formant exhibits periodicity similar to the fundamental frequency: this is true as formant is acting as a carrier frequency (signal), hence its amplitude will vary based on the frequency of the pitch (fundamental frequency).
4. this signal (obtained from #3) is analyzed with a frame (whose length is four times the pitch)

Area under the autocorrrelation envelope for a constant (DC value) signal is a triangle = N/2, where N is number of samples used to compute autocorrelation. For any other signal, the area will be less than N/2, hence (for step # 5)
5. the area under the autocorrelation envelope is computed and normalized with N/2.
This area represents the degree of variability within each band.
Different types or varying degrees of stress will influence the distribution of formant characteristics, and pitch structure and spectral based pitch harmonics from nuetral conditions. In addition to the primary issue of formant migration into adjacent filters, additional pitch harmonics would also occur.

TEO-CB-AutoEnv critical band TEO autocorrelation envelope.
Instead of having uniform partitions as for TEO-AutoEnv feature, the filterbank at the start is critical band one, modeling the hearing structure.
The filter design is: center frequency based on the critical band, and the bandwidth is the bandwidth of the critical band.
To avoid dependency on pitch information, the TEO-CB-AutoEnv feature is derived independent of pitch.

TEO-AutoEnv represents the variations around pitch caused by formant distribution variations across different frequency bands;
TEO-CB-AutoEnv represents the variations in pitch harmonics because of its higher frequency resolution.
Harmonic analysis (manual computation and measurement) for 12 voiced tokens from each (neutral, angry, loud, Lombard) spoken styles simulated stress domain of SUSAS, indicate the variations across speaking styles as well as across different critical bands.
Quantitative analysis indicates that if pitch increases, the number of pitch harmonics will reduce. Thus, the number of pitch harmonics under neutral speech (when the pitch is less as compared to other speaking styles considered) will be more as compared to that under other speaking styles. Which can lead us to think that the autocorrelation sequence (and thus the area envelope) for stressed speech will be less variable across frames (or for that matter, dependent of one or no frequencies); but neutral speech will have more variability. If we assume that neutral speech consisted of two pitch harmonics and with stressed speaking style the pitch increased two-folds, then with stressed speaking style, we will have one pitch harmonic. Implying that a single frequency, so the TEO profile will be a constant, thus the autocorrelation profile will be straight line, thus the area under the curve with stress condition will be MORE as compared to the area under the curve for neutral speech.

Now if the cross-harmonic terms are considered, the computation of autocorrelation and thus the area under becomes a complex function. Also, the number of harmonics present may vary across the speaking styles and across different bands, also the formants might migrate across bands with change in speaking style. These all compelxities, can be reduced or can be covered under an umbrella of area under the envelope.
Also, because of autocorrelation, the fast variations and minor fluctuations will minimized, but still the variations due to stress will be accounted for. Kind of pitch dependency will be OMITTED by looking at the area under the autocorrelation envelope.
Doing a waveform analysis, and upping the pitch of a neutral speech (in order to minimize the differences associated to the pitch change), the neutral speech now had "same" pitch value as that under stressed speech, hence within a valid assumption, whatever differences in TEO profile as well as any other features derived thereafter can be associated to THAT CAUSED by stress acting on speech production unit.

No doubt the effect of GABOR bandpass filter will also play a part in the computation, but barring that, the factors affecting the change in the Area feature can be associated to:
1. change in the fundamental frequency,
2. variability in the harmonics under stress, and
3. due to the nonlinear variations occurred in the airflow in the vocal tract.

DATABASE:
subset of SUSAS words: freeze, help, mark, nav, oh, zero.
stress styles: angry, loud, Lombard - simulated stress styles,
actual task: speech during roller-coaster ride
worked on voiced sections: vowels, diphthongs, liquids, glides, nasals from the word are extracted.
16-bit, 8KHz.
Baseline system: 5 state HMM-based stress classifier with continuous distributions, and each with two Gaussian mixtures.

Features: Pitch, MFCCs (their effectiveness in representing the spectral variations of speech).

Text-dependent Pairwise Stress Classification:
Model building: HMM model for voiced portion of each word using 18 tokens for each stressed style, 17 stressed tokens (for each stress model) were used HMM model (in a round-robbin way) and
testing was on 90 neutral tokens, and 1 stressed token (per stress style)

Tested with simulated conditions and actual stress conditions tell us that TEO-CB-AutoEnv performs best as compared to pitch and MFCCs. the reason partly because TEO-CB-AutoEnv feature is not dependent on the accuracy with which pitch information is extracted.

Text-independent Pairwise Stress Classification:
Similar (but a slightly lower) performances were seen for TEO-CB-AutoEnv, pitch, MFCC.

MFCC performance degradation for out-of-vocabulary test as above is more as these features are dependent on vocal tract spectral structure, and mainly designed for speech recognition, thus relies on test sequence information.

Checking for the performance comparison for a feature across different speaking styles leads to the conclusion that, some acoustic information overlap does exist between the speaking styles more so between, neutral vs loud, neutral vs Lombard. Also, the reason would be that some speakers might NOT be better at showing that particular correctly. And also, there is an overlap between loud and angry to a certain extend.

Multistyle stress classification:
Neutral HMM model
test utterance --> Angry HMM model --> neutral in, decision neutral (Correct decn)
Lombard HMM model angry (/Lombard / Loud) in, decn neutral (wrong);
Loud HMM model angry (Lombard / loud) in, decn NOT neutral (Correct)

To give the confusion matrix structure. results indicate that pitch and TEO-CB-AutoEnv feature outperform MFCCs.

Another experiment on getting both the speech recogniton and stress recognition correct indicate TEO-CB-AutoEnv CANNOT do a better job, which is understandable as it models excitation variations, while MFCC is meant for speech recognition.
This particular observation hints that a two-stage strategy can be implemented for ASR, in which TEO-CB-AutoEnv detects stress, and then MFCC is used for ASR.

Questions that can be raised or need to be addressed:
1. What is Lombard effect - explain in details the effect and consequences on speech production system, human listening ability.
2. how is channel noise characterized? what do you mean by additive noise, convolutive noise? what is the impact of each on signal, and how do you mitigate this problem?
what is multi-style training
What does MFCC represent? how is it extracted -
Derive Teager energy equation - continuous domain, discrete time.
Give the correlation of TEO equation with the spring damper system
Derive DESA, DESA-II, algorithms
Amplitude modulation, frequency modulation basic equations
Absolute magnitude difference function for pitch extraction
What are the properties of autocorrelation?
where is autocorrelation applied?
derive the equation for autocorrelation on TEO profile, compute the area of autocorrelation envelope (derive)
What is the area under the curve for a DC value signal
analysis / comments on the results obtained.

Wednesday, February 20, 2008

Evidence fro nonlinear production mechanisms in the vocal tract

Comments on "Evidence for nonlinear production mechanisms in the vocal tract," HM Teager and SM Teager, NATO Advance Study Institute, Speech Production and Speech Modelling, Chateau Bonas, France, July 17-29, 1989. Teager's invited leecture was given at Bonas, July 24, 1989 by JF Kaiser. Appears in the bound proceedings of the NATO ASI, speech Production and Speech Modelling, WJ Hardcastle and A Marchal, Editors, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1990.

The paper is split into three parts:
1. vocal tract - describing the non-linear flow, the flow inside is NOT linear, passive, nor acoustic.
2. ear - ear is MORE than a simple frequency analyzer
3. nonlinear processing techniques - to overcome fourier artifacts.

Vocal tract:
Teager observed that, even though velocity of sound in helium is THREE times greater that that in air, the shift in the formants (fundamental frequencies in speech) is about 1.6 times (approx. square root of the velocity) and also the shift (increase) in the pitch is about the same 1.6 times. This works couter-intutive as glottis (if considered a passive mechanical system) should not show any change at all. But a linear acoustic system can shown a increase in resonance with velocity.

observations with hot-wire anemometer:
1. actual flow at different locations in the oral cavity (mouth) differ, flows are more location specific across the cavity,
2. the ratio air-to-pressure fluctuation does not indicate acoustic impedance, the term that it should represent.
3. formant flows essentially stuck to the walls of the vocal tract.

The data are consistent with a pulsed JET whose average flow axis is close to the PALATE but whose direction is perturbed at the formant frequency. The flow at different locations in the oral cavity shows a different pattern, thus has different formants. Thus, formants existing in mouth may / may not exist outside.
Using multiple sensors, Teager also found that, velocity out (towards the lips and out) is always positive, and may contain large vortexes of an axial or radial type.

As the ratio of air-to-pressure does not OBEY the acoustic impedance relationship (by a factor of 100 smaller) the sound wave cannot be considered as a acoustic wave. i.e, we have a wave that is not travelling by compression.

If the wave equations describing acoustics, (and Laplace's wave equation for electromagnetic theory) were used to analyse the sound signal, then speech signal must be considered a nonlinear system.

Refer Morse and Ingard's wave equations (after adding the convection term).

As there exist no (or little) pressure difference across the cross-section, change in pressure is not the one causing change in velocity. The sound wave (can be believed to be) propogating by losing a little bit of its kinetic energy.
based on the cotinuity equation and f=ma type second equation, four kind ofwaves can exist: p+ive, n-ive, going with and against the flow. Each vowel sounds have a distinctively (and unique) different flow patternwhich my (blogger's) feeling is obvious.] These can be a combinations of separated flow, axial votexes, radial vortexes, and a variety of interactions between them.
Teager wants the sound generation system to be termed as aerodynamic system, and NOT an acoustic system. Teager observed five different instabilities (or modes of oscillations)
1. whistle: jet tangentially exciting the cavity,
2. wall of the cavity: jet that is along the inside the wall of the cavity,
3. inside the cavity: jet with a swirl inside a cavity,
4. radial vortex jet,
5. Old Aeolean instability.

First Model: Whistle [ordinary policeman's whistle, with or w/o a pea or ball inside]
Whistle is a relaxation oscillator. The cavity pressure oscillates (increases and descreases). When the cavity pressure is low, the jet of air (from the mouth) builds the pressure. When the cavity pressure is above a certain value, the jet of air blows out (thus the sound is generated). The jet of incoming air deflects as it enters the whistle cavity generating a vortex. This vortex amplitude-modulates the air flow, giving the typical sound to it.
If the above experiment is done with a different gas (helium, for example) [helium is lighter than air, speed of sound in helium is three times as compared to air] pitch was up by a factor of 1.6
Second Model: Aeolian instability - time behavior of the vortexes bound behind the wire,

All the above models can be represented by a different set of dynamic equations, but commonality is they are some form of regenerative oscillators. Thus, the system cannot be passive.

**** the contents of this paragraph *** are verbatim version of the original
Sound waves are assumed to be able to travel freely in any direction. Jets and votexes cannot. A jet of air inside a cavity with an inlet and an outlet, such as the mouth, acts as a barrier to the cavity'e outlet. An axial vortex in a similar cavity can also act as a barrier to that cavity's outlet, but n a different manner than the jet. The swirling axial vortex acts a a nonlinear plug. When the pressure inside the cavity is increased, the vortex is compressed cutting off the flow. When the pressure inside the cavity is decreased, the vortex expands allowing more flow. This is exactly the description of a positive feedback system which will oscillate under almost any circumstance, and indeed does.

Part II] Hearing
Ear is NOT a Fourier Frequency analyzer. Seeback concluded that one hears periodicities rather than multiple pure tones.
Helmholtz and von Bekesy hold the premise that ear is a tonal analyzer, with hair cells acting as resonator, resonating at different frequencies.
The fluid inside the ear has a damping effect, so Teager makes a point that, it is NOT possible to cause hair cell to respond with "resonance".
While other scientists believe that energy in the ear is carried as an acoustic wave in the bluk motion of the fluid, Teager believes that most of the energy is traveling as a wave along the inside the cochlear surface.
Teager believes that, outer hair cells might be setting up their own vortex which would then act as an amplifier. [a real low-noise amplifier to measure the deflection of the order of 1/100th of an angstrom unit.]

Teager puts across a point that a small bird who does not have a cochlea can sing and modulate its voice. Hence, he thinks that there is no way that human ears do mechanical frequency selection.
Fourier analysis makes sense for stationary, periodic signals, which is NOT the case for speech signals (speech signals has variability, and modulations).Speech signal is made up of TRANSIENTS, or repetitive transients, which if analyzed through Fourier analysis, DOES NOT make sense, or will not yeild what we are looking for.
Teager feels that, instead of breaking the sound into individual frequency components, 1) "we need to understand the energy involved in producing that sound." 2) Implying that we are interested in both square of the frequency and the square of the amplitude of the sound wave. 3) also interested in mode of oscillations, 4) the structure within oscillations - the model structure as well as amplitude structure. This is how transients will be represented. ear is interested in energy modulation that generated the transient sound.

Teager postulates that "there is something that is going on in the ear and the brain": the system does the following:
1. Filters the sound (apply a filter to a sound, what type of filter needs to be applied??) The idea is that we need to focus our attention on a particular band of frequency at a time.
2. Demodulate the result (does this mean, demodulate the result obtained at each filter of the filter bank??)
3. do correlation to find out what is going on: (does this mean, perform correlation across all the filters of th filter bank??) or work within the individual filter??

Fourier analzyer tries to multiple the speech wave by a group of sine waveforms, and integrates and averages, this destroys the basic information we wanted to extract.

A) Vocal tract is a nonlinear oscillator, hence it obeys a special property of "mode-locking". So to say, that any nonlinear oscillator CANNOT generate "all possible frequencies" at the same time.
B) Energy can be transferred: the energy from high frequency components gets coupled into the lower frequency modes, after it deforms the system with itself, and attaches attributes to the system, this gives rise to all modes of oscillations;
C) Fourier analysis will not help conclude by looking at the signal (the way it does), and help conclude whether the signal was generated by a active or passive system. But Teager believes (which is not said explicitly, but I infer that) the way he sees the system or analyzes the system, we can figure out whether the system is active or passive??

Part II: Speech, Hearing, and related signal processing:
Inspite of its ability to produce infinite variations and sounds, human communications (in most of the cases) relies on some "short" numbers of sound. Thus, we can conclude or attach some special "attributes" to these sounds.

If a set of wide bandpass filters is used (to finding out the frequency bands in which energy is concentrated), wide frequency bands of energy is seen, or the other hand, if narrow bandpass filters are implemented, energy bands all across the spectrum will be found.

For example: for broad-band sound spectrogram, numerous horizontal bands are defined as formants, while vertical striations corresponds to the pitch periods. But a closer look reflects that, a different grouping or banding of frequencies are observed. Hence, definition of "a formant" and "a frequency band" becomes hazy.

Teager during his experimentation concluded that "no pitch periods were ever the same twice".

Helmholtz, Pertson, Barney either claimed or substantiated that vowels are a combinations of two pure tones.
Teager shows that a couple of continuants were missed out in the observations for the experiments conducted above by Perterson and Barney.

Licklider claimed that each of the front vowels [e, ih, eh ..] can be identified with one formant only. This experiment was found to be true, as by Teager as well. This lead Teager to another question: where does the information lie or come from?

Teager expects that same word said by different speakers or even by same speakers at different times, sounds a little different but still the listner is able to perceive the meaning. Hence he concludes that phenomenon that distinguishes the sound of a word or phoneme is not the pure tones in that word or phoneme, but rather it is the modulations of those tones.
Most importantly,
these modulations can be both tracked and quantified.
Teager's approach:
1) locate the modes of oscillations,
2) adaptively bandpass filter the speech,
3) demodulate the results in order to characterize and identify the speech sounds.

**** comments on the filters used for the purpose: (no changes made to the contents)
filters used to obtain the output were soft in the sense that they were highly damped and did not produce any long lasting oscillations. Although these filters are linear, they are unconventional. If one uses very sharp narrow-band filters to separate the modes of oscillations prior to demodulation, then the response of those filters to a pulse of energy will be dominated by the transient rigning, or lasting oscillation, of the filters. Instead, it is best to use wide-band filters that are as narrow as possible without destroying or rearranging the energy in the original wave. filters with Gaussian-like responses work very well and are the type of filters that were actually used.

a nonlinear demodulation algorithm was applied to the output of each of the bandpass filters, to understand t he differences in modes of oscillation and their different modulation patterns.

**** As is expected of linear acoustic theory, the formants would be represented by a primarily damped sine waves, hence the wave pattern (energy pattern at each of the bandpass filters) would NOT have bumps. As fig 5 [right side waveforms] shows bumps or concentration of energy, Teager believes that formants are a result of pulsatile flow interactions.

Teagers observations on the energy profiles:
lowest bandpass filter output has a very large single pulse of energy occuring once every pitch period, indicating that this pulse is "a puff" from the glottis. Hence the lowest bandpass filter represents the energy of the glottal wave.
second bandpass filer output has heavily modulated mode of oscillation (indicated by the three successively decreasing pulses nearly evenly spaced in each pitch period.
highest filter output is mostly the rough sounds which one could also listen to alone and distinguishly decipher the original sound.
thus, energy traces indicate that the formants are modulated.
Sound generation in the vocal tract is an active distributed process.
The output at each filter indicates the place where these sounds "energy profiles" were or would have been generated. Indicate that these were generated at separate parts of the vocal tract, and the residue noise is produced by the teeth and lips.

With linear passive system, it is NOT possible to locate the source of sound generation.
The high frequency components generated in the glottis do not make it through the oral cavity to the outside of the mouth.The pulsatile sheet jet coming out of the narrow slit of the vocal folds during phonation, generate a considerable amount of high frequency noise which in inherent in the process, the pulsatile jet proceeds through the vocal tract and drives or excited everything downstream from it. even though the sound generated by the second-order process is heard, the main source of energy in the glottal jet.

*** the sounds tha thuman beings almost universally utilize for speech are in fact completely distinguishable on the basis of the amplitude and frequency modulations of their energy envelopes. each vowel sound has a unique modulation that is generally tied in with its high frequency second formant. this unique characteristic of the selected speech sounds and the fact that they are not difficult to generate might well account for their universality.

The nonlinear processes (the primary sound producing mechanisms) arise from the nonlinear interaction of the sheet jet flows and the generated flow vortexes within the confiend geometry of the vocal tract, with the vortex probably playing the role of the active oscillator in effecting modulations.

=======================
questions:
1. why should the modes [pg 12 - mode-lock] space themselves apart by at least the factor of TWO in frequency??
2. How can Teager use fourier analysis to help assist his claims, when he himself says that fourier analysis "smears and destroys the very information we are trying to extract [page 12/13]?"
3. What would the reason behind "no ptich periods were ever the same twice, pitch periods vary from being slightly different to veing very different, but they were always different"??
4. how [pg 17] did he compute the noise residue?
5. what are these wide-band Gaussian-like filters?
6. what is this [pg 17/18] nonlinear demodulation algorithm?
7. How can [figure 5, right side energy profiles] Teager get P+VE energy profiles for all the energy patterns, whereas when we implement a TEO operator, we DO GET N-VE values?? There should be something else that Teager has in his nonlinear demodulation algorithm....

Explain the source-filter model,
what is a linear system, state its advantages for speech,
how do you characterize speech
what all needs to be considered (or what are the allowances) in linear
speech model?
formants, and articulators
characterization of airflow and airflow dynamics -
Navier Stokes equation Continuity equation
electrical equivalence of acoustic waveform
what are planer waves / wavefront?
how is lip / teeth characterized?
what is difference between pitch and fundamental frequency?
what is the effect of sampling / quantization.
Gabor filter - band pass filter properties and why do we require "this properties"

Stress Detection in computer users baed on DSP of noninvasive physiological signals

Feature Extraction Algorithm (flowchart)

Comments on " Stress Detection in Computer Users based on digital signal processing of Noninvasive physiological variables," Jing Zhai, Armando Barreto, Proceedings of the 28th IEEE EMBS Annual International Conference, New York City, USA, Aug 30-Sept 3, 2006. pp. 1335-1358.

Detect: mental or cognitive stress associated with computer interaction.

physiological signals:
Galvanic Skin response (GSR)
Blood Volume Pulse (BVP)
Pupil Diameter (PD)
Skin temperature (ST)

classification strategy: SVM based, to classify between "stressed" and "relaxed" response.

Dataset:
32 students (ages 21-42).

Procedure:
first 5 minutes, subjects were shown 30 still emotionally neutral pictures to relax.
then subjected to "Paced Stroop Test" (http://en.wikipedia.org/wiki/Stroop_task). The subjects had 3 seconds to answer with a mouse click.

Features:
sampling rate: 360 Hz.
from BVP: based on the InterbeatInterval (IBI calculations) and power spectrum analysis: (4 features)
L/H ratio (low frequency: 0.05-0.15Hz, high frequency: 0.16-0.40Hz)
Mean IBI
standard deviation of IBI
amplitude of BV
from GSR: based on response detection: (5 features)
# of response
mean value of GSR
amplitude response
rising time of response
energy of response
from ST: after low pass filtering: (1 feature)
slope of ST
from PD: based on linear interpolation of PD samples (1 feature)
mean value of PD

"to account for differences in the initial arousal levels due to individual differences, normalization of the data was needed prior to use of features, between [0, 1]"

Classification: Support Vector machines (weka software). the classification performance was evaluated using 20-fold cross validation, 20 samples were pulled out as test samples, and the remaining samples were sued to train the classifiers.

the authors have also compared SVM based classifier with naive-based classifier and a decision tree classifier.
The authors were mainly interested in determining the added recognition capability that can achieved with pupil diameter measurements (in junction with other physiological signals).

Thursday, February 7, 2008

Acoustic Sensors in the Helmet Detect Voice and Physiology

Comments on, Michael Scanlon, "Acoustic Sensors in the Helmet detect Voice and Physiology," Proceedings of SPIE -- Volume 5071,Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Defense and Law Enforcement II, Edward M. Carapezza, Editor, September 2003, pp. 41-51.

Acoustic monitoring of first responders physiology for health and performance surveillance

Comments on, "Acoustic Monitoring of first responder's physiology for health and performance surveillance" Michael Scanlon, Proceedings of SPIE -- Volume 4708, Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Defense and Law Enforcement, Edward M. Carapezza, Editor, August 2002, pp. 342-353

The main focus is on body-worn acoustic sensors located at the nech to detect heartbeats and other physiological parameters. The author suggests that these sensors do a pretty good job but during rigorous activity session, a lot of artifacts get added which prevents infering conclusions. But the author argues that if there is a lot of activity then it would indicate the person is "in good shape" and there is nothing to worry about.

The parameter measured in - heart-rate variability (beat-to-beat timing fluctuations derived from the interval between two adjacent beats.)

The technique used:
Lomb prediodogram is used to derive heart-rate variability. Simple peak-detection above and below a certain threshold or waveform derivative parameters can produce the timing and amplitude features necessart for the Lomb periodogram and cross-correlation technique.

The sensor - gel-coupled sensor - has impedance properties similar to that offered by the skin, but has a significant mismatch for the airborne noises.
This technology can be used to measure: heartbeats, breaths, blood presure, motion, voice and other indicators. Other specific events - cough, gag, wheeze, and vomits can also be detect. [the author does not say it, but I feel that the sensor can help detect this events as well.]

(pg 345) Getting the resting heart-rate helps to know HRV and is a good indicator of which personnel is ready to reenter the hazard situation. The duration of elevated heart rates and the maximum rate achieved can also be an indicator of a person's ability to safely and effectively perform his/her mission (task).

data details: (remarks only on the data of concern to me)
fs=1500, anti-aliasing filter corner frequency=500Hz, 30-minutes of data.
heart beats clearly seen from the neck-sensors (left and right).

** how the IBI's fluctuate on a beat-by-beat basis, as well as long-term trends, is termed HRV - and gives an indication of how well the body is regulating blood pressure, breathing, and core temperature. These IBI's also can indicated mental activity related to concentration on a task, suh as when the IBI's become very regular due to a task with intense concentration and precision muscle control, whereas the IBI's may vary significantly for tasks with varying mental and physical distractions. --refer Mulder G and Mulder LJM, "Information Processing and Cadiovascular Control," Psychophysiology, 1981, 18, pp 392-405.

To measure blood pressure, we need to have heart-rate measurement done at two different (would be nice if these to locations are far off from each other) locations on the body causing the time lag in heart beat measurement, time lag and the distance relates to the blood pressure. delta-time between the neck and wrist acoustic pulses: a long time-delta indicates a slow wave (low systolic pressure) while short - fast wave - high systolic pressure.

** Systolic pressure can also (use this method with caution) be done from the slope of second heart-sound but is not accurate. It is also possible that breath rates can be derived from acoustic pulses at the wrist by analyzing changes in amplitude that result from the lungs over- and under-pressurizing the heart.

(pg 348) the neck acoustic data clearly shows high-amplitude heartbeat pulsations in the low-frequency (0-120Hz) region, high-amplitude harmonics of the voice structure, and medium-level braodband breath sounds in the 200- and 500-Hz region. The anti-alising filters can be seen to attenuate those sounds above 500Hz.
(pg 350) one method to monitor the personnel isto look at tshort-term energy detected at the sensors. The RMS energy from the right-neck sensor shows high levels from head turns, voice, jacket, hood, mask movements, and muscular activity from lifting or crawling.

decrease in RMS energy at all acoustic sensors will indicate a decrease in activity.

for breath-rate detection, high-passed neck data reveals a lot of braodband high-frequency energy resulting from the airflow in the throat. Using FFT to monitor the temporal fluctuations of the RMS energy produces a breath rate peak in the power spectrum results. If the data is clipped (at the level of three times the median value of the absolute value of the band-passed filtered data), the advantage is it removes the influence high-amplitude motion artifacts have on the RMS calculations.

Friday, February 1, 2008

"on a simple algorithm to calculate the "energy" of a signal"

Comments on Jim Kaiser, "On a simple algorithm to calculate the 'energy' of a signal". IEEE ICASSP 1990, pp 381-384

The paper outlines the algorithm or alternate way to represent energy for a signal, specifically speech signal (or for that matter any signal which is generated by a mechanical assembly).

The existing procedure to compute the energy of a signal is (two methods are described here):
1. average the sum of squares of the amplitude of the signal (usually for a shorter segment)
2. Using Praseval' theorem (I think), in frequency domain, take the discrete fourier transform and square the magnitudes of the frequency components.

The drawback here is: for any mechanical system generating a signal will require more energy to generate the signal of higher frequency than that required to generate a lower frequency. Thus, the energy term calculated either by (1) or (2) does not account for the frequency of the signal.

But [bloggers comment] my feeling is adding the frequency dimensions makes the computation "complicated". Because, if a signal is combination of "some" frequency components, now how to you compute the signal energy. Also, if the signal was not generated by "a linear" system, what method will you use to decompose the signal into it's frequency constituents.. These questions are not trivial and does not seem to be answered by this "simple algorithm".

Friday, January 11, 2008

Phenomenological Model for Vowel Production

Comments on:
"A Phenomenological Model for Vowel Production in the Vocal Tract," Herbert M. Teager, and Shushan M. Teager
==================================================================

The major focus of alternative or a definitive model for speech (in particular, for Vowel production) is to remove the shackles which bind the uncodified and unexplored yet seemingly solved problem of speech because the technology has bypassed the need to do so.

Some scientests believe that 90% of the observations for speech can be explained with source filter theory, but the remaining 10% is not appropriately addressed.

The speech models tacitly depend on the validity of models for hearing. ****

The current model for speech - linear filter model - views voice produced as combination of pure-tones, with ear (imperfect Fourier analyzer) is believed to extract the magnitude of these objective components.

Linear model cannot predict or test / verify the transients / noise tones existing in nature.

The observations made by Teager (to highlight the anamolies in the modeling world) are:

A) the blame of limited and constrained speech systems (and its performance) is laid on our limited understanding of human brain. Teager examplifies by stating that a bird (myna) can replicate (mimic) the long passages of human speech inspite of having anatomically small cochlea (cochlea is assumed to be a place where in humans differentiate the tones into different frequencies, the different hair cells in the ear respond different to the set of frequencies, with certain set (of hair cells) responding to a particular set of frequencies), a different vocal tract (as compared to humans) and a different cerebral cortex structure to mitigate the speech signal between ear and spoken speech.

B) The various clicks, whistles, snores and other types of sounds that humans can produce cannot be modelled by linear conventional source filter model (as they do not assume the source to be glottis).

C) In theory, formant values depends solely upon the cross-sectional areas along the center line of the supraglottal vocal tract

Wednesday, January 9, 2008

Active Fluid Dynamics Voice Production Models

Comments on, "Active Fluid Dynamic Voice Production Models, or There is a Unicorn in the Garden," Herbert M. Teager and Shushan M. Teager.
--------------------------------------------------------------------------------------------

Above is the talk given by Herb Teager to explain the necessity for new model for speech production (focus on non-linear aspect of speech flow).

Herb Teager gives evidence that the speech flow is not at all planar and that it is "separated flow."

Following things bothered Teager the most about the way modeling of speech has been done,

1. About less than 1% of the equivalent mechanical lung input energy is involved in the rate of change of volume velocity at the glottis. The 99.5% of the energy is "used up" somewhere, but the research community is NOT bothered about it.

2. Representing the speech apparatus as a "passive linear system" is not the right one, as the time domain and frequency domain observations aren't completely interchangeable.

3. Related to the behaviour of speech signal under different media (the famous helium effect), the relative shift of the formants (usually seen for the formants below 200Hz) should be proportional to the change in the density, and to an extend related to the atmospheric pressure. But, Teager observed the shift / changes in the speech signal in the other direction, and by a different factor (by the factor of square root of the density).

4. The data (and hence the observations made) obtained using the hot-wire anemometry indicate, (a) "uniform, plane acoustical wave were incompatible with the observed separated flows", (b) flow is separated, possess rotations and doesn't necessarily repeat themselves across the cycle, (c) flow patterns remain consistent for a given vowel, but are radically differ across the phonemes.

5. Flow pulses arising from the front and back of the mouth move at different speeds and attenuations. Most importantly, the pressures were uniform over cross sections within which separate flow occurred. The driving mechanism for the separate flow is still unknown.

6. Acoustic impedance which relates pressures and flows in the sound wave does not translate the flow wave that moved without a corresponding pressure wave.

*** Teager assumes that soliton (and "momentum wave" a termed coined by him) can help explain the phenomenon.

7. The continuity equation and motion equations used to model the speech flow are one dimensional (or can be extended to two-dimensions) but cannot be applied to 3-D and most importantly separated flow, as the equation is over-simplified version. "The basic unsimplified Navier-Stokes equations are intrinsically unstable and unsolvable."

The flow should be considered as "separated flow" made up of threads and gusts which change quickly in direction, time and space, i.e, asymmetry of wave propagation in separated flow.

Then Teager went on the discuss about the Models that he thinks will help us model all or some of the above anamolies.

The basic premise is to omit the effects in the lungs, and just concentrate on the effects or air-flow interactions at the glottis and in the mouth.

The major focus will be to understand the non-linear energy feedback and modulations effects that occur in a jet-cavity interactions, dynamics of rotating vortex flows.

1. A flow leaving a constriction, like a JET, with the cavity (mouth/glottis) being asymmetric, the jet will differentially exhaust one or the other side and willl be thus "attracted" to the nearest wall, but the attachment may be unstable or oscillatory. This separated flow will simulataneously give rise to vortexes (axial, radial or a combination of both).

2. The the complex system at glottis, as glottis is providing a time-varying, directional source of separated air flow in the presence of vortexes modifying those flows with their own internal and external synamics.

3. A jet-cavity interaction can result in:
a) Oscillator when there is regenerative addition of energy when the energy is already high or removal when it is low. Rotating flow can store kinetic energy, and then return it as pressure, with no differential flow. Also the interaction with the separated flows acts as a non-linear yielding wall. In almost all the cases, because of the radial, rotating vortex within, the oscillator exhibits a strong amplitude modulation at the internal vortex precession rate.
b) the axial vortex at the converging outlet - acts as non-linear plug. When pressure in the chamber is high, outward flow is impeded by the compressed outlet vortex, but when pressure is low, the vortex expands and allowsd relatively more exit flow and an increase in vortex strength. Causing oscillations.
If two flows modulated with by different frequencies collide, (with the collision taking place over part of a pitch period), then all sorts of combinations of frequencies will result.
c) if for some reason, the regenerative coupling is small or happens over a part of cycle, then the oscillator will degenerate as a filter.

Teager suggested:
1) to have or perform more experimentation on humans to understand the whole gammit or jet cavity interaction, or similar concept, within the different areas of mouth, throat and the whole system coupling.
2) avoid the computer simulations of the "simplified Navier-Stokes equations".

Teager during the discussion that followed suggested that:
Mask-type pneumotachograph needs to calibrated for pulsatile flows (as these flows occur during speech production) to access the flow variations with instruments other than hot-wire anemometer.

The nominal range for measurements or flows within the mouth / throat:
Frequency range: 0 -4 KHz
Physiological range: 10-300 cm/sec

Frequency response of the instrumentation usd by Teager is float (in 0 - 5KHz) and is independent of the flow rate.

-----------------------------------------------------------------------------------------------
*** we have affirmed that the seven dimensional (three linear velocity and three angular velocity, plus time) flow patterns are unique for each vowel, and have published four sets of manifestly different trajectory data to back up this postulate.
-----------------------------------------------------------------------------------------------

Teager Energy Operator