
left side: neutral and stressed speech signal right side: nonlinear airflow structure
Comments on, "Nonlinear feature based classfication of speech under stress," Guojun Zhou, John HL Hansen, Jim Kaiser, IEEE transaction on speech and audio processing, vol 9, no 3, march 2001, pp. 201-216
first: the outline of the paper or comments on the contents of the paper, paragraph wise.
1. Introduction:
para1: defn of stress and studies related to effect of stress on speech production
para2: effect of speech under stress on automatic speech recognition ASR system
para3: techniques to overcome the effect of stress on ASR performance,
para4: application of stress classification, and use of stress classfication algorithm.
para5, 6: features used for stress classfication so far
para7: comments of research done by Teager, and nonlinear air flow description
para8: Teager energy operator
para9: study with TEO-based features
II. Stress Classification features
A. background of the teager energy operator
para1: discrete teager energy operator, and discrete energy separation algorithms (DESA I, II, IIA/B)
B. TEO-FM-Var variation of FM component
para1: how is the feature extracted, and why / how will it represent stress or variations caused due to stress.
C.TEO-AutoEnv - Normalized TEO Autocorrelation envelope Area
motivation behind this feature, and why normalized, and why autocorrelation, and the defination of segment size.
D.TEO-CB-AutoEnv: Critical Band based TEO Autocorrelation Envelope
para1: motivation, and how / why is the feature extracted, how is it different from the previous two nonlienar features.
1. Harmonic Analysis:
para1. harmonic analysis is necessary to understand the shift in the number of harmonics in a "critical band" or a filter of the filter bank, with stress on speech. the study is more focussed on voiced segment of the words, and differences between the number of harmonics across stress condition.
2. Quantitative Analysis:
mathematical formulation of how stressed speech is different from the neutral speech, and what does autocorrelation area represent for a neutral / stressed speech.
3. Waveform analysis:
describes that not only does the pitch differ (between the neutral and the stressed speech) but also there is SOMETHING else which cannot be quantified completely with pitch or change in pitch. that (this change) can be attributed to the change in muscle tension, change in air flow, change in the way articulators are used for speech production during non-neutral conditions.
III. Evaluations
A. Database:
three domains: neutral, simulated speech, speech under stress sections of SUSAS database were used for the analysis and evaluation, speech sampled 8KHz, 16-bit data.
stress model: 5-state HMM continuous distributions with each state of two-Guassian mixtures.
B. Traditional features:
MFCC - their effectiveness in representing the spectral variations of speech,
pitch - obtained from pitch tracking algorithm
C. Stress Classification results:
three style of evaluations were carried out:
1. to evaluate the set of features (MFCC, pitch, and the three features from authors), which three of them are the best - text-dependent stress detection "pairwise" is done.
2. after finding the three best features, these three feature sets will be put to text-independent stress detection "pairwise" is done
3. these three features sets are then again evaluated for stress classification between neutral against all the stress categories (angry. loud. lombard combined into a single class).
4. study of how much does text-dependency play a major role in stress classification, is done by evaluating the features set for their applicability for stress classification as well as ASR evaluation.
Detailed notes on the actual technical aspect of the paper:
Stress exists while working in noisy backgrounds (Lombard effect), emergency conditions, high workload stress, multitasking, fatigue due to sustained operation, physical environmental factors, emotional moods, or caused due to chemical ( medicines, or otherwise; prescribed or otherwise) comsumption.
stress can cause:
speech to sound slower, faster, softer, louder, change in respiration pattern and muscle tension of the vocal tract.
Speaker at times may use a nonuniform set of speech production adjustments to convey their stress states.
How to improve the performance of ASR for speech under stress:
1. retraining the reference models (adjusting so that the trained and test conditions match),
2. training the speech models under all conditions combined together,
3. speaker dependent training,
4. using speech perturbation models within the HMM framework
The idea that should be adopted is to use stress classification algorithm to classify speech (either into neutral or stress or specific style of stress) and then use the models adopted to that specific stress condition. This is the utility of stress classification algorithm.
utility of stress classification algorithm:
1. to improve the robustness of ASR engine,
2. to prioritize the calls, based on the "emergency index" for the call. High stressed call need to be addressed with top priority.
3. to access the caller's emotional state,
4. to aid psychiatrists to aid their objective assessment of the subject,
5. forensic speech analysis.
research with speaker stress classification:
1. using pitch and variation index of pitch,
2. spectral features based on linear speech production models,
3. estimated vocal tract area profiles,
4. acoustic tube area coefficients,
5. MFCC, delta MFCC, double-delta MFCC, autocorrelation of MFCCs.
6. phoneme duration, intensity, glotttal source structure (especially spectral slope), vocal tract formant structure.
classifiers used:
distance metrics, NN based classifiers, HMM-based structure.
Airflow through the vocal tract is separate and concomitnat vortices which makes in nonlinear. stress will cause changes in muscle tensions, therefore, changes in the vortex-flowinteraction patterns; leading to a difference in nonlinear flow structure.
"while vocal tract articulators do move to configure the vocal tract shape, it is the resulting airflow properties which serve to excite those models (phoneme - speech production models - voice models) which a listener will perceive as a particular phoneme.Teager formulated TEO operator which also accounted for hearing - he associated hearing as a process of detecting energy.
Background of Teager Energy Operator:
The TEO is typically applied to a bandpass filtered speech signal, since its intent is to reflect the energy of the nonlinear flow within the vocal tract for a single resonant frequency. Although the output of a bandpass filter still contains more than one frequency component, it can be considered as an AM-FM signal.
*** As speech production unit generates freq-locked signal and will generate frequencies not close to each other "if close to each other then they will coalase", then the critical band approach seems to be a valid approach.
Although TEO processing is intended to be used for a signal with a single resonant frequency, the TEO energy of a multi-frequency signal does not only reflect individual frequency components but also reflects interactions between them. This characteristic extends the use of TEO to speech signals filtered with wide bandwidth band-pass filters (BPF).
If the two waveforms are observed carefully, neutral and stressed look similar, though stressed has less peaks; the number of peaks between the two maximas is less in stressed speech, but the amplitude of peaks is more prominent as compared to neutral speech - reason probably being glottal folds are not able to close, that fast because of 1) its own response time, 2) as well as tension on glottal muscles, 3) also resistance offered by the airflow (may be the air is flowing for the lesser time, but as the quantity of air to be passed in same, then the glottal folds opening time as compared to closed time is more; the ratio of fold open to close is more; the total duration [pitch] as well changing).
TEO-FM-Var Variation of FM component
the fine excitation variations observed in the speech signal are due to the effects of modulations. implying that a stress calssfication feature is needed which reflects these modulation variations.
No two pitch periods or values (even if measured subsequently) are alike [refer Teager's paper]. TEO-FM-Var features is tied to pitch, it is pitch-synchrous. If speech is considered as AM-FM signal, then the signal needs a carrier to carry the signal, this modulating frequency is Fo. The GBF gabor band pass filter is designed with center frequency of F0 and bandwidth (RMS bandwidth) F0/2
Gabor filter has excellent sidelobe cancellation property.
Using absolute magnitude difference function - AMDF - F0 is computed for the TEO profile of the signal. TEO profile assists the AMDF algorithm because of its squaring property.
Frame based FM-Variations (???) is extracted.
Still not clear vhat is the FM variation metric or index
TEO-AutoEnv normalized TEO Autocorrelation envelope area
to reflect instantaneous excitation variations of speech.
NOT IMPLEMENTED - DUE TO TECHNICAL DIFFICULTIES IS ASSESSING FORMANT and FORMANT TRACKING
if a filter bank is used to bandpass filter voiced speech around each of its formant frequencies, the modulation pattern around each formant can be obtained using TEO AM-FM decomposition, from which variations of modulation patterns across different frequency bands can be obtained.four fixed bandpass filters were implemented in 0-1KHz, 1-2KHz, 2-3, and 3-4 KHz. number of formants in each band will be 0 to 2 for both neutral as well as stressed speech.
To obtain TEo-AutoEnv feature:
1. filter the raw speech utterance through this 4 filters,
2. get the TEO profile for each filter,
3. filter each TEO profile at fundamental frequency, and 3dB bandwidth of F0/2 (CONFUSED, WHY??)
the the reason is: TEO output of the signal is roughly proportional to the square of both its amplitude and frequency; and AM component for a single formant exhibits periodicity similar to the fundamental frequency.
the major assumption is that we are going to have ONE formant in the filter bank which may not be "true" for all the bands. Furthermore if NO formants exists in a filterband, then do we need to go through this step. AM component of a single formant exhibits periodicity similar to the fundamental frequency: this is true as formant is acting as a carrier frequency (signal), hence its amplitude will vary based on the frequency of the pitch (fundamental frequency).
4. this signal (obtained from #3) is analyzed with a frame (whose length is four times the pitch)
Area under the autocorrrelation envelope for a constant (DC value) signal is a triangle = N/2, where N is number of samples used to compute autocorrelation. For any other signal, the area will be less than N/2, hence (for step # 5)
5. the area under the autocorrelation envelope is computed and normalized with N/2.
This area represents the degree of variability within each band.
Different types or varying degrees of stress will influence the distribution of formant characteristics, and pitch structure and spectral based pitch harmonics from nuetral conditions. In addition to the primary issue of formant migration into adjacent filters, additional pitch harmonics would also occur.
TEO-CB-AutoEnv critical band TEO autocorrelation envelope.
Instead of having uniform partitions as for TEO-AutoEnv feature, the filterbank at the start is critical band one, modeling the hearing structure.
The filter design is: center frequency based on the critical band, and the bandwidth is the bandwidth of the critical band.
To avoid dependency on pitch information, the TEO-CB-AutoEnv feature is derived independent of pitch.
TEO-AutoEnv represents the variations around pitch caused by formant distribution variations across different frequency bands;
TEO-CB-AutoEnv represents the variations in pitch harmonics because of its higher frequency resolution.
Harmonic analysis (manual computation and measurement) for 12 voiced tokens from each (neutral, angry, loud, Lombard) spoken styles simulated stress domain of SUSAS, indicate the variations across speaking styles as well as across different critical bands.
Quantitative analysis indicates that if pitch increases, the number of pitch harmonics will reduce. Thus, the number of pitch harmonics under neutral speech (when the pitch is less as compared to other speaking styles considered) will be more as compared to that under other speaking styles. Which can lead us to think that the autocorrelation sequence (and thus the area envelope) for stressed speech will be less variable across frames (or for that matter, dependent of one or no frequencies); but neutral speech will have more variability. If we assume that neutral speech consisted of two pitch harmonics and with stressed speaking style the pitch increased two-folds, then with stressed speaking style, we will have one pitch harmonic. Implying that a single frequency, so the TEO profile will be a constant, thus the autocorrelation profile will be straight line, thus the area under the curve with stress condition will be MORE as compared to the area under the curve for neutral speech.
Now if the cross-harmonic terms are considered, the computation of autocorrelation and thus the area under becomes a complex function. Also, the number of harmonics present may vary across the speaking styles and across different bands, also the formants might migrate across bands with change in speaking style. These all compelxities, can be reduced or can be covered under an umbrella of area under the envelope.
Also, because of autocorrelation, the fast variations and minor fluctuations will minimized, but still the variations due to stress will be accounted for. Kind of pitch dependency will be OMITTED by looking at the area under the autocorrelation envelope.
Doing a waveform analysis, and upping the pitch of a neutral speech (in order to minimize the differences associated to the pitch change), the neutral speech now had "same" pitch value as that under stressed speech, hence within a valid assumption, whatever differences in TEO profile as well as any other features derived thereafter can be associated to THAT CAUSED by stress acting on speech production unit.
No doubt the effect of GABOR bandpass filter will also play a part in the computation, but barring that, the factors affecting the change in the Area feature can be associated to:
1. change in the fundamental frequency,
2. variability in the harmonics under stress, and
3. due to the nonlinear variations occurred in the airflow in the vocal tract.
DATABASE:
subset of SUSAS words: freeze, help, mark, nav, oh, zero.
stress styles: angry, loud, Lombard - simulated stress styles,
actual task: speech during roller-coaster ride
worked on voiced sections: vowels, diphthongs, liquids, glides, nasals from the word are extracted.
16-bit, 8KHz.
Baseline system: 5 state HMM-based stress classifier with continuous distributions, and each with two Gaussian mixtures.
Features: Pitch, MFCCs (their effectiveness in representing the spectral variations of speech).
Text-dependent Pairwise Stress Classification:
Model building: HMM model for voiced portion of each word using 18 tokens for each stressed style, 17 stressed tokens (for each stress model) were used HMM model (in a round-robbin way) and
testing was on 90 neutral tokens, and 1 stressed token (per stress style)
Tested with simulated conditions and actual stress conditions tell us that TEO-CB-AutoEnv performs best as compared to pitch and MFCCs. the reason partly because TEO-CB-AutoEnv feature is not dependent on the accuracy with which pitch information is extracted.
Text-independent Pairwise Stress Classification:
Similar (but a slightly lower) performances were seen for TEO-CB-AutoEnv, pitch, MFCC.
MFCC performance degradation for out-of-vocabulary test as above is more as these features are dependent on vocal tract spectral structure, and mainly designed for speech recognition, thus relies on test sequence information.
Checking for the performance comparison for a feature across different speaking styles leads to the conclusion that, some acoustic information overlap does exist between the speaking styles more so between, neutral vs loud, neutral vs Lombard. Also, the reason would be that some speakers might NOT be better at showing that particular correctly. And also, there is an overlap between loud and angry to a certain extend.
Multistyle stress classification:
Neutral HMM model
test utterance --> Angry HMM model --> neutral in, decision neutral (Correct decn)
Lombard HMM model angry (/Lombard / Loud) in, decn neutral (wrong);
Loud HMM model angry (Lombard / loud) in, decn NOT neutral (Correct)
To give the confusion matrix structure. results indicate that pitch and TEO-CB-AutoEnv feature outperform MFCCs.
Another experiment on getting both the speech recogniton and stress recognition correct indicate TEO-CB-AutoEnv CANNOT do a better job, which is understandable as it models excitation variations, while MFCC is meant for speech recognition.
This particular observation hints that a two-stage strategy can be implemented for ASR, in which TEO-CB-AutoEnv detects stress, and then MFCC is used for ASR.
Questions that can be raised or need to be addressed:
1. What is Lombard effect - explain in details the effect and consequences on speech production system, human listening ability.
2. how is channel noise characterized? what do you mean by additive noise, convolutive noise? what is the impact of each on signal, and how do you mitigate this problem?
what is multi-style training
What does MFCC represent? how is it extracted -
Derive Teager energy equation - continuous domain, discrete time.
Give the correlation of TEO equation with the spring damper system
Derive DESA, DESA-II, algorithms
Amplitude modulation, frequency modulation basic equations
Absolute magnitude difference function for pitch extraction
What are the properties of autocorrelation?
where is autocorrelation applied?
derive the equation for autocorrelation on TEO profile, compute the area of autocorrelation envelope (derive)
What is the area under the curve for a DC value signal
analysis / comments on the results obtained.







No comments:
Post a Comment