Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new duallevel model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DSLSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%-a 6% improvement over current state-of-the-art unimodal models-and is comparable with multimodal models that leverage textual information as well as audio signals.
|Original language||English (US)|
|Title of host publication||2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||5|
|State||Published - May 2020|
|Event||2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain|
Duration: May 4 2020 → May 8 2020
|Name||ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings|
|Conference||2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020|
|Period||5/4/20 → 5/8/20|
Bibliographical noteFunding Information:
This work was supported in part by Office of Naval Research Grant No. N00014-18-1-2244.
- Dual-Level Model
- Dual-Sequence LSTM
- Speech Emotion Recognition