Abstract
Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new duallevel model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DSLSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%-a 6% improvement over current state-of-the-art unimodal models-and is comparable with multimodal models that leverage textual information as well as audio signals.
Original language | English (US) |
---|---|
Title of host publication | 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 6474-6478 |
Number of pages | 5 |
ISBN (Electronic) | 9781509066315 |
DOIs | |
State | Published - May 2020 |
Event | 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain Duration: May 4 2020 → May 8 2020 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
Volume | 2020-May |
ISSN (Print) | 1520-6149 |
Conference
Conference | 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 |
---|---|
Country/Territory | Spain |
City | Barcelona |
Period | 5/4/20 → 5/8/20 |
Bibliographical note
Funding Information:This work was supported in part by Office of Naval Research Grant No. N00014-18-1-2244.
Publisher Copyright:
© 2020 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
Keywords
- Dual-Level Model
- Dual-Sequence LSTM
- LSTM
- Mel-Spectrogram
- Speech Emotion Recognition