Speech emotion recognition with dual-sequence LSTM architecture

Jianyou Wang, Michael Xue, Ryan Culhane, Enmao Diao, Jie Ding, Vahid Tarokh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new duallevel model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DSLSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%-a 6% improvement over current state-of-the-art unimodal models-and is comparable with multimodal models that leverage textual information as well as audio signals.

Original languageEnglish (US)
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6474-6478
Number of pages5
ISBN (Electronic)9781509066315
DOIs
StatePublished - May 2020
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain
Duration: May 4 2020May 8 2020

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2020-May
ISSN (Print)1520-6149

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
CountrySpain
CityBarcelona
Period5/4/205/8/20

Bibliographical note

Funding Information:
This work was supported in part by Office of Naval Research Grant No. N00014-18-1-2244.

Keywords

  • Dual-Level Model
  • Dual-Sequence LSTM
  • LSTM
  • Mel-Spectrogram
  • Speech Emotion Recognition

Fingerprint Dive into the research topics of 'Speech emotion recognition with dual-sequence LSTM architecture'. Together they form a unique fingerprint.

Cite this