Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti

Öztürk, Ömer Faruk; PASHAEI, ELHAM

doi:10.24012/dumf.1001914

Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti

Ömer Faruk ÖZTÜRK, (İstanbul Gelişim Üniversitesi, Bilgisayar Mühendisliği Bölümü, İstanbul, Türkiye)

Elham PASHAEI (İstanbul Gelişim Üniversitesi, Bilgisayar Mühendisliği Bölümü, İstanbul, Türkiye)

Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi

9 2

Yıl: 2021 Cilt: 12 Sayı: 4 Sayfa Aralığı: 581 - 589 Metin Dili: Türkçe DOI: 10.24012/dumf.1001914 İndeks Tarihi: 18-01-2022

Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti

Öz:

Konuşmada duygu tanıma İngilizce adıyla Speech emotion recognition (SER), duyguların konuşmasinyalleri aracılığıyla tanınması işlemidir. İnsanlar, iletişiminin doğal bir parçası olarak bu işlemi verimlibir şekilde yerine getirebilse de programlanabilir cihazlar kullanarak duygu tanıma işlemi hali hazırdadevam eden bir çalışma alanıdır. Makinelerin de duyguları algılaması, onların insan gibi görünmesini vedavranmasını sağlayacağından dolayı, konuşmada duygu tanıma, insan-bilgisayar etkileşiminingelişmesinde önemli bir rol oynar. Geçtiğimiz on yıl içerisinde çeşitli SER teknikleri geliştirilmiştir, ancaksorun henüz tam olarak çözülmemiştir. Bu makale, Evrişimsel Sinir Ağı (Convolutional neural networks-CNN) ve Uzun-Kısa Süreli Bellek (Long Short Term Memory-LSTM) olmak üzere iki derin öğrenmemimarisinin birleşimine dayanan bir konuşmada duygu tanıma tekniği önermektedir. CNN lokal öznitelikseçiminde etkinliğini gösterirken, LSTM büyük metinlerin sıralı işlenmesinde büyük başarı göstermiştir.Önerilen Evrişimsel LSTM (Convolutional LSTM – Co-LSTM) yaklaşımı, insan-makine iletişimindeetkili bir otomatik duygu algılama yöntemi oluşturmayı amaçlamaktadır. İlk olarak, Mel FrekansıKepstrum Katsayıları (Mel Frequency Cepstral Coefficient- MFCC) kullanılarak önerilen yöntemdekonuşma sinyalinden bir görüntüsel öznitelikler matrisi çıkarılır ve ardından bu matris bir boyuta indigenir.Sonrasında modelin eğitimi için öznitelik seçme ve sınıflandırma yöntemi olarak Co-LSTM kullanılır.Deneysel analizler, konuşmanın sekiz duygusunun tamamının RAVDESS (Ryerson Audio-VisualDatabase of Emotional Speech and Song) ve TESS (Toronto Emotional Speech Set) veri tabanlarındansınıflandırılması üzerine yapılmıştır. MFCC Spektrogram öznitelikleri kullanılarak Co-LSTM ile %86,7doğruluk oranı elde edilmiştir. Elde edilen sonuçlar, önceki çalışmalar ve diğer iyi bilinensınıflandırıcılarla karşılaştırıldığında önerilen algoritmanın etkinliğini ikna edici bir şekildekanıtlamaktadır.

Anahtar Kelime:

Convolutional LSTM model for speech emotion recognition

Öz:

Speech emotion recognition (SER) is the task of recognizing emotions from speech signals. While peopleare capable of performing this task efficiently as a natural aspect of speech communication, it is still awork in progress to automate it using programmable devices. Speech emotion recognition plays animportant role in the development of human-computer interaction since adding emotions to machinesmakes them appear and act in a human-like manner. Various SER techniques have been developed overthe last few decades, but the problem has not yet been completely solved. This paper proposes a speechemotion recognition technique based on the hybrid of two deep learning architectures namelyConvolutional Neural Network (CNN) and Long Short Term Memory (LSTM). Deep CNN hasdemonstrated its effectiveness in local feature selection, whereas LSTM has shown great success in thesequential processing of large texts. The proposed Convolutional LSTM (Co-LSTM) approach aims tocreate an efficient automatic method of emotion detection in human-machine communication. In thesuggested method, Mel Frequency Cepstral Coefficient (MFCC) is used to extract a matrix of spectralfeatures from the speech signal and afterward is converted to 1-dimensional (1D) array. Then, Co-LSTMis employed as a feature selection and classification method to learn the model for emotion recognition.The experimental analyses were carried out on the classification of all the eight emotions of the speechfrom RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) and TESS (TorontoEmotional Speech Set) databases. An accuracy of 86.7% was achieved with Co-LSTM using MFCCSpectrogram features. The obtained results convincingly prove the effectiveness of the proposed algorithmwhen compared to the previous works and other well-known classifiers.

Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık

[1] “United Nations Educational, Scientific, and Cultural Organization. (2019). I’d blush if I could: closing gender divides in digital skills through education,” 2)., (Programme Document GEN/2019/EQUALS/1 REV. [Online]. Available: http://unesdoc.unesco.org/images/0021/002170/2170 73e.pdf.
[2] K. Venkataramanan and H. R. Rajamohan, “Emotion Recognition from Speech,” SpringerBriefs Speech Technol., pp. 31–32, Dec. 2019.
[3] L. B. Krithika and G. G. Lakshmi Priya, “Student Emotion Recognition System (SERS) for e-learning Improvement Based on Learner Concentration Metric,” Procedia Comput. Sci., vol. 85, pp. 767–776, Jan. 2016, doi: 10.1016/J.PROCS.2016.05.264.
[4] A. E. Wells, L. M. Hunnikin, D. P. Ash, and S. H. M. van Goozen, “Improving emotion recognition is associated with subsequent mental health and wellbeing in children with severe behavioural problems,” Eur. Child Adolesc. Psychiatry 2020, vol. 1, pp. 1–9, Sep. 2020, doi: 10.1007/S00787-020-01652-Y.
[5] J. R. I. Coleman, K. J. Lester, R. Keers, M. R. Munafò, G. Breen, and T. C. Eley, “Genome-wide association study of facial emotion recognition in children and association with polygenic risk for mental health disorders,” Am. J. Med. Genet. Part B Neuropsychiatr. Genet., vol. 174, no. 7, pp. 701–711, Oct. 2017, doi: 10.1002/AJMG.B.32558.
[6] M. Bebawy, S. Anwar, and M. Milanova, “Active Shape Model vs. Deep Learning for Facial Emotion Recognition in Security,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 10183 LNAI, pp. 1–11, 2016, doi: 10.1007/978-3-319-59259-6_1.
[7] H. Aouani and Y. Ben Ayed, “Speech Emotion Recognition with deep learning,” Procedia Comput. Sci., vol. 176, pp. 251–260, Jan. 2020, doi: 10.1016/J.PROCS.2020.08.027.
[8] B. Kratzwald, S. Ilić, M. Kraus, S. Feuerriegel, and H. Prendinger, “Deep learning for affective computing: Text-based emotion recognition in decision support,” Decis. Support Syst., vol. 115, pp. 24–35, Nov. 2018, doi: 10.1016/J.DSS.2018.09.002.
[9] E. Frant, I. Ispas, V. Dragomir, M. Dascalu, E. Zoltan, and I. C. Stoica, “Voice Based Emotion Recognition with Convolutional Neural Networks for Companion Robots,” Rom. J. Inf. Sci. Technol., vol. 20, no. 3, pp. 222–240, 2017.
[10]V. Sreenivas, V. Namdeo, and E. V. Kumar, “Group based emotion recognition from video sequence with hybrid optimization based recurrent fuzzy neural network,” J. Big Data 2020 71, vol. 7, no. 1, pp. 1–21, Aug. 2020, doi: 10.1186/S40537-020-00326-5.
[11]D. Issa, M. Fatih Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomed. Signal Process. Control, vol. 59, p. 101894, May 2020, doi: 10.1016/j.bspc.2020.101894.
[12]M. A. Ozdemir, B. Elagoz, A. Alaybeyoglu, R. Sadighzadeh, and A. Akan, “Real time emotion recognition from facial expressions using CNN architecture,” TIPTEKNO 2019 - Tip Teknol. Kongresi, Oct. 2019, doi: 10.1109/TIPTEKNO.2019.8895215.
[13]M. A. Ozdemir, M. Degirmenci, E. Izci, and A. Akan, “EEG-based emotion recognition with deep convolutional neural networks,” Biomed. Tech. (Berl)., vol. 66, no. 1, pp. 43–57, Feb. 2020, doi: 10.1515/BMT-2019-0306.
[14]L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub, and C. Cleder, “Automatic Speech Emotion Recognition Using Machine Learning,” Soc. Media Mach. Learn., Mar. 2019, doi: 10.5772/INTECHOPEN.84856.
[15]A. Saxena, A. Khanna, and D. Gupta, “Emotion Recognition and Detection Methods: A Comprehensive Survey,” J. Artif. Intell. Syst., vol. 2, no. 1, pp. 53–79, Feb. 2020, doi: 10.33969/AIS.2020.21005.
[16]J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,” Biomed. Signal Process. Control, vol. 47, pp. 312–323, Jan. 2019, doi: 10.1016/J.BSPC.2018.08.035.
[17]N. A. Zaidan and M. S. Salam, “MFCC Global Features Selection in Improving Speech Emotion Recognition Rate,” Lect. Notes Electr. Eng., vol. 387, pp. 141–153, 2016, doi: 10.1007/978-3-319-32213- 1_13.
[18]S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American english,” PLoS One, vol. 13, no. 5, p. e0196391, May 2018, doi: 10.1371/journal.pone.0196391.
[19]M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS).” Scholars Portal Dataverse, 2020, doi: doi/10.5683/SP2/E8H2MF.
[20]F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of German emotional speech,” in INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, 2005.
[21]B. Zupan, D. Neumann, D. R. Babbage, and B. Willer, “The importance of vocal affect to bimodal processing of emotion: Implications for individuals with traumatic brain injury,” Journal of Communication Disorders, vol. 42, no. 1. pp. 1–17, Jan-2009, doi: 10.1016/j.jcomdis.2008.06.001.
[22] “Voice-enabled smart speakers to reach 55% of U.S. households by 2022, says report | TechCrunch.” [Online]. Available: https://techcrunch.com/2017/11/08/voice-enabledsmart-speakers-to-reach-55-of-u-s-households-by2022-says-report/. [Accessed: 05-Sep-2021].
[23]A. S. Popova, A. G. Rassadin, and A. A. Ponomarenko, “Emotion Recognition in Sound,” in Studies in Computational Intelligence, 2018, vol. 736, pp. 117–124, doi: 10.1007/978-3-319-66604-4_18.
[24]L. Li et al., “Hybrid Deep Neural Network - Hidden Markov Model (DNN-HMM) based speech emotion recognition,” in Proceedings - 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, 2013, pp. 312–317, doi: 10.1109/ACII.2013.58.
[25]M. G. De Pinto, M. Polignano, P. Lops, and G. Semeraro, “Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients,” in IEEE Conference on Evolving and Adaptive Intelligent Systems, 2020, vol. 2020-May, doi: 10.1109/EAIS48028.2020.9122698.
[26]G. Tangriberganov, T. Adesuyi, and B. M. Kim, “(PDF) A Hybrid approach for speech emotion recognition using 1D-CNN LSTM,” in Korea Computer Congress (KCC 2020), 2020.
[27]G. Agarwal and H. Om, “Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition,” Multimed. Tools Appl. 2020 807, vol. 80, no. 7, pp. 9961–9992, Nov. 2020, doi: 10.1007/S11042-020-10118-X.
[28]R. Sarkar, S. Choudhury, S. Dutta, A. Roy, and S. K. Saha, “Recognition of emotion in music based on deep convolutional neural network,” Multimed. Tools Appl., vol. 79, no. 1–2, pp. 765–783, Jan. 2020, doi: 10.1007/s11042-019-08192-x.
[29]E. Yucesoy and V. V. Nabiyev, “Gender identification of a speaker using MFCC and GMM,” in ELECO 2013 - 8th International Conference on Electrical and Electronics Engineering, 2013, pp. 626–629, doi: 10.1109/eleco.2013.6713922.
[30]B. McFee et al., “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Science Conference, 2015, pp. 18–24, doi: 10.25080/majora-7b98e3ed-003.
[31]E. Pashaei, M. Ozen, and N. Aydin, “Splice sites prediction of human genome using AdaBoost,” in 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016, 2016, doi: 10.1109/BHI.2016.7455894.
[32]E. Pashaei, M. Ozen, and N. Aydin, “Random Forest in Splice Site Prediction of Human Genome,” in XIV Mediterranean Conference on Medical and Biological Engineering and Computing 2016, 2016, vol. 57, pp. 518–523, doi: 10.1007/978-3-319-32703-7_99.
[33]E. Pashaei and E. Pashaei, “Gene Selection using Intelligent Dynamic Genetic Algorithm and Random Forest,” in 2019 11th International Conference on Electrical and Electronics Engineering (ELECO), 2019, pp. 470–474, doi: 10.23919/ELECO47770.2019.8990557.
[34]H. K. Palo, M. Chandra, and M. N. Mohanty, “Emotion recognition using MLP and GMM for Oriya language,” Int. J. Comput. Vis. Robot., vol. 7, no. 4, pp. 426–442, 2017, doi: 10.1504/IJCVR.2017.084987.
[35]Mustaqeem and S. Kwon, “A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition,” Sensors 2020, Vol. 20, Page 183, vol. 20, no. 1, p. 183, Dec. 2019, doi: 10.3390/S20010183.
[36]F. Tao and G. Liu, “Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 2906–2910, Sep. 2018, doi: 10.1109/ICASSP.2018.8461750.
[37]L. Chen, W. Su, Y. Feng, M. Wu, J. She, and K. Hirota, “Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction,” Inf. Sci. (Ny)., vol. 509, pp. 150–163, Jan. 2020, doi: 10.1016/J.INS.2019.09.005.
[38]Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271– 280, Jan. 2018, doi: 10.1016/J.NEUCOM.2017.07.050.
[39]L. Sun, B. Zou, S. Fu, J. Chen, and F. Wang, “Speech emotion recognition based on DNN-decision tree SVM model,” Speech Commun., vol. 115, pp. 29–37, Dec. 2019, doi: 10.1016/J.SPECOM.2019.10.004.
[40]E. Pashaei, A. Yilmaz, and N. Aydin, “A combined SVM and Markov model approach for splice site identification,” 2016 6th Int. Conf. Comput. Knowl. Eng. ICCKE 2016, no. Iccke, pp. 200–204, 2016, doi: 10.1109/ICCKE.2016.7802140.
[41]J. Umamaheswari and A. Akila, “An Enhanced Human Speech Emotion Recognition Using Hybrid of PRNN and KNN,” Proc. Int. Conf. Mach. Learn. Big Data, Cloud Parallel Comput. Trends, Prespectives Prospect. Com. 2019, pp. 177–183, Feb. 2019, doi: 10.1109/COMITCON.2019.8862221.
[42]T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial-Temporal Recurrent Neural Network for Emotion Recognition,” IEEE Trans. Cybern., vol. 49, no. 3, pp. 939–947, Mar. 2019, doi: 10.1109/TCYB.2017.2788081.
[43]R. K. Behera, M. Jena, S. K. Rath, and S. Misra, “CoLSTM: Convolutional LSTM model for sentiment analysis in social big data,” Inf. Process. Manag., vol. 58, no. 1, p. 102435, Jan. 2021, doi: 10.1016/j.ipm.2020.102435.
[44]V. Passricha and R. K. Aggarwal, “A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition,” J. Intell. Syst., vol. 29, no. 1, pp. 1261– 1274, Jan. 2020, doi: 10.1515/JISYS-2018-0372.
[45]L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive Gradient Methods with Dynamic Bound of Learning Rate,” 7th Int. Conf. Learn. Represent. ICLR 2019, Feb. 2019.
[46]M. A. Ozdemir, G. D. Ozdemir, and O. Guren, “Classification of COVID-19 electrocardiograms by using hexaxial feature mapping and deep learning,” BMC Med. Informatics Decis. Mak. 2021 211, vol. 21, no. 1, pp. 1–20, May 2021, doi: 10.1186/S12911-021- 01521-X.
[47]M. A. Ozdemir, O. K. Cura, and A. Akan, “Epileptic EEG Classification by Using Time-Frequency Images for Deep Learning,” https://doi.org/10.1142/S012906572150026X, May 2021, doi: 10.1142/S012906572150026X.
[48]J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for Hyper-Parameter Optimization,” Adv. Neural Inf. Process. Syst., vol. 24, 2011.
[49]Z. Aldeneh and E. M. Provost, “Using regional saliency for speech emotion recognition,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2017, pp. 2741– 2745, doi: 10.1109/ICASSP.2017.7952655.
[50]R. V. Darekar and A. P. Dhande, “Emotion recognition from Marathi speech database using adaptive artificial neural network,” Biol. Inspired Cogn. Archit., vol. 23, pp. 35–42, Jan. 2018, doi: 10.1016/j.bica.2018.01.002.
[51]A. Bhavan, P. Chauhan, Hitkul, and R. R. Shah, “Bagged support vector machines for emotion recognition from speech,” Knowledge-Based Syst., vol. 184, p. 104886, Nov. 2019, doi: 10.1016/J.KNOSYS.2019.104886.
[52]S. Mekruksavanich, A. Jitpattanakul, and N. Hnoohom, “Negative Emotion Recognition using Deep Learning for Thai Language,” in 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering, ECTI DAMT and NCON 2020, 2020, pp. 71–74, doi: 10.1109/ECTIDAMTNCON48261.2020.9090768.
[53]A. Keesing, I. Watson, and M. Witbrock, “Convolutional and Recurrent Neural Networks for Spoken Emotion Recognition,” in Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association, 2020, pp. 104– 109.
[54]P. Singh, G. Saha, and M. Sahidullah, “Deep scattering network for speech emotion recognition,” May 2021.

APA	Öztürk Ö, PASHAEI E (2021). Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. , 581 - 589. 10.24012/dumf.1001914
Chicago	Öztürk Ömer Faruk,PASHAEI ELHAM Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. (2021): 581 - 589. 10.24012/dumf.1001914
MLA	Öztürk Ömer Faruk,PASHAEI ELHAM Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. , 2021, ss.581 - 589. 10.24012/dumf.1001914
AMA	Öztürk Ö,PASHAEI E Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. . 2021; 581 - 589. 10.24012/dumf.1001914
Vancouver	Öztürk Ö,PASHAEI E Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. . 2021; 581 - 589. 10.24012/dumf.1001914
IEEE	Öztürk Ö,PASHAEI E "Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti." , ss.581 - 589, 2021. 10.24012/dumf.1001914
ISNAD	Öztürk, Ömer Faruk - PASHAEI, ELHAM. "Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti". (2021), 581-589. https://doi.org/10.24012/dumf.1001914

APA	Öztürk Ö, PASHAEI E (2021). Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, 12(4), 581 - 589. 10.24012/dumf.1001914
Chicago	Öztürk Ömer Faruk,PASHAEI ELHAM Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi 12, no.4 (2021): 581 - 589. 10.24012/dumf.1001914
MLA	Öztürk Ömer Faruk,PASHAEI ELHAM Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, vol.12, no.4, 2021, ss.581 - 589. 10.24012/dumf.1001914
AMA	Öztürk Ö,PASHAEI E Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi. 2021; 12(4): 581 - 589. 10.24012/dumf.1001914
Vancouver	Öztürk Ö,PASHAEI E Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi. 2021; 12(4): 581 - 589. 10.24012/dumf.1001914
IEEE	Öztürk Ö,PASHAEI E "Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti." Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, 12, ss.581 - 589, 2021. 10.24012/dumf.1001914
ISNAD	Öztürk, Ömer Faruk - PASHAEI, ELHAM. "Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti". Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi 12/4 (2021), 581-589. https://doi.org/10.24012/dumf.1001914