Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama

oyucu, saadin; undefined, Mustafa Sami; Polat, Huseyin

doi:10.17671/gazibtd.1159289

Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama

Saadin OYUCU, (Adıyaman Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Adıyaman, Türkiye)

Mustafa Sami CÜCEN, (Ostim Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Ankara, Türkiye)

Hüseyin POLAT (Gazi Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Ankara, Türkiye)

Bilişim Teknolojileri Dergisi

18 3

Yıl: 2023 Cilt: 16 Sayı: 3 Sayfa Aralığı: 237 - 249 Metin Dili: Türkçe DOI: 10.17671/gazibtd.1159289 İndeks Tarihi: 13-08-2023

Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama

Öz:

Konuşma sentezleme (TTS: Text-to-Speech) sistemleri insan-bilgisayar etkileşiminin önemli bir parçasıdır. TTS işleminde bir dizi metne karşılık gelen bir dizi spektrogram tahmin edilmektedir. Elde edilen spektrogram dizisi insanların duyabileceği ses dalga formuna dönüştürülmektedir. TTS sistemlerinin başarısı, geliştirme kaynaklarının yetersizliği nedeni ile farklı diller için aynı düzeyde değildir. Bir TTS sisteminin verimli şekilde geliştirilebilmesi için ulaşılabilir, büyük boyutlu bir konuşma veri kümesine ihtiyaç duyulmaktadır. Türkçe gibi kaynak yetersizliği olan diller için konuşma veri kümelerinin eksikliği, TTS sistemleri geliştirmenin önündeki en büyük engellerden biridir. Büyük boyutlu bir veri kümesi hazırlama oldukça zaman alan, zorlu ve maliyetli bir görevdir. Bu çalışmada, Türkçe TTS sistemlerinin geliştirilmesinde kullanılabilecek bir veri kümesi hazırlanmıştır. Daha önceden hazırlanan metin verisi, bir erkek konuşmacı tarafından İstanbul Türkçesi kullanılarak duygudan bağımsız olarak seslendirilmiştir. Metin verisi 109.826 kelime içermektedir. Seslendirilen konuşma verisi yaklaşık 12 saat 38 dakika 59 saniye uzunluğundadır ve 22.050 Hz. örnekleme frekansında kaydedilmiştir. Türkçe için hazırlanan bu veri kümesi daha önce İngilizce için hazırlanmış ve başarılı sonuçlar elde edilmiş “The LJ Speech Dataset” isimli veri kümesi ile karşılaştırılmış ve gelecekteki çalışmalar için öneriler sunulmuştur. Bu veri kümesi akademik düzeyde Türkçe TTS çalışmalarını teşvik etmek için hazırlanmıştır. Hazırlanan Türkçe veri kümesinin performans durumunu gözlemlemek için GlowTTS modeli bu veri kümesi kullanılarak eğitilmiştir. Eğitilen GlowTTS modeli ile bir Türkçe TTS sistemi geliştirilmiştir. Geliştirilen Türkçe TTS sistemi kullanılarak sentezlenen konuşmalar ile doğal konuşmaların karşılaştırılması sonucu 2,12’lik bir MOS-LQO değeri elde edilmiştir. Elde edilen ilk sonuçlar hazırlanan veri kümesinin Türkçe TTS sistemi geliştirme çalışmalarına etkin bir katkı sağladığını göstermektedir.

Anahtar Kelime: Konuşma sentezleme Metinden konuşmaya dönüştürme sistemleri Türkçe konuşma sentezleme Derin öğrenme

Preparing A Balanced Corpus for Development of Turkish Speech Synthesis Systems

Öz:

Speech synthesis systems are an important part of human-computer interaction. With speech synthesis, a speech waveform corresponding to a spoken text is produced. The resulting waveform is converted into audio data that people can hear. The success of speech synthesis systems is not at the same level for different languages due to a lack of development resources. To train a speech synthesis system efficiently, a large, accessible corpus is needed. The lack of such corpus for low-resource languages such as Turkish is the biggest obstacle to developing Turkish speech synthesis systems. Preparing a large corpus is a time-consuming, challenging, and costly task. In this study, the process of creating an accessible corpus that will be used in the development of Turkish speech synthesis systems, increasing the success of naturalness and intelligibility, and the difficulties encountered are explained. The previously compiled text data for the corpus was voiced by a male speaker using Istanbul Turkish, regardless of emotion. The text data contains 109826 words. The spoken speech data is approximately 12 hours 38 minutes 59 seconds long and is at 22050 Hz. recorded at the sampling rate. This corpus prepared for Turkish was compared with the corpus named “The LJ Speech Dataset” which was previously prepared for English and successful results were obtained, and suggestions for future studies were presented. This corpus was developed to encourage Turkish speech synthesis studies at the academic level. In this way, we hope that a major deficiency in the development of Turkish speech synthesis systems will be eliminated.

Anahtar Kelime: Text-to-speech system Speech synthesis Turkish Speech synthesis Deep learning

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık

Y. Ning, S. He, Z. Wu, C. Xing, L. J. Zhang, “Review of deep learning based speech synthesis”, Appl. Sci., 9(19), 1–16, 2019.
S. Lemmetty, Review of speech synthesis technology, Yüksek Lisans Tezi, Helsinki University of Technology, Department of Electrical and Communications Engineering, 1999.
H. Dudley, T. H. Tarnóczy, “The Speaking Machine of Wolfgang von Kempelen”, J. Acoust. Soc. Am., 22 (2), 151– 166, 1949.
H. Dudley, “The Carrier Nature of Speech”, Bell Syst. Tech. J., 19 (4), 495–515, 1940.
N. Umeda, R. Teranishi, “The Parsing Program for Automatic Text-to-Speech Synthesis Developed at the Electrotechnical Laboratory in 1968”, IEEE Trans. Acoust., 23 (2), 183–188, 1975.
A. E. Yilmaz, “Türkçe Metinden Konuşma Sentezleme Uygulamaları İçin Bir Veri Sözlük Seti ve Yazılım Çerçevesi”, Gazi Üniversitesi Mühendislik Mimar. Fakültesi Derg., 24 (4), 735–744, 2009.
İ. Y. Özüm, A Speech Synthesis System for Turkish Language Based on the Concetanation of Phonemes Taken from Speaker, Yüksek Lisans Tezi, Middle East Technical University, Graduate School of Natural and Applied Sciences, 1993.
B. Eker, Turkish Text To Speech System, Yüksek Lisans Tezi, Bilkent University, The Department of Computer Engineering, 2002.
R. A. Khan, J. S. Chitode, “Concatenative Speech Synthesis: A Review”, Int. J. Comput. Appl., 136 (3), 1–6, 2016.
Y. Tabet, M. Boughazi, “Speech synthesis techniques. A survey”, 7th Int. Work. Syst. Signal Process. their Appl. WoSSPA 2011, 67–70, 2011.
M. Z. Rashad, H. M. El-Bakry, I. R. Isma’il, N. Mastorakis, “An overview of text-to-speech synthesis techniques”, Int. Conf. Commun. Inf. Technol. - Proc., 84–89, 2010.
D. Govind, S. R. M. Prasanna, “Expressive speech synthesis: A review”, Int. J. Speech Technol., 16 (2), 237–260, 2013.
R. Aşlıyan, K. Günel, “Türkçe metinler için hece tabanlı konuşma sentezleme sistemi”, Akademik Bilişim 2008, Çanakkale, Türkiye, 31–38, 2008.
H. Zen, V. Dang, R. Clark, Y. Zhang, Y. Jia, Z. Chen, Y. Wu, “LibriTTS: A corpus derived from librispeech for text-to- speech”, Conference of the International Speech Communication Association, Graz, Avusturya, 15-19 Eylül, 2019.
O. Salor, B. Pellom, T. Ciloglu, K. Hacioglu, M. Demirekler, “On developing new text and audio corpora and speech recognition tools for the Turkish language”, International Conference Spoken Language Processing, Denver, Colorado, Amerika Birleşik Devletleri, 16-20 Eylül, 2002.
O. Salor, T. Ciloglu, K. Hacioglu, M. Demirekler, “On developing new text and audio corpora and speech recognition tools for the Turkish language”, International Conference Spoken Language Processing, Denver, Colorado, Amerika Birleşik Devletleri, 16-20 Eylül, 2002.
O. Salor, B. Pellom, T. Ciloglu, M. Demirekler, “Turkish speech corpora and recognition tools developed by porting SONIC: Towards multilingual speech recognition”, Computer Speech Language, 21(4), 580-593, 2007.
E. Arisoy, D. Can, S. Parlak, H. Sak, M. Saraclar, “Turkish broadcast news speech and transcripts”, IEEE Transactions on Audio Speech and Language Processing, 17(5), 874 – 883, 2009.
İnternet: Türkçe Ulusal Derlemi (TUD) – Turkish National Corpus (TNC), https://www.tnc.org.tr/tr/, 05.04.2023.
M. Jalil, F. A. Butt, A. Malik, “A survey of different speech synthesis techniques”, 2013 Int. Conf. Technol. Adv. Electr. Electron. Comput. Eng. TAEECE 2013, 204–207, 2013.
İ. B. Uslu, “Metinden Konuşma Sentezleme”, TMMOB Elektrik Mühendisleri Odası Ankara Şubesi Haber Bülteni, 11–14, 2010.
A. Dunaev, A Text-to-Speech System Based on Deep Neural Networks, Lisans Tezi, KIT Department of Informatics, Institute for Anthropomatics and Robotics (IAR), Interactive Systems Labs (ISL) Karlsruhe Institute of Technology, 2019.
B. S. Gürler, Türkçe Konuşma Tanıma Sistemleri İçin Bir Konuşma Veritabanı, Yüksek Lisans Tezi, Gazi Üniversitesi, Elektronik-Bilgisayar Eğitimi Anabilim Dalı, 2014.
M. C. Orhan, C. Demiroğlu, “Konuşmacı Aradeğerlemeli SMM Tabanlı Metinden Konuşma Sentezleme Sistemi”, 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU 2011), 781–784, 2011.
X. Li, D. Ma, B. Yin, “Advance research in agricultural text-to- speech: the word segmentation of analytic language and the deep learning-based end-to-end system”, Comput. Electron. Agric., 180, 1–10, 2021.
N. Halabi, Modern Standard Arabic Phonetics for Speech Synthesis, Doktora Tezi, University of Southampton, Faculty of Physical Sciences and Engineering School of Electronics and Computer Science, 2016.
İnternet: Festvox, CMU_ARCTIC Databases, http://festvox.org/cmu_arctic/, 23.04.2022.
İnternet: The LJ Speech Dataset, https://keithito.com/LJ- Speech-Dataset/, 23.04.2022.
İnternet: Kaggle, The World English Bible, https://www.kaggle.com/datasets/bryanpark/the-world- english-bible-speech-dataset?select=transcript.txt, 23.04.2022.
E. Casanova vd., “TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese”, Lang. Resour. Eval., 2022.
İnternet: Papers With Code, KazakhTTS Dataset, https://paperswithcode.com/dataset/kazakhtts, 23.04.2022.
İnternet: openslr.org, https://www.openslr.org/, 23.04.2022.
D. Van Niekerk vd., “Rapid development of TTS corpora for four South African languages”, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2178–2182, 2017.
J. Kominek, A. W. Black, “The CMU Arctic Databases for Speech Synthesis”, Proc. ISCA Work. Speech Synth., 223– 224, 2004.
I. Demirşahin, O. Kjartansson, A. Gutkin, C. Rivera, “Opensource Multispeaker Corpora of the English Accents in the British Isles”, Proc. 12th Language Resources and Evaluation Conference (LREC 2020), 6532- 6541, 2020.
İnternet: Meta-Share, Estonian Emotional Speech Corpus, https://metashare.ut.ee/repository/browse/estonian-emotional- speech- corpus/4d42d7a8463411e2a6e4005056b40024a19021a316b54 b7fb707757d43d1a889/, 23.04.2022.
R. Altrov, H. Pajupuu, “Estonian Emotional Speech Corpus: theoretical base and implementation”, 4th International Workshop on Corpora for Research on Emootion Sentiment & Social Signals ES3 2012, 50–53, 2012.
İnternet: T. Müller and D. Kreutz, Thorsten-Voice- ‘Thorsten- 21.02-neutral’ Dataset, https://zenodo.org/record/5525342, 23.04.2022
İnternet: Papers With Code, JSUT Corpus Dataset, https://paperswithcode.com/dataset/jsut-corpus, 23.04.2022.
R. Sonobe, S. Takamichi, H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis”, ICASSP2018, 2017.
S. Mussakhojayeva, A. Janaliyeva, A. Mirzakhmetov, Y. Khassanov, H. A. Varol, “KazakhTTS: An open-source Kazakh text-to-speech synthesis dataset”, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 3511–3515, 2021.
N. Srivastava, R. Mukhopadhyay, K. R. Prajwal, C. V Jawahar, “IndicSpeech: Text-to-Speech Corpus for Indian Languages”, Proc. 12th Language Resources and Evaluation Conference (LREC 2020), 6417- 6422,2020.
E. Guner, C. Demiroglu, “A small footprint hybrid statistical/unit selection text-to-speech synthesis system for agglutinative languages”, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., 4537–4540, 2012.
R. Gokay ve H. Yalcin, “Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS”, 16th Int. Multi-Conference Syst. Signals Devices, SSD 2019, 357– 360, 2019.
J. Shen vd.., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions", 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779-4783, 2018.
H. Tora, İ. B. Uslu, T. Karamehmet, “Implementation of Turkish text-to-speech synthesis on a voice synthesizer card with prosodic features”, Anadolu University Journal of Science and Technology A- Applied Sciences and Engineering, 18(3), 584-594, 2017.
I. B. Uslu, H. K. İlk, “A rule based perceptual intonation model for Turkish text-to-speech synthesis”, 2012 20th Signal Processing and Communications Applications Conference (SIU), Muğla, Türkiye, 18-20 Nisan, 2012.
T. Schultz, Speaker Classification I , C. Müller, 4343, Springer, Berlin, Heidelberg, 2007.
İ. Sel, D. Hanbay, M. Karabatak, “Beyin Bilgisayar Arayüzleri İçin Türkçe Metinden Konuşma Sentezleme Sistemi”, Elektr.ve Bilgi. Sempozyumu 2011, 273–276, 2011.
İnternet: Common Voice Mozilla, https://commonvoice.mozilla.org/tr/datasets, 05.04.2023.
J. Kim vd., “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search”, ArXiv, abs/2005.11129, 2020.
A. Hines vd., “ViSQOL: an objective speech quality model”, Journal on Audio, Speech, and Music Processing, 2015(13), 2015.
C. Sloan vd., “A. Objective assessment of perceptual audio quality using ViSQOLAudio”, IEEE Trans. Broadcast, 63, 693- 705, 2017.
X. Tan vd., “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality”, ArXiv, abs/2205.04421, 2022.

APA	oyucu s, M, Polat H (2023). Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. , 237 - 249. 10.17671/gazibtd.1159289
Chicago	oyucu saadin, Mustafa Sami,Polat Huseyin Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. (2023): 237 - 249. 10.17671/gazibtd.1159289
MLA	oyucu saadin, Mustafa Sami,Polat Huseyin Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. , 2023, ss.237 - 249. 10.17671/gazibtd.1159289
AMA	oyucu s, M,Polat H Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. . 2023; 237 - 249. 10.17671/gazibtd.1159289
Vancouver	oyucu s, M,Polat H Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. . 2023; 237 - 249. 10.17671/gazibtd.1159289
IEEE	oyucu s, M,Polat H "Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama." , ss.237 - 249, 2023. 10.17671/gazibtd.1159289
ISNAD	oyucu, saadin vd. "Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama". (2023), 237-249. https://doi.org/10.17671/gazibtd.1159289

APA	oyucu s, M, Polat H (2023). Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. Bilişim Teknolojileri Dergisi, 16(3), 237 - 249. 10.17671/gazibtd.1159289
Chicago	oyucu saadin, Mustafa Sami,Polat Huseyin Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. Bilişim Teknolojileri Dergisi 16, no.3 (2023): 237 - 249. 10.17671/gazibtd.1159289
MLA	oyucu saadin, Mustafa Sami,Polat Huseyin Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. Bilişim Teknolojileri Dergisi, vol.16, no.3, 2023, ss.237 - 249. 10.17671/gazibtd.1159289
AMA	oyucu s, M,Polat H Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. Bilişim Teknolojileri Dergisi. 2023; 16(3): 237 - 249. 10.17671/gazibtd.1159289
Vancouver	oyucu s, M,Polat H Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. Bilişim Teknolojileri Dergisi. 2023; 16(3): 237 - 249. 10.17671/gazibtd.1159289
IEEE	oyucu s, M,Polat H "Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama." Bilişim Teknolojileri Dergisi, 16, ss.237 - 249, 2023. 10.17671/gazibtd.1159289
ISNAD	oyucu, saadin vd. "Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama". Bilişim Teknolojileri Dergisi 16/3 (2023), 237-249. https://doi.org/10.17671/gazibtd.1159289