GAN-Based End to End Text-to-Speech System for Indonesian Language

  • Moch Azhar Dhiaulhaq Institut Teknologi Bandung
  • Rizki Rivai Ginanjar Prosa.ai, PT Prosa Solusi Cerdas
  • Dessi Puji Lestari Institut Teknologi Bandung

Abstract

The developments of the modern text-to-speech (TTS) technology have matured in which the direction of the recent approaches has moved toward the optimization of the system and TTS modeling from the resource-scarce languages, rather than finding new model architectures. In this paper, a novel approach to modeling modern end-to-end (E2E) TTS for Indonesian language with the integration of three different generative adversarial networks (GAN)-based vocoders for comparison is proposed. Based on the evaluation, the proposed system shows promising results with the mean opinion score (MOS) value of 4.60 while still maintaining fast inference speed, proven by the real-time factor (RTF) value under one. 


 

References

[1] Khan, R. A., & Chitode, J. S. (2016). Concatenative Speech Synthesis: A Review. International Journal of Computer Applications, 136(3), 1–6. https://doi.org/10.5120/ijca2016907992
[2] King, S. (2011). An introduction to statistical parametric speech synthesis. Sadhana - Academy Proceedings in Engineering Sciences, 36(5), 837–852. https://doi.org/10.1007/s12046-011-0048-y
[3] Young, S., Everman, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (1995). The HTK Book (2009 ed.).
[4] Wu, Z., Watts, O., & King, S. (2016). Merlin: An Open Source Neural Network Speech Synthesis System. 9th ISCA Speech Synthesis Workshop. https://doi.org/10.21437/SSW.2016-33
[5] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., & Le, Q. (2017). Tacotron: Towards end-To-end speech synthesis. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
[6] Van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet : A Generative Model for RAW Audio.
[7] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018-April, 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
[8] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32(NeurIPS).
[9] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. 1–15. http://arxiv.org/abs/2006.04558
[10] Łańcucki, A. (2021). FastPitch: Parallel text-to-speech with pitch prediction. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June, 6588–6592. https://doi.org/10.1109/ICASSP39728.2021.9413889
[11] Donahue, C., Mcauley, J., & Puckette, M. (2019, Februari). ADVERSARIAL AUDIO SYNTHESIS. The International Conference on Learning Representations.
[12] Yamamoto, R., Song, E., Kim, J., & Corp, N. (2020). PARALLEL WAVEGAN : A FAST WAVEFORM GENERATION MODEL BASED ON GENERATIVE ADVERSARIAL NETWORKS WITH MULTI-RESOLUTION SPECTROGRAM LINE Corp ., Tokyo , Japan . ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6199–6203.
[13] Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., Bengio, Y., & Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Advances in Neural Information Processing Systems, 32(NeurIPS 2019).
[14] Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2021). Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech. 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, 2017, 492–498. https://doi.org/10.1109/SLT48900.2021.9383551
[15] Mustafa, A., Pia, N., & Fuchs, G. (2021). Stylemelgan: An efficient high-fidelity adversarial vocoder with temporal adaptive normalization. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. https://doi.org/10.1109/ICASSP39728.2021.9413605
[16] Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, 2020-December. https://doi.org/10.48550/arxiv.2010.05646
Published
2022-10-28
How to Cite
DHIAULHAQ, Moch Azhar; GINANJAR, Rizki Rivai; LESTARI, Dessi Puji. GAN-Based End to End Text-to-Speech System for Indonesian Language. Jurnal Linguistik Komputasional, [S.l.], v. 5, n. 2, p. 57 - 62, oct. 2022. ISSN 2621-9336. Available at: <http://inacl.id/journal/index.php/jlk/article/view/115>. Date accessed: 29 jan. 2023. doi: https://doi.org/10.26418/jlk.v5i2.115.
Section
Articles