Voice-Changing Detection with Convolutional Neural Network
Keywords:
voice-changing, Convolutional Neural NetworkAbstract
Voice-changing is a voice transformation technique that directly modifies a voice’s pitch, tone, and timbre. Like other types of voice transformation, voice-changing is threatening our society from an economic, social, and legal perspective; hence in this study, we tackle this problem by adopting a convolutional neural network. As far as we know, our proposed work is the first solution to distinguish voice-changed voices and authentic voices. In addition, we also conducted experiments and gained significant insights that can help facilitate further research into voice-changing and audio-related machine learning. Foremost, we find the architecture with four convolutional layers to be the most effective. Second, we find a strong and positive correlation between the training data size and the performance in terms of accuracy and stability. Finally, we discovered that different training languages yield very different performances, with Chinese far outweighing English as a training language. We were inspired by the findings of different training languages, which motivated us to conduct a supplementary experiment: we concluded that tone is one factor that contributes to making a training language better in spotting voice-changed voices.
References or Bibliography
[ABD+19] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty,
Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders,
Francis M Tyers, and Gregor Weber. Common voice: A massively-
multilingual speech corpus. arXiv preprint arXiv:1912.06670,
[ALF19] Ehab A AlBadawy, Siwei Lyu, and Hany Farid. Detecting ai-
synthesized speech using bispectral analysis. In CVPR Workshops,
pages 104–109, 2019.
[CM] Katherine M Crosswhite and Joyce McDonough. Comparison of in-
tonation patterns in mandarin and english for a particular speaker.
[DAAP21] Constantine C Doumanidis, Christina Anagnostou, Evangelia-
Sofia Arvaniti, and Anthi Papadopoulou. Rnnoise-ex: Hybrid
speech enhancement system based on rnn and spectral features.
arXiv preprint arXiv:2105.11813, 2021.
[HZZ+20] Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng
Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui
Wu. Contextnet: Improving convolutional neural networks for
automatic speech recognition with global context. arXiv preprint
arXiv:2005.03191, 2020.
[MK17] Seyed Hamidreza Mohammadi and Alexander Kain. An overview
of voice conversion systems. Speech Communication, 88:65–82,
[ODZ+16] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-
monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. Wavenet: A generative model for
raw audio. arXiv preprint arXiv:1609.03499, 2016.
[P+06] Gang Peng et al. Temporal and tonal aspects of chinese sylla-
bles: A corpus-based comparative study of mandarin and can-
tonese. Journal of Chinese Linguistics, 34(1):134, 2006.
[RaPTL] Catherine Ryu, Mandarin Tone Perception amp; Produc-
tion Team, and Michigan State University Libraries. Tone perfect:
Multimodal database for mandarin chinese.
[WJXH+20] Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie,
Lei Ma, and Yang Liu. Deepsonar: Towards effective and robust
detection of ai-synthesized fake voices. In Proceedings of the 28th
ACM International Conference on Multimedia, pages 1207–1216,
[ZXL+18] Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas
Usunier, Gabriel Synnaeve, and Ronan Collobert. Fully convolu-
tional speech recognition. arXiv preprint arXiv:1812.06864, 2018.
Downloads
Posted
License
Copyright (c) 2022 Chuntung Zhuang

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The copyright holder for this article has granted JSR.org a license to display the article in perpetuity.
