Preprint / Version 1

Voice-Changing Detection with Convolutional Neural Network

##article.authors##

  • Chuntung Zhuang

Keywords:

voice-changing, Convolutional Neural Network

Abstract

Voice-changing is a voice transformation technique that directly modifies a voice’s pitch, tone, and timbre. Like other types of voice transformation, voice-changing is threatening our society from an economic, social, and legal perspective; hence in this study, we tackle this problem by adopting a convolutional neural network. As far as we know, our proposed work is the first solution to distinguish voice-changed voices and authentic voices. In addition, we also conducted experiments and gained significant insights that can help facilitate further research into voice-changing and audio-related machine learning. Foremost, we find the architecture with four convolutional layers to be the most effective. Second, we find a strong and positive correlation between the training data size and the performance in terms of accuracy and stability. Finally, we discovered that different training languages yield very different  performances, with Chinese far outweighing English as a training language. We were inspired by the findings of different training languages, which motivated us to conduct a supplementary experiment: we concluded that tone is one factor that contributes to making a training language better in spotting voice-changed voices.

References or Bibliography

[ABD+19] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty,

Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders,

Francis M Tyers, and Gregor Weber. Common voice: A massively-

multilingual speech corpus. arXiv preprint arXiv:1912.06670,

[ALF19] Ehab A AlBadawy, Siwei Lyu, and Hany Farid. Detecting ai-

synthesized speech using bispectral analysis. In CVPR Workshops,

pages 104–109, 2019.

[CM] Katherine M Crosswhite and Joyce McDonough. Comparison of in-

tonation patterns in mandarin and english for a particular speaker.

[DAAP21] Constantine C Doumanidis, Christina Anagnostou, Evangelia-

Sofia Arvaniti, and Anthi Papadopoulou. Rnnoise-ex: Hybrid

speech enhancement system based on rnn and spectral features.

arXiv preprint arXiv:2105.11813, 2021.

[HZZ+20] Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng

Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui

Wu. Contextnet: Improving convolutional neural networks for

automatic speech recognition with global context. arXiv preprint

arXiv:2005.03191, 2020.

[MK17] Seyed Hamidreza Mohammadi and Alexander Kain. An overview

of voice conversion systems. Speech Communication, 88:65–82,

[ODZ+16] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-

monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew

Senior, and Koray Kavukcuoglu. Wavenet: A generative model for

raw audio. arXiv preprint arXiv:1609.03499, 2016.

[P+06] Gang Peng et al. Temporal and tonal aspects of chinese sylla-

bles: A corpus-based comparative study of mandarin and can-

tonese. Journal of Chinese Linguistics, 34(1):134, 2006.

[RaPTL] Catherine Ryu, Mandarin Tone Perception amp; Produc-

tion Team, and Michigan State University Libraries. Tone perfect:

Multimodal database for mandarin chinese.

[WJXH+20] Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie,

Lei Ma, and Yang Liu. Deepsonar: Towards effective and robust

detection of ai-synthesized fake voices. In Proceedings of the 28th

ACM International Conference on Multimedia, pages 1207–1216,

[ZXL+18] Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas

Usunier, Gabriel Synnaeve, and Ronan Collobert. Fully convolu-

tional speech recognition. arXiv preprint arXiv:1812.06864, 2018.

Downloads

Posted

03-31-2022