Voice Spoofing Detection Using Long Short-Term Memory Models with Mel-Spectrogram Analysis
DOI:
https://doi.org/10.47611/jsrhs.v14i1.8746Keywords:
Voice Spoofing, Frequency Masking, Convolutional Neural NetworkAbstract
Voice spoofing prevention has become a primary concern due to the increasing trend of frauds and scams using machine learning technology. To address this issue, I propose a novel voice spoofing detection system that classifies voices with accuracy, determining whether they are genuine or artificially generated. This system uses the fourier transform to convert the voice input into a mel-spectrogram, which is then processed using convolutional neural networks and long short term memory networks for classification. Comprehensive experiments demonstrate that the proposed method effectively classifies input signals with significant accuracy. In addition, I conducted frequency masking experiments to study how specific frequency bandwidths are correlated with enhancing real-fake classification. I expect that this method will inspire further innovations to prevent voice spoofing attacks.
Downloads
References or Bibliography
Cohen, A., Rimon, I., Aflalo, E., & Permuter, H. H. (2022). A study on data augmentation in voice anti-spoofing. Speech Communication, 141, 56-67.
Ergünay, S. K., Khoury, E., Lazaridis, A., & Marcel, S. (2015, September). On the vulnerability of speaker verification to realistic voice spoofing. In 2015 IEEE 7th international conference on biometrics theory, applications and systems (BTAS) (pp. 1-6). IEEE.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139-144.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). https://doi.org/10.48550/arXiv.1512.03385
Kaggle. (2024, Sep 4). “The Fake-or-Real (FoR) Dataset (deepfake audio)”: Kaggle.
https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520). https://doi.org/10.48550/arXiv.1801.04381
Ustubioglu, A., Ustubioglu, B., & Ulutas, G. (2023). Mel spectrogram-based audio forgery detection using CNN. Signal, Image and Video Processing, 17(5), 2211-2219.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., ... & Xiao, B. (2020). Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10), 3349-3364. https://doi.org/10.48550/arXiv.1908.07919
Published
How to Cite
Issue
Section
Copyright (c) 2025 IAN BAEK; Sojung Min

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.


