Voice Spoofing Detection Using Long Short-Term Memory Models with Mel-Spectrogram Analysis

Authors

  • IAN BAEK Korean Minjok Leadership Academy
  • Sojung Min Korean Minjok Leadership Academy

DOI:

https://doi.org/10.47611/jsrhs.v14i1.8746

Keywords:

Voice Spoofing, Frequency Masking, Convolutional Neural Network

Abstract

Voice spoofing prevention has become a primary concern due to the increasing trend of frauds and scams using machine learning technology. To address this issue, I propose a novel voice spoofing detection system that classifies voices with accuracy, determining whether they are genuine or artificially generated. This system uses the fourier transform to convert the voice input into a mel-spectrogram, which is then processed using convolutional neural networks and long short term memory networks for classification. Comprehensive experiments demonstrate that the proposed method effectively classifies input signals with significant accuracy. In addition, I conducted frequency masking experiments to study how specific frequency bandwidths are correlated with enhancing real-fake classification. I expect that this method will inspire further innovations to prevent voice spoofing attacks.

Downloads

Download data is not yet available.

References or Bibliography

Cohen, A., Rimon, I., Aflalo, E., & Permuter, H. H. (2022). A study on data augmentation in voice anti-spoofing. Speech Communication, 141, 56-67.

Ergünay, S. K., Khoury, E., Lazaridis, A., & Marcel, S. (2015, September). On the vulnerability of speaker verification to realistic voice spoofing. In 2015 IEEE 7th international conference on biometrics theory, applications and systems (BTAS) (pp. 1-6). IEEE.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139-144.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). https://doi.org/10.48550/arXiv.1512.03385

Kaggle. (2024, Sep 4). “The Fake-or-Real (FoR) Dataset (deepfake audio)”: Kaggle.

https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset

Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520). https://doi.org/10.48550/arXiv.1801.04381

Ustubioglu, A., Ustubioglu, B., & Ulutas, G. (2023). Mel spectrogram-based audio forgery detection using CNN. Signal, Image and Video Processing, 17(5), 2211-2219.

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., ... & Xiao, B. (2020). Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10), 3349-3364. https://doi.org/10.48550/arXiv.1908.07919

Published

02-28-2025

How to Cite

BAEK, I., & Min, S. (2025). Voice Spoofing Detection Using Long Short-Term Memory Models with Mel-Spectrogram Analysis. Journal of Student Research, 14(1). https://doi.org/10.47611/jsrhs.v14i1.8746

Issue

Section

HS Research Projects