Phonetic analysis of real and synthetic speech using HuBERT embeddings: Perspectives for Deepfake detection
Temmar, Dia Elhak ; Hamadene, Assia ; Nallaguntla, Vamshi ; Fursule, Aishwarya ; Allili, Mohand Saïd ; Kshirsagar, Shruti ; Avila, Anderson R.
Temmar, Dia Elhak
Hamadene, Assia
Nallaguntla, Vamshi
Fursule, Aishwarya
Allili, Mohand Saïd
Kshirsagar, Shruti
Avila, Anderson R.
Citations
publications.citations-section.null.title:
Altmetric:
Authors
Temmar, Dia Elhak
Hamadene, Assia
Nallaguntla, Vamshi
Fursule, Aishwarya
Allili, Mohand Saïd
Kshirsagar, Shruti
Avila, Anderson R.
Hamadene, Assia
Nallaguntla, Vamshi
Fursule, Aishwarya
Allili, Mohand Saïd
Kshirsagar, Shruti
Avila, Anderson R.
Other Names
Location
Time Period
Advisors
Original Date
Digitization Date
Issue Date
2026-01-28
Type
Conference paper
Genre
Keywords
Audio DeepFake detection,Hu-BERT,Phoneme and word embedding,Self-supervised speech representation
Subjects (LCSH)
Citation
D. E. Temmar et al., "Phonetic Analysis of Real and Synthetic Speech Using HuBERT Embeddings: Perspectives for Deepfake Detection," 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria, 2025, pp. 86-91, doi: 10.1109/SMC58881.2025.11343334.
Abstract
The growing sophistication of speech generated by Artificial Intelligence (AI) has introduced new challenges in audio deepfake detection. Text-to-speech (TTS) and voice conversion (VC) technologies can now produce convincing synthetic speech with high quality and intelligibility. This poses a serious threat to voice biometric security systems, such as automatic speaker recognition. It also increases the risks associated to the spread of spoken disinformation, where synthetic voices can be used to disseminate malicious content. In this study, we conduct an analysis of real and synthetic speech at phonetic and word levels. For that, a parallel dataset comprising real and synthetic speech signals were developed based on a subset of the LibriSpeech ASR corpus. Synthetic speech samples were generated using two TTS and one VC systems: Coqui TTS, VITS TTS, and StarGANv2 VC. We adopted HuBERT, a self-supervised speech model, to extract speech embeddings. The motivation for using this model stems from its ability to recognize sound units corresponding to the so-called pseudo phonemes. Our analysis is based on the KL divergence (KLD) between the distributions of synthetic and real phonemes, which allowed us to rank synthetic phonemes based on their alignment with their real counterpart. We also trained several classifiers per phoneme to distinguish between real and synthetic samples. We then compute the correlations between KLD and accuracies per phoneme. Besides showing a list of phonemes that are more discriminative, our findings suggest that vowels correlate better with the classifiers' performance, suggesting that the KLD can be an indicator of the most distinguishable phonemes for deepfake detection. © 2025 IEEE.
Table of Contents
Description
Click on the DOI link to access this article at the publishers website (may not be free).
Publisher
Institute of Electrical and Electronics Engineers Inc.
Journal
Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
Book Title
Series
2025 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2025
5 October 2025 through 8 October 2025
Hybrid, Vienna
219622
5 October 2025 through 8 October 2025
Hybrid, Vienna
219622
Digital Collection
Finding Aid URL
Use and Reproduction
Archival Collection
PubMed ID
ISSN
1062922X
