Abstract:
This study evaluates the Whisper model’s performance for Romanian speech-to-text transcription, investigating how transcription accuracy varies across diverse audio domains. Audio sources, including audiobooks, news broadcasts, and official public speeches, were selected for their verified textual references, ensuring robust evaluation through accurate alignment. Each domain presents distinct linguistic and acoustic characteristics, from the structured and clear narration of audiobooks to the dynamic and occasionally noisy environments of live news, to the formal rhetoric of political discourse. The study uses standard evaluation metrics such as Word Error Rate (WER) and Character Error Rate (CER), enabling a consistent assessment of transcription performance. By focusing on Romanian, a low-resource language in automatic speech recognition, this study provides novel insights into Whisper’s effectiveness and the influence of the audio domain on transcription quality, contributing to advancements in speech recognition for under-resourced languages. Results show that Whisper performs best on scripted, high-quality audio such as audiobooks. At the same time, accuracy decreases in more variable and spontaneous contexts, highlighting the model’s sensitivity to content structure and recording conditions.