Open Speech and Language Resources



Sagalee

Identifier: SLR157

Summary: Automatic Speech Recognition Dataset for Oromo Language

Category: Speech

License: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Downloads (use a mirror closer to you):
sagalee.7z [10G]   ( speech and transcripts )   Mirrors: [US]   [EU]   [CN]  

About this resource:

This dataset is Automatic Speech Recognition (ASR) resource for the Oromo language, a widely spoken language in Ethiopia and neighboring regions.
The dataset was collected using mobile app from diverse speakers.

This work was accepted for presentation at ICASSP 2025 and its development was partly supported by the National Natural Science Foundation of China (NSFC) under Grants No.62301075. The paper is available on arXiv: https://arxiv.org/abs/2502.00421 And the training code https://github.com/turinaf/sagalee

You can cite the data using the following BibTeX entry:

  @misc{abu2025sagaleeopensourceautomatic,
    title={Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language}, 
    author={Turi Abu and Ying Shi and Thomas Fang Zheng and Dong Wang},
    year={2025},
    eprint={2502.00421},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2502.00421}, 
  }