This repository contains the data for the six aligned languages of the BibleTTS corpus (Asante Twi, Akuapem Twi, Ewe, Hausa, Lingala, Yoruba).
This data has been automatically verse-aligned and filtered for TTS training.
Each .tgz
file contains: speech files for individual verses and corresponding transcripts for each standardized split per language (train, dev, test). Files in each split are grouped into subdirectories by book.
The speech data is distributed as flac files in the original 48kHz mono format; it may be desired to resample for TTS training.
For more information, see the:
Citation: If you use the BibleTTS corpus in your work, please cite the dataset paper:
@inproceedings{meyer2022bibletts, title={BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus}, author={Josh Meyer and David Adelani and Edresson Casanova and Alp {\"O}ktem and Daniel Whitenack and Julian Weber and Salomon Kabongo Kabenamualu and Elizabeth Salesky and Iroro Orife and Colin Leong and Perez Ogayo and Chris Chinenye Emezue and Jonathan Mukiibi and Salomey Osei and Apelete Agbolo and Victor Akinode and Bernard Opoku and Olanrewaju Samuel and Jesujoba Alabi and Shamsuddeen Hassan Muhammad}, booktitle={Interspeech}, publisher = {{ISCA}}, year={2022}, url={https://arxiv.org/pdf/2207.03546.pdf} }