This dataset contains 17,090 audio clips of length 30 seconds sampled from archives collected from 6 Guinean radio stations. The broadcasts consist of news and various radio shows in languages including French, Guerze, Koniaka, Kissi, Kono, Maninka, Mano, Pular, Susu, and Toma. Some radio shows include phone calls, background and foreground music, and various noise types. We collected this dataset for the purpose of unsupervized speech representation learning. A validation set of 300 tagged audio clips is also included.

Please see our paper for more details on this dataset. Additional resources can be found in the following git repository: https://github.com/mdoumbouya/nicolingua

You can cite our work using the following BibTeX entry.

 @inproceedings{doumbouya2021usingradio,
    title={Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users},
    author={Doumbouya, Moussa and Einstein, Lisa and Piech, Chris},
    booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={35},
    year={2021}
  }