A crowdsourced open-source speech corpus for the Kazakh language. The KSC contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The dataset is primarily intended to be used for training automatic speech recognition systems.

You can find more information about the dataset here.

To cite the dataset, please use the following BibTeX entry:

@inproceedings{khassanov-etal-2021-crowdsourced,
  title = "A Crowdsourced Open-Source {K}azakh Speech Corpus and Initial Speech Recognition Baseline",
  author={Yerbolat Khassanov and Saida Mussakhojayeva and Almas Mirzakhmetov and Alen Adiyev and Mukhamet Nurpeiissov and Huseyin Atakan Varol},
  booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
  month = apr,
  year = "2021",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.eacl-main.58",
  doi = "10.18653/v1/2021.eacl-main.58",
  pages = "697--706"
}