Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus. The main project page: Golos GitHub repository
Domains | Train utterances | Train hours | Test utterances | Test hours |
---|---|---|---|---|
Crowd | 979 796 | 1 095 | 9 994 | 11.2 |
Farfield | 124 003 | 132.4 | 1 916 | 1.4 |
Total | 1 103 799 | 1 227.4 | 11 910 | 12.6 |
Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed below: train_farfield.tar [15.4 GB] train_crowd0.tar [11 GB] train_crowd1.tar [14 GB] train_crowd2.tar [13.2 GB] train_crowd3.tar [11.6 GB] train_crowd4.tar [15.8 GB] train_crowd5.tar [13.1 GB] train_crowd6.tar [15.7 GB] train_crowd7.tar [12.7 GB] train_crowd8.tar [12.2 GB] train_crowd9.tar [8.08 GB] test.tar [1.3 GB]
You can cite the data using the following BibTeX entry:
@article{karpov2021golos, title={Golos: Russian Dataset for Speech Research}, author={Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor}, journal={arXiv preprint arXiv:2106.10161}, year={2021} }
To contact us please create an issue in the Golos GitHub repository!