Golos dataset

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.
We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.
The main project page: Golos GitHub repository

Dataset structure

Domains Train utterances Train hours Test utterances Test hours
Crowd 979 796 1 095 9 994 11.2
Farfield 124 003 132.4 1 916 1.4
Total 1 103 799 1 227.4 11 910 12.6

External URLs

Audio files in opus format

golos_opus.tar [20.5 GB]

Audio files in wav format

Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed below:
train_farfield.tar [15.4 GB]
train_crowd0.tar [11 GB]
train_crowd1.tar [14 GB]
train_crowd2.tar [13.2 GB]
train_crowd3.tar [11.6 GB]
train_crowd4.tar [15.8 GB]
train_crowd5.tar [13.1 GB]
train_crowd6.tar [15.7 GB]
train_crowd7.tar [12.7 GB]
train_crowd8.tar [12.2 GB]
train_crowd9.tar [8.08 GB]
test.tar [1.3 GB]

Acoustic and language models

QuartzNet15x5_golos.nemo [68 MB]
KenLMs.tar [4.8 GB]

Authors (in alphabetic order):

You can cite the data using the following BibTeX entry:

  @article{karpov2021golos,
    title={Golos: Russian Dataset for Speech Research},
    author={Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor},
    journal={arXiv preprint arXiv:2106.10161},
    year={2021}
  }

To contact us please create an issue in the Golos GitHub repository!