Golos dataset

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.
We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.
The main project page: Golos GitHub repository

Dataset structure

Domains	Train utterances	Train hours	Test utterances	Test hours
Crowd	979 796	1 095	9 994	11.2
Farfield	124 003	132.4	1 916	1.4
Total	1 103 799	1 227.4	11 910	12.6

External URLs

Audio files in opus format

golos_opus.tar [20.5 GB]

Audio files in wav format

Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed below:
train_farfield.tar [15.4 GB]
train_crowd0.tar [11 GB]
train_crowd1.tar [14 GB]
train_crowd2.tar [13.2 GB]
train_crowd3.tar [11.6 GB]
train_crowd4.tar [15.8 GB]
train_crowd5.tar [13.1 GB]
train_crowd6.tar [15.7 GB]
train_crowd7.tar [12.7 GB]
train_crowd8.tar [12.2 GB]
train_crowd9.tar [8.08 GB]
test.tar [1.3 GB]

Acoustic and language models

QuartzNet15x5_golos.nemo [68 MB]
KenLMs.tar [4.8 GB]

Authors (in alphabetic order):

Alexander Denisenko
Angelina Kovalenko
Fedor Minkin
Nikolay Karpov

You can cite the data using the following BibTeX entry:

  @article{karpov2021golos,
    title={Golos: Russian Dataset for Speech Research},
    author={Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor},
    journal={arXiv preprint arXiv:2106.10161},
    year={2021}
  }

To contact us please create an issue in the Golos GitHub repository!