Golos
Identifier: SLR114
Summary: Russian ASR dataset (1240 hours) with trained acoustic and language models
Category: Speech
License: https://github.com/sberdevices/golos/blob/master/license/en_us.pdf
Downloads (use a mirror closer to you):
golos_opus.tar.gz [18G] ( Opus audio files with Russian speech and transcripts
) Mirrors:
[US]
[EU]
[CN]
QuartzNet15x5_golos.nemo.gz [71M] ( Acoustic model trained using Golos dataset
) Mirrors:
[US]
[EU]
[CN]
kenlms.tar.gz [4.7G] ( KenLM language models created using Russian Common Crawl corpus
) Mirrors:
[US]
[EU]
[CN]
About this resource:
Golos dataset
Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus. The main project page: Golos GitHub repository
Dataset structure
Domains | Train utterances | Train hours | Test utterances | Test hours |
---|---|---|---|---|
Crowd | 979 796 | 1 095 | 9 994 | 11.2 |
Farfield | 124 003 | 132.4 | 1 916 | 1.4 |
Total | 1 103 799 | 1 227.4 | 11 910 | 12.6 |
External URLs
Audio files in opus format
golos_opus.tar [20.5 GB]Audio files in wav format
Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed below: train_farfield.tar [15.4 GB] train_crowd0.tar [11 GB] train_crowd1.tar [14 GB] train_crowd2.tar [13.2 GB] train_crowd3.tar [11.6 GB] train_crowd4.tar [15.8 GB] train_crowd5.tar [13.1 GB] train_crowd6.tar [15.7 GB] train_crowd7.tar [12.7 GB] train_crowd8.tar [12.2 GB] train_crowd9.tar [8.08 GB] test.tar [1.3 GB]
Acoustic and language models
QuartzNet15x5_golos.nemo [68 MB] KenLMs.tar [4.8 GB]Authors (in alphabetic order):
- Alexander Denisenko
- Angelina Kovalenko
- Fedor Minkin
- Nikolay Karpov
You can cite the data using the following BibTeX entry:
@article{karpov2021golos, title={Golos: Russian Dataset for Speech Research}, author={Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor}, journal={arXiv preprint arXiv:2106.10161}, year={2021} }
To contact us please create an issue in the Golos GitHub repository!
External URLs:
https://sc.link/JpD (Opus audio files and transcripts with Russian speech
)
https://sc.link/ZMv (Acoustic model trained using Golos dataset
)
https://sc.link/YL0 (KenLM language models created using Russian Common Crawl corpus
)