The WenetSpeech corpus is a 10000+ hours multi-domain transcribed Mandarin Speech Corpus collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.
- 10000+ hours high-label data: with confidence >= 0.95, for supervised training, etc.
- 2400+ hours weak-label data: with 0.6 <= confidence < 0.95, for semi-supervisied or noisy training, etc.
- ~10000 hours unlabeled data: with confidence < 0.6, for unsupervised training, etc.
- 22400+ hours audio in total: consists of both labeled and unlabeled data, for unsupervised training or pretraining, etc.
Diversity
The high-label data of Wenetspeech can be mainly classified into 10 categories according to speaking styles and spoken scenarios:
- drama (43.36%)
- reading (11.1%)
- interview (9.38%)
- news (8.68%)
- variety (8.27%)
- documentary (4.77%)
- talk (2.94%)
- audiobook (2.51%)
- commentary (2.48%)
- others (6.51%)
Citations
You can cite the data using the following BibTeX entry:
@inproceedings{zhang2022wenetspeech,
title={WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition},
author={Zhang, Binbin and Lv, Hang and Guo, Pengcheng and Shao, Qijie and Yang, Chao and Xie, Lei and Xu, Xin and Bu, Hui and Chen, Xiaoyu and Zeng, Chenchen and Wu, Di and Peng, Zhendong},
booktitle={International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2022},
organization={IEEE}
}
Download
To download the corpus, please go to https://wenet.org.cn/WenetSpeech, and fill out the google form to receive
the password and follow the instructions to download. The github page, provides the toolkits for downloading.