LibriSpeech-PC: A dataset based on LibriSpeech* with restored punctuation and capitalization.
- The dataset includes ONLY .json manifests, NO audio files, audio files can be taken from the original LibriSpeech: https://www.openslr.org/12
- Subsets' structure is preserved.
- Some samples were dropped during punctuation and capitalization restoration, see STATISTICS for details.
*V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
You can cite the data using the following BibTeX entry:
@article{meister2023librispeechpc,
title={LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models},
author={A. Meister and M. Novikov and N. Karpov and E. Bakhturina and V. Lavrukhin and B. Ginsburg},
journal={arXiv preprint arXiv:2310.02943},
year={2023},
}