LibriSpeech-PC: A dataset based on LibriSpeech* with restored punctuation and capitalization.

The dataset includes ONLY .json manifests, NO audio files, audio files can be taken from the original LibriSpeech: https://www.openslr.org/12
Subsets' structure is preserved.
Some samples were dropped during punctuation and capitalization restoration, see STATISTICS for details.

*V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.

You can cite the data using the following BibTeX entry:

@article{meister2023librispeechpc,
        title={LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models}, 
        author={A. Meister and M. Novikov and N. Karpov and E. Bakhturina and V. Lavrukhin and B. Ginsburg},
        journal={arXiv preprint arXiv:2310.02943},
        year={2023},
  }