The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers. This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:
@inproceedings{choi_2018, title={{Pansori: ASR corpus generation from open online video contents}}, author={Choi, Yoona and Lee, Bowon}, booktitle={Proceedings of the IEEE Seoul Section Student Paper Contest 2018}, pages={117-121}, month={Nov}, year={2018}, }Extra care was taken to maintain the quality of the generated corpus:
Electronics Engineering, Inha University (link)