IISc-MILE Tamil ASR Corpus contains transcribed speech corpus for training ASR systems for Tamil language. It contains ~150 hours of read speech data collected from 531 speakers in a noise-free recording environment with high quality USB microphones.

The corpus is split as train and test and each folder contains two subfolders named audio_files and trans_files. The folder "audio_files" contains .wav file recordings (16 KhZ, 16 bit, mono, PCM format). The folder "trans_files" contains .txt files in UTF-8 Unicode text corresponding to each audio file.

This corpus is published by Medical Intelligence and Language Engineering Lab, Indian Institute of Science, Bangalore.

You can cite the data using the following BibTeX entries:

@misc{mile_1,
  doi = {10.48550/ARXIV.2207.13331},
  url = {https://arxiv.org/abs/2207.13331},
  author = {A, Madhavaraj and Pilar, Bharathi and G, Ramakrishnan A},
  title = {Subword Dictionary Learning and Segmentation Techniques for Automatic Speech Recognition in Tamil and Kannada},
  publisher = {arXiv},
  year = {2022},
}

@misc{mile_2,
  doi = {10.48550/ARXIV.2207.13333},
  url = {https://arxiv.org/abs/2207.13333},
  author = {A, Madhavaraj and Pilar, Bharathi and G, Ramakrishnan A},
  title = {Knowledge-driven Subword Grammar Modeling for Automatic Speech Recognition in Tamil and Kannada},
  publisher = {arXiv},
  year = {2022},
}