The Hindi speech dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively. There are 4506 and 386 unique sentences taken from Hindi stories in the train and test sets, respectively, with no overlap of sentences. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers. The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 6542.
The Marathi speech data is collected from three different user groups: College students, Rural low income workers, Urban low income workers. The dataset is split into train and test, with 93.89 hours and 5 hours of audio, respectively. There are 2543 and 200 unique sentences in the train and test sets, respectively, and the utterances belong to the same set of 31 speakers in both train and test sets, with 100% speaker overlap. The text transcriptions of train and test sets are disjoint. The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 3395.
The text data was collected from four districts, (representative dialects indicated in parenthesis) - Sambalpur (North Western Odia), Mayurbhanj (North Eastern Odia), Puri (Central and Standard Odia) and Koraput (Southern Odia). The focal themes of data collection were agriculture, healthcare and finance. Data collection was carried out on the field from farmers and agriculture officers for Agriculture domain; nurses, doctors and associate professionals (front desk staff, naturopathy practitioners) for healthcare domain and bank employees for Finance domain. A cumulative of 885 sentences were obtained for speech data collection, and were split across train and test set with 94.54 hours and 5.49 hours audio respectively. The dataset has 65 unique sentences in Test set non overlapping with 820 unique sentences in Train set. The audio files are sampled at 8kHz, 16-bit encoding. The vocabulary size is 1644.
In addition to the train and test sets, the blind test set for subtask1 is also provided.