The Hindi-English and Bengali-English datasets are extracted from spoken tutorials. These tutorials cover a range of technical topics and the code-switching predominantly arises from the technical content of the lectures. The segments file in the baseline recipe provides sentence time-stamps. These time-stamps were used to derive segments from the audio file to be aligned with the transcripts given in the text file. Hindi-English train and test datasets contain 89.86 hours and 5.18 hours, respectively, while the Bengali-English train and test datasets contain 46.11 hours and 7.02 hours of speech, respectively. All the audio files in both datasets are sampled at 16 kHz, 16 bits encoding. The vocabulary size for Hindi-English and Bengali-English are 17877 and 13656, respectively.
In addition to the train and test sets, the blind test set for subtask2 is also provided.