The dataset is structured as the following:
audio_data/
reader_1/
001001.mp3 (surah 1, ayah 1)
001002.mp3
...
reader_2/
001001.mp3
...
114006.mp3 (surah 114, ayah 6)
reader_3/
...
Note, that not all readers have all the 6236 ayat of Quran, some may not even have all the 114 surahs.
The text of the surahs is in the all_ayat.json file. all_ayat.json file has all the surahs and ayas in Arabic text.
json key format is "1_1" for surah 1 ayah 1, or "114_2" (surah 114 ayah 2). In other words, "xxx_yyy" where x is surah number and y is ayah number up to 3 digits long.
{"tafsir":{"1_1":{"text":"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"},"1_2":{"text":"الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ"},"1_3":{"text":"الرَّحْمَٰنِ الرَّحِيمِ"},"1_4":{"text":"مَالِكِ يَوْمِ الدِّينِ"},"1_5":{"text":"إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"},"1_6":{"text":"اهْدِنَا الصِّرَاطَ الْمُسْتَقِيمَ"}, ...}
audo_list.txt has a list of all mp3 files found in the audio_data directory, transcripts.tsv is a tab-separated-value file that can be used as an input to a machine learning program. It has the format Path-Duration(in seconds) -Arabic text.