The Tunisian_MSA corpus was originally collected to train acoustic models for pronunciation modeling in Arabic language learning applications.

The data collection took place near Tunis the capital of the Republic of Tunisia in 2003.

The Tunisian_MSA corpus is divided into recited and prompted speech subcorpora. The recited speech is stored under the recordings directory. The prompted speech is stored under the answers directory. Each of the 118 informants contributed to both subcorpora by reciting sentences and providing answers to prompted questions. The Tunisian_MSA corpus has 11.2 hours of speech.

A small corpus was collected in 2017 for testing. It consists of speech from 4 speakers, 3 male Libyans and 1 female from Tunisia.