Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
With a total duration of 6682 hours of audio recording, this corpus is the result of the voluntary work of 130 speakers.
The metadata is divided into 4 .jsons files (all(100%), train(80%), dev(10%) and test(10%)) to be used in NLP models.
The corpus and its metadata were uploaded through a script distributing the information in a .csv file. The use of these audio and metadata files is intended for pre-trained speech models.
Speakers Gender* | Number of Files | Number of Speakers | Total Duration | Avg. Duration | Min. Duration | Max. Duration |
---|---|---|---|---|---|---|
M | 19345 | 51 | 4127:34:45 | 00:12:48 | 00:15:01 | 06:39:43 |
F | 8261 | 70 | 2272:18:23 | 00:16:30 | 00:03:49 | 02:46:06 |
U | 879 | 9 | 282:07:09 | 00:19:15 | 00:00:09 | 03:30:16 |
Total | 28485 | 130 | 6682:00:18 | 00:14:04 | 00:00:09 | 06:39:43 |
@inproceedings{Felice2024, title={Audiocite.net: A Large Spoken Read Dataset in French}, author={Soline Felice and Solène Evain and Solange Rossato and François Portet}, booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, year={2024} }And please cite the Audiocite.net website (https://www.audiocite.net/) if you use this dataset for your own research.