Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
With a total duration of 6682 hours of audio recording, this corpus is the result of the voluntary work of 130 speakers. The metadata is divided into 4 .jsons files (all(100%), train(80%), dev(10%) and test(10%)) to be used in NLP models.
The corpus and its metadata were uploaded through a script distributing the information in a .csv file. The use of these audio and metadata files is intended for pre-trained speech models.

Overview of the Corpus :

Speakers Gender*	Number of Files	Number of Speakers	Total Duration	Avg. Duration	Min. Duration	Max. Duration
M	19345	51	4127:34:45	00:12:48	00:15:01	06:39:43
F	8261	70	2272:18:23	00:16:30	00:03:49	02:46:06
U	879	9	282:07:09	00:19:15	00:00:09	03:30:16
Total	28485	130	6682:00:18	00:14:04	00:00:09	06:39:43

*Beware speaker gender has been guessed and should not be considered as ground truth (cf. README.md of the audiocite.net_0.zip)

You can cite the data using the following BibTeX entry:

@inproceedings{Felice2024,
    title={Audiocite.net: A Large Spoken Read Dataset in French},
    author={Soline Felice and Solène Evain and Solange Rossato and François Portet},
    booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    year={2024}
}

And please cite the Audiocite.net website (https://www.audiocite.net/) if you use this dataset for your own research.