Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
With a total duration of 6682 hours of audio recording, this corpus is the result of the voluntary work of 130 speakers. The metadata is divided into 4 .jsons files (all(100%), train(80%), dev(10%) and test(10%)) to be used in NLP models.
The corpus and its metadata were uploaded through a script distributing the information in a .csv file. The use of these audio and metadata files is intended for pre-trained speech models.
  • Overview of the Corpus :

  • Speakers Gender* Number of Files Number of Speakers Total Duration Avg. Duration Min. Duration Max. Duration
    M 19345 51 4127:34:45 00:12:48 00:15:01 06:39:43
    F 8261 70 2272:18:23 00:16:30 00:03:49 02:46:06
    U 879 9 282:07:09 00:19:15 00:00:09 03:30:16
    Total 28485 130 6682:00:18 00:14:04 00:00:09 06:39:43

    *Beware speaker gender has been guessed and should not be considered as ground truth (cf. README.md of the audiocite.net_0.zip)

    You can cite the data using the following BibTeX entry:
    @inproceedings{Felice2024,
        title={Audiocite.net: A Large Spoken Read Dataset in French},
        author={Soline Felice and Solène Evain and Solange Rossato and François Portet},
        booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
        year={2024}
    }
    
    And please cite the Audiocite.net website (https://www.audiocite.net/) if you use this dataset for your own research.