The SlideSpeech, a large-scale multi-modal audio-visual corpus with a significant amount of real-time synchronized slides, is collected from YouTube. The corpus contains 1,659 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech for training (Train). Additionally, the Dev set consists of 21 videos (5.07 hours), while the Test set comprises 25 videos (8.75 hours). Unlike the Train set, annotations in the Dev and Test sets are manually labeled. Specifically, SlideSpeech covers a variety of domain categories (22 classes), including computer science, musical instruments, history, agriculture, animation, music, parenting, travel, education and others. The corpus is easy to expand or target specific scenarios. The slides in the videos also include relevant images and some facial information. This corpus can be applied for automatic subtitle generation in online education scenarios.

The details of SlideSpeech can be found here: SlideSpeech website link

You can cite the data using the following BibTeX entry:

  @misc{wang2023slidespeech,
    title={SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus}, 
    author={Haoxu Wang and Fan Yu and Xian Shi and Yuezhang Wang and Shiliang Zhang and Ming Li},
    year={2023},
    eprint={2309.05396},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
  }
SlideSpeech introduction paper: SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS
Download links
train_video.tar.gz [232.43GB] (SlideSpeech Train Video Set, bcbecb539bcb61ad662f8846e7520c57)
train_wav.tar.gz [88.56GB] (SlideSpeech Train Audio Set, 52204766bad0458ce846440cff62dd1b)
dev_video.tar.gz [887.68MB] (SlideSpeech Dev Video Set, 779cb23bd41c697a5740b816051038fd)
dev_audio.tar.gz [429.96MB] (SlideSpeech Dev Audio Set, fb3d47cee6a8ea8140c2933936181a55)
test_video.tar.gz [1.45GB] (SlideSpeech Test Video Set, 5211d0a6099028ce97408f8199240ceb)
test_audio.tar.gz [737.77MB] (SlideSpeech Test Audio Set, 0c4d41adc9ae33c379f431fbbf2297c7)
related_files.tar.gz [1.37GB] (SlideSpeech segments, transcription and OCR results, bfa834a7a02ba3b13e5c3f6fed82c102)