openslr.org

Open Speech and Language Resources

Multilingual TEDx

Identifier: SLR100

Summary: a multilingual corpus of TEDx talks for speech recognition and translation

Category: Speech

License: Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
mtedx_es.tgz [35G] ( Spanish speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_fr.tgz [34G] ( French speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_pt.tgz [29G] ( Portuguese speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_it.tgz [19G] ( Italian speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_ru.tgz [10G] ( Russian speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_el.tgz [5.5G] ( Greek speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_ar.tgz [3.6G] ( Arabic speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_de.tgz [2.6G] ( German speech and transcripts ) Mirrors: [US] [EU] [CN]
mtedx_es-en.tgz [13G] ( Spanish speech and transcripts with aligned English translations ) Mirrors: [US] [EU] [CN]
mtedx_es-fr.tgz [1.9G] ( Spanish speech and transcripts with aligned French translations ) Mirrors: [US] [EU] [CN]
mtedx_es-it.tgz [1.9G] ( Spanish speech and transcripts with aligned Italian translations ) Mirrors: [US] [EU] [CN]
mtedx_es-pt.tgz [8.1G] ( Spanish speech and transcripts with aligned Portuguese translations ) Mirrors: [US] [EU] [CN]
mtedx_fr-en.tgz [9.8G] ( French speech and transcripts with aligned English translations ) Mirrors: [US] [EU] [CN]
mtedx_fr-es.tgz [7.1G] ( French speech and transcripts with aligned Spanish translations ) Mirrors: [US] [EU] [CN]
mtedx_fr-pt.tgz [4.7G] ( French speech and transcripts with aligned Portuguese translations ) Mirrors: [US] [EU] [CN]
mtedx_pt-en.tgz [10G] ( Portuguese speech and transcripts with aligned English translations ) Mirrors: [US] [EU] [CN]
mtedx_pt-es.tgz [4.5G] ( Portuguese speech and transcripts with aligned Spanish translations ) Mirrors: [US] [EU] [CN]
mtedx_it-en.tgz [9.1G] ( Italian speech and transcripts with aligned English translations ) Mirrors: [US] [EU] [CN]
mtedx_it-es.tgz [1.6G] ( Italian speech and transcripts with aligned Spanish translations ) Mirrors: [US] [EU] [CN]
mtedx_ru-en.tgz [2.3G] ( Russian speech and transcripts with aligned English translations ) Mirrors: [US] [EU] [CN]
mtedx_el-en.tgz [2.4G] ( Greek speech and transcripts with aligned English translations ) Mirrors: [US] [EU] [CN]
mtedx_iwslt2021.tgz [5.7G] ( Test sets for IWSLT'21 Multilingual task ) Mirrors: [US] [EU] [CN]
MTEDx-french-talks-gender-annotation.csv [105K] ( Gender annotations for French talks, contributed by Laurent Besacier and Marcely Zanon Boito ) Mirrors: [US] [EU] [CN]

About this resource:

Multilingual TEDx (mTEDx) is a multilingual speech recognition and translation corpus to facilitate the training of ASR and SLT models in additional languages.

The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages (Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, German) with translations into up to 5 languages (English, Spanish, French, Portguese, Italian).
The audio recordings are automatically aligned at the sentence level with their manual transcriptions and translations.
Each .tgz file contains two directories: data and docs. docs contains a README detailing the files provided in data and their structure.
Test sets for all IWSLT 2021 language pairs can be found in mtedx_iwslt2021.tgz.
For more information on the dataset please see the dataset paper.

Contact: Elizabeth Salesky, Matthew Wiesner. esalesky@jhu.edu, wiesner@jhu.edu

Citation: If you use the Multilingual TEDx corpus in your work, please cite the dataset paper:

  @inproceedings{salesky2021mtedx,
    title={Multilingual TEDx Corpus for Speech Recognition and Translation},
    author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
    booktitle={Proceedings of Interspeech},
    year={2021},
  }