The TEDx Spanish Corpus is a gender unbalanced corpus of 24 hours of duration.

It contains spontaneous speech of several expositors in TEDx events; most of them are men.

Transcriptions are presented in lowercase with no punctuation marks.

The data collection process was partly developed by the social service program "Desarrollo de Tecnologías del Habla" that depends on the National Autonomous University of Mexico and partly by the CIEMPIESS-UNAM project (http://www.ciempiess.org/)

Special thanks to the TED-Talks team for allowing us to share this dataset.

You can cite the data using the following BibTeX entry:


@misc{mena_2019,
	title = "{TEDx Spanish Corpus. Audio and transcripts in Spanish taken from the TEDx Talks; shared under the CC BY-NC-ND 4.0 license}",
	author = "Hernandez-Mena, Carlos D.",
	howpublished = "Web Download",
	institution = "Universidad Nacional Autonoma de Mexico",
	location = "Mexico City",
	year = "2019"
}