This dataset is part of our effort to increase the amount of data available for low-resource languages like Armenian
and Georgian.
It consists of processed audiobooks, which initially consisted of single big transcript and tens
of minutes long audios for each chapter.
To make the data ASR/TTS friendly we converted the raw corpus and many
multi second long audio chunks (typically 3-15seconds) with corresponding texts.
We coordinated with the original authors from Grqaser.org, who agreed on the selection of new books we processed.
To make the reconstruction of the books (usually different speakers per book) harder, we encoded the names of audios
and hide book, chapter and author information. This is done to avoid Voice Cloning attempts on TTS setup (as the
majority of the data were collected on voluntary bases and cloning the voices of those people is forbidden).
The .tgz file contains the following directories:
texts/
- Contains text transcripts in .txt formataudios/
- Contains audio files in .wav format"Grqaser" is a non-governmental organization dedicated to promoting Armenian language preservation globally through the creation of a comprehensive library of Armenian audiobooks. Established in 2015, "Grqaser" aims to facilitate access to Armenian literature for diaspora communities and individuals with visual impairments. Their initiative provides a valuable resource for listeners to engage with Armenian culture and language through accessible audio formats, supporting educational and cultural enrichment worldwide.Author(s) of the corpus Ara Yeroyan,
ar23yeroyan gmail.com