This data set contains transcribed high-quality audio of English sentences recorded by volunteers speaking different dialects of the language. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.csv contains a line id, an anonymized FileID and the transcription of audio in the file. The recordings from the Welsh English speakers were collected in collaboration with Cardiff University. The data set contains the following number of lines:
Irish English male: 450
Midlands English female: 246
Midlands English male: 450
Northern English female: 750
Northern English male: 2097
Scottish English female: 894
Scottish English male: 1649
Southern English female: 4161
Southern English male: 4331
Welsh English female: 1199
Welsh English male: 1650

The data set has been manually quality checked, but there might still be errors.

Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues

See LICENSE file for license information.

If you use this data in publications, please cite it as follows:

  @inproceedings{demirsahin-etal-2020-open,
    title = {{Open-source Multi-speaker Corpora of the English Accents in the British Isles}},
    author = {Demirsahin, Isin and Kjartansson, Oddur and Gutkin, Alexander and Rivera, Clara},
    booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)},
    month = may,
    year = {2020},
    pages = {6532--6541},
    address = {Marseille, France},
    publisher = {European Language Resources Association (ELRA)},
    url = {https://www.aclweb.org/anthology/2020.lrec-1.804},
    ISBN = {979-10-95546-34-4},
  }