openslr.org

Open Speech and Language Resources

SMIIP-TV dataset

Identifier: SLR156

Summary: A short-term time-varying speaker verificaition dataset

Category: Speech

License: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0 US)

Downloads (use a mirror closer to you):
SMIIP-TV_v2.0.tar.gz [49G] ( dataset and meta information for SMIIP-TV dataset ) Mirrors: [US] [EU] [CN]

About this resource:

This dataset is a short-term time-varying speaker verification dataset.

The SMIIP-TimeVarying Dataset (SMIIP-TV), is a speaker verification dataset designed for research purposes that focuses on short-term time-varying of speaker verification. The recordings language is mandarin. The dataset contains recordings from 373 speakers who provided utterances over 90 consecutive days, in which each speaker needs to record multiple utterances at varying time slots in each day. To ensure that recording time spans the full day without location limitations, researchers developed an Android application, which randomly assigns recording tasks in five different time slots: 6:00-8:00, 9:00-11:00, 12:00-14:00, 17:00-19:00, and 20:00-22:00. In each time slot, speakers provide three utterances, including both textdependent and text-independent speech samples. Additional meta information such as speaker region (in total 27 provinces, China), age, and cellphone type were collected. Additionally, speakers were asked to report details on their physical state (in total 7 types, including normal, sleepy, eating, sore throat, exercise, cold/fever, others), recording environment (in total 16 scenes) and the degree of noise (in total 4 levels, including quiet, normal, noisy, extremely noisy), all were manually reviewed. The gender distribution is balanced (171 males:202 females). Most recordings were made indoors, with majority of the noise and physical conditions being normal. Speakers were also encouraged to report various scenes with different physical conditions. Due to the challenge of continuously recording for 90 days, some speakers were unable to provide recordings for the entire duration. Finally, 133 speakers recorded for the entire 90-day period, and the dataset selected 58 of them as the SMIIP-TV test set, and the remaining speaker data (315 speakers) is adopted as the training set.

For more details on the evaluation, please visit: https://github.com/qinxiaoyi/TimeVarying_ASV.

If you use the dataset, please cite it using the following BibTeX entry:

@ARTICLE{10599875,
  author={Qin, Xiaoyi and Li, Na and Duan, Shufei and Li, Ming},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Investigating Long-Term and Short-Term Time-Varying Speaker Verification}, 
  year={2024},
  volume={32},
  number={},
  pages={3408-3423},
  keywords={Task analysis;Aging;Time-varying systems;Videos;Recording;Face recognition;Databases;Cross-age;reinforcement learning;speaker verification;template updating;time-varying},
  doi={10.1109/TASLP.2024.3428910}}

External URL: https://drive.google.com/file/d/1cOGX6zptbfMbdYcpWYVrt9LTXvL0Y47A/view?usp=drive_link SMIIP-TV dataset