Funding
European Language Equality (ELE) 2 Open Call for SRIA Contribution Projects
European Language Equality (ELE) 2 Open Call for SRIA Contribution Projects
The project aims to study the usability of open data for building speech datasets for types of voices that are usually missing or underrepresented in existing speech datasets. We will conduct a case study on the Romanian language, with the possibility of applying the same methodology to any other language. We will identify existing multimedia open data, including platforms, types of media, percent of usable voices in a data sample, types of open licenses, types of underrepresented voices (including children, young people, older people, women, etc.), percent of underrepresented voices. To validate our methodology we will build a pilot dataset of Romanian underrepresented voices aligned with the corresponding textual representation.
The GitHub Repository contains scripts used for segmenting the original files, metadata creation, and internal utilities.
The USPDATRO Dataset is available in Zenodo, ELG and RELATE:
This is the first version of the Annotation Guide, created before the annotation process started. An updated version of the Annotation Guide is available as an appendix in the final project report.
dr. Vasile Păiș