TinyVox


Marvin Lavechin
LIS Marseille
Psycholinguistics, MIT

Citation

Description

TinyVox is a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. These forms are all taken from phonological transcriptions in PhonBank. TinyVox has been used to train BabAR, a cross-linguistic phoneme recognition system for child speech available from https://github.com/MarvinLvn/BabAR .

The project is described in the BabAR and BabyHuBERT papers cited and linked above.

The collection of 561,312 one-item .wav audio files is available in this zip file

the metadata.csv, train.csv, test.csv, and val.csv files are available in this .zip file

the 238GB collection of the audio from which the items were extracted is available on demand.