TinyVox

Marvin Lavechin
LIS Marseille
Psycholinguistics, MIT
marvinlavechin@gmail.com

Citation

Charlot, T., Kunze, T., Poli, M., Cristia, A., Dupoux, E., & Lavechin, M. (2025). BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings. arXiv preprint arXiv:2509.15001.

Lavechin, M., Bergelson, E., & Levy, R. (2026). BabAR: from phoneme recognition to developmental measures of young children's speech production. arXiv preprint arXiv:2603.05213.

Description

TinyVox is a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. These forms are all taken from phonological transcriptions in PhonBank. TinyVox has been used to train BabAR, a cross-linguistic phoneme recognition system for child speech available from https://github.com/MarvinLvn/BabAR .

The project is described in the BabAR and BabyHuBERT papers cited and linked above.

The collection of 561,312 one-item .wav audio files is available in this zip file

the metadata.csv, train.csv, test.csv, and val.csv files are available in this .zip file

the 238GB collection of the audio from which the items were extracted is available on demand.