CHILDES Bulgarian LabLing Corpus

Velka Popova
Laboratory of Applied Linguistics
University of Shumen
v.popova@shu.bg

Dmitar Popov
Laboratory of Applied Linguistics
University of Shumen
labling@shu.bg

Participants:	5, 50, 71
Type of Study:	naturalistic, narrative
Location:	Bulgaria
Media type:	audio
DOI:	doi:10.21415/PHWH-J834

Citation information

Popova, V. (2024). Колекция с българска детска реч в термините на корпусната лингвистика [A Collection of Bulgarian Child Language in Corpus Linguistics Terms]. Институт за български език „Проф. Любомир Андрейчин“, Българска академия на науките. 328-346.

Popova, V. (2021). Български корпус с детска реч на платформата CHILDES – Рогожникова, Т. М. (ред.) Теория и практика языковой коммуникации: материалы XIII Международной научно-методической конференции / Уфимск. гос. авиац. техн. ун-т; [отв. ред. Т. М. Рогожникова]. – Уфа: РИК УГАТУ, 2021, 136–147 (РИНЦ) ISBN 978-5-4221-1507-5.

Popova, V., Iglikova, R & Kordov, K. (2021). LABLASS and the BULGARIAN LABLING CORPUS for Teaching Linguistics. Selected papers from the CLARIN Annual Conference 2020. Linköping Electronic Conference Proceedings 180, 2021, 208–213. CLARIN Annual Conference. 208-213. 10.3384/ecp18022.

Popova, V., & Popov, D. (2023). Computer-assisted Transcription and Analysis of Bulgarian Child Speech Data using CHILDES and CLAN. Journal of Computational and Applied Linguistics, 1, 66–76. https://doi.org/10.33919/JCAL.23.1.3

Popova, V., Popov, D. (2025). Bulgarian Speech Resources in the CHILDES System. In: Karpov, A., Delić, V. (eds) Speech and Computer. SPECOM 2024. Lecture Notes in Computer Science, vol 15299. Springer, Cham. https://doi.org/10.1007/978-3-031-77961-9_13

Project Description

The main focus of the LabLing research program is the creation of a Bulgarian children's language corpus as part of the CHILDES database. The LabLing is part of the consortium of the Bulgarian national research infrastructure for resources and technologies for linguistic, cultural and historical heritage, integrated within CLARIN EU and DARIAH EDU (CLaDA-BG – https://clada-bg.eu/en). The data in particular will be of great importance for the formation and creation of a national interdisciplinary electronic infrastructure in the process of integration and development of electronic resources in Bulgarian. Therefore, the construction of the LabLing CORPUS is a priority task of the consortium CLaDA-BG. The Cyrillic letters Я, Ю, Ъ, Ч, Щ, Ш, Ж, Ц, Й are assigned the following Latin correspondences: Я – ja , Ю – ju , Ъ – y , Ч – ch , Ш – sh , Щ – sht, Ж – zh , Ц – c , Й – j, X - x.

Longitudinal Corpus

The LabLing corpus includes two segments: the longitudinal corpus and the narrative corpus. The longitudinal corpus contains the transcribed data of 5 Bulgarian girls – ALE, TEF, BOG, SIM, and ELI. ALE was born 29-JAN-1989, BOG was born 23-JUN-2000, ELI was born 12-APR-2004, SIM was born 19-DEC-2018, and TEF was born 29-NOV-2000.

The children were born and live in the northeastern part of Bulgaria (Shuman and Varna). They were recorded in common situations (games, when dressing, eating, going to sleep, going through children’s pictorial books, free playing with mother, free playing with father, free playing with other children, reading a book and others) in the process of their daily interaction surrounded by their relatives. All individuals who were signed in the database in their role as participants in dialogues are monolingual native speakers of Bulgarian. The adults in the surroundings have a sufficient level of proper education (either secondary or higher university education). The audio-recordings of two of the children (ALE and TEF) were made by the researchers team of LabLing and those of of BOG, SIM, and ELI – by their mothers. The digitization and transcription of the material is done by the participants in the research team.

Narrative Corpus

The narrative corpus consists of two segments. The first uses the fox and cat stories and the second uses the birds and dogs stories.

Fox-Cat Collection

The fox-cat collection contains 91 transcripts of children`s narratives extracted from 50 monolingual children (native speakers of Bulgarian). They were recorded using a recorder in several kindergartens in Shumen and Varna (north-eastern Bulgaria), in only a few separate cases - at home or in the street. The children are grouped into 3 age groups:

The first group includes 21 children aged 3-4 years – 36 narratives (21 of which without audio, 15 with both audio and transcripts)
The second group includes 23 children aged 4-5 years - 43 narratives (10 of which without audio, 33 with both audio and transcripts);
The third group includes 6 children aged 5-6 years - 12 narratives (with both audio and transcripts).

The corpus has as its basis 2 pictorial stories, each of which contains 6 black-and-white illustrations without text. Namely, the Cat Story (Hickmann 2002) and the Fox Story (developed by the research team of the ZAS-Berlin headed by D. Bittner and first published in Gülzow & Gagarina 2007). Future work will use the the Baby Birds Story and the Dogs Story from the ZAS MAIN study in the CHILDES Biling folder.

Dog-Birds Collection

The second collection uses the the Baby Birds Story and the Dogs Story from the ZAS MAIN study. It contains narratives from 71 children. They were recorded using a recorder in the kindergarten and at home, in the street in Shumen, Razgrad, Varna, Loznitsa, Burgas. The children are grouped into 5 age groups:

The first group includes 6 children aged 3-4 years – 12 narratives (with both audio and transcripts);
The second group includes 4 children aged 4-5 years - 8 narratives (with both audio and transcripts);
The third group includes 21 children aged 5-6 years – 42 narratives (with both audio and transcripts);
The fourth group includes 27 children aged 6-7 years – 54 narratives (with both audio and transcripts);
The fifth group includes 13 children aged 7-8 years – 26 narratives (1 of which without audio, 25 with both audio and transcripts).