CHILDES German Szagun Corpus


Gisela Szagun
Institut fur Psychologie
University of Oldenburg

Participants: 22
Type of Study: naturalistic
Location: Germany
Media type: audio
DOI: doi:10.21415/T5KG7T

Browsable transcripts

Download transcripts

Link to media folder

Citation information

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This data set comprises 2 large corpora of German child language: 1) a corpus of 22 typically developing children (TD) with 212 data files. This comprises 212 x 2 hours spontaneous speech of child and adult, i.e. altogether 424 hours; 2) a corpus of 22 deaf children with cochlear implants (CI) with 210 data files. This comprises 210 x 1 ½ hours, i.e. altogether 315 hours. Both studies present longitudinal data of early language development in these two groups. Besides child speech these corpora present a comprehensive sampling of child-directed adult speech.

Naturalistic Setting

Recordings took place during free play sessions in a large playroom at the Department of Psychology, Carl-von-Ossietzky University of Oldenburg, for the TD children, and in a smaller playroom for the CI children at Cochlear Implant Centrum (CIC) Wilhelm Hirte, Hannover. In both location there were varied sets of toys, i.e. cars and a garage and park house, zoo animals, farm animals, forest animals, a school with children and teachers, doll’s house, picture books, puzzles, medical kit, fire-station, shop and other sets. A parent or investigator played with the child.

Typically developing children (TD)

All children were recorded between 1;4 and 2;10, and a subgroup between 1;4 and 3;8. All child and adult speech has been transcribed, i.e. the total of 424 hours. The TD children also served as a control group for the CI children.

Children with cochlear implants (CI)

The 22 children with CI were deaf before onset of language. All the children were implanted before 4 years of age. They were matched with the 22 TD children for initial language level using number of words and MLU. Data points for the CI children were hearing ages. All children were recorded between hearing ages 0;5 and 1;11. After this period data collection continued for all 22 children but for varying lengths of time between hearing ages 2;4 and 3;6. Altogether, there are 210 data files, i.e. a total of 315 hours.

Complete transcriptions

Due to the (unexpected) wealth of data it took several years to supply a complete transcription of all child and adult speech. When these corpora were added to TalkBank initially only child speech but not all adult speech had been transcribed. A previous update in 2023 presented complete transcriptions for TD children. With this final update in the year 2026 complete transcriptions for both groups, TD and CI, are presented. Thus, ALL speech has been transcribed, with the exception of a few data points in the CI corpus. This is marked at the beginning of the respective texts.

Audio files are available for the majority of files and have been linked. Technical problems and missing recordings Some of the early audio files are not available. This is mainly due to the digital recording equipment not being available to us at the start of data collection for the research project. However, technical problems also occurred and led to some loss of audios.

Acknowledgements

The research was funded by Deutsche Forschungsgemeinschaft (DFG) (German Research Foundation) grants Sz 41/5-1 and Sz 41/5-2. The University of Oldenburg invested considerably in making building structures child-safe and suitable.