TalkBank Creating a Child Language Corpus

This page provides some guidelines to researchers and parents interested in creating a new child language corpus. In many cases, this could be a longitudinal case-study of a single child in the home. However, these guidelines could also be applied with minimal changes to a corpus collected from several children.


Recordings can use either audio or video or a mix of the two. Suggestions for equipment and methods for audio recordings can be found here and suggestions for equipment and methods for video recording can be found here .


With modern audio equipment and cheap media storage, it is possible to record up to 24 hours a day. However, you would be hard pressed to transcribe all that material. Given these obvious limitations, it is usually best to record regularly during periods when the child is maximally active and talkative. Having said that, it is good to have samples across activities, such as dinner time, bath time, peer play time, book reading time, and game time. The more frequent the recording during these high activity times the better. For example, the dense corpora collected by the MPI in Leipzig and Manchester recorded children for two hours each day for a week, but then did no recording for 3 weeks and started again with the next dense recording week. This gives a good clear snapshot during the week of dense recording, although one then wonders about what happens during the other weeks. If you are recording with video, it may become impractical to store so much material, although you can help in this process by compressing to a good .h264 format as you go, or by mixing video with audio recording. You may want to begin each recording with a statement about the date and where the recording is made.

File format:

It is best to keep the names of your transcript and media files simple. If you are studying a single child, then the best format uses the age as the identifier for each transcript, as in 20112.cha for a session during which the child was 2;1.12 (two years, one month, 12 days). The corresponding media file should be called 20112.wav or 20112.mp4.


To create transcriptions, you should use the CLAN editor, as described in the CLAN manual downloadable from the web.


You can rely on the CLAN programs for analysis. In some cases, these programs will help you by sending data to Excel or R for further analysis.