CHILDES English Thomas Corpus
|
Jeannine Goh
MPI Child Study Centre
University of Manchester
jeannine.goh@manchester.ac.uk
website |
Elena Lieven
MPI-Leipzig
Manchester University
lieven@eva.mpg.de
| Participants: | 1 |
| Type of Study: | longitudinal, naturalistic |
| Location: | England |
| Media type: | audio |
| DOI: | doi:10.21415/T5JG64 |
Publications using these data should cite:
Lieven, E., Salomo, D. & Tomasello, M.
(2009). Two-year-old children’s production of multiword utterances:
A usage-based analysis. Cognitive Linguistics, 20, 3, 481-508.
In accordance with TalkBank rules, any use of data from this corpus
must be accompanied by at least one of the above references.
Other publications based on the use of these data include:
Maslen, R., Theakston, A., Lieven, E. &
Tomasello, M. (2004). A Dense Corpus Study of Past Tense and Plural
Overregularization in English. Journal of Speech, Language and
Hearing Research, 47, 1319-1333
Dąbrowska, E. & Lieven, E.
(2005). Towards a lexically specific grammar of children’s question
constructions. Cognitive Linguistics, 16, 3, 437-474.
Lieven, E. (2006). Producing multiword utterances. In B. Kelly & E.
Clark (eds.) Constructions in Acquisition. Stanford, CA: CSLI
Publications, pps. 83-110.
Cameron-Faulkner, T., Lieven,
E. & Theakston, A. (2007). What part of no do children not
understand? A usage-based account of multiword negation, Journal of
Child Language, 34, 251-282.
Chang, F., Lieven, E., & Tomasello, M.
(2008). Automatic evaluation of syntactic learners in
typologically-different languages. Cognitive Systems Research, 9
(3), 198-213
Bannard, C. & Lieven, E.. (2009). Repetition and Reuse in Child
Language Learning In Roberta Corrigan, Edith Moravcsik, Hamid Ouali,
Kathleen Wheatley (eds.). Formulaic Language: Volume II: Acquisition,
Loss, Psychological reality, Functional Explanations. Amsterdam:
John Benjamins (pps.297-321).
Bannard, C. & Matthews, D. (2008).
Stored word sequences in language learning: The effect of familiarity on
children's repetition of four-wrod sequences. Psychological Science,
19 (3), 241-248
Ph.D. dissertations (largely based on these data): Cameron-Faulkner,
Maslen, Kiravainen
Project Description
This corpus contains the data from a longitudinal naturalistic study
of one child over a period of three years. The child is called Thomas.
He was born 03-APR-1997 into a middle class family. His primary
care-giver is his mother. This large dataset is best considered in
three sections (Sections A, B, C). Section A differs from B and C in
the frequency of recordings, and section C differ from A and B in its
use of an updated transcription and morphosyntactic coding system. More
details of these differences are given below.
THE FREQUENCY OF DATA
Section A (Thomas aged 2-00-12 to 3-02-12) A VERY INTENSIVE
PERIOD
Thomas is recorded for one hour, five times a week, every week for
the entire period. One of each of the five recordings is a video.
There are 279 scripts and 49 videos.
Section B (Thomas aged 3-03-02 to 3-11-06) AN INTENSIVE PERIOD
Thomas is recorded for one hour, one week in every month. During
this week there are five recordings one of which is a video. There are
43 scripts and 12 videos.
Section C (Thomas aged 4-00-02 to 4-11-20) AN INTENSIVE PERIOD
Thomas is recorded for one hour, one week in every month. During
this week there are five recordings one of which is a video. There are
57 scripts and 12 videos.
Procedure
Over the three year period the audio of a total of 379 sessions was
recorded using a standard Sony mini-disc recorder and Sennheiser
evolution radio microphones. The microphones were positioned around the
downstairs of the house, allowing Thomas to move freely during his play
whilst still capturing his speech. For 73 of these recordings a video
recording was also taken using a standard video-camera. These videos
are now in DVD format but permission was not gained for submission to
the CHILDES database. All of the audio recordings took place in Thomas’s
home where he was engaged in normal play activities with his mother. In
most of the video recordings the investigator is also present and is
engaged in play with Thomas. The videos were mainly recorded in
Thomas’s home, although a number were recorded in the laboratory at the
Max Planck child study centre at the University of Manchester. Most of
the recordings are 60 minutes long.
Known inconsistencies in the data
The corpus was gathered
over a number of years during which time CLAN was updated, the
experience of the transcribers increased, transcribers came and went,
and problems were identified and rectified along the way. This has
inevitably led to some inconsistencies in transcription some of which
are listed below.
- A decision was made after Thomas A that
Pluses (+) should only be used with compound nouns (e.g.
fire+engine, washing+machine, fishing+rod, snip+snip@f,
quack+quack@f, etc.) and NOT be used when transcribing repetitions
such as no+no+no+no, jumpity+jump+jump, wait+wait+wait.
Repetitions are instead coded as <no no no> [/] no,
<jumpity jump> [//] jump, <wait wait> [/]
wait. During the changeover the coding of repetition is not
always consistent.
- When Thomas was two years old he omits
many words. The transcribers were asked to mark errors where
Thomas missed auxiliaries and when the missed words confused the
utterance. The transcribers also marked overextensions. Some examples
are provided below.
| Missing
auxiliary: | Mummy 0is [*] come-ing
|
| Overextension: |
brokened [*]
|
| Omissions: | David
0and [*] Sharon
|
| | Mummy-0’s [*]
watch |
| | Lots of
train-0s [*]
|
| Confusions: |
- You will however find no error coding in utterances such as:
&
nbsp; Fallen all down
&
nbsp; Watch postman
&
nbsp; One tree blow off
&
nbsp; tree-s fallen on the leaves all down
&
nbsp; Thomas smell it
Note: The transcribers did initially struggle with error
coding and their use of codes becomes more accurate and consistent as
the study goes on.
- The marker @sc is used to mark schwas.
A word is marked as a schwa whenever the child does not fully pronounce
the target word e.g. I@sc play. In the early files Thomas tends
to use the sounds a or o in the place of prepositions,
adverbs, pronouns etc. In most of these cases the target word is not
identifiable and these are therefore are coded as a@sc and
o@sc. Later in the study as Thomas’s language becomes clearer
the transcribers try to place the word they think Thomas is trying to
say before the @sc sign, e.g. the@sc. They may also transcribe
the actual sound they hear e.g. pwosh@sc. The transcriber’s
interpretation of @sc does have some variation therefore searches and
analysis must be undertaken with this in mind. Moreover the way in which
the MOR program codes @sc varies and care must also be taken when using
the MOR line. The sound files are provided with this data which will
allow the @sc codes to be listened to again if required.
- There are some inconsistencies in the codes @c (child invented
form) and @o (onomatopoeia) and @f (family invented form). For
example, miaow@c, miaow@f, miaow@f or mmm@o, mmm@c, mmmm@f.
- The transcribers vary in the way they spell
Mrs, it is mostly transcribed as Mrs however
Misses is also used. Care must be taken not to confuse
this second spelling with the third person verb misses. Mrs and Misses
are however always coded with capital letters and on most occasions a +
joins a name e.g.. Mrs+Platford, Misses+Platford. Similarly
Mr may also be spelt Mister, although this does not have
the possible confusion with a verb. Other possible spelling variations
are listed below:
• Purdy, Purdie, Purdey (cat)
• Granddad, Grandad
• Beilbie, Bilbey (name)
• Nee+naa@o, nee+naw, nee+nah (sound of police car)
• Play+doh, Play+dough
• Teletubbie, telytubby (television
program)
• Incy+wincy+spider, Incey+wincey+spider
• Miaow,
miaou, meeiow, meow
- Some common transcription errors:
• whose with who’s
• your with
you-‘re
• have with of (e.g. might of
instead of might have)
• it-’s (verb) with
its (poss)
• let-‘us with let’s
More Notes on transcription
Phonological forms: The focus in this study is early
grammatical development and not specific phonological forms that Thomas
uses. Therefore, unless Thomas uses what appears to be child-specific
forms, the target word is transcribed rather than an approximation of
the child’s phonological form.
Thomas’s
early language
- ah+phss@o: expression that Thomas
uses to refer to sleep/sleeping/snoring etc.
- alander@c: lad (also, also says “land” instead of
“lad”)
- apple: Thomas’s name for Jeannine
- a@sc do: Thomas uses this expression when he wants an
action to be repeated - asking mummy to do something again
- a@sc: he uses it very often in his speech, usually in
the place of pronouns, prepositions and adverbs
- backside@c: the back garden; also used as “back
outside” or “back inside”
- backways@c: backwards
- bang+a+drum+time@c: music lesson
- bee+ba@c: a police car, ambulance or a fire engine
-real or toy; sometimes used for other types of cars as well
- Beechy@c: -Dimitra- for a short while
- big splash: bath
- black juice:
blackcurrant juice
- Bow: this is how he refers to
their cat
- bow@c: for other cats or other animals
- Bow+Wow: a dog - one of Thomas’s toys
- choc+choc@f: chocolate
- choo+choo@f: train
- crane-ing@c:
lifting things up - usually using a toy crane to do so
- done and gone: he uses them often in his speech but
it’s almost impossible to tell one from the other and he may not have
yet distinguished one from the other. When transcribing we make the
choice between gone and done mainly in terms of context
- doc+doc@c: doctor
- dot+dot@f: uses
this expression to refer to little scratches
- Hat:
(actually sounds something between shat, sat and hat. Because of the “s”
sound it’s also been transcribed as is@sc Hat or &s Hat)
He uses it in three different ways:
- hat - to refer to an actual hat
- Hat - in
order to refer to Dipsy -the green teletubby which usually wears a hat
- hat@c - when he wants to say “green”
- mumm+mumm@c: car
- nap+nap@c: nappy
- Nin+Nin: this is what he calls his mother
- nip, nip-s, nip-s@c: nipples
- Noo+Noo
or noo+noo@c : this what he calls vacuum cleaners - Noo+Noo is the
vacuum cleaner on the teletubby show
- pap+pap@c:
parrot
- Po: the name of the red teletubby - he uses
it:
- Po- to refer to the teletubby
- po@c
- when he wants to say “red”po@c - to refer to red objects in
general
- poo: poo -actual poo
- pooh: smell -either good or bad
- shining@c or shiny@c: has frequently used this
expression to refer to the sun
- snip+snip@o or
snip+snip@c : uses this expression to refer to anything that may
make this sound -e.g. scissors, cutting, chopping
- snipsnip+man@c: hairdresser , barber
- o@sc: as a@sc above - he doesn’t use it as often
- quack+quack@f: ducks or birds in general
- ta@d: thanks
- ta@d much (in)deed:
thank you very much indeed
- what-‘is this: Thomas
often says a@sc this, wo’this or wo’dis. The way we transcribe it is:
*CHI: what-‘is this [= actually says wo’dis]
The MOR programme
will code “what’s this”.
- Wodar@c: (it’s also been
transcribed in the following ways : a@sc there, &wo there, wada@c, a@sc
dar@c; Thomas tends to use this expression when he wants to be given
something.
It is possible that wodar is used as different
expression to a@sc there or &wo there, but we do not yet have the
contextual information to make any distinctions
- wow+wow@c: dog
Error Coding
Errors that are coded during transcription are as follows (APP 3:
Error coding more guidelines)
| Missing
morphemes | e.g. ‘two dog-0s’, ‘He’s go-0ing’ , ‘Mummy-0’s sock’
etc. |
| Case errors | e.g. ‘Her do it’, ‘Me get it’
|
| Missing or incorrect auxiliaries and copulas | e.g. ‘It
0is going there’, ‘I 0am getting a drink’, |
| Word Class
Errors | e.g. double determiners ‘a that one’, |
| Agreement
errors | e.g. ‘a bricks’, ‘these penguin’, ‘Does she likes it?’,
‘It don’t go there’. |
| Pronominal Errors | e.g. ‘Carry
you’ when the child wants to be carried |
| Wrong
word | e.g. ‘I put it off’ - where the context indicates ‘take’ is
appropriate. |
| Overgeneralisation | e.g. ‘it broke-ed’
|
Not all errors are easy to identify. In utterances such as the
following “what doing trucks” it’s difficult to pinpoint the type
of error that has been made. In such cases an error marker [*] is
placed on the main tier and a question mark in the error line
When to use an error code
An error code should be used whenever what the child says is
grammatically incorrect. If there is something wrong with the sentence,
you as the transcriber, need to flag it up using the [*] sign. You
should place the [*] sign straight after the word that is the problem.
If we do not flag up the errors then the researcher may not know what
the child intended to say, for example:
*CHI: me Mummy stopped
You may know from hearing the transcript if the Mummy has stopped or
if the child has stopped, or if the Mummy has stopped the child. Maybe
whether there is an omitted has or had. These are all useful things for
the researcher to know.
If you know there is an error but there is ambiguity surrounding it
then it is best to use a [?] on the error line. You can use angular
brackets to show it is the whole sentence or some words in the sentence
that you are unsure about
*CHI: [*]
%err: [?]
Omitted/missing words
These are generally transcribed correctly but to revise. An ‘O’ is
used to indicate that there is a word omitted and that you have
indicated what it is by preceding it with the 0. Commonly words like
have and has (auxilaries) are often omitted or even parts of words, for
example:
CHI: I 0have [*] got
%err: 0have=have
CHI: I am go0ing
%err: go0ing=going
CHI: I want two sweet0s [*]
%err: sweet0s=sweets
What is said after the ‘0’ is taken out when we run the grammar
program and what is left behind should read exactly what the child
actually said. Anything after the 0 is what you have corrected.
Additions and overextensions
The following is VERY important, if the child has wrongly added an
‘ed’ ending on a word it should be coded like this:
*CHI: threwed [*] it .
%err: threwed = threw .
If in the next example you are sure that they mean one sweet:
CHI: I want a sweets [*]
%err: sweets=sweet
If you are not sure if it was one sweet:
CHI: I want [*]
%err: [?]
More than one error on a line
Any number of errors can be coded on a single %err line as long as
there is one [*] symbol for each error and each coding on the %err line
is separated by a semi-colon.
CHI: I am go0ing [*] homes [*]
%err: go0ing=going;
homes=home
Please note what is on the left side of the equals sign is what is in
the transcript what is on the right side is what it should be.
Using [= actually says]
We use [= actually says ] quite a lot in the transcript, this should
only be used if the child makes a mistake in a word, for example , the
following examples are fine:
*CHI: hitting [= actually says higging]
*CHI: spaghetti [=
actually says getti]
Acknowledgements
Funding was supplied by these sources:
The Department of
Comparative and Developmental Psychology, Max Planck Institute for
Evolutionary Anthropology, Leipzig, Germany.