|
Lisa Pearl University of California, Irvine |
|
Jon Sprouse University of Connecticut, Storrs |
If using these corpora in published materials, please cite one or more of the following:
Pearl, L. & Sprouse, J. 2013. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition, 20, 23-68.
Pearl, L. & Sprouse, J. 2013. Computational models of acquisition for islands. In J. Sprouse & N. Hornstein (eds), Experimental syntax and islands effects. Cambridge University Press, 109-131.
Pearl, L. & Sprouse, J. 2019. Comparing solutions to the linking problem using an integrated quantitative framework of language acquisition. Language, 95(4), 583-611. lingbuzz: https://ling.auf.net/lingbuzz/003913.
CHILDES derived corpus: CHILDES Treebank files
(annotated with phrase structure tree information)
Created with funding from NSF grant BCS-0843896, "Testing the Universal Grammar Hypothesis" and NSF grant BCS-1347028, “An integrated theory of syntactic acquisition — Realistic input, quantitatively defined target states, and computational models of the learning strategy”.
I. OVERVIEW
The CHILDES Treebank corpus is derived from several corpora from the American English section of CHILDES (MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates; http://childes.psy.cmu.edu/). The goal was to annotate child-directed speech utterance transcriptions with phrase structure tree information. All of the materials are available in this .zip file. For each corpus included, the following was done:
We have additionally benefited from some error-checking from the following researchers:
PLEASE NOTE: The hope was that this process would remove the errors resulting from the automatic parsing process. However, errors may remain and we strongly suggest that any users of these data review automated extraction results to make sure they are accurate. If you do find syntactic annotation errors, please feel free to email Lisa Pearl (lpearl@uci.edu) what they are and where you found them - we'll happily update the files and release the updates in the next version.
II. CURRENT CORPORA
The corpora currently included in the CHILDES Treebank are as follows.
All child-directed speech utterances:
NOTE: +animacy+theta = includes trace, animacy, and thematic role annotation (detailed below).
Only child-directed speech utterances containing wh-words:
Original corpora references:
Bates, E., Bretherton, I., & Snyder, L. (1988). From first words to grammar: Individual differences and dissociable mechanisms. Cambridge, MA: Cambridge University Press.
Bernstein-Ratner, N. (1987). The phonology of parent child speech. In K. Nelson & A. Van Kleeck (Eds.), Children's language: Vol. 6. Hillsdale, NJ: Lawrence Erlbaum.
Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.
Snow, C.E., & Dickinson, D.K. 1990. Social sources of narrative skills at home and at school. First Language, 10, 87-103.
Soderstrom, M., Blossom, M., Foygel, R., & Morgan, J.L. (2008). Acoustical cues and grammatical units in speech to two preverbal infants. Journal of Child Language, 35(4), 869-902.
Suppes, P. (1974). The semantics of children's language. American Psychologist, 29, 103-114.
Valian, V. (1991). Syntactic subjects in the early speech of American and Italian children. Cognition, 40, 21-81.
Van Houten, L. (1986). Role of maternal input in the acquisition process: The communicative strategies of adolescent and older mothers with their language learning children. Paper presented at the Boston University Conference on Language Development, Boston.
Corpus statistics, with number of children in [] and age ranges in ()
All child-directed speech utterances:
**Note: Word counts are approximate**
Only child-directed speech utterances containing wh-words:
Total utterances for all corpora: 201,446
Total words for all corpora: ~964,212
Note: Utterance and approximate word counts retrievable using the Tregex query:
/^\w.*/ <: (/.+/ !< /.+/)
Word counts are approximate, since clitics like ’s and n’t are under their own leaf, and so get counted as separate words.
Many thanks to Avery Andrews for this.
III. PHRASE STRUCTURE ANNOTATION
The phrase structure tree annotation is similar to the Penn Treebank II notation, described in "Bracketing Guidelines for Treebank II Style Penn Treebank Project": AnnotationLabels.html
There are a few exceptions, which were made to (hopefully) make the labeling more useful to syntactic acquisition researchers:
IV. TRACE ANNOTATION
To aid syntactic acquisition researchers, we have begun adding trace annotation. The trace notation is similar in format to the Switchboard corpus, but uses different categories of traces.
The first distinction is between A-bar traces and A traces, with each of these subdivided into additional categories.
(1) ABAR traces:
(a) WH = wh-traces (e.g., “How do you feel __?”)
(b) RC = relative clause traces (e.g., “This is the one which I want __.”)
(c) OTHER = other A-bar traces like topicalization (e.g., “When I laugh, I feel good ___”. “Smart though Jack is ___1, he often does silly things ___2.”)
(2) A traces:
(a) PASS = passive (e.g., “Jack was kissed __ by Lily.”)
(b) RAISE = raising (e.g., “Lily seems __ to be happy.”)
Some examples of the specific trace notation are available in this Word file .
V. ANIMACY ANNOTATION
For corpora with animacy annotation, this is added onto the basic form of the phrase structure trees that already have trace annotation.
Because these are non-linguistic (i.e., conceptual) information, we surround the annotations with angle brackets <…>
There are two distinctions:
Notes:
Some examples of the specific animacy notation are also available in this Word file .
VI. THEMATIC ROLE ANNOTATION
Corpora with thematic role annotation have this information added on top of the trace and animacy information. Because these are non-linguistic (i.e., conceptual) information, we surround the annotations with angle brackets <…>. Each thematic role label also indicates which verb it is associated with, with the verbs themselves indicated by There are 13 roles and some approximate heuristics associated with them:
Notes about thematic roles:
Some examples of the specific thematic role notation are also available in this Word file .
VII. TOOLS
A very useful tool for searching through these corpora automatically is the Stanford Natural Language Processing Group's freely available Tregex tool, available at
http://nlp.stanford.edu/software/tregex.shtml#Download
+Animate: -
-Animate: -
(a) This label should be attached to the NP phrase level.
(b) For both ABAR and A traces, the animacy is marked on the overt position of the NP (not where the trace is).
(c) Animacy is labeled for all NPs associated with verbs. (Others may not be labeled.)
(d) Animacy is labeled with respect to the object itself (is it capably of agency or self-initialized motion, does it have feelings, etc.), no matter what verb is associated with it. So, for “The dolly likes her coat” - “the dolly” is inanimate
(1) Agent: -AGENT-Vx
—> Viewed as bringing about the event (and the event is action-y)
(2) Patient: -PATIENT-Vx
—> Viewed as undergoing an event passively (ex: “penguin” in “I hugged the penguin”)
(3) Theme: -THEME-Vx
—> Typically the subject of intransitive verbs or paired with a Possessor. (ex: “The ball rolled.” “He’s coming.” “You’re going.” “He’s sitting.”; Possessor ex: “this belongs to Lisa”: Theme = “this”, Possessor = “Lisa”; “I have a cookie”: Theme = “cookie”, Possessor = “I”)
(4)Causer: -CAUSER-Vx
—> Causing perception to Experiencer (“The economy” in “The economy worries him.”) or actual causative subject (“we” in “we made him go” and “we let him go”)
(5) Causee: -CAUSEE-Vx
—> What’s caused, typically an event: (“him go” in “We let him go”)
(6) Experiencer: -EXPER-Vx
—> Experiencing the psych/mental state (“him” in “The economy worries him.”)
(7) Subject Matter: -SUBJMATT-Vx
—> Receives the via psych/mental state (“the economy” in “Jon worries about the economy”)
(8) Location: -LOC-Vx
—> Where event occurs, no movement (ex: “on the table” in “The book rests on the table.”)
(9) Source: -SOURCE-Vx
—> Where event comes from (ex: “from Lisa” in “This came from Lisa.”)
(10) Goal: -GOAL-Vx
—> Where event goes to (also recipients) (ex: “to LA” in “I sent this to LA”, “Lisa” in “I gave a cookie to Lisa”)
(11) Benefactor: -BENEF-Vx
—> Benefits from event (ex: “Lisa” in “I washed the car for Lisa.”)
(12) Instrument: -INSTR-Vx
—> Instrument used to accomplish event (ex: “a push” in “I closed it with a push.”)
(13) Possessor: -POSSESS-Vx
—> Typically with stative verbs (ex: “Lisa” in “This belongs to Lisa.”)