Determiners Derived Corpus

Stephan Meylan
University of California -- Berkeley

Citation

If using these materials in publications, please cite this article:

Meylan, S. C., Frank, M. C., Roy, B. C., & Levy, R. (2017). The emergence of an abstract grammatical category in children’s early speech. Psychological science, 28(2), 181-192.

Description

These materials for determiner counts used in Meylan et al. (2017) were contributed by Stephan Meylan in 2016.

75 count files are included in this .zip file . Most are derived from US and UK English longitudinal corpora in the CHILDES archive (Brown, Kuczaj, Suppes, Providence, Sachs, Bloom 1970, Thomas, and Manchester). The remainder are derived from the Speechome Corpus. Citations for the corpora are available in the above paper.

Several files are associated with each corpus, reflecting a range of extraction methods and choices regarding the aggregation of counts across morphologically-similar forms.

Extraction methods:

“standard”: determiner + noun tokens found were found with an automated dependency parse run on the CHILDES morphological tier
“LN”: determiner + noun tokens found with the Stanford statistical part of speech tagger; taking the *last* noun in the cases of multiple successive nouns
“FN”: determiner + noun tokens found with the Stanford statistical part of speech tagger; taking the *first* noun in the cases of multiple successive nouns

Morphology treatment:

“singulars”: only determiner + singular noun tokens are retained, e.g., “the dog” is present in the dataset while “the dogs” is discarded
“none”: noun tokens are represented as their lemma, e.g., “the dog” and “the dogs” are treated as two instances of the same type
“all”: all morphological terms are retained and tracked separately, e.g., “the dog” and “the dogs” are instances of different types

All corpora were processed with the LN and FN methods, and all corpora with the exception of Speechome and Thomas were processed with the standard method (due to the absence of valid CHILDES-formatted source data in these two cases). Each corpus processed with the standard method has all three morphological treatments; each processed with the LN and FN methods has “singulars” and “all” morphological treatments, but no lemmatized “none” dataset.

Files prepended with SPEECHOME, USCHILDES, and UKCHILDES are the count files used in the imputation of caregiver data. These represent counts aggregated across all caregivers in each dataset.

Additional details regarding the columns in each file are given below. The extraction code is available at https://github.com/smeylan/determiner_learning, and additional details are available in the supplementary material for the above referenced paper.

****************************************

All files are UTF-8 encoded comma-separated-value (.CSV) text files.

All count files from derived CHILDES datasets have the fields:

determiner: determiner identity. “an” is mapped to “a”
noun: the noun identity, modified per the morphological treatment
pos: part of speech per !!! tags
speaker: the designation in CHILDES for the speaker
age: child age in days, computed assuming 30.5 days per month
file: the filename of the original CHILDES file
child: name of the child
numIntermediateWords: the number of words between the determiner and the noun
Utt.number: the index of the utterance in the CHILDES file

Datasets from the standard preparation have the additional columns:

sent_gra: dependency parse of the sentence (following CHILDES)

sent_mor: morphology tags from the %MOR tier

Datasets from Speechome have a different format:

noun: the anonymized noun. Noun types have been anonymized to protect the privacy of the family.

determiner: “a” or “the”; “an” is mapped to “a”

pos: part of speech tag

age: the child age in days

speaker: “caregiver” or “child”