|
Stephan Meylan University of California -- Berkeley |
If using these materials in publications, please cite this article:
Meylan, S. C., Frank, M. C., Roy, B. C., & Levy, R. (2017). The emergence of an abstract grammatical category in children’s early speech. Psychological science, 28(2), 181-192.
These materials for determiner counts used in Meylan et al. (2017) were contributed by Stephan Meylan in 2016.
75 count files are included in this .zip file . Most are derived from US and UK English longitudinal corpora in the CHILDES archive (Brown, Kuczaj, Suppes, Providence, Sachs, Bloom 1970, Thomas, and Manchester). The remainder are derived from the Speechome Corpus. Citations for the corpora are available in the above paper.
Several files are associated with each corpus, reflecting a range of extraction methods and choices regarding the aggregation of counts across morphologically-similar forms.
Extraction methods:
Morphology treatment:
All corpora were processed with the LN and FN methods, and all corpora with the exception of Speechome and Thomas were processed with the standard method (due to the absence of valid CHILDES-formatted source data in these two cases). Each corpus processed with the standard method has all three morphological treatments; each processed with the LN and FN methods has “singulars” and “all” morphological treatments, but no lemmatized “none” dataset.
Files prepended with SPEECHOME, USCHILDES, and UKCHILDES are the count files used in the imputation of caregiver data. These represent counts aggregated across all caregivers in each dataset.
Additional details regarding the columns in each file are given below. The extraction code is available at https://github.com/smeylan/determiner_learning, and additional details are available in the supplementary material for the above referenced paper.
****************************************
All files are UTF-8 encoded comma-separated-value (.CSV) text files.
All count files from derived CHILDES datasets have the fields:
Datasets from the standard preparation have the additional columns:
Datasets from Speechome have a different format: