UCI_Brent_Syl Derived Corpus
|
Lisa Pearl
University of California, Irvine
|
Citations
If using these corpora in published materials, please cite one or more of the following:
Phillips, L. & Pearl, L. 2012. 'Less is More' in Bayesian word segmentation: When cognitively plausible learners outperform the ideal, In N. Miyake, D. Peebles, & R. Cooper (eds), Proceedings of the 34th Annual Conference of the Cognitive Science Society, 863-868. Austin, TX: Cognitive Science Society.
Phillips, L. & Pearl, L. 2013. "Less is More" in language acquisition: Evidence from word segmentation. Manuscript, University of California, Irvine.
CHILDES database:
B.MacWhinney. 2000. The CHILDES Project: Tools for analyzing talk. Mahwah,
NJ: Lawrence Erlbaum Associates.
Brent corpus (original source of the data):
Brent, M. R. & Siskind, J. M. (2001). The role of exposure to isolated words in early vocabulary development. Cognition, 81, 31-44.
Description
Lawrence Phillips & Lisa Pearl contributed these materials in 2013. This corpus is derived from the CDS of the CHILDES Brent corpus. The goal was to train an automatic segmenter.
The following files are available in this .zip file .
Information file:
- Klatt_IPA.pdf: A file describing the Klattese IPA encoding
Data files (in data directory):
- brent9mos.txt: A file containing the orthographic transcript of a subset of the Brent corpus in the CHILDES database, specifically all utterances directed at children 9 months or younger.
- brent9mos-text-klatt-syls: A file containing the syllabified Klattese IPA transcript of the subset of the Brent corpus directed at children 9 months and younger.
[The following three files are used in the syllabification process]
- dict-Brent.txt: Phonemic (Klattese) transcription of all words appearing in brent9mos.txt file.
- mrc-call-syllabified.txt: Syllabic transcription of words in brent9mos.txt, derived from the MRC Psycholinguistic Database (Wilson 1988).
Wilson, M.D. (1988). The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2, Behavioral Research Methods, Instruments and Computers, 20 6-11.
- ValidOnsets.txt: Contains a list of valid syllable onsets.
Perl (and related) files (in perl directory):
- README_syllabic_conversion.txt: describes how to run syllabic conversion process for use in programs that operate over single characters (e.g., code designed to process individual phonemese)
- run-syllabification-9mos: master batch script that calls perl scripts to create syllabified form of english orthographic text
- adding-syllabification.pl, convert-to-english-9mos.pl, convert-to-english-syls-9mos.pl, convert-to-unicode-9mos.pl, create-unicode-dict.pl, edit-dict-brent.pl, remove-end-spaces-9mos.pl: helper scripts to accomplish different parts of the syllabification process
Output file generated during syllabic conversion process (in output directory):
brent9mos-text-klatt-syls.txt: syllabified Klattese IPA version of original orthographic text
brent9mos-text-klatt.txt: Klattese IPA version of original orthographic text
brent9mos-text-unicode.txt: Unicode encoding of syllabified Klattese IPA orthographic text
brent9mos.txt: orthographic text of 9mos subsection of Brent corpus
dict-Brent-Klatt.txt: Klattese IPA version of words in Brent corpus
syllabified-dict.txt: syllabification of Klattese IPA version of words in Brent corpus
unicode-dict.txt: Syllable to unicode conversion dictionary
unicode-word-dict.txt: Word to unicode-syllable conversion dictionary