UCI_Brent_Syl Derived Corpus

Lisa Pearl
University of California, Irvine

Citations

If using these corpora in published materials, please cite one or more of the following:

Phillips, L. & Pearl, L. (2012). 'Less is More' in Bayesian word segmentation: When cognitively plausible learners outperform the ideal, In N. Miyake, D. Peebles, & R. Cooper (eds), Proceedings of the 34th Annual Conference of the Cognitive Science Society, 863-868. Austin, TX: Cognitive Science Society.

Phillips, L. & Pearl, L. (2015). The Utility of Cognitive Plausibility in Language Acquisition Modeling: Evidence From Word Segmentation. Cognitive Science, 1-31. doi: 10.1111/cogs.12217

CHILDES database:

B.MacWhinney. 2000. The CHILDES Project: Tools for analyzing talk. Mahwah, NJ: Lawrence Erlbaum Associates.

Brent corpus (original source of the data):

Brent, M. R. & Siskind, J. M. (2001). The role of exposure to isolated words in early vocabulary development. Cognition, 81, 31-44.

Description

Lawrence Phillips & Lisa Pearl contributed these materials in 2013. This corpus is derived from the CDS of the CHILDES Brent corpus. The goal was to train an automatic segmenter.

The following files are available in this .zip file .

Information file:

Klatt_IPA.pdf: A file describing the Klattese IPA encoding

Data files (in data directory):

brent9mos.txt: A file containing the orthographic transcript of a subset of the Brent corpus in the CHILDES database, specifically all utterances directed at children 9 months or younger.
brent9mos-text-klatt-syls: A file containing the syllabified Klattese IPA transcript of the subset of the Brent corpus directed at children 9 months and younger.

dict-Brent.txt: Phonemic (Klattese) transcription of all words appearing in brent9mos.txt file.
mrc-call-syllabified.txt: Syllabic transcription of words in brent9mos.txt, derived from the MRC Psycholinguistic Database (Wilson 1988).

ValidOnsets.txt: Contains a list of valid syllable onsets.

Perl (and related) files (in perl directory):

README_syllabic_conversion.txt: describes how to run syllabic conversion process for use in programs that operate over single characters (e.g., code designed to process individual phonemese)
run-syllabification-9mos: master batch script that calls perl scripts to create syllabified form of english orthographic text
adding-syllabification.pl, convert-to-english-9mos.pl, convert-to-english-syls-9mos.pl, convert-to-unicode-9mos.pl, create-unicode-dict.pl, edit-dict-brent.pl, remove-end-spaces-9mos.pl: helper scripts to accomplish different parts of the syllabification process

Output file generated during syllabic conversion process (in output directory):

brent9mos-text-klatt-syls.txt: syllabified Klattese IPA version of original orthographic text

brent9mos-text-klatt.txt: Klattese IPA version of original orthographic text

brent9mos-text-unicode.txt: Unicode encoding of syllabified Klattese IPA orthographic text

brent9mos.txt: orthographic text of 9mos subsection of Brent corpus

dict-Brent-Klatt.txt: Klattese IPA version of words in Brent corpus

syllabified-dict.txt: syllabification of Klattese IPA version of words in Brent corpus

unicode-dict.txt: Syllable to unicode conversion dictionary

unicode-word-dict.txt: Word to unicode-syllable conversion dictionary