UCI_Brent_Syl Derived Corpus


Lisa Pearl
University of California, Irvine

Citations

If using these corpora in published materials, please cite one or more of the following:

Phillips, L. & Pearl, L. 2012. 'Less is More' in Bayesian word segmentation: When cognitively plausible learners outperform the ideal, In N. Miyake, D. Peebles, & R. Cooper (eds), Proceedings of the 34th Annual Conference of the Cognitive Science Society, 863-868. Austin, TX: Cognitive Science Society.

Phillips, L. & Pearl, L. 2013. "Less is More" in language acquisition: Evidence from word segmentation. Manuscript, University of California, Irvine.

CHILDES database:
B.MacWhinney. 2000. The CHILDES Project: Tools for analyzing talk. Mahwah, NJ: Lawrence Erlbaum Associates.

Brent corpus (original source of the data):
Brent, M. R. & Siskind, J. M. (2001). The role of exposure to isolated words in early vocabulary development. Cognition, 81, 31-44.

Description

Lawrence Phillips & Lisa Pearl contributed these materials in 2013. This corpus is derived from the CDS of the CHILDES Brent corpus. The goal was to train an automatic segmenter.

The following files are available in this .zip file .

Information file:

  1. Klatt_IPA.pdf: A file describing the Klattese IPA encoding

Data files (in data directory):

  1. brent9mos.txt: A file containing the orthographic transcript of a subset of the Brent corpus in the CHILDES database, specifically all utterances directed at children 9 months or younger.
  2. brent9mos-text-klatt-syls: A file containing the syllabified Klattese IPA transcript of the subset of the Brent corpus directed at children 9 months and younger.

  3. [The following three files are used in the syllabification process]

  4. dict-Brent.txt: Phonemic (Klattese) transcription of all words appearing in brent9mos.txt file.
  5. mrc-call-syllabified.txt: Syllabic transcription of words in brent9mos.txt, derived from the MRC Psycholinguistic Database (Wilson 1988).

  6. Wilson, M.D. (1988). The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2, Behavioral Research Methods, Instruments and Computers, 20 6-11.

  7. ValidOnsets.txt: Contains a list of valid syllable onsets.
Perl (and related) files (in perl directory):
  1. README_syllabic_conversion.txt: describes how to run syllabic conversion process for use in programs that operate over single characters (e.g., code designed to process individual phonemese)
  2. run-syllabification-9mos: master batch script that calls perl scripts to create syllabified form of english orthographic text
  3. adding-syllabification.pl, convert-to-english-9mos.pl, convert-to-english-syls-9mos.pl, convert-to-unicode-9mos.pl, create-unicode-dict.pl, edit-dict-brent.pl, remove-end-spaces-9mos.pl: helper scripts to accomplish different parts of the syllabification process

Output file generated during syllabic conversion process (in output directory):

  • brent9mos-text-klatt-syls.txt: syllabified Klattese IPA version of original orthographic text
  • brent9mos-text-klatt.txt: Klattese IPA version of original orthographic text
  • brent9mos-text-unicode.txt: Unicode encoding of syllabified Klattese IPA orthographic text
  • brent9mos.txt: orthographic text of 9mos subsection of Brent corpus
  • dict-Brent-Klatt.txt: Klattese IPA version of words in Brent corpus
  • syllabified-dict.txt: syllabification of Klattese IPA version of words in Brent corpus
  • unicode-dict.txt: Syllable to unicode conversion dictionary
  • unicode-word-dict.txt: Word to unicode-syllable conversion dictionary