Brent_Ratner Derived Corpus


Michael Brent
Washington University

Sharon Goldwater
The University of Edinburgh

Nan Bernstein Ratner
University of Maryland

Citation

If using these corpora in published materials, please use the following citations.

CHILDES database:
MacWhinney, B., & Snow, C. (1985). The child language data exchange system. Journal of Child Language, 12, 271-296.

Bernstein-Ratner corpus (original source of data):
Bernstein-Ratner, N. (1987). The phonology of parent-child speech. In K. Nelson and A. van Kleeck (Eds.), Children's Language (Vol. 6, 159-174). Erlbaum, Hillsdale, NJ.

Brent version of BR corpus:
Brent, M. R., & Cartwright, T. A.. (1996). Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93-125.

Description

This version of the Brown corpus has been parsed and labelled for semantic roles. These files were contributed by Sharon Goldwater in 2008 for corpora used in Goldwater et al. word segmentation papers.

Three files are included here, originally obtained from Michael Brent, and redistributed with his permission. This .zip file contains all three files.

  1. br-text.txt: the orthographic transcript made by Brent of the Bernstein-Ratner corpus in the CHILDES database. This was made by cleaning up non-standard spellings, removing partial words, utterances not directed at the children, etc.
  2. dict.txt: the phonological dictionary used to convert orthographic forms into phonological forms, resulting in br-phono.txt.
  3. br-phono.txt: the phonological transcript.