CHILDES English Fernald-Marchman-Bang FMB_home Corpus


Anne Fernald
Department of Psychology
Stanford University

Virginia Marchman
Department of Psychology
Stanford University

Janet Bang
Child Development
San Jose State University

Participants: 61 (30, 31)
Type of Study: daylong audio
Location: USA
Media type: not available
DOI: doi:10.21415/FWXJ-P781

Browsable transcripts

Download transcripts

Citation information

Publications using this data should cite:

Bang, J.Y., Mora, A., Munévar, M., Fernald, A., & Marchman, V. (under revision). Time to talk: Multiple sources of variability in caregiver verbal engagement during everyday activities in English- and Spanish-speaking families in the U.S. https://psyarxiv.com/6jzwg/

The pre-registration and additional codebooks for this publication can be seen on Open Science Framework (https://osf.io/byjfg/). All data and code are available on GitHub (https://github.com/janetybang/TimetoTalk).

Additional publications presenting data from this sample:

Tan, A., Read, K., *Gamboa, S., Bang, J.Y., & Marchman, V. (under revision). The power of the page: Comparing richness in text and talk during book sharing with two-year old children

Bang, J.Y., Kachergis, G., Weisleder, A., & Marchman, V. (2023). An automated classifier for periods of sleep and target-child-directed speech from LENA recordings. Language Development Research, 3(1), 211-248. https://doi.org/10.34842/xmrq-er43

Marchman, V. A., Bermúdez, V. N., Bang, J. Y., & Fernald, A. (2020). Off to a good start: Early Spanish‐language processing efficiency supports Spanish‐ and English‐language outcomes at 4½ years in sequential bilinguals. Developmental Science, 23(6), e12973. https://doi.org/10.1111/desc.12973

Fernald, A., Marchman, V. A., & Weisleder, A. (2013). SES differences in language processing skill and vocabulary are evident at 18 months. Developmental Science, 16(2), 234–248. https://doi.org/10.1111/desc.12019

Project Description

The present corpus reflects everyday home activities during caregiver-child interactions sampled using LENA daylong audio recordings in socioeconomically and linguistically diverse English- and Spanish-speaking families with 2-year-old children (23 - 28 months) in the Western United States (collected from 2010 - 2013). The corpus includes 30 English- and 31 Spanish-speaking families for which the primary caregiver provided consent to share anonymized transcripts (out of 45 families in each language group).

Caregiver-child interactions were sampled from daylong recordings to reflect the densest periods of speech to children wearing the LENA recorder. These interactions occurred across various activities and periods of the day. Families who requested to record on multiple days were asked to record so that all periods of the day were represented. Each ID number reflects 1 family, with 6 transcripts per family (~ 10 min interaction), therefore 6 transcripts represent up to 1 hour of speech to 2-year-old children. Additional recruitment, sampling, recording details, and participant details can be seen elsewhere (Bang et al., under review).

Activities were coded by listening to the audio recording using ELAN software (version 5.0; Wittenburg et al., 2006). Audio recordings were initially transcribed using a transcription service who employed fluent speakers of the respective language. Trained research assistants also ‘proofread’ and revised the transcripts according to CHAT conventions relevant to our project goals.

Transcript Information

  1. CHAT headers
  2. Gems used in the transcripts - We included the following gems to denote periods of adult speech to children during various activities. Gems were marked in the transcript by hand, which were determined using the start and end periods coded in the ELAN file.
  3. Utterances - The transcribers at the transcription service and RAs were provided the following information to judge utterance breaks. RAs underwent training on a set of selected files before reviewing transcripts.
  4. Precodes and postcodes used in the transcripts
  5. Special lexical forms used in the transcripts
  6. Capitalizations - The following were capitalized in the transcripts.
  7. Anonymization - All names of people and places were replaced with a general placeholder to respect the anonymity of speakers. Below are some of the frequent anonymizations used in the transcripts.
  8. Other considerations

Acknowledgements

We are very grateful to the children and parents who participated in this research. The following work would not have been possible without the collaboration of the undergraduate students and staff of the Language Learning Lab, directed by Dr. Anne Fernald. Thank you to Darwin Mastin for previous work and thoughts on an earlier version of the activity coding. Thank you to Mónica Munévar and Arlyn Mora who were project managers for the activity coding and transcription.

We are grateful to the research assistants who helped code and transcribe the data: Jessica Magallón, Nadia Segura, Shriya Anand, Sophia Gamboa, Marisol Rodriguez, Maria Lopez, Stephen Lopez, Jesús Esquivel-Barrientos, Laura Jonsson, Kalpana Gopalkrishnan, Maribel Mercardo, Tami Alade, Jaqueline De Paz-Romero, Lesly Leon, Alice Articia, Julia Briones-Avila, and Elizabeth Sanchez.

We greatly appreciate the patience, positivity, and dedication by all to capture natural and spontaneous language in everyday interactions with young children.

This work was supported by grants from the National Institutes of Health (R01 HD42235, R01 DC008838, R01 HD092343, 2R01 HD069150), the Schusterman Foundation, the David and Lucile Packard Foundation, the Bezos Family Foundation, and the Stanford Maternal and Child Health Research Institute.

Usage Restrictions

Please only use these transcripts for research/educational purposes.