CHILDES English Fernald-Marchman-Bang FMB

CHILDES English Fernald-Marchman-Bang FMB_home Corpus

	Anne Fernald Department of Psychology Stanford University afernald@stanford.edu		Virginia Marchman Department of Psychology Stanford University marchman@stanford.edu
	Janet Bang Child Development San Jose State University janet.bang@sjsu.edu

Participants:	61 (30, 31)
Type of Study:	daylong audio
Location:	USA
Media type:	not available
DOI:	doi:10.21415/FWXJ-P781

Citation information

Publications using this data should cite:

Bang, J.Y., Mora, A., Munévar, M., Fernald, A., & Marchman, V. (under revision). Time to talk: Multiple sources of variability in caregiver verbal engagement during everyday activities in English- and Spanish-speaking families in the U.S. https://psyarxiv.com/6jzwg/

The pre-registration and additional codebooks for this publication can be seen on Open Science Framework (https://osf.io/byjfg/). All data and code are available on GitHub (https://github.com/janetybang/TimetoTalk).

Additional publications presenting data from this sample:

Tan, A., Read, K., *Gamboa, S., Bang, J.Y., & Marchman, V. (under revision). The power of the page: Comparing richness in text and talk during book sharing with two-year old children

Bang, J.Y., Kachergis, G., Weisleder, A., & Marchman, V. (2023). An automated classifier for periods of sleep and target-child-directed speech from LENA recordings. Language Development Research, 3(1), 211-248. https://doi.org/10.34842/xmrq-er43

Marchman, V. A., Bermúdez, V. N., Bang, J. Y., & Fernald, A. (2020). Off to a good start: Early Spanish‐language processing efficiency supports Spanish‐ and English‐language outcomes at 4½ years in sequential bilinguals. Developmental Science, 23(6), e12973. https://doi.org/10.1111/desc.12973

Fernald, A., Marchman, V. A., & Weisleder, A. (2013). SES differences in language processing skill and vocabulary are evident at 18 months. Developmental Science, 16(2), 234–248. https://doi.org/10.1111/desc.12019

Project Description

The present corpus reflects everyday home activities during caregiver-child interactions sampled using LENA daylong audio recordings in socioeconomically and linguistically diverse English- and Spanish-speaking families with 2-year-old children (23 - 28 months) in the Western United States (collected from 2010 - 2013). The corpus includes 30 English- and 31 Spanish-speaking families for which the primary caregiver provided consent to share anonymized transcripts (out of 45 families in each language group).

Caregiver-child interactions were sampled from daylong recordings to reflect the densest periods of speech to children wearing the LENA recorder. These interactions occurred across various activities and periods of the day. Families who requested to record on multiple days were asked to record so that all periods of the day were represented. Each ID number reflects 1 family, with 6 transcripts per family (~ 10 min interaction), therefore 6 transcripts represent up to 1 hour of speech to 2-year-old children. Additional recruitment, sampling, recording details, and participant details can be seen elsewhere (Bang et al., under review).

Activities were coded by listening to the audio recording using ELAN software (version 5.0; Wittenburg et al., 2006). Audio recordings were initially transcribed using a transcription service who employed fluent speakers of the respective language. Trained research assistants also ‘proofread’ and revised the transcripts according to CHAT conventions relevant to our project goals.

Transcript Information

CHAT headers
- @Languages
  - English families (eng) - All families spoke in English. On occasion there were few borrowed words in other languages, in which case the respective language was also included in the header line.
  - Spanish families (spa) - The majority of Spanish-speaking caregivers spoke in Spanish to the child wearing the LENA recorderr, however at times they used English. Other family members may have also used English.
- @ID
  - We included the language(s) spoken by the speaker, corpus name, 3-letter speaker code, sex, and role. For the child ID we also include the age and ID number.
- @Comment
  - Study-specific identifiers regarding the 10-min segment (higher vs. lower adult talk to children and LENA recording information). For questions about this naming system please email janet.bang@sjsu.edu.
Gems used in the transcripts - We included the following gems to denote periods of adult speech to children during various activities. Gems were marked in the transcript by hand, which were determined using the start and end periods coded in the ELAN file.
- Gems for child-centered activities - For play and routines, we grouped multiple gems into one category of “play” or “routines”, respectively.
  - books: ch_books
  - play: ch_play or ch_literacy
    - Please note that ch_literacy was used at times when we did not have reasonable enough information to indicate that a book was being read with the child.
  - unstructured conversation: ch_conversation
  - routines: ch_dressing, ch_grooming, or ch_potty
  - food: ch_meals
- Gems for adult-centered activities - We grouped the following activities into one category of “adult centered activities”. The “target child” refers to the child wearing the LENA recorder.
  - adult-centered activities: ad_shopping, ad_selfcare, ad_chores, ad_NTC (adult speech to the non-target child), ad_going (adult speech to target child while in movement, e.g., in the car or walking), ad_unknown (adult speech to target child during unknown activity).
- Gem for other-directed speech (overheard speech; ohs)
  - other-directed speech: ohs
Utterances - The transcribers at the transcription service and RAs were provided the following information to judge utterance breaks. RAs underwent training on a set of selected files before reviewing transcripts.
- Pause of ~ 1 - 2 seconds: Utterances were written on separate lines when there was a sufficient 1 - 2 second pause between sentences or portions of sentences. If there was no pause, and two or more utterances were syntactically separated, then they were written on difference lines. For example if MOT or CHI said what follows without a sufficient pause, their utterances were written on different lines.
  *MOT:      no te vayas.
  *MOT:      ven aquí.
  *MOT:      no te vayas.
- A change in speaker: This would mark a “conversational turn” (e.g., the mother speaks and then the child speaks). Each speaker’s utterance were transcribed on their own line. In the example below, the mother uses the vocalization “hmmm” to ask the child to repeat what they said.
  *CHI:        esa la que está allá.
  *MOT:      hmm?
  *CHI:        allá.
- A new thought (i.e., a change of topic): Sometimes the speaker did not have a clear pause, but changed topics. In this case even if there was no pause, utterances were written on separate lines because of the change in topic.
  *MOT: mira la zanahoria es anaranjada.
  *MOT: no juegues con la cuchara.
- Connecting words: Any “complex” sentences conjoined by words such as and/y, but/pero, after/después, then/pues, or/o, because/porque, when/cuando, where/donde, were written as a single utterance, as long as they did not contain a pause.
  *MOT: te doy el dulce cuando termines tu pollo.
- Self-corrections: Self-corrected speech on the same topic was considered as one utterance. For example, if the speaker interrupted herself but continued on with the same basic idea, we did not break up the utterance and instead put a comma where she interrupted herself.
  *MOT: mira una manzana, perdón una pera.
- “Tag” questions: These words were considered part of an utterance and were transcribed on the same line.
  *MOT: sabe muy bueno, verdad?
Precodes and postcodes used in the transcripts
- Bilingual interactions - When the whole utterance was in the second language, we marked the beginning of that utterance with a precode using the three-letter language code. For a primarily Spanish-speaking family, if they spoke English:
  *MOT:      [- eng] sing!
  *MOT:      [- eng] mary had a little lamb little lamb little lamb. [+ recit]
  
  NOTE: For mixed utterances, we used the @s symbol to mark the words or phrases that were in the second language.
  *MOT:      vamos a hacer un cake@s.
  *MOT:      ahora es time@s to@s eat@s. [+ rmix]
- Cloze utterances [+ cloze] - Used if a speaker started a part of a memorized/recited phrase or a part of a word and was priming the child to fill in the blank.
  *MOT:      &+tur? [+ cloze]
  *CHI:        &+tle. [+ cloze]
  
  *MOT:      &+mari? [+ cloze]
  *CHI:        &+posa. [+ cloze]
  
  *MOT:      mary had a little? [+ recit] [+ cloze]
  %com:      mother is singing, waits for child to join her.
  *CHI:        what?
  *MOT:      lamb. [+ recit] [+ cloze]
- Elicited imitation [+ eimit] - Used if the caregiver is asking the child to explicitly repeat what the caregiver is saying.
  *MOT:      di mango.
  *CHI:        mango. [+ eimit]
  *MOT:      ahora manzana.
  *CHI:        (man)zana. [+ eimit]
  *MOT:      bravo!
  *MOT:      pollo.
  *CHI:        pollo. [+ eimit]
- Recited speech [+ recit] and [+ rmix] - Used if a speaker is reciting text [+ recit] or mixing text with their own deviations from the text [+ rmix].
- Recited utterances: [+ recit]
  - These included periods when the child or caregiver was reading a book, singing, conventional expressions or sayings (e.g., prayers), imitating media playing in the background. Please note that some cloze postcodes also included recited speech postcodes (see above).
  - Research assistants identified recited speech from book text using a variety of information including if the title or author names were mentioned, comparing against the book itself (found on Youtube, borrowed from the library, or purchased), and/or specific details from the pictures. In a few instances, research assistants noted speech as recited based on prosody and other cues, even if the book title could be confirmed. More about book texts can be seen in the citation provided under the Fernald book sharing corpus (FMB_book corpus). If it was unclear whether caregivers were reading text or limited text (e.g., caregivers were naming general animals or describing pictures, or naming letters), then we did not mark these utterances as recited.
  - Acceptable words within recited speech included speech that was ‘starting’ an utterance to get the child’s attention. In the example below, “look it says” is not part of the text, but the phrase starts the utterance.
    *MOT: look it says Thomas_the_Train went round and round. [+ recit]
  - The following are still accepted as recited:
    - If the caregiver read part of the recited text and skipped around the page, but every word is said as it is written.
    - Changing “that” for “which” or vice versa was acceptable and not enough for a rmix postcode.
    - Contractions or vice versa are acceptable as recited because these did not change the meaning (e.g., the text is “I will be back said the dog” and the caregiver says “I’ll be back said the dog”, or vice versa).
    - Changing the order of words (e.g., white fluffy feathers to fluffy white feathers).
- Recited mixed utterances: [+ rmix]
  - Any words that were omitted, added or changed within the text received the [+ rmix] postcode.
  - Omitted “so”: The text is “and King said he was so happy to see his friends”. *MOT: and King said he was happy to see his friends [+ rmix]
  - Added language:
    *MOT:      and King said he was so happy to see his good friends [+ rmix] here the word "good" was added.
    
    *MOT:      ya le diste uno. [+ recit]
    *MOT:      ya le diste dos. [+ recit]
    *MOT:      ya le diste tres. [+ recit]
    *MOT:      ya le diste cuatro. [+ rmix] here, the word "cuatro" was added.
  - Changed language: this includes articles (e.g., a vs. the)
    *MOT: and King said he was so happy to see his friend [+ rmix]
    
    NOTE: We also counted translations of the recited text (this counts for ‘live’ translations, where the caregiver had a book in one language in front of them that did not have a translated text, so instead is translating it as they read). We counted this here because this language was no longer spontaneous for our purposes, since what was said was dictated by the available text.
- Overheard speech [+ ohs] - While extended periods of overheard speech (i.e., speech to other children or adults) were identified using gems, if overheard speech was present ‘in the background’ during a caregivers’ interaction with the target child, postcodes were used to tag these utterances.
Special lexical forms used in the transcripts
- @g: child-invented, family-specific, nonsense words in books
- @g: overregularization, e.g., breaked@g
- @k: when a speaker is spelling out the word, e.g., apple is spelled out “a, p, p, l, e”
- @l: when a speaker is saying letters
- @s: when the word is in a second language noted on the @Languages tier
- @o: onomatopoeia
- @s:spa&eng: Some words were considered to be common by Spanish-English speakers in our bilingual community, and lab-specific conventions were determined for these spellings (e.g., “coras” for quarters).
Capitalizations - The following were capitalized in the transcripts.
- People (when roles were used as names, e.g., a child saying “Mommy, come”).
- State (as long as it did not identify the location of participants)
- City (as long as it does not identify the location of participants)
- Song Title
- Book Title
- Planet name
- Holidays
- Organizations (e.g., Navy, Ford for the car company)
- Months of the year
- Days of the week
- Brand names (e.g., Facebook, Cheezits)
- Racial/Ethnic groups (e.g., American, Hispanic)
- Political affiliation (e.g., Republican)
- Game names (e.g., Patty_Cake)
Anonymization - All names of people and places were replaced with a general placeholder to respect the anonymity of speakers. Below are some of the frequent anonymizations used in the transcripts.
- targetchild_name (for all variations of the child’s name)
- mother_name; father_name
- aunt_name; uncle_name
- grandma_name; grandpa_name
- adultfemale_name; adultmale_name
- sibling_name
- otherchild_name
- highway_name
- school_name
- park_name
Other considerations
- Intelligibility of 2-year-old children - Some children were less intelligible than others. When coders came across unclear vocalizations, they were asked to prioritize intent to say the word, using a variety of cues including how families responded to the child and what could be inferred from the broader context.
  - If a coder could not reasonably determine what the child was trying to say, then they were asked to consider this as a phonological fragment (using the “&+” CHAT convention). However, if they could reasonably determine what the child was saying, then they could also use the “()” shortening convention or the sound substitution convention. For example, if the child says, “I wa ou-hi”, then the proofreader may consider transcribing this as “I want outside” depending on the talk around this and how others respond. If it was not clear that the “I” was meant as a lexical item, coders were instructed to transcribe this as “&I wa(nt) ou(ts)i(de)”.
  - It was also possible to use the sound substitution, “[: word]” convention, as in ñaña [: araña].
- Overlapping speech - When there was overlapping speech, transcribers were instructed to finish the utterance on the line of the respective speaker, and then add the overlapping utterance of the other speaker on the next line.
- Repeated words
  - If there was varied prosody/intonation between each word - Words were split up across utterances.
  - If there was limited prosody/intonation between each word - At the time of transcription, we used the CHAT convention of [x N] to denote repeated words when the speaker appeared to be repeating the word in succession without pauses, with little variation in intonation. At the time of this contribution, the [x N] was discontinued in favor of using the [/] symbol between repeated words. Therefore we automatically changed these using “kwal -d90 +d +f +t@ +t% *.cha +1 @”.
- Spelling conventions - Please note that a variety of lab-specific conventions were developed for the following.
  - Collocations and Concatenations - These were determined upon a case-by-case basis, and noted in a shared lab manual as a reference. Please note that words in English that may be collocated did not necessarily make sense to collocate in Spanish (e.g., peanut_butter vs. manteca de cacahuete). We used dictionary forms as one reference (English: m-w.com, Spanish: https://dle.rae.es/), but also determined lab-specific conventions upon group discussion and consensus. Ultimately, we prioritized practical internal consistency within the study rather than linguistic or theoretical rationale.
  - Baby talk, exclamations, communicators - Lab-specific spelling was determined for a variety of English and Spanish baby talk words, exclamations, and communicators.
  - Onomatopoeias - Lab-specific spellings were used when an onomatopoeia was not present among the standard CHAT conventions in either language.
  - Simple events - Additional simple events to those seen in the CHAT manual were derived as needed.
- Comparisons to published data – Please note that in preparing these corpora, we made minor revisions or adjustments to clean the transcripts, and prior %mor and %gra lines using the MOR program in CLAN were replaced using universal dependencies (UD). Thus, values derived with the shared corpus may not match exact estimates in the published data.

Acknowledgements

We are very grateful to the children and parents who participated in this research. The following work would not have been possible without the collaboration of the undergraduate students and staff of the Language Learning Lab, directed by Dr. Anne Fernald. Thank you to Darwin Mastin for previous work and thoughts on an earlier version of the activity coding. Thank you to Mónica Munévar and Arlyn Mora who were project managers for the activity coding and transcription.

We are grateful to the research assistants who helped code and transcribe the data: Jessica Magallón, Nadia Segura, Shriya Anand, Sophia Gamboa, Marisol Rodriguez, Maria Lopez, Stephen Lopez, Jesús Esquivel-Barrientos, Laura Jonsson, Kalpana Gopalkrishnan, Maribel Mercardo, Tami Alade, Jaqueline De Paz-Romero, Lesly Leon, Alice Articia, Julia Briones-Avila, and Elizabeth Sanchez.

We greatly appreciate the patience, positivity, and dedication by all to capture natural and spontaneous language in everyday interactions with young children.

This work was supported by grants from the National Institutes of Health (R01 HD42235, R01 DC008838, R01 HD092343, 2R01 HD069150), the Schusterman Foundation, the David and Lucile Packard Foundation, the Bezos Family Foundation, and the Stanford Maternal and Child Health Research Institute.

Usage Restrictions

Please only use these transcripts for research/educational purposes.