Tools for Analyzing Talk
Part 1: The CHAT Transcription Format
Carnegie Mellon University
September 2, 2017
When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version of the CHILDES manual:
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates
This allows us to systematically track usage of the programs and data through scholar.google.com.
This electronic edition of the CHAT manual is being continually revised to keep pace with the growing interests of the language research communities served by the TalkBank and CHILDES communities. The first three editions were published in 1990, 1995, and 2000 by Lawrence Erlbaum Associates. After 2000, we switched to the current electronic publication format. However, we ask that users of this system cite the version of the manual published in 2000, when using data and programs in their published work.
In its current version, this manual tends still to focus on the use of the programs for child language data in the context of the CHILDES system (childes.talkbank.org). However, beginning in 2001 with support from NSF, we introduced the concept of TalkBank to include a wide variety of language databases. These now include:
1. CHILDES (childes.talkbank.org) for child language acquisition,
2. AphasiaBank (aphasia.talkbank.org) for aphasia,
3. PhonBank (phonbank.talkbank.org) for the study of phonological development,
4. TBIBank (talkbank.org/TBIBank) for language in traumatic brain injury,
5. RHDBank (talkbank.org/RHDBank) for language in right hemisphere damage,
6. DementiaBank (talkbank.org/DementiaBank) for language in dementia,
7. FluencyBank(fluency.talkbank.org) for the study of childhood fluency development,
8. HomeBank (homebank.talkbank.org) for daylong recordings in the home,
9. CABank for Conversation Analysis, including the large SCOTUS corpus,
10. SLABank (sla.talkbank.org) for second language acquisition,
11. ClassBank for studies of language in the classroom,
12. BilingBank for the study of bilingualism and code-switching,
13. LangBank for the study and learning of classical languages, and
14. SamtaleBank for Danish conversations.
We are continually adding corpora to each of these collections. The current size of the text database is 800MB and there is an additional 3TB of media. All of the data in TalkBank are freely open to downloading and analysis with the exception of the data in the clinical language banks which are open to clinical researchers using passwords. The CLAN program and the related morphosyntactic taggers are all free and open-sourced through GitHub.
Fortunately, all of these different language banks make use of the same transcription format (CHAT) and the same set of programs (CLAN). This means that, although most of the examples in this manual rely on data from the CHILDES database, the principles extend easily to data in all of the TalkBank repositories. TalkBank (http://talkbank.org) is the largest open repository of data on spoken language. All of the data in TalkBank are transcribed in the CHAT format which is compatible with the CLAN programs.
Using conversion programs available inside CLAN (see the CLAN manual for details), transcripts in CHAT format can be automatically converted into the formats required for Praat (praat.org), Phon (phonbank.talkbank.org), ELAN (tla.mpi.nl/tools/elan), CoNLL, ANVIL (anvil-software.org), EXMARaLDA (exmaralda.org), LIPP (ihsys.com), SALT (saltsoftware.com), LENA (lenafoundation.org), Transcriber (trans.sourceforge.net), and ANNIS (corpus-tools.org/ANNIS).
TalkBank databases and programs have been used widely in the research literature. CHILDES, which is the oldest and most widely recognized of these databases, has been used in over 7000 published articles. PhonBank has been used in 480 articles and AphasiaBank has been used in 212 presentations and publications. In general, the longer a database has been available to researchers, the more the use of that database has become integrated into the basic research methodology and publication history of the field.
Metadata for the transcripts and media in these various TalkBank databases have been entered into the two major systems for accessing linguistic data: OLAC, and VLO (Virtual Language Observatory). Each transcript and media file has been assigned a PID (permanent ID) using the Handle System (www.handle.net), and each corpus has received an ISBN and DOI (digital object identifier) number.
For ten of the languages in the database, we provide automatic morphosyntactic analysis using a series of programs built into CLAN. These languages are Cantonese, Chinese, Dutch, English, French, German, Hebrew, Japanese, Italian, and Spanish. The codes produced by these programs could eventually be harmonized with the GOLD ontology. In addition, we can compute a dependency grammar analysis for each of these 10 languages. As a result of these efforts, TalkBank has been recognized as a Center in the CLARIN network (clarin.eu) and has received the Data Seal of Approval (datasealofapproval.org). TalkBank data have also been included in the SketchEngine corpus tool (sketchengine.co.uk).
Language acquisition research thrives on data collected from spontaneous interactions in naturally occurring situations. You can turn on a tape recorder or videotape, and, before you know it, you will have accumulated a library of dozens or even hundreds of hours of naturalistic interactions. But simply collecting data is only the beginning of a much larger task, because the process of transcribing and analyzing naturalistic samples is extremely time-consuming and often unreliable. In this first volume, we will present a set of computational tools designed to increase the reliability of transcriptions, automate the process of data analysis, and facilitate the sharing of transcript data. These new computational tools have brought about revolutionary changes in the way that research is conducted in the child language field. In addition, they have equally revolutionary potential for the study of second-language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia. Although the tools are of wide applicability, this volume concentrates on their use in the child language field, in the hope that researchers from other areas can make the necessary analogies to their own topics.
Before turning to a detailed examination of the current system, it may be helpful to take a brief historical tour over some of the major highlights of earlier approaches to the collection of data on language acquisition. These earlier approaches can be grouped into five major historical periods.
The first attempt to understand the process of language development appears in a remarkable passage from The Confessions of St. Augustine (1952). In this passage, Augustine claims that he remembered how he had learned language:
This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named anything, and as they spoke turned towards it, I saw and remembered that they called what they would point out by the name they uttered. And that they meant this thing, and no other, was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and, having broken in my mouth to these signs, I thereby gave utterance to my will. Thus I exchanged with those about me these current signs of our wills, and so launched deeper into the stormy intercourse of human life, yet depending on parental authority and the beck of elders.
Augustine's outline of early word learning drew attention to the role of gaze, pointing, intonation, and mutual understanding as fundamental cues to language learning. Modern research in word learning (Bloom, 2000) has supported every point of Augustine's analysis, as well as his emphasis on the role of children's intentions. In this sense, Augustine's somewhat fanciful recollection of his own language acquisition remained the high water mark for child language studies through the Middle Ages and even the Enlightenment. Unfortunately, the method on which these insights were grounded depends on our ability to actually recall the events of early childhood – a gift granted to very few of us.
Charles Darwin provided much of the inspiration for the development of the second major technique for the study of language acquisition. Using note cards and field books to track the distribution of hundreds of species and subspecies in places like the Galapagos and Indonesia, Darwin was able to collect an impressive body of naturalistic data in support of his views on natural selection and evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for naturalistic observation could be adopted to the study of human development. By taking detailed daily notes, Darwin showed how researchers could build diaries that could then be converted into biographies documenting virtually any aspect of human development. Following Darwin's lead, scholars such as Ament (1899), Preyer (1882), Gvozdev (1949), Szuman (1955), Stern & Stern (1907), Kenyeres (Kenyeres, 1926, 1938), and Leopold (1939, 1947, 1949a, 1949b) created monumental biographies detailing the language development of their own children.
Darwin's biographical technique also had its effects on the study of adult aphasia. Following in this tradition, studies of the language of particular patients and syndromes were presented by Low (1931) , Pick (1913), Wernicke (1874), and many others.
The limits of the diary technique were always quite apparent. Even the most highly trained observer could not keep pace with the rapid flow of normal speech production. Anyone who has attempted to follow a child about with a pen and a notebook soon realizes how much detail is missed and how the note-taking process interferes with the ongoing interactions.
The introduction of the tape recorder in the late 1950s provided a way around these limitations and ushered in the third period of observational studies. The effect of the tape recorder on the field of language acquisition was very much like its effect on ethnomusicology, where researchers such as Alan Lomax (Parrish, 1996) were suddenly able to produce high quality field recordings using this new technology. This period was characterized by projects in which groups of investigators collected large data sets of tape recordings from several subjects across a period of 2 or 3 years. Much of the excitement in the 1960s regarding new directions in child language research was fueled directly by the great increase in raw data that was possible through use of tape recordings and typed transcripts.
This increase in the amount of raw data had an additional, seldom discussed, consequence. In the period of the baby biography, the final published accounts closely resembled the original database of note cards. In this sense, there was no major gap between the observational database and the published database. In the period of typed transcripts, a wider gap emerged. The size of the transcripts produced in the 60s and 70s made it impossible to publish the full corpora. Instead, researchers were forced to publish only high-level analyses based on data that were not available to others. This led to a situation in which the raw empirical database for the field was kept only in private stocks, unavailable for general public examination. Comments and tallies were written into the margins of ditto master copies and new, even less legible copies, were then made by thermal production of new ditto masters. Each investigator devised a project-specific system of transcription and project-specific codes. As we began to compare hand-written and typewritten transcripts, problems in transcription methodology, coding schemes, and cross-investigator reliability became more apparent.
Recognizing this problem, Roger Brown took the lead in attempting to share his transcripts from Adam, Eve, and Sarah (Brown, 1973) with other researchers. These transcripts were typed onto stencils and mimeographed in multiple copies. The extra copies were lent to and analyzed by a wide variety of researchers. In this model, researchers took their copy of the transcript home, developed their own coding scheme, applied it (usually by making pencil markings directly on the transcript), wrote a paper about the results and, if very polite, sent a copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the conclusions drawn from those data by Brown himself!
During this early period, the relations between the various coding schemes often remained shrouded in mystery. A fortunate consequence of the unstable nature of coding systems was that researchers were very careful not to throw away their original data, even after it had been coded. Brown himself commented on the impending transition to computers in this passage (Brown, 1973, p. 53):
It is sensible to ask and we were often asked, “Why not code the sentences for grammatically significant features and put them on a computer so that studies could readily be made by anyone?” My answer always was that I was continually discovering new kinds of information that could be mined from a transcription of conversation and never felt that I knew what the full coding should be. This was certainly the case and indeed it can be said that in the entire decade since 1962 investigators have continued to hit upon new ways of inferring grammatical and semantic knowledge or competence from free conversation. But, for myself, I must, in candor, add that there was also a factor of research style. I have little patience with prolonged “tooling up” for research. I always want to get started. A better scientist would probably have done more planning and used the computer. He can do so today, in any case, with considerable confidence that he knows what to code.
With the experience of three more decades of computerized analysis behind us, we now know that the idea of reducing child language data to a set of codes and then throwing away the original data is simply wrong. Instead, our goal must be to computerize the data in a way that allows us to continually enhance it with new codes and annotations. It is fortunate that Brown preserved his transcript data in a form that allowed us to continue to work on it. It is unfortunate, however, that the original audiotapes were not kept.
Just as these data analysis problems were coming to light, a major technological opportunity was emerging in the shape of the powerful, affordable microcomputer. Microcomputer word-processing systems and database programs allowed researchers to enter transcript data into computer files that could then be easily duplicated, edited, and analyzed by standard data-processing techniques. In 1981, when the Child Language Data Exchange System (CHILDES) Project was first conceived, researchers basically thought of computer systems as large notepads. Although researchers were aware of the ways in which databases could be searched and tabulated, the full analytic and comparative power of the computer systems themselves was not yet fully understood.
Rather than serving only as an “archive” or historical record, a focus on a shared database can lead to advances in methodology and theory. However, to achieve these additional advances, researchers first needed to move beyond the idea of a simple data repository. At first, the possibility of utilizing shared transcription formats, shared codes, and shared analysis programs shone only as a faint glimmer on the horizon, against the fog and gloom of handwritten tallies, fuzzy dittos, and idiosyncratic coding schemes. Slowly, against this backdrop, the idea of a computerized data exchange system began to emerge. It was against this conceptual background that CHILDES (the name uses a one-syllable pronunciation) was conceived. The origin of the system can be traced back to the summer of 1981 when Dan Slobin, Willem Levelt, Susan Ervi