Tools for Analyzing Talk
Part 1: The CHAT Transcription Format
Carnegie Mellon University
May 12, 2020
When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version of the CHILDES manual:
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates.
This allows us to track usage of the programs and data systematically through scholar.google.com.
This electronic edition of the CHAT manual is being continually revised to keep pace with the growing interests of the language research communities served by the TalkBank and CHILDES communities. The first three editions were published in 1990, 1995, and 2000 by Lawrence Erlbaum Associates. After 2000, we switched to the current electronic publication format. However, in order to easily track usage through systems such as Google Scholar, we ask that users cite the version of the manual published in 2000, when using data and programs in their published work. This is the citation: MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates.
In its earlier version, this manual focused exclusively on the use of the programs for child language data in the context of the CHILDES system (https://childes.talkbank.org). However, beginning in 2001 with support from NSF, we introduced the concept of TalkBank (https://talkbank.org)to include a wide variety of language databases. These now include:
1. AphasiaBank (https://aphasia.talkbank.org) for language in aphasia,
2. ASD Bank (https://asd.talkbank.org ) for language in autism,
3. BilingBank (https://biling.talkbank.org) for the study of bilingualism and code-switching,
4. CABank (https://ca.talkbank.org) for Conversation Analysis, including the large SCOTUS corpus,
5. CHILDES (https://childes.talkbank.org) for child language acquisition,
6. ClassBank (https://class.talkbank.org) for studies of language in the classroom,
7. DementiaBank (https://dementia.talkbank.org) for language in dementia,
8. FluencyBank(https://fluency.talkbank.org) for the study of childhood fluency development,
9. HomeBank (https://homebank.talkbank.org) for daylong recordings in the home,
10. PhonBank (https://phonbank.talkbank.org) for the study of phonological development,
11. RHDBank (https://rhd.talkbank.org) for language in right hemisphere damage,
12. SamtaleBank (https://samtalebank.talkbank.org) for Danish conversations.
13. SLABank (https://slabank.talkbank.org) for second language acquisition, and
14. TBIBank (https://tbi.talkbank.org) for language in traumatic brain injury,
The current manual maintains some of the earlier emphasis on child language, particularly in the first sections, while extending the treatment to these further areas and formats in terms of new codes and several new sections. We are continually adding corpora to each of these separate collections. In 2018, the size of the text database is 800MB and there is an additional 5TB of media. All of the data in TalkBank are freely open to downloading and analysis with the exception of the data in the clinical language banks which are open to clinical researchers using passwords. The CLAN program and the related morphosyntactic taggers are all free and open-sourced through GitHub.
Fortunately, all of these different language banks make use of the same transcription format (CHAT) and the same set of programs (CLAN). This means that, although most of the examples in this manual rely on data from the CHILDES database, the principles extend easily to data in all of the TalkBank repositories. TalkBank is the largest open repository of data on spoken language. All of the data in TalkBank are transcribed in the CHAT format which is compatible with the CLAN programs.
Using conversion programs available inside CLAN (see the CLAN manual for details), transcripts in CHAT format can be automatically converted into the formats required for Praat (praat.org), Phon (phonbank.talkbank.org), ELAN (tla.mpi.nl/tools/elan), CoNLL, ANVIL (anvil-software.org), EXMARaLDA (exmaralda.org), LIPP (ihsys.com), SALT (saltsoftware.com), LENA (lenafoundation.org), Transcriber (trans.sourceforge.net), and ANNIS (corpus-tools.org/ANNIS).
TalkBank databases and programs have been used widely in the research literature. CHILDES, which is the oldest and most widely recognized of these databases, has been used in over 7000 published articles. PhonBank has been used in 480 articles and AphasiaBank has been used in 212 presentations and publications. In general, the longer a database has been available to researchers, the more the use of that database has become integrated into the basic research methodology and publication history of the field.
Metadata for the transcripts and media in these various TalkBank databases have been entered into the two major systems for accessing linguistic data: OLAC, and VLO (Virtual Language Observatory). Each transcript and media file has been assigned a PID (permanent ID) using the Handle System (www.handle.net), and each corpus has received an ISBN and DOI (digital object identifier) number.
For ten of the languages in the database, we provide automatic morphosyntactic analysis using a series of programs built into CLAN. These languages are Cantonese, Chinese, Dutch, English, French, German, Hebrew, Japanese, Italian, and Spanish. The codes produced by these programs could eventually be harmonized with the GOLD ontology. In addition, we can compute a dependency grammar analysis for each of these 10 languages. As a result of these efforts, TalkBank has been recognized as a Center in the CLARIN network (clarin.eu) and has received the Data Seal of Approval (datasealofapproval.org). TalkBank data have also been included in the SketchEngine corpus tool (sketchengine.co.uk).
Language acquisition research thrives on data collected from spontaneous interactions in naturally occurring situations. You can turn on a tape recorder or videotape, and, before you know it, you will have accumulated a library of dozens or even hundreds of hours of naturalistic interactions. But simply collecting data is only the beginning of a much larger task, because the process of transcribing and analyzing naturalistic samples is extremely time-consuming and often unreliable. In this first volume, we will present a set of computational tools designed to increase the reliability of transcriptions, automate the process of data analysis, and facilitate the sharing of transcript data. These new computational tools have brought about revolutionary changes in the way that research is conducted in the child language field. In addition, they have equally revolutionary potential for the study of second-language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia. Although the tools are of wide applicability, this volume concentrates on their use in the child language field, in the hope that researchers from other areas can make the necessary analogies to their own topics.
Before turning to a detailed examination of the current system, it may be helpful to take a brief historical tour over some of the major highlights of earlier approaches to the collection of data on language acquisition. These earlier approaches can be grouped into five major historical periods.
The first attempt to understand the process of language development appears in a remarkable passage from The Confessions of St. Augustine (1952). In this passage, Augustine claims that he remembered how he had learned language:
This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named anything, and as they spoke turned towards it, I saw and remembered that they called what they would point out by the name they uttered. And that they meant this thing, and no other, was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and, having broken in my mouth to these signs, I thereby gave utterance to my will. Thus I exchanged with those about me these current signs of our wills, and so launched deeper into the stormy intercourse of human life, yet depending on parental authority and the beck of elders.
Augustine's outline of early word learning drew attention to the role of gaze, pointing, intonation, and mutual understanding as fundamental cues to language learning. Modern research in word learning (Bloom, 2000) has supported every point of Augustine's analysis, as well as his emphasis on the role of children's intentions. In this sense, Augustine's somewhat fanciful recollection of his own language acquisition remained the high water mark for child language studies through the Middle Ages and even the Enlightenment. Unfortunately, the method on which these insights were grounded depends on our ability to actually recall the events of early childhood – a gift granted to very few of us.
Charles Darwin provided much of the inspiration for the development of the second major technique for the study of language acquisition. Using note cards and field books to track the distribution of hundreds of species and subspecies in places like the Galapagos and Indonesia, Darwin was able to collect an impressive body of naturalistic data in support of his views on natural selection and evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for naturalistic observation could be adopted to the study of human development. By taking detailed daily notes, Darwin showed how researchers could build diaries that could then be converted into biographies documenting virtually any aspect of human development. Following Darwin's lead, scholars such as Ament (1899), Preyer (1882), Gvozdev (1949), Szuman (1955), Stern & Stern (1907), Kenyeres (Kenyeres, 1926, 1938), and Leopold (1939, 1947, 1949a, 1949b) created monumental biographies detailing the language development of their own children.
Darwin's biographical technique also had its effects on the study of adult aphasia. Following in this tradition, studies of the language of particular patients and syndromes were presented by Low (1931) , Pick (1913), Wernicke (1874), and many others.
The limits of the diary technique were always quite apparent. Even the most highly trained observer could not keep pace with the rapid flow of normal speech production. Anyone who has attempted to follow a child about with a pen and a notebook soon realizes how much detail is missed and how the note-taking process interferes with the ongoing interactions.
The introduction of the tape recorder in the late 1950s provided a way around these limitations and ushered in the third period of observational studies. The effect of the tape recorder on the field of language acquisition was very much like its effect on ethnomusicology, where researchers such as Alan Lomax (Parrish, 1996) were suddenly able to produce high quality field recordings using this new technology. This period was characterized by projects in which groups of investigators collected large data sets of tape recordings from several subjects across a period of 2 or 3 years. Much of the excitement in the 1960s regarding new directions in child language research was fueled directly by the great increase in raw data that was possible through use of tape recordings and typed transcripts.
This increase in the amount of raw data had an additional, seldom discussed, consequence. In the period of the baby biography, the final published accounts closely resembled the original database of note cards. In this sense, there was no major gap between the observational database and the published database. In the period of typed transcripts, a wider gap emerged. The size of the transcripts produced in the 60s and 70s made it impossible to publish the full corpora. Instead, researchers were forced to publish only high-level analyses based on data that were not available to others. This led to a situation in which the raw empirical database for the field was kept only in private stocks, unavailable for general public examination. Comments and tallies were written into the margins of ditto master copies and new, even less legible copies, were then made by thermal production of new ditto masters. Each investigator devised a project-specific system of transcription and project-specific codes. As we began to compare hand-written and typewritten transcripts, problems in transcription methodology, coding schemes, and cross-investigator reliability became more apparent.
Recognizing this problem, Roger Brown took the lead in attempting to share his transcripts from Adam, Eve, and Sarah (Brown, 1973) with other researchers. These transcripts were typed onto stencils and mimeographed in multiple copies. The extra copies were lent to and analyzed by a wide variety of researchers. In this model, researchers took their copy of the transcript home, developed their own coding scheme, applied it (usually by making pencil markings directly on the transcript), wrote a paper about the results and, if very polite, sent a copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the conclusions drawn from those data by Brown himself!
During this early period, the relations between the various coding schemes often remained shrouded in mystery. A fortunate consequence of the unstable nature of coding systems was that researchers were very careful not to throw away their original data, even after it had been coded. Brown himself commented on the impending transition to computers in this passage (Brown, 1973, p. 53):
It is sensible to ask and we were often asked, “Why not code the sentences for grammatically significant features and put them on a computer so that studies could readily be made by anyone?” My answer always was that I was continually discovering new kinds of information that could be mined from a transcription of conversation and never felt that I knew what the full coding should be. This was certainly the case and indeed it can be said that in the entire decade since 1962 investigators have continued to hit upon new ways of inferring grammatical and semantic knowledge or competence from free conversation. But, for myself, I must, in candor, add that there was also a factor of research style. I have little patience with prolonged “tooling up” for research. I always want to get started. A better scientist would probably have done more planning and used the computer. He can do so today, in any case, with considerable confidence that he knows what to code.
With the experience of three more decades of computerized analysis behind us, we now know that the idea of reducing child language data to a set of codes and then throwing away the original data is simply wrong. Instead, our goal must be to computerize the data in a way that allows us to continually enhance it with new codes and annotations. It is fortunate that Brown preserved his transcript data in a form that allowed us to continue to work on it. It is unfortunate, however, that the original audiotapes were not kept.
Just as these data analysis problems were coming to light, a major technological opportunity was emerging in the shape of the powerful, affordable microcomputer. Microcomputer word-processing systems and database programs allowed researchers to enter transcript data into computer files that could then be easily duplicated, edited, and analyzed by standard data-processing techniques. In 1981, when the Child Language Data Exchange System (CHILDES) Project was first conceived, researchers basically thought of computer systems as large notepads. Although researchers were aware of the ways in which databases could be searched and tabulated, the full analytic and comparative power of the computer systems themselves was not yet fully understood.
Rather than serving only as an “archive” or historical record, a focus on a shared database can lead to advances in methodology and theory. However, to achieve these additional advances, researchers first needed to move beyond the idea of a simple data repository. At first, the possibility of utilizing shared transcription formats, shared codes, and shared analysis programs shone only as a faint glimmer on the horizon, against the fog and gloom of handwritten tallies, fuzzy dittos, and idiosyncratic coding schemes. Slowly, against this backdrop, the idea of a computerized data exchange system began to emerge. It was against this conceptual background that CHILDES (the name uses a one-syllable pronunciation) was conceived. The origin of the system can be traced back to the summer of 1981 when Dan Slobin, Willem Levelt, Susan Ervin-Tripp, and Brian MacWhinney discussed the possibility of creating an archive for typed, handwritten, and computerized transcripts to be located at the Max-Planck-Institut für Psycholinguistik in Nijmegen. In 1983, the MacArthur Foundation funded meetings of developmental researchers in which Elizabeth Bates, Brian MacWhinney, Catherine Snow, and other child language researchers discussed the possibility of soliciting MacArthur funds to support a data exchange system. In January of 1984, the MacArthur Foundation awarded a two-year grant to Brian MacWhinney and Catherine Snow for the establishment of the Child Language Data Exchange System. These funds provided for the entry of data into the system and for the convening of a meeting of an advisory board. Twenty child language researchers met for three days in Concord, Massachusetts and agreed on a basic framework for the CHILDES system, which Catherine Snow and Brian MacWhinney would then proceed to implement.
Since 1984, when the CHILDES Project began in earnest, the world of computers has gone through a series of remarkable revolutions, each introducing new opportunities and challenges. The processing power of the home computer now dwarfs the power of the mainframe of the 1980s; new machines are now shipped with built-in audiovisual capabilities; and devices such as CD-ROMs and optical disks offer enormous storage capacity at reasonable prices. This new hardware has now opened up the possibility for multimedia access to digitized audio and video from links inside the written transcripts. In effect, a transcript is now the starting point for a new exploratory reality in which the whole interaction is accessible from the transcript. Although researchers have just now begun to make use of these new tools, the current shape of the CHILDES system reflects many of these new realities. In the pages that follow, you will learn about how we are using this new technology to provide rapid access to the database and to permit the linkage of transcripts to digitized audio and video records, even over the Internet.
Beginning in 2001, with support from an NSF Infrastructure grant, we began the extension of the CHILDES database concept to a series of additional fields listed in the Introduction. These extensions have led to the need for additional features in the CHAT coding system to support CA notation, phonological analysis, and gesture coding. As we develop new tools for each of these areas and increase the interoperability between tools, the power of the system continues to grow. As a result, we can now refer to this work as the TalkBank Project.
The reasons for developing a computerized exchange system for language data are immediately obvious to anyone who has produced or analyzed transcripts. With such a system, we can:
automate the process of data analysis,
obtain better data in a consistent, fully-documented transcription system, and
provide more data for more children from more ages, speaking more languages.
The TalkBank Project has addressed each of these goals by developing three separate, but integrated, tools. The first tool is the chat transcription and coding format. The second tool is the CLAN analysis program, and the third tool is the database. These three tools are like the legs of a three-legged stool. The transcripts in the database have all been put into the chat transcription system. The program is designed to make full use of the chat format to facilitate a wide variety of searches and analyses. Many research groups are now using the CLAN programs to enter new data sets. Eventually, these new data sets will be available to other researchers as a part of the growing TalkBank databases. In this way, chat, CLAN, and the database function as an integrated set of tools. There are manuals for each of these TalkBank tools.
1. Part 1 of the TalkBank manual, which you are now reading, describes the conventions and principles of CHAT transcription.
2. Part 2 describes the use of the basic CLAN computer programs that you can use to transcribe, annotate, and analyze language interactions.
3. Part 3 describes the use of additional CLAN program for morphosyntactic analysis.
4. The final section of the manuals, which describes the contents of the databases, is broken out as a collection of index and documentation files on the web. For example, if want to survey the shape of the Dutch child language corpora, you first go to https://childes.talkbank.org. There is also a link to that site from the overall index at https://talkbank.org/. From that homepage you click on **Index to Corpora** and then Dutch. For there, you might want to read about the contents of the CLPF corpus for early phonological development in Dutch. You then click on the CLPF link and it takes you to the fuller corpus description with photos from the contributors. From links on that page you can either browse the corpus, download the transcripts, or download the media.
In addition to these basic manual resources, there are these further facilities for learning CHAT and CLAN, all of which can be downloaded from the talkbank.org and childes.talkbank.org server sites:
1. Nan Bernstein Ratner and Shelley Brundage have contributed a manual designed specifically for clinical practitioners called the SLP’s Guide to CLAN.
2. There are versions of the manuals in Japanese and Chinese.
3. Davida Fromm has produced a series of screencasts describing how to use basic features of CLAN.
We received a great deal of extremely helpful input during the years between 1984 and 1988 when the CHAT system was being formulated. Some of the most detailed comments came from George Allen, Elizabeth Bates, Nan Bernstein Ratner, Giuseppe Cappelli, Annick De Houwer, Jane Desimone, Jane Edwards, Julia Evans, Judi Fenson, Paul Fletcher, Steven Gillis, Kristen Keefe, Mary MacWhinney, Jon Miller, Barbara Pan, Lucia Pfanner, Kim Plunkett, Kelley Sacco, Catherine Snow, Jeff Sokolov, Leonid Spektor, Joseph Stemberger, Frank Wijnen, and Antonio Zampolli. Comments developed in Edwards (1992) were useful in shaping core aspects of CHAT. George Allen (1988) helped developed the UNIBET and PHONASCII systems. The workers in the LIPPS Group (LIPPS, 2000) have developed extensions of CHAT to cover code-switching phenomena. Adaptations of CHAT to deal with data on disfluencies are developed in Bernstein-Ratner, Rooney, and MacWhinney (1996). The exercises in the CLAN manual are based on materials originally developed by Barbara Pan for Chapter 2 of Sokolov & Snow (1994)
In the period between 2001 and 2004, we converted much of the CHILDES system to work with the new XML Internet data format. This work was begun by Romeo Anghelache and completed by Franklin Chen. Support for this major reformatting and the related tightening of the CHAT format came from the NSF TalkBank Infrastructure project which involved a major collaboration with Steven Bird and Mark Liberman of the Linguistic Data Consortium.
The CLAN program is the brainchild of Leonid Spektor. Ideas for particular analysis commands came from several sources. Bill Tuthill's HUM package provided ideas about concordance analyses. The SALT system of Miller & Chapman (1983) provided guidelines regarding basic practices in transcription and analysis. Clifton Pye's PAL program provided ideas for the MODREP and PHONFREQ commands.
Darius Clynes ported CLAN to the Macintosh. Jeffrey Sokolov wrote the CHIP program. Mitzi Morris designed the MOR analyzer using specifications provided by Roland Hauser of Erlangen University. Norio Naka and Susanne Miyata developed a MOR rule system for Japanese; and Monica Sanz-Torrent helped develop the MOR system for Spanish. Julia Evans provided recommendations for the design of the audio and visual capabilities of the editor. Johannes Wagner and Spencer Hazel helped show us how we could modify CLAN to permit transcription in the Conversation Analysis framework. Steven Gillis provided suggestions for aspects of MODREP. Christophe Parisse built the POST and POSTTRAIN programs (Parisse & Le Normand, 2000). Brian Richards contributed the VOCD program (Malvern, Richards, Chipere, & Purán, 2004). Julia Evans helped specify TIMEDUR and worked on the details of DSS. Catherine Snow designed CHAINS, KEYMAP, and STATFREQ. Nan Bernstein Ratner specified aspects of PHONFREQ and plans for additional programs for phonological analysis.
The primary reason for the success of the TalkBank databases has been the generosity of over 300 researchers who have contributed their corpora. Each of these corpora represents hundreds, often thousands, of hours spent in careful collection, transcription, and checking of data. All researchers in child language should be proud of the way researchers have generously shared their valuable data with the whole research community. The growing size of the database for language impairments, adult aphasia, and second-language acquisition indicates that these related areas have also begun to understand the value of data sharing.
Many of the corpora contributed to the system were transcribed before the formulation of CHAT. In order to create a uniform database, we had to reformat these corpora into CHAT. Jane Desimone, Mary MacWhinney, Jane Morrison, Kim Roth, Kelley Sacco, Lillian Jarold, Anthony Kelly, Andrew Yankes, and Gergely Sikuta worked many long hours on this task. Steven Gillis, Helmut Feldweg, Susan Powers, and Heike Behrens supervised a parallel effort with the German and Dutch data sets.
Because of the continually changing shape of the programs and the database, keeping this manual up to date has been an ongoing activity. In this process, I received help from Mike Blackwell, Julia Evans, Kris Loh, Mary MacWhinney, Lucy Hewson, Kelley Sacco, and Gergely Sikuta. Barbara Pan, Jeff Sokolov, and Pam Rollins also provided a reading of the final draft of the 1995 version of the manual.
Since the beginning of the project, Catherine Snow has continually played a pivotal role in shaping policy, building the database, organizing workshops, and determining the shape of chat and CLAN. Catherine Snow collaborated with Jeffrey Sokolov, Pam Rollins, and Barbara Pan to construct a series of tutorial exercises and demonstration analyses that appeared in Sokolov & Snow (1994). Those exercises form the basis for similar tutorial sections in the current manual. Catherine Snow has contributed six major corpora to the database and has conducted CHILDES workshops in a dozen countries.
Several other colleagues have helped disseminate the CHILDES system through workshops, visits, and Internet facilities. Hidetosi Sirai established a CHILDES file server mirror at Chukyo University in Japan and Steven Gillis established a mirror at the University of Antwerp. Steven Gillis, Kim Plunkett, Johannes Wagner, and Sven Strömqvist helped propagate the CHILDES system at universities in Northern and Central Europe. Susanne Miyata has brought together a vital group of child language researchers using CHILDES to study the acquisition of Japanese and has supervised the translation of the current manual into Japanese. In Italy, Elena Pizzuto organized symposia for developing the CHILDES system and has supervised the translation of the manual into Italian. Magdalena Smoczynska in Krakow and Wolfgang Dressler in Vienna have helped new researchers who are learning to use CHILDES for languages spoken in Eastern Europe. Miquel Serra has supported a series of CHILDES workshops in Barcelona. Zhou Jing organized a workshop in Nanjing and Chien-ju Chang organized a workshop in Taipei.
The establishment and promotion of additional segments of TalkBank now relies on a wide array of inputs. Yvan Rose has spearheaded the creation of PhonBank. Nan Bernstein Ratner has led the development of FluencyBank. Audrey Holland, Davida Fromm, and Margie Forbes have worked to create AphasiaBank. Johannes Wagner has created SamtaleBank and segments of CABank. Jerry Goldman developed the SCOTUS segment of CABank. Roy Pea contributed to the development of ClassBank. Within each of these communities, scores of other scholars have helped with donations of corpora, analyses, and ideas.
From 1984 to 1988, the John D. and Catherine T. MacArthur Foundation supported the CHILDES Project. In 1988, the National Science Foundation provided an equipment grant that allowed us to put the database on the Internet and on CD-ROMs. From 1989, the CHILDES project has been supported by an ongoing grant from the National Institutes of Health (NICHHD). In 1998, the National Science Foundation Linguistics Program provided additional support to improve the programs for morphosyntactic analysis of the database. In 1999, NSF funded the TalkBank project. In 2002, NSF provided support for the development of the GRASP system for parsing of the corpora. In 2002, NIH provided additional support for the development of PhonBank for child language phonology and AphasiaBank for the study of communication in aphasia. Currently (2017), NICHD is providing support for CHILDES and PhonBank; NIDCD provides support for AphasiaBank and FluencyBank, NSF provides support for HomeBank and FluencyBank, and NEH provides support for LangBank. Beginning in 2014, TalkBank also became a member of the CLARIN federation (clarin.eu), a system designed to coordinate resources for language computation in the Humanities and Social Sciences.
Each of the three parts of the TalkBank system is described in separate sections of the TalkBank manual. The CHAT manual describes the conventions and principles of CHAT transcription. The CLAN manual describes the use of the editor and the analytic commands. The database manual is a set of over a dozen smaller documents, each describing a separate segment of the database.
To learn the TalkBank system, you should begin by downloading and installing the CLAN program. Next, you should download and start to read the current manual (CHAT Manual) and the CLAN manual (Part 2 of the TalkBank manual). Before proceeding too far into the CHAT manual, you will want to walk through the tutorial section at the beginning of the CHAT manual. After finishing the tutorial, try working a bit with each of the CLAN commands to get a feel for the overall scope of the system. You can then learn more about CHAT by transcribing a small sample of your data in a short test file. Run the CHECK program at frequent intervals to verify the accuracy of your coding. Once you have finished transcribing a small segment of your data, try out the various analysis programs you plan to use, to make sure that they provide the types of results you need for your work.
If you are primarily interested in analyzing data already stored in TalkBank, you do not need to learn the CHAT transcription format in much detail and you will only need to use the editor to open and read files. In that case, you may wish to focus your efforts on learning to use the CLAN programs. If you plan to transcribe new data, then you also need to work with the current manual to learn to use CHAT.
Teachers will also want to pay particular attention to the sections of the CLAN manual that present a tutorial introduction. Using some of the examples given there, you can construct additional materials to encourage students to explore the database to test out particular hypotheses.
The TalkBank system was not intended to address all issues in the study of language learning, or to be used by all students of spontaneous interactions. The chat system is comprehensive, but it is not ideal for all purposes. The programs are powerful, but they cannot solve all analytic problems. It is not the goal of TalkBank to provide facilities for all research endeavors or to force all research into some uniform mold. On the contrary, the programs are designed to offer support for alternative analytic frameworks. For example, the editor now supports the various codes of Conversation Analysis (CA) format, as alternatives and supplements to CHAT format.. Moreover, we have developed programs that convert between CHAT format and other common formats, because we know that users often need to run analyses in these other formats.
The TalkBank tools have been extensively tested for ease of application, accuracy, and reliability. However, change is fundamental to any research enterprise. Researchers are constantly pursuing better ways of coding and analyzing data. It is important that the tools keep progress with these changing requirements. For this reason, there will be revisions to chat, the programs, and the database as long as the TalkBank Project is active.
The chat system provides a standardized format for producing computerized transcripts of face-to-face conversational interactions. These interactions may involve children and parents, doctors and patients, or teachers and second-language learners. Despite the differences between these interactions, there are enough common features to allow for the creation of a single general transcription system. The system described here is designed for use with both normal and disordered populations. It can be used with learners of all types, including children, second-language learners, and adults recovering from aphasic disorders. The system provides options for basic discourse transcription as well as detailed phonological and morphological analysis. The system bears the acronym “chat,” which stands for Codes for the Human Analysis of Transcripts. Chat is the standard transcription system for the TalkBank and CHILDES (Child Language Data Exchange System) Projects. All of the transcripts in the TalkBank databases are in chat format.
What makes CHAT particularly powerful is the fact that files transcribed in CHAT can also be analyzed by the CLAN programs that are described in the CLAN manual, which is an electronic companion piece to this manual. The CHAT programs can track a wide variety of structures, compute automatic indices, and analyze morphosyntax. Moreover, because all CHAT files can now also be translated to a highly structured form of XML (a language used for text documents on the web), they are now also compatible with a wide range of other powerful computer programs such as ELAN, Praat, EXMARaLDA, Phon, Transcriber, and so on.
The TalkBank system has had a major impact on the study of child language. At the time of the last monitoring in 2016, there were over 7000 published articles that had made use of the programs and database. In 2016, the size of the database had grown to over 110 million words, making it by far the largest database of conversational interactions available anywhere. The total number of researchers who have joined as members across the length of the project is now over 5000. Of course, not all of these people are making active use of the tools at all times. However, it is safe to say that, at any given point in time, well over 100 groups of researchers around the world are involved in new data collection and transcription using the chat system. Eventually the data collected in these various projects will all be contributed to the database.
Public inspection of experimental data is a crucial prerequisite for serious scientific progress. Imagine how genetics would function if every experimenter had his or her own individual strain of peas or drosophila and refused to allow them to be tested by other experimenters. What would happen in geology, if every scientist kept his or her own set of rock specimens and refused to compare them with those of other researchers? In some fields the basic phenomena in question are so clearly open to public inspection that this is not a problem. The basic facts of planetary motion are open for all to see, as are the basic facts underlying Newtonian mechanics.
Unfortunately, in language studies, a free and open sharing and exchange of data has not always been the norm. In earlier decades, researchers jealously guarded their field notes from a particular language community of subject type, refusing to share them openly with the broader community. Various justifications were given for this practice. It was sometimes claimed that other researchers would not fully appreciate the nature of the data or that they might misrepresent crucial patterns. Sometimes, it was claimed that only someone who had actually participated in the community or the interaction could understand the nature of the language and the interactions. In some cases, these limitations were real and important. However, all such restrictions on the sharing of data inevitably impede the progress of the scientific study of language learning.
Within the field of language acquisition studies it is now understood that the advantages of sharing data outweigh the potential dangers. The question is no longer whether data should be shared, but rather how they can be shared in a reliable and responsible fashion. The computerization of transcripts opens up the possibility for many types of data sharing and analysis that otherwise would have been impossible. However, the full exploitation of this opportunity requires the development of a standardized system for data transcription and analysis.
Before examining the chat system, we need to consider some dangers involved in computerized transcriptions. These dangers arise from the need to compress a complex set of verbal and nonverbal messages into the extremely narrow channel required for the computer. In most cases, these dangers also exist when one creates a typewritten or handwritten transcript. Let us look at some of the dangers surrounding the enterprise of transcription.
Perhaps the greatest danger facing the transcriber is the tendency to treat spoken language as if it were written language. The decision to write out stretches of vocal material using the forms of written language can trigger a variety of theoretical commitments. As Ochs (1979) showed so clearly, these decisions will inevitably turn transcription into a theoretical enterprise. The most difficult bias to overcome is the tendency to map every form spoken by a learner – be it a child, an aphasic, or a second-language learner – onto a set of standard lexical items in the adult language. Transcribers tend to assimilate nonstandard learner strings to standard forms of the adult language. For example, when a child says “put on my jamas,” the transcriber may instead enter “put on my pajamas,” reasoning unconsciously that “jamas” is simply a childish form of “pajamas.” This type of regularization of the child form to the adult lexical norm can lead to misunderstanding of the shape of the child's lexicon. For example, it could be the case that the child uses “jamas” and “pajamas” to refer to two very different things (Clark, 1987; MacWhinney, 1989).
There are two types of errors possible here. One involves mapping a learner's spoken form onto an adult form when, in fact, there was no real correspondence. This is the problem of overnormalization. The second type of error involves failing to map a learner's spoken form onto an adult form when, in fact, there is a correspondence. This is the problem of undernormalization. The goal of transcribers should be to avoid both the Scylla of overnormalization and the Charybdis of undernormalization. Steering a course between these two dangers is no easy matter. A transcription system can provide devices to aid in this process, but it cannot guarantee safe passage.
Transcribers also often tend to assimilate the shape of sounds spoken by the learner to the shapes that are dictated by morphosyntactic patterns. For example, Fletcher (1985) noted that both children and adults generally produce “have” as “uv” before main verbs. As a result, forms like “might have gone” assimilate to “mightuv gone.” Fletcher believed that younger children have not yet learned to associate the full auxiliary “have” with the contracted form. If we write the children's forms as “might have,” we then end up mischaracterizing the structure of their lexicon. To take another example, we can note that, in French, the various endings of the verb in the present tense are distinguished in spelling, whereas they are homophonous in speech. If a child says /mʌnʒ/ “eat,” are we to transcribe it as first person singular mange, as second person singular manges, or as the imperative mange? If the child says /mãʒe/, should we transcribe it as the infinitive manger, the participle mangé, or the second person formal mangez?
CHAT deals with these problems in three ways. First, it uses IPA as a uniform way of transcribing discourse phonetically. Second, the editor allows the user to link the digitized audio record of the interaction directly to the transcript. This is the system called “sonic CHAT.” With these sonic CHAT links, it is possible to double-click on a sentence and hear its sound immediately. Having the actual sound produced by the child directly available in the transcript takes some of the burden off of the transcription system. However, whenever computerized analyses are based not on the original audio signal but on transcribed orthographic forms, one must continue to understand the limits of transcription conventions. Third, for those who wish to avoid the work involved in IPA transcription or sonic CHAT, that is a system for using nonstandard lexical forms, that the form “might (h)ave” would be universally recognized as the spelling of “mightof”, the contracted form of “might have.” More extreme cases of phonological variation can be annotated as in this example: popo [: hippopotamus].
Transcribers have a tendency to write out spoken language with the punctuation conventions of written language. Written language is organized into clauses and sentences delimited by commas, periods, and other marks of punctuation. Spoken language, on the other hand, is organized into tone units clustered about a tonal nucleus and delineated by pauses and tonal contours (Crystal, 1969, 1979; Halliday, 1966, 1967, 1968). Work on the discourse basis of sentence production (Chafe, 1980; Jefferson, 1984) has demonstrated a close link between tone units and ideational units. Retracings, pauses, stress, and all forms of intonational contours are crucial markers of aspects of the utterance planning process. Moreover, these features also convey important sociolinguistic information. Within special markings or conventions, there is no way to directly indicate these important aspects of interactions.
Whatever form a transcript may take, it will never contain a fully accurate record of what went on in an interaction. A transcript of an interaction can never fully replace an audiotape, because an audio recording of the interaction will always be more accurate in terms of preserving the actual details of what transpired. By the same token, an audio recording can never preserve as much detail as a video recording with a high-quality audio track. Audio recordings record none of the nonverbal interactions that often form the backbone of a conversational interaction. Hence, they systematically exclude a source of information that is crucial for a full interpretation of the interaction. Although there are biases involved even in a video recording, it is still the most accurate record of an interaction that we have available. For those who are trying to use transcription to capture the full detailed character of an interaction, it is imperative that transcription be done from a video recording which should be repeatedly consulted during all phases of analysis.
When the CLAN editor is used to link transcripts to audio recordings, we refer to this as sonic CHAT. When the system is used to link transcripts to video recordings, we refer to this as video CHAT. The CLAN manual explains how to link digital audio and video to transcripts.
Transcription and coding systems often force the user to make difficult distinctions. For example, a system might make a distinction between grammatical ellipsis and ungrammatical omission. However, it may often be the case that the user cannot decide whether an omission is grammatical or not. In that case, it may be helpful to have some way of blurring the distinction. chat has certain symbols that can be used when a categorization cannot be made. It is important to remember that many of the chat symbols are entirely optional. Whenever you feel that you are being forced to make a distinction, check the manual to see whether the particular coding choice is actually required. If it is not required, then simply omit the code altogether.
It is important to recognize the difference between transcription and coding. Transcription focuses on the production of a written record that can lead us to understand, albeit only vaguely, the flow of the original interaction. Transcription must be done directly off an audiotape or, preferably, a videotape. Coding, on the other hand, is the process of recognizing, analyzing, and taking note of phenomena in transcribed speech. Coding can often be done by referring only to a written transcript. For example, the coding of parts of speech can be done directly from a transcript without listening to the audiotape. For other types of coding, such as speech act coding, it is imperative that coding be done while watching the original videotape.
The chat system includes conventions for both transcription and coding. When first learning the system, it is best to focus on learning how to transcribe. The chat system offers the transcriber a large array of coding options. Although few transcribers will need to use all of the options, everyone needs to understand how basic transcription is done on the “main line.” Additional coding is done principally on the secondary or “dependent” tiers. As transcribers work more with their data, they will include further options from the secondary or “dependent” tiers. However, the beginning user should focus first on learning to correctly use the conventions for the main line. The manual includes several sample transcripts to help the beginner in learning the transcription system.
Like other forms of communication, transcription systems are subjected to a variety of communicative pressures. The view of language structure developed by Slobin (1977) sees structure as emerging from the pressure of three conflicting charges or goals. On the one hand, language is designed to be clear. On the other hand, it is designed to be processible by the listener and quick and easy for the speaker. Unfortunately, ease of production often comes in conflict with clarity of marking. The competition between these three motives leads to a variety of imperfect solutions that satisfy each goal only partially. Such imperfect and unstable solutions characterize the grammar and phonology of human language (Bates & MacWhinney, 1982). Only rarely does a solution succeed in fully achieving all three goals.
Slobin's view of the pressures shaping human language can be extended to analyze the pressures shaping a transcription system. In many regards, a transcription system is much like any human language. It needs to be clear in its markings of categories, and still preserve readability and ease of transcription. However, transcripts address rather different audiences. One audience is the human audience of transcribers, analysts, and readers. The other audience is the digital computer and its programs. To deal with these two audiences, a system for computerized transcription needs to achieve the following goals:
Clarity: Every symbol used in the coding system should have some clear and definable real-world referent. Symbols that mark particular words should always be spelled in a consistent manner. Symbols that mark particular conversational patterns should refer to consistently observable patterns. Codes must steer between the Scylla of overregularization and the Charybdis of underregularization discussed earlier. Distinctions must avoid being either too fine or too coarse. Another way of looking at clarity is through the notion of systematicity. Codes, words, and symbols must be used in a consistent manner across transcripts. Ideally, each code should always have a unique meaning independent of the presence of other codes or the particular transcript in which it is located. If interactions are necessary, as in hierarchical coding systems, these interactions need to be systematically described.
Readability: Just as human language needs to be easy to process, so transcripts need to be easy to read. This goal often runs directly counter to the first goal. In the TalkBank system, we have attempted to provide a variety of chat options that will allow a user to maximize the readability of a transcript. We have also provided clan tools that will allow a reader to suppress the less readable aspects in transcript when the goal of readability is more important than the goal of clarity of marking.
Ease of data entry: As distinctions proliferate within a transcription system, data entry becomes increasingly difficult and error-prone. There are two ways of dealing with this problem. One method attempts to simplify the coding scheme and its categories. The problem with this approach is that it sacrifices clarity. The second method attempts to help the transcriber by providing computational aids. The CLAN programs follow this path. They provide systems for the automatic checking of transcription accuracy, methods for the automatic analysis of morphology and syntax, and tools for the semiautomatic entry of codes. However, the basic process of transcription has not been automated and remains the major task during data entry.
chat provides both basic and advanced formats for transcription and coding. The basic level of chat is called minchat. New users should start by learning minchat. This system looks much like other intuitive transcription systems that are in general use in the fields of child language and discourse analysis. However, eventually users will find that there is something they want to be able to code that goes beyond minchat. At that point, they should move on to learning the additional features of CHAT that are relevant for the type of working they are doing.
There are several minimum standards for the form of a minchat file. These standards must be followed for the CLAN commands to run successfully on chat files:
1. Every line must end with a carriage return.
2. The first line in the file must be an @Begin header line.
3. The second line in the file must be an @Languages header line. The languages entered here use a three-letter ISO 639-3 code, such as “eng” for English.
4. The third line must be an @Participants header line listing three-letter codes for each participant, the participant's name, and the participant's role.
5. After the @Participants header come a set of @ID headers providing further details for each speaker. These will be inserted automatically for you when you run CHECK using escape-L.
6. The last line in the file must be an @End header line.
7. Lines beginning with * indicate what was actually said. These are called “main lines.” Each main line should code one and only one utterance. When a speaker produces several utterances in a row, code each with a new main line.
8. After the asterisk on the main line comes a three-letter code in upper case letters for the participant who was the speaker of the utterance being coded. After the three-letter code comes a colon and then a tab.
9. What was actually said is entered starting in the ninth column.
10. Lines beginning with the % symbol can contain codes and commentary regarding what was said. They are called “dependent tier” lines. The % symbol is followed by a three-letter code in lowercase letters for the dependent tier type, such as “pho” for phonology; a colon; and then a tab. The text of the dependent tier begins after the tab.
11. Continuations of main lines and dependent tier lines begin with a tab which is inserted automatically by the CLAN editor.
In addition to these minimum requirements for the form of the file, there are certain minimum ways in which utterances and words should be written on the main line:
1. Utterances must end with an utterance terminator. The basic utterance terminators are the period, the exclamation mark, and the question mark. These can be preceded by a space, but the space is not required.
2. Commas can be used as needed to mark phrasal junctions, but they are not used by the programs and have no sharp prosodic definition.
3. Use upper case letters only for proper nouns and the word “I.” Do not use uppercase letters for the first words of sentences. This will facilitate the identification of proper nouns.
4. To facilitate recognition of proper nouns and avoid misspellings, words should not contain capital letters except at their beginning. Words should not contain numbers, unless these mark tones.
5. Unintelligible words with an unclear phonetic shape should be transcribed as xxx.
6. If you wish to note the phonological form of an incomplete or unintelligible phonological string, write it out with an ampersand, as in &guga.
7. Incomplete words can be written with the omitted material in parentheses, as in (be)cause and (a)bout.
Here is a sample that illustrates these principles. This file is syntactically correct and uses the minimum number of chat conventions while still maintaining compatibility with the CLAN commands.
@Participants: CHI Ross Child, FAT Brian Father
*ROS: why isn't Mommy coming?
%com: Mother usually picks Ross up around 4 PM.
*FAT: don't worry.
*FAT: she'll be here soon.
For researchers who are just now beginning to use chat and CLAN, there is one single suggestion that can potentially save literally hundreds of hours of wasted time. The suggestion is to transcribe and analyze one single small file completely and perfectly before launching a major effort in transcription and analysis. The idea is that you should learn just enough about minchat and minCLAN to see your path through these four crucial steps:
1. entry of a small set of your data into a CHAT file,
2. successful running of the CHECK command inside the editor to guarantee accuracy in your CHAT file,
3. development of a series of codes that will interface with the particular CLAN commands most appropriate for your analysis, and
4. running of the relevant CLAN commands, so that you can be sure that the results you will get will properly test the hypotheses you wish to develop.
If you go through these steps first, you can guarantee in advance the successful outcome of your project. You can avoid ending up in a situation in which you have transcribed hundreds of hours of data in a way that does not match correctly with the input requirements for CLAN.
After having learned minchat, you are ready to learn the basics of CLAN. To do this, you will want to work through the first chapters of the CLAN manual focusing in particular on the CLAN tutorial. These chapters will take you up to the level of minCLAN, which corresponds to the minchat level.
Once you have learned minCHAT and minCLAN, you are ready to move on to learning the rest of the system. You should next work through the chapters on words, utterances, and scoped symbols. Depending on the shape of your particular project, you may then need to study additional chapters in this manual. For people working on large projects that last many months, it is a good idea to eventually read all of the current manual, although some sections that seem less relevant to the project can be skimmed.
Each CLAN command runs a very superficial check to see if a file conforms to minchat. This check looks only to see that each line begins with either @, *, %, a tab or a space. This is the minimum that the CLAN commands must have to function. However, the correct functioning of many of the functions of CLAN depends on adherence to further standards for minchat. In order to make sure that a file matches these minimum requirements for correct analysis through CLAN, researchers should run each file through the CHECK program. The CHECK command can be run directly inside the editor, so that you can verify the accuracy of your transcription as you are producing it. CHECK will detect errors such as failure to start lines with the correct symbols, use of incorrect speaker codes, or missing @Begin and @End symbols. CHECK can also be used to find errors in chat coding beyond those discussed in this chapter. Using CHECK is like brushing your teeth. It may be hard at first to remember to use the command, but the more you use it the easier it becomes and the better the final results.
Each TalkBank database consists of a collection of corpora, organized into larger folders by languages and language groups. For example, there is a top-level folder called Romance in which one finds subfolders for Spanish, French, and other Romance languages. Within the Spanish folder, there are then dozens of further folders, each of which has a single corpus. With a corpus, files may be further grouped by individual children or groups of children. For longitudinal corpora, we recommend that file names use the age of the child followed by a letter if there are several recordings from a given day. For example, the transcript from the fourth taping session when the child was 2;3;22 would be called 20322d.cha. It is better to use ages for file names, rather than dates or other material.
Increasingly, researchers rely on Internet systems to locate and retrieve language data and resources. There are currently several systems designed to facilitate this process and we have adapted the indexing and registration of materials in the CHILDES and TalkBank systems to provide information that can be incorporated into these systems. The two systems designed specifically to deal with linguistic data are OLAC (Online Language Archives Community at www.language-archives.org) and VLO (Virtual Language Observatory at vlo.clarin.eu). These systems allow researchers to search for whole corpora or single files, using terms such as Cantonese, video, gesture, or aphasia. In order to publish or register TalkBank data within these systems, we create a 0metadata.cdc file at the top level of each corpus in TalkBank. Some of the fields in this metadata file are designed for indexing in OLAC and some are designed for the CMDI system used by VLO and the related facility called The Language Archive (tla.mpi.nl). Because of the highly specific nature of the terms and the software used for regular harvesting and publication of these data, we do not require users to create the 0metadata.cdc files. The following table explains what keywords are expected within each field of these files. The first fields listed are for OLAC and the later ones are for CMDI. For CMDI, the values unknown and unspecified are also available for most of the fields.
Set by Handle Server system
Bilingual AarsenBos Corpus