Tools for Analyzing Talk

 

Part 1:  The CHAT Transcription Format

 

 

Brian MacWhinney

Carnegie Mellon University

 

November 5, 2017

 

 

 

 

 

 

When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version of the CHILDES manual:

 

MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates

 

This allows us to systematically track usage of the programs and data through scholar.google.com.


 

 

1        Introduction. 5

2        The CHILDES Project 7

2.1    Impressionistic Observation.. 7

2.2    Baby Biographies. 8

2.3    Transcripts. 8

2.4    Computers. 10

2.5    Connectivity. 10

3        From CHILDES to TalkBank. 12

3.1    Three Tools. 12

3.2    Shaping CHAT.. 13

3.3    Building CLAN.. 13

3.4    Constructing the Database. 14

3.5    Dissemination.. 14

3.6    Funding. 15

3.7    How to Use These Manuals. 15

3.8    Changes. 16

4        Principles. 17

4.1    Computerization.. 17

4.2    Words of Caution.. 18

4.2.1       The Dominance of the Written Word.. 18

4.2.2       The Misuse of Standard Punctuation.. 19

4.2.3       Working With Video.. 19

4.3    Problems With Forced Decisions. 20

4.4    Transcription and Coding. 20

4.5    Three Goals. 21

5        minCHAT.. 22

5.1    minCHAT – the Form of Files. 22

5.2    minCHAT – Words and Utterances. 22

5.3    Analyzing One Small File. 23

5.4    Next Steps. 24

5.5    Checking Syntactic Accuracy. 24

6        Corpus Organization. 25

6.1    File Naming. 25

6.2    Metadata. 25

6.3    The Documentation File. 27

7        File Headers. 29

7.1    Hidden Headers. 29

7.2    Initial Headers. 30

7.3    Participant-Specific Headers. 35

7.4    Constant Headers. 35

7.5    Changeable Headers. 37

8        Words. 41

8.1    The Main Line. 42

8.2    Basic Words. 42

8.3    Special Form Markers. 42

8.4    Unidentifiable Material 45

8.5    Incomplete and Omitted Words. 47

8.6    Standardized Spellings. 48

8.6.1       Letters.. 48

8.6.2       Compounds and Linkages.. 49

8.6.3      Capitalization.. 49

8.6.4      Acronyms. 49

8.6.5       Numbers and Titles.. 50

8.6.6       Kinship Forms.. 50

8.6.7       Shortenings.. 51

8.6.8       Assimilations.. 52

8.6.9       Communicators and Interjections.. 52

8.6.10     Spelling Variants.. 53

8.6.11     Colloquial Forms.. 53

8.6.12     Dialectal Variations.. 53

8.6.13     Baby Talk.. 54

8.6.14     Word separation in Japanese.. 55

8.6.15     Abbreviations in Dutch.. 55

9        Utterances. 57

9.1    One Utterance or Many?. 57

9.2    Satellite Markers. 58

9.3    Discourse Repetition.. 59

9.4    C-Units, sentences, utterances, and run-ons. 59

9.5    Retracing. 60

9.6    Basic Utterance Terminators. 60

9.7    Separators. 61

9.8    Tone Direction.. 62

9.9    Prosody Within Words. 62

9.10      Local Events. 63

9.10.1     Simple Events.. 63

9.10.2     Complex Local Events.. 64

9.10.3     Pauses.. 65

9.10.4     Long Events.. 65

9.11      Special Utterance Terminators. 65

9.12      Utterance Linkers. 68

10      Scoped Symbols. 70

10.1      Audio and Video Time Marks. 70

10.2      Paralinguisticand Duration Scoping. 71

10.3      Explanations and Alternatives. 72

10.4      Retracing, Overlap, and Clauses. 73

10.5      Error Marking. 77

10.6      Initial and Final Codes. 77

11      Dependent Tiers. 79

11.1      Standard Dependent Tiers. 79

11.2      Synchrony Relations. 85

12      CHAT-CA Transcription. 87

13      Disfluency Transcription. 90

14     Transcribing Aphasic Language. 92

15      Arabic and Hebrew Transcription. 96

16      Specific Applications. 98

16.1      Code-Switching. 98

16.2      Elicited Narratives and Picture Descriptions. 99

16.3      Written Language. 99

16.4      Nested Files for Gesture Analysis. 100

16.5      Sign Language Transcription.. 102

16.6      Sign and Speech.. 102

17      Speech Act Codes. 104

17.1      Interchange Types. 104

17.2      Illocutionary Force Codes. 105

18      Error Coding. 108

18.1      Word level error codes summary. 108

18.2      Word level coding – detailed explanations. 109

18.2.1     General Considerations.. 109

18.2.2     Phonological errors.. 110

18.2.3     Semantic errors.. 111

18.2.4     Neologism errors  -- transcribe using IPA and attach @u to the error.. 111

18.2.5     Dysfluency errors.. 112

18.2.6     Morphology errors.. 112

18.2.7     Formal lexical device errors.. 113

18.2.8     Missing word errors.. 113

18.3      Utterance level error coding (post-codes). 113

References. 115

1       Introduction

This electronic edition of the CHAT manual is being continually revised to keep pace with the growing interests of the language research communities served by the TalkBank and CHILDES communities. The first three editions were published in 1990, 1995, and 2000 by Lawrence Erlbaum Associates.  After 2000, we switched to the current electronic publication format.  However, we ask that users of this system cite the version of the manual published in 2000, when using data and programs in their published work.

In its current version, this manual tends still to focus on the use of the programs for child language data in the context of the CHILDES system (childes.talkbank.org).  However, beginning in 2001 with support from NSF, we introduced the concept of TalkBank to include a wide variety of language databases. These now include:

1.     CHILDES (childes.talkbank.org) for child language acquisition,

2.     AphasiaBank (aphasia.talkbank.org) for aphasia,

3.     PhonBank (phonbank.talkbank.org) for the study of phonological development,

4.     TBIBank (talkbank.org/TBIBank) for language in traumatic brain injury,

5.     RHDBank (talkbank.org/RHDBank) for language in right hemisphere damage,

6.     DementiaBank (talkbank.org/DementiaBank) for language in dementia,

7.     FluencyBank(fluency.talkbank.org) for the study of childhood fluency development,

8.     HomeBank (homebank.talkbank.org) for daylong recordings in the home,

9.     CABank for Conversation Analysis, including the large SCOTUS corpus,

10.  SLABank (sla.talkbank.org) for second language acquisition,

11.  ClassBank for studies of language in the classroom,

12.  BilingBank for the study of bilingualism and code-switching,

13.  LangBank for the study and learning of classical languages, and

14.  SamtaleBank for Danish conversations. 

We are continually adding corpora to each of these collections.  The current size of the text database is 800MB and there is an additional 3TB of media. All of the data in TalkBank are freely open to downloading and analysis with the exception of the data in the clinical language banks which are open to clinical researchers using passwords. The CLAN program and the related morphosyntactic taggers are all free and open-sourced through GitHub.

Fortunately, all of these different language banks make use of the same transcription format (CHAT) and the same set of programs (CLAN).  This means that, although most of the examples in this manual rely on data from the CHILDES database, the principles extend easily to data in all of the TalkBank repositories.  TalkBank (http://talkbank.org)  is the largest open repository of data on spoken language.  All of the data in TalkBank are transcribed in the CHAT format which is compatible with the CLAN programs. 

Using conversion programs available inside CLAN (see the CLAN manual for details), transcripts in CHAT format can be automatically converted into the formats required for Praat (praat.org), Phon (phonbank.talkbank.org), ELAN (tla.mpi.nl/tools/elan), CoNLL, ANVIL (anvil-software.org), EXMARaLDA (exmaralda.org), LIPP (ihsys.com), SALT (saltsoftware.com), LENA (lenafoundation.org), Transcriber (trans.sourceforge.net), and ANNIS (corpus-tools.org/ANNIS).

TalkBank databases and programs have been used widely in the research literature.  CHILDES, which is the oldest and most widely recognized of these databases, has been used in over 7000 published articles.  PhonBank has been used in 480 articles and AphasiaBank has been used in 212 presentations and publications.  In general, the longer a database has been available to researchers, the more the use of that database has become integrated into the basic research methodology and publication history of the field.

Metadata for the transcripts and media in these various TalkBank databases have been entered into the two major systems for accessing linguistic data: OLAC, and VLO (Virtual Language Observatory).  Each transcript and media file has been assigned a PID (permanent ID) using the Handle System (www.handle.net), and each corpus has received an ISBN and DOI (digital object identifier) number.

For ten of the languages in the database, we provide automatic morphosyntactic analysis using a series of programs built into CLAN.  These languages are Cantonese, Chinese, Dutch, English, French, German, Hebrew, Japanese, Italian, and Spanish.  The codes produced by these programs could eventually be harmonized with the GOLD ontology. In addition, we can compute a dependency grammar analysis for each of these 10 languages. As a result of these efforts, TalkBank has been recognized as a Center in the CLARIN network (clarin.eu) and has received the Data Seal of Approval (datasealofapproval.org).  TalkBank data have also been included in the SketchEngine corpus tool (sketchengine.co.uk).

2       The CHILDES Project

Language acquisition research thrives on data collected from spontaneous interactions in naturally occurring situations. You can turn on a tape recorder or videotape, and, before you know it, you will have accumulated a library of dozens or even hundreds of hours of naturalistic interactions. But simply collecting data is only the beginning of a much larger task, because the process of transcribing and analyzing naturalistic samples is extremely time-consuming and often unreliable. In this first volume, we will present a set of compu­tational tools designed to increase the reliability of transcriptions, automate the process of data analysis, and facilitate the sharing of transcript data. These new computational tools have brought about revolutionary changes in the way that research is conducted in the child language field. In addition, they have equally revolutionary potential for the study of sec­ond-language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia. Although the tools are of wide applicability, this volume concentrates on their use in the child language field, in the hope that researchers from other areas can make the necessary analogies to their own topics.

Before turning to a detailed examination of the current system, it may be helpful to take a brief historical tour over some of the major highlights of earlier approaches to the collec­tion of data on language acquisition. These earlier approaches can be grouped into five ma­jor historical periods.

2.1      Impressionistic Observation

The first attempt to understand the process of language development appears in a re­markable passage from The Confessions of St. Augustine (1952). In this passage, Augustine claims that he remembered how he had learned language:

This I remember; and have since observed how I learned to speak. It was not that my elders taught me words (as, soon after, other learning) in any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have my will, and yet unable to express all I willed or to whom I willed, did myself, by the understanding which Thou, my God, gavest me, practise the sounds in my memory. When they named anything, and as they spoke turned towards it, I saw and remembered that they called what they would point out by the name they uttered. And that they meant this thing, and no other, was plain from the motion of their body, the natural language, as it were, of all nations, expressed by the countenance, glances of the eye, gestures of the limbs, and tones of the voice, indicating the affections of the mind as it pursues, possesses, rejects, or shuns. And thus by constantly hearing words, as they occurred in various sentences, I collected gradually for what they stood; and, having broken in my mouth to these signs, I thereby gave utterance to my will. Thus I exchanged with those about me these current signs of our wills, and so launched deeper into the stormy intercourse of human life, yet depending on parental authority and the beck of elders.

Augustine's outline of early word learning drew attention to the role of gaze, pointing, intonation, and mutual understanding as fundamental cues to language learning.  Modern research in word learning (Bloom, 2000) has supported every point of Augustine's analysis, as well as his emphasis on the role of children's intentions.  In this sense, Augustine's somewhat fanciful recollection of his own language acquisition remained the high water mark for child language studies through the Middle Ages and even the Enlightenment. Unfortunately, the method on which these insights were grounded depends on our ability to actually recall the events of early childhood – a gift granted to very few of us.

2.2      Baby Biographies

Charles Darwin provided much of the inspiration for the development of the second major technique for the study of language acquisition. Using note cards and field books to track the distribution of hundreds of species and subspecies in places like the Galapagos and Indonesia, Darwin was able to col­lect an impressive body of naturalistic data in support of his views on natural selection and evolution. In his study of gestural development in his son, Darwin (1877) showed how these same tools for naturalistic observation could be adopted to the study of human devel­opment. By taking detailed daily notes, Darwin showed how researchers could build diaries that could then be converted into biographies documenting virtually any aspect of human development. Following Darwin's lead, scholars such as Ament (1899), Preyer (1882), Gvozdev (1949), Szuman (1955), Stern & Stern (1907), Kenyeres (Kenyeres, 1926, 1938), and Leopold (1939, 1947, 1949a, 1949b) created monumental biographies detailing the language devel­opment of their own children.

Darwin's biographical technique also had its effects on the study of adult aphasia. Fol­lowing in this tradition, studies of the language of particular patients and syndromes were presented by Low (1931) , Pick (1913), Wernicke (1874), and many others.

2.3      Transcripts

The limits of the diary technique were always quite apparent. Even the most highly trained observer could not keep pace with the rapid flow of normal speech production. Any­one who has attempted to follow a child about with a pen and a notebook soon realizes how much detail is missed and how the note-taking process interferes with the ongoing interac­tions.

The introduction of the tape recorder in the late 1950s provided a way around these lim­itations and ushered in the third period of observational studies. The effect of the tape re­corder on the field of language acquisition was very much like its effect on ethnomusicology, where researchers such as Alan Lomax (Parrish, 1996) were suddenly able to produce high quality field recordings using this new technology. This period was characterized by projects in which groups of investigators collected large data sets of tape recordings from several subjects across a period of 2 or 3 years. Much of the excitement in the 1960s regarding new directions in child language research was fueled directly by the great increase in raw data that was possible through use of tape recordings and typed tran­scripts.

This increase in the amount of raw data had an additional, seldom discussed, conse­quence. In the period of the baby biography, the final published accounts closely resembled the original database of note cards. In this sense, there was no major gap between the ob­servational database and the published database. In the period of typed transcripts, a wider gap emerged. The size of the transcripts produced in the 60s and 70s made it impossible to publish the full corpora. Instead, researchers were forced to publish only high-level analyses based on data that were not available to others. This led to a situation in which the raw empirical database for the field was kept only in private stocks, unavailable for general public examination. Comments and tallies were written into the margins of ditto master copies and new, even less legible copies, were then made by thermal production of new ditto masters. Each investigator devised a project-specific system of transcription and project-specific codes. As we began to compare hand-written and typewritten transcripts, problems in transcription methodology, coding schemes, and cross-investigator reliability became more apparent.

Recognizing this problem, Roger Brown took the lead in attempting to share his tran­scripts from Adam, Eve, and Sarah (Brown, 1973) with other researchers. These transcripts were typed onto stencils and mimeographed in multiple copies. The extra copies were lent to and analyzed by a wide variety of researchers. In this model, researchers took their copy of the transcript home, developed their own coding scheme, applied it (usually by making pencil markings directly on the transcript), wrote a paper about the results and, if very po­lite, sent a copy to Roger. Some of these reports (Moerk, 1983) even attempted to disprove the conclusions drawn from those data by Brown himself!

During this early period, the relations between the various coding schemes often remained shrouded in mystery. A fortunate consequence of the unstable nature of coding systems was that researchers were very careful not to throw away their original data, even after it had been coded. Brown himself commented on the impending transition to computers in this passage (Brown, 1973, p. 53):

It is sensible to ask and we were often asked, “Why not code the sentences for grammatically significant features and put them on a computer so that studies could readily be made by anyone?”  My answer always was that I was continually discovering new kinds of information that could be mined from a transcription of conversation and never felt that I knew what the full coding should be.  This was certainly the case and indeed it can be said that in the entire decade since 1962 investigators have continued to hit upon new ways of inferring grammatical and semantic knowledge or competence from free conversation. But, for myself, I must, in candor, add that there was also a factor of research style.  I have little patience with prolonged “tooling up” for research.  I always want to get started. A better scientist would probably have done more planning and used the computer.  He can do so today, in any case, with considerable confidence that he knows what to code.

With the experience of three more decades of computerized analysis behind us, we now know that the idea of reducing child language data to a set of codes and then throwing away the original data is simply wrong.  Instead, our goal must be to computerize the data in a way that allows us to continually enhance it with new codes and annotations.  It is fortunate that Brown preserved his transcript data in a form that allowed us to continue to work on it.  It is unfortunate, however, that the original audiotapes were not kept.

2.4      Computers

Just as these data analysis problems were coming to light, a major technological oppor­tunity was emerging in the shape of the powerful, affordable microcomputer. Microcom­puter word-processing systems and database programs allowed researchers to enter transcript data into computer files that could then be easily duplicated, edited, and ana­lyzed by standard data-processing techniques. In 1981, when the Child Language Data Exchange System (CHILDES) Project was first conceived, researchers basically thought of computer systems as large notepads. Al­though researchers were aware of the ways in which databases could be searched and tab­ulated, the full analytic and comparative power of the computer systems themselves was not yet fully understood.

Rather than serving only as an “archive” or historical record, a focus on a shared data­base can lead to advances in methodology and theory. However, to achieve these additional advances, researchers first needed to move beyond the idea of a simple data repository. At first, the possibility of utilizing shared transcription formats, shared codes, and shared anal­ysis programs shone only as a faint glimmer on the horizon, against the fog and gloom of handwritten tallies, fuzzy dittos, and idiosyncratic coding schemes. Slowly, against this backdrop, the idea of a computerized data exchange system began to emerge. It was against this conceptual background that CHILDES (the name uses a one-syllable pronunciation) was conceived. The origin of the system can be traced back to the summer of 1981 when Dan Slobin, Willem Levelt, Susan Ervin-Tripp, and Brian MacWhinney discussed the pos­sibility of creating an archive for typed, handwritten, and computerized transcripts to be lo­cated at the Max-Planck-Institut für Psycholinguistik in Nijmegen. In 1983, the MacArthur Foundation funded meetings of developmental researchers in which Elizabeth Bates, Brian MacWhinney, Catherine Snow, and other child language researchers discussed the possi­bility of soliciting MacArthur funds to support a data exchange system. In January of 1984, the MacArthur Foundation awarded a two-year grant to Brian MacWhinney and Catherine Snow for the establishment of the Child Language Data Exchange System. These funds provided for the entry of data into the system and for the convening of a meeting of an ad­visory board. Twenty child language researchers met for three days in Concord, Massachu­setts and agreed on a basic framework for the CHILDES system, which Catherine Snow and Brian MacWhinney would then proceed to implement.

2.5      Connectivity

Since 1984, when the CHILDES Project began in earnest, the world of computers has gone through a series of remarkable revolutions, each introducing new opportunities and challenges. The processing power of the home computer now dwarfs the power of the mainframe of the 1980s; new machines are now shipped with built-in audiovisual capabil­ities; and devices such as CD-ROMs and optical disks offer enormous storage capacity at reasonable prices. This new hardware has now opened up the possibility for multimedia ac­cess to digitized audio and video from links inside the written transcripts. In effect, a tran­script is now the starting point for a new exploratory reality in which the whole interaction is accessible from the transcript. Although researchers have just now begun to make use of these new tools, the current shape of the CHILDES system reflects many of these new re­alities. In the pages that follow, you will learn about how we are using this new technology to provide rapid access to the database and to permit the linkage of transcripts to digitized audio and video records, even over the Internet. 

3       From CHILDES to TalkBank

Beginning in 2001, with support from an NSF Infrastructure grant, we began the extension of the CHILDES database concept to a series of additional fields listed in the Introduction. These extensions have led to the need for additional features in the CHAT coding system to support CA notation, phonological analysis, and gesture coding. As we develop new tools for each of these areas and increase the interoperability between tools, the power of the system continues to grow.  As a result, we can now refer to this work as the TalkBank Project.

3.1      Three Tools

The reasons for developing a computerized exchange system for language data are im­mediately obvious to anyone who has produced or analyzed transcripts. With such a sys­tem, we can:

automate the process of data analysis,

obtain better data in a consistent, fully-documented transcription system, and

provide more data for more children from more ages, speaking more languages.

The TalkBank Project has addressed each of these goals by developing three separate, but integrated, tools. The first tool is the chat transcription and coding format. The sec­ond tool is the CLAN analysis program, and the third tool is the database. These three tools are like the legs of a three-legged stool. The transcripts in the database have all been put into the chat transcription system. The program is designed to make full use of the chat format to facilitate a wide variety of searches and analyses. Many research groups are now using the CLAN programs to enter new data sets. Eventually, these new data sets will be available to other researchers as a part of the growing TalkBank databases. In this way, chat, CLAN, and the database function as an integrated set of tools. There are manuals for each of these TalkBank tools. 

1.     Part 1 of the TalkBank manual, which you are now reading, describes the conventions and principles of CHAT transcription.

2.     Part 2 describes the use of the basic CLAN computer pro­grams that you can use to transcribe, annotate, and analyze language interactions.

3.     Part 3 describes the use of additional CLAN program for morphosyntactic analysis.

4.     The final section of the manuals, which describes the contents of the databases, is broken out as a collection of index and documentation files on the web. For example, if want to survey the shape of the Dutch child language corpora, you first go to http://childes.talkbank.org (there is also a link to that site from the overall index at http://talkbank.org). From that homepage you click on **Index to Corpora** and then Dutch.  For there, you might want to read about the contents of the CLPF corpus for early phonological development in Dutch.  You then click on the CLPF link and it takes you to the fuller corpus description with photos from the contributors. From links on that page you can either browse the corpus, download the transcripts, or download the media.

In addition to these basic manual resources, there are these further facilities for learning CHAT and CLAN, all of which can be downloaded from the talkbank.org and childes.talkbank.org server sites:

1.     Nan Bernstein Ratner and Shelley Brundage have contributed a manual designed specifically for clinical practitioners called the SLP’s Guide to CLAN.

2.     There are versions of the manuals in Japanese and Chinese.

3.     Davida Fromm has produced a series of screencasts describing how to use basic features of CLAN.

3.2      Shaping CHAT

We received a great deal of extremely helpful input during the years between 1984 and 1988 when the CHAT system was being formulated. Some of the most detailed comments came from George Allen, Elizabeth Bates, Nan Bernstein Ratner, Giuseppe Cappelli, An­nick De Houwer, Jane Desimone, Jane Edwards, Julia Evans, Judi Fenson, Paul Fletcher, Steven Gillis, Kristen Keefe, Mary MacWhinney, Jon Miller, Barbara Pan, Lucia Pfanner, Kim Plunkett, Kelley Sacco, Catherine Snow, Jeff Sokolov, Leonid Spektor, Joseph Stemberger, Frank Wijnen, and Antonio Zampolli. Comments developed in Edwards (1992) were useful in shaping core aspects of CHAT. George Allen (1988) helped developed the UNIBET and PHO­NASCII systems. The workers in the LIPPS Group (LIPPS, 2000) have developed extensions of CHAT to cover code-switching phenomena. Adaptations of CHAT to deal with data on disfluencies are developed in Bernstein-Ratner, Rooney, and MacWhinney (1996). The exercises in the CLAN manual are based on materials originally de­veloped by Barbara Pan for Chapter 2 of Sokolov & Snow (1994)

In the period between 2001 and 2004, we converted much of the CHILDES system to work with the new XML Internet data format.  This work was begun by Romeo Anghelache and completed by Franklin Chen. Support for this major reformatting and the related tightening of the CHAT format came from the NSF TalkBank Infrastructure project which involved a major collaboration with Steven Bird and Mark Liberman of the Linguistic Data Consortium. Ongoing work in TalkBank is documented on the web at http://talkbank.org. 

3.3      Building CLAN

The CLAN program is the brainchild of Leonid Spektor. Ideas for particular analysis commands came from several sources. Bill Tuthill's HUM package provided ideas about concordance analyses. The SALT system of Miller & Chapman (1983) provided guide­lines regarding basic practices in transcription and analysis. Clifton Pye's PAL program provided ideas for the MODREP and PHONFREQ commands.

Darius Clynes ported CLAN to the Macintosh. Jeffrey Sokolov wrote the CHIP pro­gram. Mitzi Morris designed the MOR analyzer using specifications provided by Roland Hauser of Erlangen University. Norio Naka and Susanne Miyata developed a MOR rule system for Japanese; and Monica Sanz-Torrent helped develop the MOR system for Spanish. Julia Evans provided recommendations for the design of the audio and visual capabilities of the editor. Johannes Wagner and Spencer Hazel helped show us how we could modify CLAN to permit transcription in the Conversation Analysis framework. Steven Gillis provided suggestions for aspects of MODREP.  Christophe Parisse built the POST and POSTTRAIN programs (Parisse & Le Normand, 2000). Brian Richards contributed the VOCD program (Malvern, Richards, Chipere, & Purán, 2004).  Julia Evans helped specify TIMEDUR and worked on the details of DSS. Catherine Snow designed CHAINS, KEYMAP, and STATFREQ. Nan Bernstein Ratner specified aspects of PHONFREQ and plans for additional programs for phonological analysis.

3.4      Constructing the Database

The primary reason for the success of the TalkBank databases has been the generosity of over 300 researchers who have contributed their corpora. Each of these corpora represents hundreds, often thousands, of hours spent in careful collection, tran­scription, and checking of data. All researchers in child language should be proud of the way researchers have generously shared their valuable data with the whole research com­munity. The growing size of the database for language impairments, adult aphasia, and sec­ond-language acquisition indicates that these related areas have also begun to understand the value of data sharing.

Many of the corpora contributed to the system were transcribed before the formulation of CHAT. In order to create a uniform database, we had to reformat these corpora into CHAT. Jane Desimone, Mary MacWhinney, Jane Morrison, Kim Roth, Kelley Sacco, Lillian Jarold, Anthony Kelly, Andrew Yankes, and Gergely Sikuta worked many long hours on this task. Steven Gillis, Helmut Feldweg, Susan Powers, and Heike Behrens supervised a parallel effort with the German and Dutch data sets.

Because of the continually changing shape of the programs and the database, keeping this manual up to date has been an ongoing activity. In this process, I received help from Mike Blackwell, Julia Evans, Kris Loh, Mary MacWhinney, Lucy Hewson, Kelley Sacco, and Gergely Sikuta. Barbara Pan, Jeff Sokolov, and Pam Rollins also provided a reading of the final draft of the 1995 version of the manual.

3.5      Dissemination

Since the beginning of the project, Catherine Snow has continually played a pivotal role in shaping policy, building the database, organizing workshops, and determining the shape of chat and CLAN. Catherine Snow collaborated with Jeffrey Sokolov, Pam Rollins, and Barbara Pan to construct a series of tutorial exercises and demonstration analyses that ap­peared in Sokolov & Snow (1994). Those exercises form the basis for similar tutorial sec­tions in the current manual. Catherine Snow has contributed six major corpora to the database and has conducted CHILDES workshops in a dozen countries.

Several other colleagues have helped disseminate the CHILDES system through work­shops, visits, and Internet facilities. Hidetosi Sirai established a CHILDES file server mir­ror at Chukyo University in Japan and Steven Gillis established a mirror at the University of Antwerp. Steven Gillis, Kim Plunkett, Johannes Wagner, and Sven Strömqvist helped propagate the CHILDES system at universities in Northern and Central Europe. Susanne Miyata has brought together a vital group of child language researchers using CHILDES to study the acquisition of Japanese and has supervised the translation of the current manual into Japanese. In Italy, Elena Pizzuto organized symposia for developing the CHILDES sys­tem and has supervised the translation of the manual into Italian. Magdalena Smoczynska in Krakow and Wolfgang Dressler in Vienna have helped new researchers who are learning to use CHILDES for languages spoken in Eastern Europe. Miquel Serra has sup­ported a series of CHILDES workshops in Barcelona. Zhou Jing organized a workshop in Nanjing and Chien-ju Chang organized a workshop in Taipei.

The establishment and promotion of additional segments of TalkBank now relies on a wide array of inputs. Yvan Rose has spearheaded the creation of PhonBank. Nan Bernstein Ratner has led the development of FluencyBank.  Audrey Holland, Davida Fromm, and Margie Forbes have worked to create AphasiaBank. Johannes Wagner has created SamtaleBank and segments of CABank.  Jerry Goldman developed the SCOTUS segment of CABank. Roy Pea contributed to the development of ClassBank. Within each of these communities, scores of other scholars have helped with donations of corpora, analyses, and ideas.

3.6      Funding

From 1984 to 1988, the John D. and Catherine T. MacArthur Foundation supported the CHILDES Project. In 1988, the National Science Foundation provided an equipment grant that allowed us to put the database on the Internet and on CD-ROMs. From 1989, the CHILDES project has been supported by an ongoing grant from the National Insti­tutes of Health (NICHHD). In 1998, the National Science Foundation Linguistics Program provided additional support to improve the programs for morphosyntactic analysis of the database. In 1999, NSF funded the TalkBank project. In 2002, NSF provided support for the development of the GRASP system for parsing of the corpora.  In 2002, NIH provided additional support for the development of PhonBank for child language phonology and AphasiaBank for the study of communication in aphasia. Currently (2017), NICHD is providing support for CHILDES and PhonBank; NIDCD provides support for AphasiaBank and FluencyBank, NSF provides support for HomeBank and FluencyBank, and NEH provides support for LangBank. Beginning in 2014, TalkBank also became a member of the CLARIN federation (clarin.eu), a system designed to coordinate resources for language computation in the Humanities and Social Sciences. 

3.7      How to Use These Manuals

Each of the three parts of the TalkBank system is described in separate sections of the TalkBank manual.  The CHAT manual describes the conventions and principles of CHAT transcription. The CLAN manual describes the use of the editor and the analytic commands. The database manual is a set of over a dozen smaller documents, each describing a separate segment of the database.

To learn the TalkBank system, you should begin by downloading and installing the CLAN program.  Next, you should download and start to read the current manual (CHAT Manual) and the CLAN manual (Part 2 of the TalkBank manual).  Before proceeding too far into the CHAT manual, you will want to walk through the tutorial section at the beginning of the CHAT manual.  After finishing the tutorial, try working a bit with each of the CLAN commands to get a feel for the overall scope of the system. You can then learn more about CHAT by transcribing a small sample of your data in a short test file. Run the CHECK program at frequent intervals to verify the accuracy of your coding. Once you have fin­ished transcribing a small segment of your data, try out the various analysis pro­grams you plan to use, to make sure that they provide the types of results you need for your work.

 

If you are primarily interested in analyzing data already stored in TalkBank, you do not need to learn the CHAT transcription format in much detail and you will only need to use the editor to open and read files. In that case, you may wish to focus your efforts on learning to use the CLAN programs. If you plan to transcribe new data, then you also need to work with the current manual to learn to use CHAT.

Teachers will also want to pay particular attention to the sections of the CLAN manual that present a tutorial introduction. Using some of the examples given there, you can construct additional materials to encourage students to explore the database to test out particular hypotheses.

The TalkBank system was not intended to address all issues in the study of language learning, or to be used by all students of spontaneous interactions. The chat system is comprehensive, but it is not ideal for all purposes. The programs are pow­erful, but they cannot solve all analytic problems. It is not the goal of TalkBank to provide facilities for all research endeavors or to force all research into some uniform mold. On the contrary, the programs are designed to offer support for alternative analytic frameworks. For example, the editor now supports the various codes of Conversation Analysis (CA) format, as alternatives and supplements to CHAT format.. Moreover, we have developed programs that convert between CHAT format and other common formats, because we know that users often need to run analyses in these other formats.

3.8      Changes

The TalkBank tools have been extensively tested for ease of application, accuracy, and reliability. However, change is fundamental to any research enterprise. Researchers are con­stantly pursuing better ways of coding and analyzing data. It is important that the tools keep progress with these changing requirements. For this reason, there will be revisions to chat, the programs, and the database as long as the TalkBank Project is active.

4       Principles

The chat system provides a standardized format for producing computerized tran­scripts of face-to-face conversational interactions. These interactions may involve children and parents, doctors and patients, or teachers and second-language learners. Despite the dif­ferences between these interactions, there are enough common features to allow for the cre­ation of a single general transcription system. The system described here is designed for use with both normal and disordered populations. It can be used with learners of all types, including children, second-language learners, and adults recovering from aphasic disor­ders. The system provides options for basic discourse transcription as well as detailed pho­nological and morphological analysis. The system bears the acronym “chat,” which stands for Codes for the Human Analysis of Transcripts. Chat is the standard transcrip­tion system for the TalkBank and CHILDES (Child Language Data Exchange System) Projects. All of the transcripts in the TalkBank da­tabases are in chat format.

What makes CHAT particularly powerful is  the fact that files transcribed in CHAT can also be analyzed by the CLAN programs that are described in the CLAN manual, which is an electronic companion piece to this manual. The CHAT programs can track a wide variety of structures, compute automatic indices, and analyze morphosyntax.  Moreover, because all CHAT files can now also be translated to a highly structured form of XML (a language used for text documents on the web), they are now also compatible with a wide range of other powerful computer programs such as ELAN, Praat, EXMARaLDA, Phon, Transcriber, and so on.

The TalkBank system has had a major impact on the study of child language. At the time of the last monitoring in 2016, there were over 7000 published articles that had made use of the programs and database.  In 2016, the size of the database had grown to over 110 million words, making it by far the largest database of conversational interactions available anywhere.  The total number of researchers who have joined as members across the length of the project is now over 5000. Of course, not all of these people are making active use of the tools at all times. However, it is safe to say that, at any given point in time, well over 100 groups of researchers around the world are involved in new data collection and transcription using the chat system. Eventually the data collected in these various projects will all be contributed to the da­tabase.

4.1      Computerization

Public inspection of experimental data is a crucial prerequisite for serious scientific progress. Imagine how genetics would function if every experimenter had his or her own individual strain of peas or drosophila and refused to allow them to be tested by other ex­perimenters. What would happen in geology, if every scientist kept his or her own set of rock specimens and refused to compare them with those of other researchers? In some fields the basic phenomena in question are so clearly open to public inspection that this is not a problem. The basic facts of planetary motion are open for all to see, as are the basic facts underlying Newtonian mechanics.

Unfortunately, in language studies, a free and open sharing and exchange of data has not always been the norm. In earlier decades, researchers jealously guarded their field notes from a particular language community of subject type, refusing to share them openly with the broader community. Various justifications were given for this practice. It was some­times claimed that other researchers would not fully appreciate the nature of the data or that they might misrepresent crucial patterns. Sometimes, it was claimed that only someone who had actually participated in the community or the interaction could understand the na­ture of the language and the interactions. In some cases, these limitations were real and im­portant. However, all such restrictions on the sharing of data inevitably impede the progress of the scientific study of language learning.

Within the field of language acquisition studies it is now understood that the advantages of sharing data outweigh the potential dangers. The question is no longer whether data should be shared, but rather how they can be shared in a reliable and responsible fashion. The computerization of transcripts opens up the possibility for many types of data sharing and analysis that otherwise would have been impossible. However, the full exploitation of this opportunity requires the development of a standardized system for data transcription and analysis.

4.2      Words of Caution

Before examining the chat system, we need to consider some dangers involved in computerized transcriptions. These dangers arise from the need to compress a complex set of verbal and nonverbal messages into the extremely narrow channel required for the computer. In most cases, these dangers also exist when one creates a typewritten or hand­written transcript. Let us look at some of the dangers surrounding the enterprise of transcription.

4.2.1     The Dominance of the Written Word

Perhaps the greatest danger facing the transcriber is the tendency to treat spoken lan­guage as if it were written language. The decision to write out stretches of vocal material using the forms of written language can trigger a variety of theoretical commitments. As Ochs (1979) showed so clearly, these decisions will inevitably turn transcription into a theoretical en­terprise. The most difficult bias to overcome is the tendency to map every form spoken by a learner – be it a child, an aphasic, or a second-language learner – onto a set of standard lexical items in the adult language. Transcribers tend to assimilate nonstandard learner strings to standard forms of the adult language. For example, when a child says “put on my jamas,” the transcriber may instead enter “put on my pajamas,” reasoning unconsciously that “jamas” is simply a childish form of “pajamas.” This type of regularization of the child form to the adult lexical norm can lead to misunderstanding of the shape of the child's lex­icon. For example, it could be the case that the child uses “jamas” and “pajamas” to refer to two very different things (Clark, 1987; MacWhinney, 1989).

There are two types of errors possible here. One involves mapping a learner's spoken form onto an adult form when, in fact, there was no real correspondence. This is the prob­lem of overnormalization. The second type of error involves failing to map a learner's spo­ken form onto an adult form when, in fact, there is a correspondence. This is the problem of undernormalization. The goal of transcribers should be to avoid both the Scylla of over­normalization and the Charybdis of undernormalization. Steering a course between these two dangers is no easy matter. A transcription system can provide devices to aid in this pro­cess, but it cannot guarantee safe passage.

 

Transcribers also often tend to assimilate the shape of sounds spoken by the learner to the shapes that are dictated by morphosyntactic patterns. For example, Fletcher (1985) not­ed that both children and adults generally produce “have” as “uv” before main verbs. As a result, forms like “might have gone” assimilate to “mightuv gone.” Fletcher believed that younger children have not yet learned to associate the full auxiliary “have” with the con­tracted form. If we write the children's forms as “might have,” we then end up mischarac­terizing the structure of their lexicon. To take another example, we can note that, in French, the various endings of the verb in the present tense are distinguished in spelling, whereas they are homophonous in speech. If a child says /mʌnʒ/ “eat,” are we to transcribe it as first person singular mange, as second person singular manges, or as the imperative mange? If the child says /mãʒe/, should we transcribe it as the infinitive manger, the participle mangé, or the second person formal mangez?

CHAT deals with these problems in three ways.  First, it uses IPA as a uniform way of transcribing discourse phonetically.  Second, the editor allows the user to link the digitized audio record of the interaction directly to the transcript.  This is the system called “sonic CHAT.” With these sonic CHAT links, it is possible to double-click on a sentence and hear its sound immediately.  Having the actual sound produced by the child directly available in the transcript takes some of the burden off of the transcription system. However, whenever computerized analyses are based not on the original audio signal but on transcribed orthographic forms, one must continue to understand the limits of transcription conventions. Third, for those who wish to avoid the work involved in IPA transcription or sonic CHAT, that is a system for using nonstandard lexical forms, that the form “might (h)ave” would be universally recognized as the spelling of “mightof”, the contracted form of “might have.” More extreme cases of phonological variation can be annotated as in this example:  popo [: hippopotamus].

4.2.2     The Misuse of Standard Punctuation

Transcribers have a tendency to write out spoken language with the punctuation con­ventions of written language. Written language is organized into clauses and sentences de­limited by commas, periods, and other marks of punctuation. Spoken language, on the other hand, is organized into tone units clustered about a tonal nucleus and delineated by pauses and tonal contours (Crystal, 1969, 1979; Halliday, 1966, 1967, 1968). Work on the discourse basis of sentence production (Chafe, 1980; Jefferson, 1984) has demonstrated a close link between tone units and ideational units. Retracings, pauses, stress, and all forms of intonational contours are crucial markers of aspects of the utterance planning process. Moreover, these features also convey important sociolinguistic informa­tion. Within special markings or conventions, there is no way to directly indicate these im­portant aspects of interactions.

4.2.3     Working With Video

Whatever form a transcript may take, it will never contain a fully accurate record of what went on in an interaction. A transcript of an interaction can never fully replace an au­diotape, because an audio recording of the interaction will always be more accurate in terms of preserving the actual details of what transpired. By the same token, an audio recording can never preserve as much detail as a video recording with a high-quality audio track. Au­dio recordings record none of the nonverbal interactions that often form the backbone of a conversational interaction. Hence, they systematically exclude a source of information that is crucial for a full interpretation of the interaction. Although there are biases involved even in a video recording, it is still the most accurate record of an interaction that we have avail­able. For those who are trying to use transcription to capture the full detailed character of an interaction, it is imperative that transcription be done from a video recording which should be repeatedly consulted during all phases of analysis.

When the CLAN editor is used to link transcripts to audio recordings, we refer to this as sonic CHAT. When the system is used to link transcripts to video recordings, we refer to this as video CHAT. The CLAN manual explains how to link digital audio and video to transcripts.

4.3      Problems With Forced Decisions

Transcription and coding systems often force the user to make difficult distinctions. For example, a system might make a distinction between grammatical ellipsis and ungrammat­ical omission. However, it may often be the case that the user cannot decide whether an omission is grammatical or not. In that case, it may be helpful to have some way of blurring the distinction. chat has certain symbols that can be used when a categorization cannot be made. It is important to remember that many of the chat symbols are entirely optional. Whenever you feel that you are being forced to make a distinction, check the manual to see whether the particular coding choice is actually required. If it is not required, then simply omit the code altogether.

4.4      Transcription and Coding

It is important to recognize the difference between transcription and coding. Transcrip­tion focuses on the production of a written record that can lead us to understand, albeit only vaguely, the flow of the original interaction. Transcription must be done directly off an au­diotape or, preferably, a videotape. Coding, on the other hand, is the process of recognizing, analyzing, and taking note of phenomena in transcribed speech. Coding can often be done by referring only to a written transcript. For example, the coding of parts of speech can be done directly from a transcript without listening to the audiotape. For other types of coding, such as speech act coding, it is imperative that coding be done while watching the original videotape.

The chat system includes conventions for both transcription and coding. When first learning the system, it is best to focus on learning how to transcribe. The chat system offers the transcriber a large array of coding options. Although few transcribers will need to use all of the options, everyone needs to understand how basic transcription is done on the “main line.” Additional coding is done principally on the secondary or “dependent” tiers. As transcribers work more with their data, they will include further options from the secondary or “dependent” tiers. However, the beginning user should focus first on learning to correctly use the conventions for the main line. The manual includes several sample tran­scripts to help the beginner in learning the transcription system.

4.5      Three Goals

Like other forms of communication, transcription systems are subjected to a variety of communicative pressures. The view of language structure developed by Slobin (1977) sees structure as emerging from the pressure of three conflicting charges or goals. On the one hand, language is designed to be clear. On the other hand, it is designed to be processible by the listener and quick and easy for the speaker. Unfortunately, ease of production often comes in conflict with clarity of marking. The competition between these three motives leads to a variety of imperfect solutions that satisfy each goal only partially. Such imperfect and unstable solutions characterize the grammar and phonology of human language (Bates & MacWhinney, 1982). Only rarely does a solution succeed in fully achieving all three goals.

Slobin's view of the pressures shaping human language can be extended to analyze the pressures shaping a transcription system. In many regards, a transcription system is much like any human language. It needs to be clear in its markings of categories, and still preserve readability and ease of transcription. However, transcripts address rather different audiences. One audience is the human audience of transcribers, analysts, and readers. The other audience is the digital computer and its pro­grams. To deal with these two audiences, a system for computerized transcription needs to achieve the following goals:

Clarity: Every symbol used in the coding system should have some clear and definable real-world referent. Symbols that mark particular words should al­ways be spelled in a consistent manner. Symbols that mark particular conversa­tional patterns should refer to consistently observable patterns. Codes must steer between the Scylla of overregular­ization and the Charybdis of underregularization discussed earlier. Distinctions must avoid being either too fine or too coarse. Another way of looking at clarity is through the notion of systematicity. Codes, words, and symbols must be used in a consistent manner across transcripts. Ideally, each code should always have a unique meaning independent of the presence of other codes or the particular tran­script in which it is located. If interactions are necessary, as in hierarchical cod­ing systems, these interactions need to be systematically described.

Readability: Just as human language needs to be easy to process, so transcripts need to be easy to read. This goal often runs directly counter to the first goal. In the TalkBank system, we have attempted to provide a variety of chat options that will allow a user to maximize the readability of a transcript. We have also provided clan tools that will allow a reader to suppress the less readable as­pects in transcript when the goal of readability is more important than the goal of clarity of marking.

Ease of data entry: As distinctions proliferate within a transcription system, data entry becomes increasingly difficult and error-prone. There are two ways of dealing with this problem. One method attempts to simplify the coding scheme and its categories. The problem with this approach is that it sacrifices clarity. The second method attempts to help the transcriber by providing computational aids. The CLAN programs follow this path. They provide systems for the automatic checking of transcription accuracy, methods for the automatic analysis of mor­phology and syntax, and tools for the semiautomatic entry of codes. However, the basic process of transcription has not been automated and remains the major task during data entry.

5           minCHAT

chat provides both basic and advanced formats for transcription and coding. The ba­sic level of chat is called minchat. New users should start by learning minchat. This system looks much like other intuitive transcription systems that are in general use in the fields of child language and discourse analysis. However, eventually users will find that there is something they want to be able to code that goes beyond minchat. At that point, they should move on to learning midCHAT.

5.1      minCHAT – the Form of Files

There are several minimum standards for the form of a minchat file. These standards must be followed for the CLAN commands to run successfully on chat files:

1.     Every line must end with a carriage return.

2.     The first line in the file must be an @Begin header line.

3.     The second line in the file must be an @Languages header line.  The languages entered here use a three-letter ISO 639-3 code, such as “eng” for English.

4.     The third line must be an @Participants header line listing three-letter codes for each participant, the participant's name, and the participant's role.

5.     After the @Participants header come a set of @ID headers providing further details for each speaker.  These will be inserted automatically for you when you run CHECK using escape-L.

6.     The last line in the file must be an @End header line.

7.     Lines beginning with * indicate what was actually said. These are called “main lines.” Each main line should code one and only one utterance. When a speaker produces several utterances in a row, code each with a new main line.

8.     After the asterisk on the main line comes a three-letter code in upper case letters for the participant who was the speaker of the utterance being coded. After the three-letter code comes a colon and then a tab.

9.     What was actually said is entered starting in the ninth column.

10.