Tools for Analyzing Talk

Part 2:  The CLAN Program

 

Brian MacWhinney

Carnegie Mellon University

December 27, 2017

 

 

When citing the use of TalkBank facilities, please use this reference to the last printed version of the CHILDES manual:

MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates

This allows us to systematically track usage of the programs and data through scholar.google.com.


 

1      Getting Started. 13

1.1      Why you want to learn CLAN.. 13

1.2      Learning CLAN.. 13

1.3      Installing CLAN – Mac OS X.. 14

1.4      Installing CLAN – Windows. 14

2      Using the Web. 15

2.1      Community Resources. 15

2.2      Downloading Materials. 15

2.3      Using the Browsable Database. 15

2.4      Downloading Transcripts and Media. 16

3      Tutorial 17

3.1      The Commands Window.. 17

3.1.1     Setting the Working Directory.. 17

3.1.2     The Recall Button.. 18

3.1.3     The ? Button.. 18

3.1.4     The Progs Menu.. 18

3.1.5     The FILE IN Button.. 18

3.1.6     The TIERS Button.. 19

3.2      Typing Command Lines. 19

3.2.1     Wildcards.. 20

3.2.2     Output Files.. 21

3.2.3     Redirection.. 21

3.3      Sample Runs. 21

3.3.1     Sample KWAL Run.. 22

3.3.2     Sample FREQ Run.. 22

3.3.3     Sample MLU Run.. 23

3.3.4     Sample COMBO Run.. 24

3.3.5     Sample GEM and GEMFREQ Runs.. 24

3.4      Advanced Commands. 25

3.5      Exercises. 29

3.5.1     MLU50 Analysis.. 30

3.5.2     MLU5 Analysis.. 32

3.5.3     MLT Analysis.. 34

3.5.4     TTR Analysis.. 35

3.5.5     Generating Language Profiles.. 35

3.6      Further Exercises. 37

4      The Editor. 39

4.1      Screencasts. 39

4.2      Text Mode vs. CHAT Mode. 39

4.3      File, Edit, and Font Menus. 40

4.4      Default Window Positioning, Size, and Font Control 40

4.5      CA Styles. 41

4.6      Setting Special Colors. 41

4.7      Searching. 41

4.8      Hiding Tiers. 42

4.9      Send to Sound Analyzer. 42

4.10    Tiers Menu Items. 42

4.11    Running CHECK Inside the Editor. 42

4.12    Preferences and Options. 43

4.13    Coder Mode. 44

4.13.1       Entering Codes.. 44

4.13.2       Setting Up Your Codes File.. 45

5      Media Linkage. 48

5.1      Sonic Mode. 49

5.2      Transcriber Mode. 51

5.2.1     Linking to an already existing transcript. 51

5.2.2     To create a new transcript. 52

5.2.3     Sparse Annotation.. 52

5.3      Video Linking. 53

5.4      SoundWalker. 54

5.5      Export to Partitur Editors. 55

5.6      Playback Control 55

5.7      Multiple Video Playback.. 56

5.8      Manual Editing. 56

6      Other Features. 58

6.1      Shell Commands. 58

6.2      Online Help.. 59

6.3      Commands Listing. 59

6.4      Aliases. 59

6.5      Macros. 60

6.6      Testing CLAN.. 60

6.7      Bug Reports. 60

6.8      Feature Requests. 61

7      Analysis Commands. 62

7.1      CHAINS. 63

7.1.1     Sample Runs.. 63

7.1.2     Unique Options.. 66

7.2      CHECK.. 67

7.2.1     How CHECK works.. 67

7.2.2     CHECK in CA Mode.. 67

7.2.3     Running CHECK... 68

7.2.4     Restrictions on Word Forms.. 68

7.2.5     Unique Options.. 69

7.3      CHIP.. 69

7.3.1     The Tier Creation System.... 70

7.3.2     The CHIP Coding System.... 71

7.3.3     Word Class Analysis.. 73

7.3.4     Summary Measures.. 73

7.3.5     Unique Options.. 75

7.4      COMBO.. 76

7.4.1     Composing Search Strings.. 76

7.4.2     Examples of Search Strings.. 77

7.4.3     Referring to Files in Search Strings.. 78

7.4.4     Cross-tier Combo.. 79

7.4.5     Cluster Sequences in COMBO... 80

7.4.6     Tracking Final Words.. 80

7.4.7     Tracking Initial Words.. 81

7.4.8     Limiting with combo.. 81

7.4.9     Adding Codes with COMBO... 82

7.4.10       Unique Options.. 82

7.5      COOCUR.. 85

7.5.1     Unique Options.. 85

7.6      DIST.. 85

7.6.1     Unique Options.. 86

7.7      DSS. 86

7.7.1     CHAT File Format Requirements.. 86

7.7.2     Selection of a 50-sentence Corpus.. 87

7.7.3     Automatic Calculation of DSS.. 87

7.7.4     Sentence Points.. 88

7.7.5     DSS Output. 88

7.7.6     DSS Summary.. 89

7.7.7     DSS for Japanese.. 91

7.7.8     How DSS works.. 93

7.7.9     Unique Options.. 94

7.8      EVAL. 95

7.8.1     Explanation of EVAL Measures.. 95

7.8.2     EVAL Demo.. 97

7.8.3     EVAL Output. 98

7.8.4     Comparing Multiple Transcripts.. 99

7.8.5     Unique Options.. 99

7.9      FLUCALC. 100

7.10    FREQ.. 102

7.10.1       What FREQ ignores.. 102

7.10.2       Studying Lexical Groups using the +s@file switch.. 102

7.10.3       Searches for %mor and %gra combinations.. 103

7.10.4       Searches in Multilingual Corpora.. 103

7.10.5       Building Concordances with FREQ... 104

7.10.6       Using Wildcards with FREQ... 104

7.10.7       FREQ for the %mor line.. 106

7.10.8       Errors for morphological codes.. 108

7.10.9       Directing the Output of FREQ... 108

7.10.10     Limiting in FREQ... 109

7.10.11     Creating Crosstabulations in FREQ... 110

7.10.12     TTR for Lemmas.. 111

7.10.13     Studying Unique Words and Shared Words.. 111

7.10.14     Grammatical Complexity Analysis through FREQ... 112

7.10.15     Unique Options.. 113

7.10.16     Further Illustrations.. 115

7.11    FREQMERG.. 118

7.12    FREQPOS. 118

7.12.1       Unique Options.. 119

7.13    GEM.. 119

7.13.1       Sample Runs.. 120

7.13.2       Limiting with GEM.... 121

7.13.3       Unique Options.. 121

7.14    GEMFREQ.. 122

7.14.1       Unique Options.. 122

7.15    GEMLIST.. 123

7.16    IPSYN.. 123

7.17    KEYMAP.. 124

7.17.1       Sample Runs.. 124

7.17.2       Unique Options.. 125

7.18    KIDEVAL. 125

7.18.1       Unique Options.. 130

7.19    KWAL. 130

7.19.1       Tier Selection in KWAL.. 130

7.19.2       KWAL with signs and speech.. 131

7.19.3       Unique Options.. 132

7.20    MAXWD.. 133

7.20.1       Unique Options.. 134

7.21    MLT.. 135

7.21.1       MLT defaults.. 135

7.21.2       Breaking Up Turns.. 136

7.21.3       Sample Runs.. 136

7.21.4       Unique Options.. 136

7.22    MLU.. 137

7.22.1       Including and Excluding in MLU and MLT... 140

7.22.2       Unique Options.. 142

7.23    MODREP.. 143

7.23.1       Exclusions and Inclusions.. 144

7.23.2       Using a %mod Line.. 145

7.23.3       MODREP for the %mor line.. 145

7.23.4       Unique Options.. 146

7.24     MORTABLE.. 146

7.25    PHONFREQ.. 147

7.25.1       Unique Options.. 148

7.26    RELY.. 148

7.26.1       Unique Options.. 150

7.27    SCRIPT.. 150

7.28    TIMEDUR.. 152

7.29    VOCD.. 153

7.29.1       Origin of the Measure.. 154

7.29.2       Calculation of D... 155

7.29.3       Sample Size.. 156

7.29.4       Preparation of Files.. 156

7.29.5       The Output from VOCD... 156

7.29.6       Lemma-based Analysis.. 157

7.29.7       Unique Options.. 157

7.30    WDLEN.. 158

8      Options. 160

8.1      +F Option.. 160

8.2      +K Option.. 161

8.3      +L Option.. 161

8.4      +P Option.. 162

8.5      +R Option.. 162

8.6      +S Option.. 163

8.7      +T Option.. 165

8.8      +U Option.. 166

8.9      +V Option.. 167

8.10    +W Option.. 167

8.11    +X Option.. 167

8.12    +Y Option.. 167

8.13    +Z Option.. 168

8.14    Metacharacters for Searching. 169

9      Utility Commands Table. 171

9.1      ANVIL2CHAT.. 172

9.2      CHAT2ANVIL. 172

9.3      CHAT2CA.. 172

9.4      CHAT2CONLL. 172

9.5      CHAT2ELAN.. 173

9.6      CHAT2PRAAT.. 173

9.7      CHAT2SRT.. 173

9.8      CHAT2XMAR.. 173

9.9      CHSTRING.. 174

9.10    CMDI. 176

9.11    COMBINE. 176

9.12    COMPOUND.. 176

9.13    COMBTIER.. 176

9.14    CONLL2CHAT.. 176

9.15    CP2UTF. 177

9.16    DATACLEAN.. 177

9.17    DATES. 178

9.18    DELIM.. 178

9.19    ELAN2CHAT.. 178

9.20    FIXBULLETS. 180

9.21    FIXIT.. 180

9.22    FIXLANG.. 180

9.23    FIXMP3S. 180

9.24    FLO.. 180

9.25    INDENT.. 181

9.26    INSERT.. 181

9.27    JOINITEMS. 181

9.28    LAB2CHAT.. 181

9.29    LENA2CHAT.. 181

9.30    LIPP2CHAT.. 181

9.31    LONGTIER.. 182

9.32    LOWCASE. 182

9.33    OLAC. 182

9.34    ORT.. 182

9.35    PRAAT2CHAT.. 182

9.36    QUOTES. 183

9.37    REPEAT.. 183

9.38    RETRACE. 183

9.39    RTFIN.. 183

9.40    SALTIN.. 183

9.41    SILENCE. 184

9.42    SPREADSHEET.. 184

9.43    SUBTITLES. 185

9.44    SYNCODING.. 185

9.45    TEXTIN.. 185

9.46    TIERORDER.. 185

9.47    TRIM.. 185

9.48    TRNFIX.. 186

9.49    UNIQ.. 186

9.50    USEDLEX.. 186

10   References. 187

1       Getting Started

This manual describes the use of the CLAN program, designed and written by Leonid Spektor at Carnegie Mellon University. The acronym CLAN stands for Computerized Language ANalysis. CLAN is designed specifically to analyze data transcribed in the CHAT format.  This is the format used in the various segments of the TalkBank system.  There are three parts to the overall TalkBank manual.  Part 1 describes the CHAT transcription system.  Part 2 (this current manual) describes the CLAN analysis programs. Part 3 describes the segments of the CLAN program that perform automatic morphosyntactic analysis.

1.1      Why you want to learn CLAN

If you are a researcher studying conversational interaction, language learning, or language disorders, you will want to learn to use CLAN, because it will help you address basic research questions and explore many different language types.  If you are a clinician, CLAN can help you analyze data from individual clients and compare them against a large database of similar transcripts. For both these purposes, CLAN emphasizes the automatic computation of indices such as MLU, TTR, DSS, and IPSyn.  It also provides powerful methods for speeding transcription, linking transcripts to media, sending data to automatic acoustic analysis, and automatic computation of a wide range of morphosyntactic features.  For conversation analysts, CLAN provides the full range of Jeffersonian markings within a computationally clear framework.  For all these purposes, CLAN is available free, as is the huge TalkBank database of transcripts compatible with CLAN analyses.

1.2      Learning CLAN

The first six chapters of this manual provide a basic introduction to CLAN. 

1.     Chapter 1 explains how to install and configure CLAN. This process has different steps, depending on whether you are using Windows or Mac OS X. 

2.     Chapter 2 explains how to access and use materials from the CHILDES and TalkBank homepages on the web. 

3.     Chapter 3 provides a tutorial on how to begin using CLAN commands.

4.     Chapter 4 ex­plains how to use the editor.

5.     Chapter 5 explains how to link transcripts to media.

6.     Chapter 6 provides advanced exercises for learning CLAN. 

Ideally, you should work through all six chapters in that order.  However, some users may wish to skip some sections.  If you are not interested in transcribing new data, you can skip chapters 4 and 5 on the editor and linkage. People working with CA (Conversation Analysis) will probably not need to read chapter 3 on CLAN commands.  The examples and analyses all focus on child language data.  People working with other language types such as aphasia, adult conversation, or second language may wish to use practice the exercises with CHAT files and media appropriate to those areas.

1.3      Installing CLAN – Mac OS X

Here is how to install and configure CLAN for Mac OS X:

1.     If you need to permit downloading of non-AppStore apps, go to SystemPreferences / Security / General, open the lock, and click on "Anywhere".

2.     Next, point your browser at http://talkbank.org/clan and download the Mac version of CLAN.  Click to open clan.dmg and then click to start the installer.  It will install in your Applications folder and your working directory will be: Applications/CLAN/work.  For shared computers, there is also an option to install in ~/Applications.

3.     Drag the CLAN file icon into the dock to create a link for easy access.

4.     You may also want to create a link to Applications/CLAN/ in your “favorites” list.

5.     Go to System Preferences and select Keyboard.  Check the two boxes there to use standard function keys and to show Character Viewers.

1.4      Installing CLAN – Windows

Here is how to install and configure CLAN for Windows:

1.     Point your browser at http://talkbank.org/clan and download the Windows version of CLAN.  (Current versions of CLAN are no longer compatible with Windows 95/98/ME.)

2.     CLAN will automatically install in c:/TalkBank and your working directory will be c:/TalkBank/CLAN/work. The installer will create shortcuts for CLAN and the /work folder.

2       Using the Web

In this chapter, we will first survey some of the community resources available at the TalkBank homepages. Then we will learn about how to download and play transcripts and linked media, and how to use the Browsable Database.

2.1      Community Resources

From the TalkBank homepage at http:/.talkbank.org, look at the community resource information at these links under System:

1.     Ground rules.  Whenever using TalkBank data, remember to cite the sources provided.

2.     Contributing New Data.  How to configure new research projects for eventual inclusion in TalkBank.

3.     IRB Principles. We explain how to configure consent forms to specify the levels of confidentiality protection appropriate for your project.

4.     Programs.  Manuals and programs.

Then take a quick look at the homepages for AphasiaBank, BilingBank, CHILDES, SLABank, PhonBank, CABank, and other banks.  Finally, look at the information about Google Group mailing lists and membership.

2.2      Downloading Materials

By default, CLAN materials will download to your desktop.  You can then download additional materials, such as the manuals. Rather than printing out the long manuals, it is best to keep them in your /work folder and access them through Adobe Reader.

2.3      Using the Browsable Database

The Browsable Database facility allows you to playback transcripts with linked media directly from your browser.  Here are the steps to follow:

1.     Click on the Browsable Database link on the CHILDES homepage.  When the new page opens, glance over the instructions.  You can always come back to read these in detail later.

2.     In the left column, click on Eng-UK / Forrester / biggirl.cha.

3.     Study the display to get a sense of what a CHAT file looks like.  There are headers for the first 11 lines and then the dialog begins on line 12.  *E: is the child Ella and *F: is her father.

4.     Each line is linked to the corresponding segment of the video and both will play back over the web. Place your cursor on the right arrow on line 11 and either click or press “s”. Usually, it takes a few seconds to establish the initial web connection, but playback is smooth after that.

5.     If the video is stopped, pressing the “s” key starts it. If the video is playing, pressing the “s” key stops it. You can just follow along with continuous playback or you can select certain segments to play.

2.4      Downloading Transcripts and Media

If you want to study transcripts more closely, you will probably want to download them, rather than playing through the Browsable Database.  Using the Forrester transcripts as an example, here is how you do this:

1.     At http://childes.talkbank.org, click on Index to Corpora, then Eng-UK and Forrester.  

2.     Click on the link called download transcripts and the .zip file will download to your computer.

3.     If it is not automatically unzipped, you should unzip it.

4.     To download the media, click on Link to Media Folder and you can then download individual videos one by one.  Downloading media takes a lot more time than downloading transcripts.

3       Tutorial

Once you have installed CLAN, you start it by double-clicking on its icon or its shortcut.

3.1      The Commands Window

After this, a window titled Commands opens and you can type commands into this window. If the window does not open automatically, then type Control-d (Windows) or -d (Macintosh). This window controls many of the functions of CLAN. It remains active until the program is terminated. The main components of the Commands window are the command box in the center and the several buttons. There is also some text in the bottom line giving you the data when your version of CLAN was compiled.

3.1.1           Setting the Working Directory

The first thing you need to do when running CLAN is to set the working directory. The working directory is the place where the files you would like to work with are lo­cated. For this tutorial, we will use the CLAN library directory as both our Working direc­tory and our Library directory. To set the working directory:

1.     Download the examples.zip file from this URL: http://talkbank.org/examples.zip.

2.     After downloading, you should have a folder on your desktop called examples.

3.     Press the working button in the Command window and select the examples directory inside the CLAN directory as your working directory by pressing the Select Current Directory button.

After selecting your working directory, you will return to the Commands window. The directory you selected will be listed to the right of the working button. This is useful because you will always know what directory you are working in without having to leave the Commands window.  You can also double-click on the actual name of the working directory to see and go back to other directories you have recently visited.

By default, CLAN sets your LIB directory to the /lib folder in the CLAN distribution.  You typically also do not have to worry about setting your output directory, because it will be the same as your working directory. To test your installation, type the command “freq sample.cha” into the Commands window. Then either hit the return key or press the Run but­ton. You should get the following output in the CLAN Output window.

> freq sample.cha

freq sample.cha

Tue Aug  7 15:51:12 2007

freq (03-Aug-2007) is conducting analyses on:

  ALL speaker tiers

****************************************

From file <sample.cha>

  1 a

  1 any

  1 are

  5 chalk

  1 delicious

  1 don't

  ---- (more lines here)

  2 you

------------------------------

   32  Total number of different word types used

   51  Total number of words (tokens)

0.627  Type/Token ratio

The output continues down the page.  The exact shape of this window will depend on how you have sized it. 

3.1.2           The Recall Button

If you want to see some of your old commands, you can use the recall function. Just hit the Recall button and you will get a window of old commands. The Recall window con­tains a list of the last 20 commands entered in the Commands window. These commands can be automatically entered in the Commands window by double-clicking on the line. This is particularly useful for repetitive tasks and tracking command strings. Another way to access previously used commands is by using the ­ arrow on the keyboard. This will en­ter the previous command into the Commands window each time the key is pressed.

3.1.3           The ? Button

Pressing the ? button can give you some basic information about file and directory commands that you may find useful. You enter these commands into the command box. For example, just try typing dir into the Commands window will list the files in your working directory.

3.1.4           The Progs Menu

The Progs menu gives you a list of CLAN commands you can run. Try clicking this button and then se­lecting the FREQ command. The name of the command will then be inserted into the Com­mands window.

3.1.5           The FILE IN Button

Once you have selected the FREQ command, you now see that the File In button will be available. Click on this button and you will get a dialog that asks you to locate some input files in your working directory. The files on the left are the items in your working directory. The files on the right will be the ones used for analysis. The Remove button that appears under the Files for Analysis scrolling list is used to eliminate files from the selected data set. The Clear button removes all the files you have added. The radio button at the bottom right allows you to see only *.cha and *.cex files, if you wish.  When you are finished adding files for analysis, hit Done. After the files are selected and you have returned to the Commands window, an @ is appended onto the command string.  This symbol represents the set of files listed.  In this case, the @ represents the single file “sample.cha”.

3.1.6           The TIERS Button

This button will allow you to restrict your analysis to a certain participant.  For this example, we will restrict our analysis to the child, who is coded as *CHI in the transcript, so we type “CHI” into the Tier Option dialog, leaving the button for “main tier” selected. 

At this point, the command being constructed in the Commands window should look like this:  freq @ +t*CHI   If you hit the RUN button at the bottom right of the Commands window, or if you just hit a carriage return, the FREQ program will run and will display the frequencies of the six words the child is using in this sample transcript.

3.2      Typing Command Lines

There are two ways to build up commands. You can build commands using buttons and menus. However, this method only provides access to the most basic options, but you will find it useful when you are beginning. Alternatively, you can just type in commands directly to the Commands window. Let us try entering a command just by typ­ing. Suppose we want to run an MLU analysis on the sample.cha file.  Let us say that we also want to restrict the MLU analysis so that it looks only at the child’s utterances. To do this, we enter the following command into the window:

mlu +t*CHI sample.cha

In this command line, there are three parts. The first part gives the name of the command; the second part tells the program to look at only the *CHI lines; and the third part tells the program which file to analyze as input.

If you press the return key after entering this command, you should see a CLAN Out­put window that gives you the result of this MLU analysis. This analysis is conducted, by default, on the %mor line which was generated by the MOR program.  If a file does not have this %mor line, then you will need to use other forms of the MLU command that only count utterances in words.  Also, you will need to learn how to use the various options, such as +t or +f. One way to learn the options is to use the various buttons in the graphic user interface as a way of learning what CLAN can do. Once you have learned these options, it is often easier to just type in this command directly. However, in other cases, it may be easier to use buttons to locate rare options that are hard to remember. The decision of whether to type directly or to rely on buttons is one that is left to each user. 

What if you want to send the output to a permanent file and not just to the temporary CLAN Output window? To do this you add the +f switch:

mlu +t*CHI +f sample.cha

Try entering this command, ending with a carriage return. You should see a message in the CLAN Output window telling you that a new file called sample.mlu.cex has been created. If you want to look at that file, type Control-O (Windows) or -o (Mac) for Open File and you can use the standard navigation window to locate the sample.mlu.cex file. It should be in the same di­rectory as your sample.cha file.

You do not need to worry about the order in which the options appear. In fact, the only order rule that is used for CLAN commands is that the command name must come first. After that, you can put the switches and the file name in any order you wish.

3.2.1           Wildcards

A wildcard uses the asterisk symbol (*) to take the place of something else. For exam­ple, if you want to run this command across a group of ten files all ending with the exten­sion .cha, you can enter the command in this form:

mlu +tCHI +f *.cha

Wildcards can be used to refer to a group of files (*.cha), a group of speakers (CH*), or a group of words with a common form (*ing). To see how these could work together, try out this command:

freq *.cha +s”*ing”

This command runs the FREQ program on all the .cha files in the LIB directory and looks for all words ending in “-ing.” The output is sent to the CLAN Output window and you can set your cursor there and scroll back and forth to see the output. You can print this win­dow or you can save it to a file.

3.2.2           Output Files

When you run the command

mlu +f sample.cha

the program will create an output file with the name sample.mlu.cex. It drops the .cha ex­tension from the input file and then adds a two-part extension to indicate which command has run (.mlu) and the fact that this is CLAN output file (.cex). If you run this command repeatedly, it will create additional files such as sample.ml0.cex, sample.ml1.cex, sam­ple.ml2.cex, and the like. You can add up to three letters after the +f switch, as in:

mlu +fmot sample.cha

If you do this, the output file will have the name “sample.mot.cex.” As an example of a case where this would be helpful, consider how you might want to have a group of output files for the speech of the mother and another group for the speech of the father. The mother’s files would be named *.mot.cex and the father’s files would be named *.fat.cex.

3.2.3           Redirection

Instead of using the +f switch for output, you may sometimes want to use the redirect symbol (>). This symbol sends all of the outputs to a single file. The individual analysis of each file is preserved and grouped into one output file that is named in the command string. There are three forms of redirection, as illustrated in the following examples:

freq sample.cha > myanalyses

freq sample.cha >> myanalyses

freq sample.cha >& myanalyses

These three forms have slightly different results.

1.     The single arrow overwrites material already in the file.

2.     The double arrow appends new material to the file, placing it at the end of mate­rial already in the file.

3.     The single arrow with the ampersand writes both the analyses of the program and various system messages to the file.

If you want to analyze a whole collection of files and send the output from each to a sepa­rate file, use the +f switch instead.

3.3      Sample Runs

Now we are ready to try out a few sample runs with the five most basic CLAN com­mands: KWAL, FREQ, MLU, COMBO, and GEM.

3.3.1           Sample KWAL Run

KWAL searches data for user-specified words and outputs those keywords in context. The +s option is used to specify the words to be searched. The context or cluster is a com­bination of main tier and the selected dependent tiers in relation to that line. The following command searches for the keyword “bunny” and shows both the two sentences preceding it, and the two sentences following it in the output.  To access the 0042.cha file, you need toi change your working directory to the /transcripts folder inside the examples folder.

kwal +sbunny -w2 +w2 0042.cha

The -w and +w options indicate how many lines of text should be included before and after the search words. A segment of the output looks as follows:

----------------------------------------

*** File "0042.cha": line 2724. Keyword: bunny

*CHI:   0 .

*MOT:   see ?

*MOT:   is the bunny rabbit jumping ?

*MOT:   okay .

*MOT:   wanna [: want to] open the book ?

----------------------------------------

If you triple-click on the line with the three asterisks, the whole orginal transcript will open up with that line highlighted.  Repetitions and retracing will be excluded by default unless you add the +r6 switch to the command.

3.3.2           Sample FREQ Run

FREQ counts the frequencies of words used in selected files. It also calculates the type–token ratio typically used as a measure of lexical diversity. In its simplest mode, it generates an alphabetical list of all the words used by all speakers in a transcript along with the fre­quency with which these words occur. The following example looks specifically at the child’s tier. The output will be printed in the CLAN window in alphabetical order:

freq +t*CHI 0042.cha

In this file, the child uses the filler “uh” a lot, but that is ignored in the analysis.  The output for this command is:

> freq  +t*CHI 0042.cha

freq +t*CHI 0042.cha

Sat Jun 14 14:38:12 2014

freq (13-Jun-2014) is conducting analyses on:

  ONLY speaker main tiers matching: *CHI;

****************************************

From file <0042.cha>

Speaker: *CHI:

  1 ah

  2 bow+wow

  1 vroom@o

------------------------------

    3  Total number of different item types used

    4  Total number of items (tokens)

0.750  Type/Token ratio

A statistical summary is provided at the end. In the above example, there were a total of 4 words or tokens used with only 3 different word types. The type–token ratio is found by dividing the total of unique words by the total of words spoken. For our example, the type–token ratio would be 3 divided by 4 or 0.750.

The +f option can be used to save the results to a file. CLAN will automatically add the .frq.cex extension to the new file it creates. By default, FREQ excludes the strings xxx, yyy, www, as well as any string immediately preceded by one of the following symbols: 0, &, +, -, #. However, FREQ includes all retraced material unless otherwise commanded. For example, given this utterance:

*CHI: the dog [/] dog barked.

FREQ would give a count of two for the word “dog,” and one each for the words “the” and “barked.” If you wish to exclude retraced material, use the +r6 option. To learn more about the many variations in FREQ, read the section devoted specifically to this useful command.

3.3.3           Sample MLU Run

The MLU command is used primarily to determine the mean length of utterance of a specified speaker. It also provides the total number of utterances and of morphemes in a file. The ratio of morphemes over utterances (MLU) is derived from those two totals. The following command would perform an MLU analysis on the mother’s tier (+t*MOT) from the file 0042.cha:

mlu +t*MOT 0042.cha

The output from this command looks like this:

> mlu +t*MOT 0042.cha

mlu +t*MOT 0042.cha

Sat Jun 14 14:41:48 2014

mlu (13-Jun-2014) is conducting analyses on:

  ONLY dependent tiers matching: %MOR;

****************************************

From file <0042.cha>

MLU for Speaker: *MOT:

  MLU (xxx, yyy and www are EXCLUDED from the utterance and morpheme counts):

   Number of: utterances = 511, morphemes = 1588

   Ratio of morphemes over utterances = 3.108

   Standard deviation = 2.214

Thus, we have the mother’s MLU or ratio of morphemes over utterances (3.108) and her total number of utterances (511).

3.3.4           Sample COMBO Run

COMBO is a powerful program that searches the data for specified combinations of words or character strings. For example, COMBO will find instances where a speaker says kitty twice in a row within a single utterance. The following command would search the mother’s tiers (+t*MOT) of the specified file 0042.cha:

combo +tMOT +s"kitty^kitty" 0042.cha

Here, the string +tMOT selects the mother’s speaker tier only for analysis. When search­ing for a combination of words with COMBO, it is necessary to precede the com­bination with +s (e.g., +s"kitty^kitty") in the command line. The symbol ^ specifies that the word kitty is immediately followed by the word kitty. The output of the com­mand used above is as follows:

> combo +tMOT +s"kitty^kitty" 0042.cha

kitty^kitty

combo +tMOT +skitty^kitty 0042.cha

Sat Jun 14 14:44:21 2014

combo (13-Jun-2014) is conducting analyses on:

  ONLY speaker main tiers matching: *MOT;

****************************************

From file <0042.cha>

----------------------------------------

*** File "0042.cha": line 3034.

*MOT:   (1)kitty (1)kitty kitty .

----------------------------------------

*** File "0042.cha": line 3111.

*MOT:   and (1)kitty (1)kitty .

    Strings matched 2 times

3.3.5           Sample GEM and GEMFREQ Runs

GEM and GEMFREQ look at previously tagged selections or “gems” within larger transcripts for further analyses. For example, we might want to divide the transcript by different social situations or activities.  In the 0012.cha file, there are gem markers delineating the segment of the transcript that involves book reading, using the code word “book”.  By dividing the transcripts in this manner, separate analyses can be conducted on each situation type. Once this is done, you can use this command to compute a frequency analysis for material in these segments:

gemfreq +t*CHI +sbook 0012.cha

The output is as follows:

> gemfreq +sbook +t*CHI 0012.cha

gemfreq +sbook +t*CHI 0012.cha

Sat Jun 14 14:54:19 2014

gemfreq (13-Jun-2014) is conducting analyses on:

  ONLY speaker main tiers matching: *CHI;

  and ONLY header tiers matching: @BG:; @EG:;

****************************************

From file <0012.cha>

 24 tiers in gem "book":

      2 kitty

      2 no+no

      2 oh

      2 this

GEM and GEMFREQ are particularly useful in corpora such as the AphasiaBank transcripts.  In these, each participant does a retell of the Cinderella story that is marked with @G: Cinderella.  Using the three Kempler files, the following command will create three new files with only the Cinderella segment:

gem +sCinderella +n +d1 +t*PAR +t%mor +f *.cha

You can then run further programs such as MLU or FREQ on these shorter files.

3.4      Advanced Commands

This section provides a series of CLAN commands designed to illustrate a fuller range of options available in some of the most popular CLAN commands.  With a few exceptions, the commands are designed to run on the Adler directory included in the examples.zip distribution. So, you should begin by opening CLAN and setting your working directory to the Adler folder.  Each command is followed by an English-language explanation of the meaning of each of the terms in the command, translating in order from left to right.  You should test out each command and study its results.  To save typing, you can cut cut each command from this document and paste it into the CLAN Commands window and then hit a carriage return.

Run KWAL on the Participant looking for "slipper" in all the files:

kwal +t*PAR +s"slipper*" *.cha

Run KWAL on the Participant looking for "because" in adler23a.cha:

kwal +t*PAR +s”because” adler23a.cha

Run KWAL on the Participant looking specifically for "because" transcribed as (be)cause (produced as "cause") in adler 23a.cha:

kwal +t*PAR +s”(be)cause” adler23a.cha +r2

Run KWAL on the Participant on the list of words in the whwords.cut in adler23a.cha. (The whwords.cut file is in the examples/pos folder).

kwal +t*PAR +s@whwords.cut adler23a.cha

Run KWAL on the Participant to exclude utterances coded with the post-code [+ exc] and create new files in legal CHAT format for all the files:

kwal -s"[+ exc]" +d  +t*PAR +t%mor +t@ +f *.cha

Run COMBO on the Participant to find all sequences of "fairy” followed immediately by “godmother" and combine the results from all the files into a single file:

combo +t*PAR +sfairy^godmother +u *.cha

Run COMBO on the Participant's %mor tier to find all combinations of infinitive and verb in adler01a.cha:

combo +s"inf|*^v|*" +t*PAR +t%mor adler01a.cha

Run MAXWD on the Participant to get the longest utterance in words in all files:

maxwd +g2 +t*PAR *.cha

Run EVAL on the Participant to get a spreadsheet with summary data (duration, MLU, TTR, % word errors, # utterance errors, % various parts of speech, # repetitions, and # revisions) in all the files. Add +o4 to get output in raw numbers instead of percentages.

eval +t*PAR +u *.cha

This program is similar to EVAL, but tailored for child data:

kideval +t*PAR +leng *.cha

Run MLU on the Participant, creating one spreadsheet for all files. Add -b to get mlu in words:

mlu +t*PAR +d +u *.cha

Run MLT on the Participant, creating one spreadsheet for all files. MLT counts utterances and words on a line that may include xxx (unlike MLU):

mlt +t*PAR +d *.cha

Run TIMEDUR on the Participant, creating a spreadsheet with ratio of words and utterances over time duration for all files:

timedur +t*PAR +d10 *.cha

Run GEM on the Participant, including the %mor line, using the “Sandwich” gem with lazy gem marking, outputting legal CHAT format for adler07.cha:

gem +t*PAR +t%mor +sSandwich +n +d1  adler07a.cha

Same thing, excluding irrelevant lines:

gem +t*PAR +t%mor +sSandwich +n +d1  -s"[+ exc]" adler07a.cha

Run GEM on the Participant main tier and %mor tier for the Sandwich “gem”, using lazy gem marking, create a new file in legal CHAT format called "Sand" for all Adler files

gem +t*PAR +t%mor +sSandwich +n +d1 +fSand *.cha

Run VOCD on the Participant, output to spreadsheet only, and exclude repetitions and revisions in all the files:

vocd +t*PAR +d3 +r6 *.cha

Run CHIP to compare the Mother and the Child in terms of utterance overlaps with both the previous speaker (%chi and %adu, echoes) and their own previous utterances (%csr and %asr, self-repetitions) in chip.cha:

chip +bMOT +cCHI chip.cha  (the chip.cha file is in examples/progs)

Same thing, but excluding printing of the results for the self-repetitions:

chip +tMOT +cCHI –ns chip.cha

The next commands all use the FREQ program to illustrate various options.

Run FREQ on the Participant tier and get output in order of descending frequency for adler01a.cha:

freq +t*PAR +o adler01a.cha

Run FREQ on the Participant tier and send output to a spreadsheet for adler01a.cha. To open the spreadsheet, triple-click on stat.frq.xls:

freq +t*PAR +d2 adler01a.cha

Same, on all the files in Adler:

freq +t*PAR +d2 *.cha

Same, but only include Anomics:

freq +t@"ID=*|Anomic|*" +d2 *.cha

Run FREQ on the Participant tier and get type token ratio only in a spreadsheet for adler01a.cha:

freq +t*PAR +d3 adler01a.cha

Run FREQ on the Participant %mor tier and not the Participant speaker tier and get output in order of descending frequency for adler01a.cha:

freq +t%mor +t*PAR -t* +o adler01a.cha

Run FREQ on the Participant %mor tier for stems only (happily and happier = happy) and get output in order of descending frequency for adler01a.cha:

freq +t*PAR +t%mor -t* +s"@r-*,o-%” +o adler01a.cha

Learn how to use the +s switch for analysis of the %mor line

freq +sm

Learn how to use the +s switch for analysis of the %gra line

freq +sg

Run FREQ on the Participant tier, include fillers "uh" and "um", and get output in order of descending frequency for adler01a.cha:

freq +t*PAR +s+&uh +s+&um +o adler01a.cha

Run FREQ on the Participant tier and count instances of unintelligible jargon for adler01a.cha:

freq +t*PAR +s"xxx" adler01a.cha

Same, but adding +d to see the actual place of occurrence, then triple-click on any line that has a file name to open the original:

freq +t*PAR +s"xxx" +d adler01a.cha

Run FREQ on the Participant tier, counting instances of gestures for adler01a.cha:

freq +t*PAR +s&=ges* adler01a.cha

Run FREQ on the Participant tier, including repetitions and revisions, excluding neologisms (nonword:unknown target), and getting output in order of descending frequency for adler01a.cha. Add +d6 to include error production info. Add +d4 for type token info only.

freq +t*PAR +r6 -s"<*\* n:uk*>" +o adler01a.cha

Run FREQ on the Participant, searching for a list of words in a *.cut file with multiple words searched per line, where multiple words do not have to be found in consecutive alignment, but must be in the same utterance, and merging output across all files:

freq +t*PAR +s@0list.cut +c3 +u *.cha

Same with +d added for outputting the original utterance:

freq +t*PAR +s@0list.cut +c3 +u +d *.cha

Here are some additional switches for making specific exclusions:

·      add -s*\** if you want to exclude words that were produced in error (coded with any of the [* errorcodes] on the main tier)

·      add +r5 if you want to exclude any text replacements (horse [: dog], beds [: breads])

·      add +r6 if you want to include repetitions and revisions

The final section in the description of the FREQ command gives many further detailed examples of how to use FREQ with the %mor and %gra tier.

3.5      Exercises

This section presents exercises designed to help you think about the application of CLAN for specific aspects of language analysis. The illustrations in the section below are based on materials developed by Barbara Pan originally published in Chapter 2 of Sokolov and Snow (1994).  They are included in the /transcripts/ne20 and /transcripts/ne32 folders in the examples.zip file you downloaded. The original text has been edited to reflect subsequent changes in the programs and the database.  Barbara Pan devised the initial form of this extremely useful set of exercises and kindly consented to their inclusion here.

One approach to transcript analysis focuses on the computation of certain measures or scores that characterize the stage of language development in the children or adults in the sample.

1.     One popular measure (Brown, 1973) is the MLU or mean length of utterance, which can be computed by the MLU program.

2.     A second measure is the MLU of the five longest utterances in a sample, or MLU5. Wells (1981) found that increases in MLU of the five longest utterances tend to parallel those in MLU, with both levelling off after about 42 months of age. Brown suggested that MLU of the longest utterance tends, in children de­veloping normally, to be approximately three times greater than MLU.

3.     A third measure is MLT or Mean Length of Turn which can be computed the the MLT program. 

4.     A fourth popular measure of lexical diversity is the type–token ratio of Templin (1957).

In these exercises, we will use CLAN to generate these four measures of spontaneous language production for a group of normally developing children at 20 months. The goals are to use data from a sizeable sample of normally developing children to inform us as to the average (mean) performance and degree of variation (standard deviation) among chil­dren at this age on each measure; and to explore whether individual children's performance relative to their peers was constant across domains. That is, were children whose MLU was low relative to their peers also low in terms of lexical diversity and conversational partici­pation? Conversely, were children with relatively advanced syntactic skills as measured by MLU also relatively advanced in terms of lexical diversity and the share of the conversa­tional load they assumed?

The speech samples analyzed here are taken from the New England corpus of the CHILDES database, which includes longitudinal data on 52 normally developing children. Spontaneous speech of the children interacting with their mothers was collected in a play setting when the children were 14, 20, and 32 months of age. Transcripts were prepared ac­cording to the CHAT conventions of the Child Language Data Exchange System, including conventions for morphemicizing speech, such that MLU could be computed in terms of morphemes rather than words. Data were available for 48 of the 52 children at 20 months. The means and standard deviations for MLU5, TTR, and MLT reported below are based on these 48 children. Because only 33 of the 48 children produced 50 or more utterances during the observation session at 20 months, the mean and standard deviation for MLU50 is based on 33 subjects.

For illustrative purposes, we will discuss five children: the child whose MLU was the highest for the group (68.cha), the child whose MLU was the lowest (98.cha), and one child each at the first (66.cha), second (55.cha), and third (14.cha) quartiles. Transcripts for these five children at 20 months can be found in the /ne20 directory in the examples.zip file found at http://talkbank.org/examples.zip.

Our goal is to compile the following basic measures for each of the five target children: MLU on 50 utterances, MLU of the five longest utterances, TTR, and MLT. We then com­pare these five children to their peers by generating z-scores based on the means and stan­dard deviations for the available sample for each measure at 20 months. In this way, we were will generate language profiles for each of our five target children.

3.5.1           MLU50 Analysis

The first CLAN analysis we will perform involves calculating MLU for each child on a sample of 50 utterances. By default, the MLU program runs on the %mor line that is already present in these files.  This means that it computes the mean length of utterance in terms of morphemes, not words.  Also by default, the MLU program excludes the strings xxx, yyy, www, as well as any string immediately preceded by one of the following symbols: 0, &, +, -, #, $, or : (see the CHAT manual for a description of transcription conventions). The MLU program also excludes from all counts material in angle brackets followed by [/], [//], or [% bch] (see the CLAN manual for list of symbols CLAN considers to be word, morpheme, or utterance delimiters). Remember that to perform any CLAN analysis, you need to be in the directory where your data is when you issue the appropriate CLAN command. In this case, we want to be in the /transcripts/ne20 folder in the in the examples.zip file that you downloaded from http://talkbank.org/examples.zip.

.  The command string we used to compute MLU for all five children is:

mlu +t*CHI +z50u +f *.cha

+t*CHI        Analyze the child speaker tier only

+z50u         Analyze the first 50 utterances only

+f            Save the results in a file

*.cha         Analyze all files ending with the extension .cha

The only constraint on the order of elements in a CLAN command is that the name of the program (here, MLU) must come first. Many users find it good practice to put the name of the file on which the analysis is to be performed last, so that they can tell at a glance both what program was used and what file(s) were analyzed. Other elements may come in any order.

The option +t*CHI tells CLAN that we want only CHI speaker tiers considered in the analysis. Were we to omit this string, a composite MLU would be computed for all speakers in the file.

The option + z50u tells CLAN to compute MLU on only the first 50 utterances. We could, of course, have specified the child’s first 100 utterances (+z100u) or utterances from the 51st through the 100th (+z51u-100u). With no +z option specified, MLU is computed on the entire file.

The option +f tells CLAN that we want the output recorded in output files, rather than simply displayed onscreen. CLAN will create a separate output file for each file on which it computes MLU. If we wish, we may specify a three-letter file extension for the output files immediately following the +f option in the command line. If a specific file extension is not specified, CLAN will assign one automatically. In the case of MLU, the default ex­tension is .mlu.cex.  The .cex at the end is mostly important for Windows, since it allows the Windows operating system to know that this is a CLAN output file.

Finally, the string *.cha tells CLAN to perform the analysis specified on each file end­ing in the extension .cha found in the current directory. To perform the analysis on a single file, we would specify the entire file name (e.g., 68.cha). It was possible to use the wildcard * in this and following analyses, rather than specifying each file separately, because all the files to be analyzed ended with the same file extensions and were in the same directory; and in each file, the target child was identified by the same speaker code (i.e., CHI), thus allowing us to specify the child’s tier by means of +t*CHI.

Utilization of wildcards whenever possible is more efficient than repeatedly typing in similar commands. It also cuts down on typing errors. For illustrative purposes, let us suppose that we ran the above analysis on only a single child (68.cha), rather than for all five children at once (by specifying *.cha). We would use the following command:

mlu +t*CHI +z50u 68.cha

The output for this command would be as follows:

> mlu +t*CHI +z50u 68.cha

mlu +t*CHI +z50u 68.cha

Tue Jun 24 17:15:38 2014

mlu (24-Jun-2014) is conducting analyses on:

  ONLY dependent tiers matching: %MOR;

****************************************

From file <68.cha>

MLU for Speaker: *CHI:

  MLU (xxx, yyy and www are EXCLUDED from the utterance and morpheme counts):

   Number of: utterances = 50, morphemes = 133

   Ratio of morphemes over utterances = 2.660

   Standard deviation = 1.595

MLU reports the number of utterances (in this case, the 50 utterances we specified), the number of morphemes that occurred in those 50 utterances, the ratio of morphemes over utterances (MLU in morphemes), and the standard deviation of utterance length in mor­phemes. The standard deviation statistic gives some indication of how variable the child’s utterance length is. This child’s average utterance is 2.660 morphemes long, with a stan­dard deviation of 1.595 morphemes.

Check line 1 of the output for typing errors in entering the command string. Check lines 3 and possibly 4 of the output to be sure the proper speaker tier and input file(s) were spec­ified. Also, check to be sure that the number of utterances or words reported is what was specified in the command line. If CLAN finds that the transcript contains fewer utterances or words than the number specified with the +z option, it will still run the analysis but will report the actual number of utterances or words analyzed.

3.5.2           MLU5 Analysis

The second CLAN analysis we will perform computes the mean length in morphemes of each child’s five longest utterances. To do this, we will run MAXWD and then MLU on the output of MAXWD.  By default, MAXWD runs on the %mor line, rather than the main line.

maxwd +t*CHI +g1 +c5 +d1 68.cha

+gl          Identify the longest utterances in terms of morphemes

+c5         Identify the five longest utterances

+d1        Output the data in CHAT format

68.cha    The child language transcript to be analyzed

The output of this command is:

*CHI:   <I want to see the other box> [?] .

%mor:   pro:sub|I v|want inf|to v|see det|the qn|other n|box .

*CHI:   there's a dolly in there [= box] .

%mor:   pro:exist|there~cop|be&3S det|a n|doll-DIM

        prep|in pro:dem|there .

*CHI:   that's [= book] the “morning (.) noon and night” .

%mor:   pro:dem|that~cop|be&3S det|the bq|bq n|morning

        n|noon coord|and n|night eq|eq .

*CHI:   it's [= contents of box] crayons and paper .

%mor:   pro|it~aux|be&3S n|crayon-PL coord|and n|paper .

*CHI:   pop g(o)es the weasel .

%mor:   n|pop v|go-3S det|the n|weasel .

We can analyze these results directly for MLU by adding the “piping” symbol which sends the above results from MAXWD directly to MLU:

maxwd +t*CHI +g1 +c5 +d1 68.cha | mlu

The results of this full command are then:

From pipe input

MLU for Speaker: *CHI:

  MLU (xxx, yyy and www are EXCLUDED from the utterance and morpheme counts):

   Number of: utterances = 5, morphemes = 32

   Ratio of morphemes over utterances = 6.400

   Standard deviation = 0.800

The procedure for obtaining output files in CHAT format differs from program to pro­gram but it is always the +d option that performs this operation. You must check the +d options for each program to determine the exact level of the +d option that is required. We can create a single file to run this type of analysis.  This is called a batch file.  The batch file for this analysis would be:

maxwd +t*CHI +g1 +c5 +d1 14.cha | mlu > 14.ml5.cex

maxwd +t*CHI +g1 +c5 +d1 55.cha | mlu > 55.ml5.cex

maxwd +t*CHI +g1 +c5 +d1 66.cha | mlu > 66.ml5.cex

maxwd +t*CHI +g1 +c5 +d1 68.cha | mlu > 68.ml5.cex

maxwd +t*CHI +g1 +c5 +d1 98.cha | mlu > 98.ml5.cex

To run all five commands in sequence automatically, we put the batch file in our working directory with a name such as batchml5.cex and then enter the command

batch batchml5

This command will produce five output files. 

You can also use variables designated as %1, %2 in the lines inside a batch file to allow for the control of values from the command line.  For example, you could have these lines in a batch file called bat.cut:

mlu %1 %2

mlt %1 %2

freq %1

Then you could use this command, in which %1 is barry1.cha and %2 is +t*MOT

batch bat.cut barry1.cha +t*MOT

This single command would effectively run these three commands:

mlu barry1.cha +t*MOT

mlt barry1.cha +t*MOT

freq barry1.cha

3.5.3           MLT Analysis

The third analysis we will perform is to compute MLT (Mean Length of Turn) for both child and mother. Note that, unlike the MLU program, the CLAN program MLT includes the symbols xxx and yyy in all counts. Thus, utterances that consist of only unintelligible vocal material still constitute turns, as do nonverbal turns with only “0” on the main line.  We can use a single command to run our complete analysis and put all the results into a single file.

mlt *.cha > allmlt.cex

In this output file, the results for the mother in 68.cha are:

From file <68.cha>

MLT for Speaker: *MOT:

  MLT (xxx, yyy and www are EXCLUDED from the word counts, but are INCLUDED in utterance counts):

    Number of: utterances = 356, turns = 227, words = 1360

   Ratio of words over turns = 5.991

   Ratio of utterances over turns = 1.568

   Ratio of words over utterances = 3.820

There is similar output data for the child.  This output allows us to consider Mean Length of Turn either in terms of words per turn or utterances per turn. We chose to use words per turn in calculating the ratio of child MLT to mother MLT, reasoning that words per turn is likely to be sensitive for a somewhat longer developmental period. MLT ratio, then, was calculated as the ratio of child words/turn over mother words/turn. As the child begins to assume a more equal share of the conversational load, the MLT ratio should approach 1.00. For file 68.cha, this ratio is: 2.184 ÷ 5.991 = 0.365.

3.5.4           TTR Analysis

The fourth CLAN analysis we will perform for each child is to compute the TTR or­type–token ratio. For this we will use the FREQ command. By default, FREQ ignores the strings xxx (unintelligible speech) and www (irrelevant speech researcher chose not to tran­scribe). It also ignores words beginning with the symbols 0, &, +, -, or #. Here we were interested not in whether the child uses plurals or past tenses, but how many different vo­cabulary items she uses. Therefore, we wanted to count cats and cat as two tokens (i.e., instances) of the word-type cat. Similarly, we wanted to count play and played as two tokens under the word-type play. To make these distinctions correctly, we need to use MOR and POST to create a %mor line for our transcript.  The process of doing this is described in Chapter 11.  For now, we will assume that the transcripts already have this %mor line.  In that case, the command we use is:

 freq +t*CHI +s”@r-*,o-%" +f *.cha

+t*CHI  Analyze the child speaker only

+s”@r-*,o-%"           Search for roots or lemmas and ignore the rest

+f           Save output in a file

*.cha      Analyze all files ending with the extension .cha

The only new element in this command is +s”@r-*,o-%". The +s option tells FREQ to search for and count certain strings. The r-* part of this switch tells FREQ to look only at the roots or lemmas that follow the | symbol in the %mor line. The o-% part of the switch tells FREQ to ignore the rest of the material on the %mor line.  The output generated from this analysis goes into five files.  For the 68.cha input file, the output is 68.frq.cex.  At the end of this file, we find this summary analysis:

   83  Total number of different item types used

  244  Total number of items (tokens)

0.340  Type/Token ratio

We can look at each of the five output files to get this summary TTR information for each child.

3.5.5           Generating Language Profiles

Once we have computed these basic measures of utterance length, lexical diversity, and conversational participation for our five target children, we need to see how each child compares to his or her peers in each of these domains. To do this, we use the means and standard deviations for each measure for the whole New England sample at 20 months, as given in the following table.

Measure

Mean

SD

Range

MLU50

1.406

0.360

1.00-2.66

MLU5 longest

2.936

1.271

1.00-6.40

TTR

0.433

0.108

0.255-0.611

MLT Ratio

0.189

0.089

0.034-0.438

 

The distribution of MLU50 scores was quite skewed, with most children who produced at least 50 utterances falling in the MLU range of 1.00-1.30. As noted earlier, 17 of the 48 children failed to produce even 50 utterances. At this age most children in the sample are essentially still at the one-word stage, producing few utterances of more than one word or morpheme. Like MLU50, the shape of the distributions for MLU5 and for the MLT ratio were somewhat skewed toward the lower end, though not as severely as was MLU50.

Z-scores, or standard scores, are computed by subtracting the sample mean score from the child’s score on a par­ticular measure and then dividing the result by the overall standard deviation:

(child's score - group mean) / standard deviation

The results of this computation are given in the following table.

Child

MLU50

MLU5

TTR

MLT Ratio

14

 0.26

 0.21

1.65

-0.16

55

-0.30

-0.15

-0.36

-0.53

66

-0.16

-0.11

-0.64

-0.84

68

 2.30

 2.72

-0.86

  1.98

98

-0.96

-0.74

-0.63

 -0.08

 

We would not expect to see radical departures from the group means on any of the mea­sures. For the most part, this expectation is borne out: we do not see departures greater than 2 standard deviations from the mean on any measure for any of the five children, except for the particularly high MLU50 and MLU5 observed for Subject 068.

It is not the case, however, that all five of our target children have flat profiles. Some children show marked strengths or weaknesses relative to their peers in certain domains. For example, Subject 14, although very close to the mean in terms of utterance length (MLU5O and MLU5), shows marked strength in lexical diversity (TTR), even though she shoulders relatively little of the conversational burden (as measured by MLT ratio). Overall, Subject 68 seems advanced on all measures except TTR. The subjects at the second and third quartile in terms of MLU (Subject 055 and Subject 066) have profiles that are relatively flat: Their z-scores on each measure fall between -1 and 0. However, the child with the lowest MLU50 (Subject 098) again shows an uneven profile. Despite her limited production, she manages to bear her portion of the conversational load. You will recall that unintelligible vocalizations transcribed as xxx or yyy, as well as nonverbal turns indicated by the postcode [+ trn], are all counted in computing MLT. Therefore, it is possible that many of this child’s turns consisted of unintelligible vocalizations or nonverbal gestures.

What we have seen in examining the profiles for these five children is that, even among normally developing children, different children may have strengths in different domains, relative to their age mates. For illustrative purposes, we have considered only three do­mains, as measured by four indices. To get a more detailed picture of a child’s lan­guage production, we might choose to include other indices, or to further refine the measures we use. For example, we might compute TTR based on the number of words, or we might time-sample by examining the number of word types and word tokens the child produced in a certain number of minutes of mother–child interaction. We might also consider other measures of conversational competence, such as number of child initi­ations and responses; fluency measures, such as number of retraces or hesitations; or prag­matic measures, such as variety of speech acts produced. Computation of some of these measures would require that codes be entered in the transcript prior to analysis; however, the CLAN analyses themselves would, for the most part, simply be variations on the tech­niques discussed in this chapter. In the exercises that follow, you will have an op­portunity to use these techniques to perform analyses on these five children at both 20 months and 32 months.

3.6      Further Exercises

The files needed for the following exercises are in two directories in the /transcripts/ne20  and ne32 folders in the examples.zip file found at http://talkbank.org/examples.zip.  No data are available for Subject 14 at 32 months.

1.     Compute the length in morphemes of each target child’s single longest utterance at 20 months. Compare with the MLU of the five longest utterances. Consider why a researcher might want to use MLU of the five longest rather than MLU of the single longest utterance.

2.      Use the +z option to compute TTR on each child’s first 50 words at 32 months. Then do the same for each successive 50-word band up to 300.  Check the output each time to be sure that 50 words were in fact found. If you specify a range of 50 words where there are fewer than 50 words available in the file, FREQ still performs the analysis, but the output will show the actual number of tokens found. What do you observe about the stability of TTR across different samples of 50 words?

3.     Use the MLU and FREQ programs to examine the mother’s (*MOT) language to her child at 20 months and at 32 months. What do you observe about the length/complexity and lexical diversity of the mother’s speech to her child? Do they remain generally the same across time or change as the child’s language de­velops? If you observe change, how can it be characterized?

4.     Perform the same analyses for the four target children for whom data are avail­able at age 32 months. Use the data given earlier to compute z-scores for each target child on each measure (MLU 50 utterances, MLU of five longest utteranc­es, TTR, MLT ratio). Then plot profiles for each of the target children at 32 months. What consistencies and inconsistencies do you see from 20 to 32 months? Which children, if any, have similar profiles at both ages? Which chil­dren's profiles change markedly from 20 to 32 months?

5.     Conduct a case study of a child you know to explore whether type of activity and/or interlocutor affect mean length of turn (MLT). Videotape the child and mother engaged in two different activities (e.g., bookreading, having a snack to­gether, playing with a favorite toy). On another occasion, videotape the child en­gaged in the same activities with an unfamiliar adult. Compare the MLT ratio for each activity and adult–child pair. Describe any differences you ob­serve.

4       The Editor

CLAN includes an editor that is specifically designed to work cooperatively with chat files. To open up an editor window, either type -n (Control-n on Windows) for a new file or -o to open an old file (Control-o on Windows).  This is what a new text win­dow looks like on the Macintosh:

You can type into this editor window just as you would in any full-screen text editor, such as MS-Word.  In fact, the basic functions of the CLAN editor and MS-Word are all the same.  Some users say that they find the CLAN editor difficult to learn.  However, on the basic level it is no harder than MS-Word.  What makes the CLAN editor difficult is the fact that it is used to transcribe the difficult material of child language data with all its special forms, overlaps, and precise timings.  These functions are outside of the scope of editors, such as MS-Word or Pages.

4.1      Screencasts

Use of the tutorial can be supplemented through the online screencasts for specific CLAN features found at http://talkbank.org/screencasts and on YouTube.  These movies, created by Davida Fromm and Brian MacWhinney, show the use of specific CLAN functions in real time with real transcripts.

4.2      Text Mode vs. CHAT Mode

The editor works in two basic modes: Text Mode and CHAT Mode. In Text Mode, the editor functions as a basic text editor. To indicate that you are in Text Mode, the bar at the bottom of the editor window displays [E][Text]. To enter Text Mode, you must uncheck the CHAT Mode button on the Mode pulldown menu. In CHAT Mode, the editor facilitates the typing of new chat files and the editing of existing chat files. If your file has the .cha extension, you will automatically be placed into CHAT Mode when you open it. To indicate that you are in CHAT Mode, the bar at the bottom of the editor window displays [E][CHAT].

When you are first learning to use the editor, it is best to begin in CHAT mode. When you start CLAN, it automatically opens a new window for text editing. By default, this file will be opened using CHAT mode. You can use this editor window to start learning the editor or you can open an existing CHAT file using the option in the File menu. It is prob­ably easiest to start work with an existing file. To open a file, type Command-o (Macintosh) or Control-o (Windows). You will be asked to locate a file. Try to open the sample.cha file that you will find in the Lib directory inside the CLAN directory or folder. This is just a sample file, so you do not need to worry about accidentally saving changes.

You should stay in CHAT mode until you have learned the basic editing commands. You can insert characters by typing in the usual way. Movement of the cursor with the mouse and arrow keys works the same way as in Word or Pages. Functions like scrolling, highlighting, cutting, and pasting also work in the standard way. You should try out these functions right away. Use keys and the scroll bar to move around in the sample.cha file. Cut and paste sections and type a few sentences, just to convince yourself that you are already familiar with the basic editor functions.

4.3      File, Edit, and Font Menus

The functions of opening files, printing, cutting, undoing, and font changing are the same as in Pages or Word. These commands can be found under the File, Edit, and Font menus in the menu bar. The keyboard shortcuts for pulling down these menu items are listed next to the menu options.  Note that there is also a File function called “Save Last Clip As ...” which you can use to save a time-delimited sound segment as a separate file. 

4.4      Default Window Positioning, Size, and Font Control

When CLAN starts up it will open a new Commands window and a new Text window in the same position it used when you last ran CLAN.  If you want to change the position or size of a window, you can move it and resize it.  You can then close it and open a new text window using ⌘-n and it will assume the size and position of the earlier window. Repositioning also works in the same way for the Commands window, but you cannot resize the Commands window.

When starting up video playback, it can be the case that the movie window occupies too much of the screen.  In order to size it properly, you can click on the green button in the top of the QuickTime video window and the window will be resized to the smallest dimension.  Then you drag on the botton right corner to expand it to the size you wish.

The system for controlling the default Font depends on your operating system.  On Windows (PC) there is a Font menu under View. You use the Set Font option to set the text window font, and the Set Commands Font to set the Commands window font. The option to Set Default Font is only needed in rare cases when no default font had yet been selected. When using the View pulldown to change font or size, both the font and the size must be selected.  If you select only one, no change will be made. On Macintosh, you can use the Size/Style menu to control the font.

4.5      CA Styles

CHAT supports many of the CA (Conversation Analysis) codes as developed by Sacks, Schegloff, Jefferson (1974) and their students.  The imple­mentation of CA inside CLAN was guided by suggestions from Johannes Wagner, Chris Ramsden, Michael Forrester, Tim Koschmann, Charles Goodwin, and Curt LeBaron. Files that use CA styles should declare this fact by including CA in the @Options line, as in this example:

@Options:     CA

By default, CA files will use the CAfont, because the characters in this font have a fixed width, allowing the INDENT program to make sure that CA overlap markers are clearly aligned.   When doing CA transcription, you can also select underlining and italics, although bold is not allowed, because it is too difficult to recognize. Special CA characters can be inserted by typing the F1 function key followed by some letter or number, as indicated in a list that you can find by selecting Special Characters under CLAN’s Windows menu.  The full list is at http://talkbank.org/CABank/codes.html .

The F1 and F2 keys are also used to facilitate the entry of special characters for Hebrew, Arabic, and other systems.  These uses are also listed in the Special Characters window.  The raised h diacritic is bound to F1-shift-h and the subscript dot is bound to F1-comma.

4.6      Setting Special Colors

You can set the color of certain tiers to improve the readability of your files.  To do this, select the Color Keywords option in the Size/Style pulldown menu.  In the dialog that appears, type the tier that you want to color in the upper box.  For example, you may want to have %mor or *CHI in a special color.  Then click on “add to list” and edit the color to the type you wish.  The easiest way to do this is to use the crayon selector.  Then make sure you select “color entire tier.”  To learn the various uses of this dialog, try selecting and applying different options.

4.7      Searching

In the middle of the Edit pulldown menu, you will find a series of commands for searching. The Find command brings up a dialog that allows you to enter a search string and to perform a reverse search. The Find Same command allows you to repeat that same search multiple times. The Go To Line command allows you to move to a particular line number. The Replace command brings up a dialog like the Find dialog.  However, this dialog allows you to find a certain string and replace it with another one. You can replace some strings and not others by skipping over the ones you do not want to replace with the Find-Next function. When you need to perform a large series of different replacements, you can set up a file of replacement forms and use it by pressing the from file button. You then are led through the words in this replacement file one by one. The form of that file is like this:

“String_A”       “Replacement_A”

“String_B”       “Replacement_B”

4.8      Hiding Tiers

To hide or unhide a certain dependent tier, type Esc-4. (Remember to always release the escape key before typing the next key). Then you type e to ex­clude a tier and %mor for the morphological tier. If you want to exclude all tiers, you type just %. To reset the tiers and to see them all, you type Esc-4 and then r.

You can use the 0hide.cut file in CLAN’s /lib folder to set defaults for hiding and displaying tiers.  In that file, (unused) comment line start with the # sign.  If you want to hide a particular tier, just remove the # sign.  To go back to displaying that tier, replace the # sign.

4.9       Send to Sound Analyzer

This option under the Mode menu allows you to send a bulleted sound segment to Praat or Pitchworks. You choose which analyzer you want to use by an option under the Edit menu.  The default analyzer is Praat. The bullets must be formatted in the current format (post 2006).  If you have a file using the old format, you can use the FIXBULLETS program to fix them. If you are using Praat, you must first start up the Praat win­dow (download Praat from http://www.fon.hum.uva.nl/praat) and place your cursor in front of a bullet for a sound segment.  Selecting Send to Sound Analyzer then sends that clip to the Praat window for further analysis.  To run Praat in the back­ground without a GUI, you can also send this command from a Perl or Tcl script:

system (“\”C:\\Program Files\\Praatcon.exe\” myPraatScript.txt

4.10  Tiers Menu Items

When you open a CHAT file with an @Participants line, the editor looks at each of the participants in that line and inserts their codes into the Tiers menu. You can then enter the name quickly, using the commands listed in that menu. If you make changes to the @Participants line, you can press the Update button at the bottom of the menu to reload new speaker names.  As an alternative to manual typing of information on the @ID lines, you can enter information for each participant separately using the dialog system that you start up using the ID Headers option in the Tiers menu.

4.11  Running CHECK Inside the Editor

You can run CHECK from inside the editor. You do this by typing Esc-L or selecting Check Opened File from the Mode menu. If you are in CHAT Mode, CHECK will look for the correct use of CHAT. Make sure that you have set your Lib directory to the place where the depfile.cut file is located. On Windows, this should be c:\TalkBank\CLAN\lib.  CHECK can also be run from the command line using a command such as: check *.cha.  See the section on CHECK in the command descriptions of this manual for more details.  The command line version of CHECK is able to spot a few additional problems that cannot be detected by the version that operates inside the editor.

4.12  Preferences and Options

You can set various Editor preferences by pulling down the Edit menu and selecting Options. The fol­lowing dialog box will pop up:

These options control the following features:

1.     Checkpoint frequency. This controls how often your file will be saved. If you set the frequency to 50, it will save after each group of 50 characters that you enter.

2.     Limit of lines in CLAN output. This determines how many output lines will go to your CLAN output screen. It is good to use a large number, since this will al­low you to scroll backwards through large output results.

3.     Tier for disambiguation. This is the default tier for the Disambiguator Mode function.

4.     Open Commands window at startup. Selecting this option makes it so that the Commands window comes up automatically whenever you open CLAN.

5.     No backup file. By default, the editor creates a backup file, in case the program hangs. If you check this, CLAN will not create a backup file.

6.     Start in CHAT Coder mode. Checking this will start you in Text Mode when you open a new text window.

7.     Auto-wrap in Text Mode. This will wrap long lines when you type.

8.     Auto-wrap CLAN output. This will wrap long lines in the output.

9.     Show mixed stereo sound wave.  CLAN can only display a single sound wave when editing.  If you are using a stereo sound, you may want to choose this option.