Tools for Analyzing Talk

 

Part 3:  Morphosyntactic Analysis

 

 

Brian MacWhinney

Carnegie Mellon University

 

May 24, 2023


https://doi.org/10.21415/T5B97X

 

 

 

 

 

 

When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version:

 

MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates

 

This allows us to systematically track usage of the programs and data through scholar.google.com.


 

Tools for Analyzing Talk. 1

Part 3:  Morphosyntactic Analysis. 1

Brian MacWhinney. 1

1      Introduction.. 4

2      Morphosyntactic Coding. 5

2.1       One-to-one correspondence. 5

2.2       Tag Groups and Word Groups. 6

2.3       Words. 6

2.4       Part of Speech Codes. 7

2.5       Stems. 9

2.6       Affixes. 9

2.7       Clitics. 10

2.8       Compounds. 10

2.9       Punctuation Marks. 11

2.10     Sample Morphological Tagging for English. 11

2.11     Ongoing development. 14

3      Running the Program Chain. 15

4      Morphological Analysis. 16

4.1       The Design of MOR.. 16

4.2       Example Analyses. 16

4.3       Exclusions in MOR.. 17

4.4       Unique Options. 17

4.5       Categories and Components of MOR.. 18

4.6       MOR Part-of-Speech Categories. 19

4.7       MOR Grammatical Categories. 23

4.8       Compounds and Complex Forms. 24

4.9       Errors and Replacements. 25

4.10     Affixes. 25

4.11     Control Features and Output Features. 26

5      Correcting errors. 27

5.1       Lexicon Building. 29

5.2       Disambiguator Mode. 30

6      A Formal Description of the Rule Files. 31

6.1       Declarative structure. 31

6.2       Pattern-matching symbols. 31

6.3       Variable notation.. 32

6.4       Category Information Operators. 32

6.5       Arules. 33

6.6       Crules. 35

7      Building new MOR grammars. 37

7.1       minMOR.. 37

7.2       Adding affixes. 37

7.3       Interactive MOR.. 38

7.4       Testing. 38

7.5       Building Arules. 39

7.6       Building crules. 40

8      MOR for Bilingual Corpora. 43

9      POST.. 45

9.1       POSTLIST.. 46

9.2       POSTMODRULES. 47

9.3       PREPOST.. 47

9.4       POSTMORTEM.. 48

9.5       POSTTRAIN.. 49

9.6       POSTMOD.. 52

9.7       TRNFIX.. 52

10       GRASP – Syntactic Dependency Analysis. 53

10.1     Grammatical Relations. 53

10.2     Predicate-head relations. 54

10.3     Argument-head relations. 56

10.4     Extra-clausal elements. 58

10.5     Cosmetic relations. 58

10.6     MEGRASP.. 59

11       Building a training corpus. 60

11.1     ROOT, SUBJ, DET, PUNCT.. 60

Example 1a. 60

11.2     OBJ and OBJ2.. 61

11.3     JCT, NJCT and POBJ. 62

11.4     PRED.. 63

11.5     AUX. 64

11.6     NEG.. 65

11.7     MOD and POSS. 65

11.8     CONJ and COORD.. 66

11.9     ENUM... 66

11.10      POSTMOD.. 68

11.11      COMP, LINK.. 69

11.12      QUANT and PQ.. 70

11.13      CSUBJ, COBJ, CPOBJ, CPRED.. 71

11.14      CJCT and XJCT.. 73

11.15      CMOD and XMOD.. 74

11.16      BEG, BEGP, END, ENDP.. 75

11.17      COM and TAG.. 76

11.18      SRL, APP.. 77

11.19      NAME, DATE.. 78

11.20      INCROOT, OM.. 79

12       GRs for other languages. 80

12.1     Spanish.. 80

12.2     Chinese. 80

12.3     Japanese. 82

 

1       Introduction

 

This third volume of the TalkBank manuals deals with the use of the programs that perform automatic computation of the morphosyntactic structure of transcripts in CHAT format.  These manuals, the programs, and the TalkBank datasets can all be downloaded freely from https://talkbank.org.

The first volume of the TalkBank manual describes the CHAT transcription format. The second volume describes the use of the CLAN data analysis programs. This third manual describes the use of the MOR, POST, POSTMORTEM, and MEGRASP programs to add a %mor and %gra line to CHAT transcripts.  The %mor line provides a complete part-of-speech tagging for every word indicated on the main line of the transcript.  The %gra line provides a further analysis of the grammatical dependencies between items in the %mor line.  These programs for morphosyntactic analysis are all built into CLAN. 

Users who do not wish to create or process information on the %mor and %gra lines will not need to read this current manual.  However, researchers and clinicians interested in these features will need to know the basics of the use of these programs, as described in the next chapter.  The additional sections of this manual are directed to researchers who wish to extend or improve the coverage of MOR and GRASP grammars or who wish to build such grammars for languages that are not yet covered.

 

2       Morphosyntactic Coding

Linguists and psycholinguists rely on the analysis of morphosyntax to illuminate core issues in learning and development. Generativist theories have emphasized issues such as: the role of triggers in the early setting of a parameter for subject omission (Hyams & Wexler, 1993), evidence for advanced early syntactic competence (Wexler, 1998), evidence for early absence functional categories that attach to the IP node (Radford, 1990), the role of optional infinitives in normal and disordered acquisition (Rice, 1997), and the child’s ability to process syntax without any exposure to relevant data (Crain, 1991). Generativists have sometimes been criticized for paying inadequate attention to the empirical patterns of distribution in children’s productions.  However, work by researchers in this tradition, such as Stromswold (1994), van Kampen (1998), and Meisel (1986), demonstrates the important role that transcript data can play in evaluating alternative generative accounts.

Learning theorists have placed an even greater emphasis on the use of transcripts for understanding morphosyntactic development.  Neural network models have shown how cue validities can determine the sequence of acquisition for both morphological (MacWhinney & Leinbach, 1991; MacWhinney, Leinbach, Taraban, & McDonald, 1989; Plunkett & Marchman, 1991) and syntactic (Elman, 1993; Mintz, Newport, & Bever, 2002; Siskind, 1999) development.  This work derives further support from a broad movement within linguistics toward a focus on data-driven models (Bybee & Hopper, 2001) for understanding language learning and structure.  These accounts formulate accounts that view constructions (Tomasello, 2003) and item-based patterns (MacWhinney, 1975) as the loci for statistical learning.

The study of morphosyntax also plays an important role in the study and treatment of language disorders, such as aphasia, specific language impairment, stuttering, and dementia. For this work, both researchers and clinicians can benefit from methods for achieving accurate automatic analysis of correct and incorrect uses of morphosyntactic devices.  To address these needs, the TalkBank system uses the MOR command to automatically generate candidate morphological analyses on the %mor tier, the POST command to disambiguate these analyses, and the MEGRASP command to compute grammatical dependencies on the %gra tier.

2.1      One-to-one correspondence

MOR creates a %mor tier with a one-to-one cor­respondence between words on the main line and words on the %mor tier. In order to achieve this one-to-one correspondence, the following rules are observed:

1.     Each word group (see below) on the %mor line is surrounded by spaces or an initial tab to correspond to the corresponding space-de­limited word group on the main line.  The correspondence matches each %mor word (morphological word) to a main line word in a left-to-right linear order in the utterance.

2.     Utterance delimiters are preserved on the %mor line to facilitate readability and analysis.  These delimiters should be the same as the ones used on the main line.

3.     Along with utterance delimiters, the satellite markers of for the vocative and „ for tag questions or dislocations are also included on the %mor line in a one-to-one alignment format.

4.     Retracings and repetitions are excluded from this one-to-one mapping, as are nonwords such as xxx or strings beginning with &. When word repetitions are marked in the form word [x 3], the material in parentheses is stripped off and the word is considered as a single form.

5.     When a replacing form is indicated on the main line with the form [: text], the material on the %mor line corresponds to the replacing material in the square brackets, not the material that is being replaced. For example, if the main line has gonna [: going to], the %mor line will code going to.

6.     The [*] symbol that is used on the main line to indicate errors is not duplicated on the %mor line.

2.2      Tag Groups and Word Groups

On the %mor line, alternative taggings of a given word are clustered together in tag groups. These groups include the alternative taggings of a word that are produced by the MOR program.  Alternatives are separated by the ^ character. Here is an example of a tag group for one of the most ambiguous words in English:

adv|back^adj|back^n|back^v|back

After you run the POST program on your files, all of these alternatives will be disambiguated and each word will have only one alternative.  You can also use the hand disambiguation method built into the CLAN editor to disambiguate each tag group case by case.

The next level of organization for the MOR line is the word group.  Word groups are combinations marked by the preclitic delimiter $, the postclitic delimiter ~ or the compound delimiter +.  For example, the Spanish word dámelo can be represented as

vimpsh|da-2S&IMP~pro:clit|1S~pro:clit|OBJ&MASC=give

This word group is a series of three words (verb~postclitic~postclitic) combined by the ~ marker. Clitics may be either preclitics or postclitics. Separable prefixes of the type found in German or Hungarian and other discontinuous morphemes can be represented as word groups using the preclitic delimiter $, as in this example for ausgegangen (“gone”):

prep|aus$PART#v|geh&PAST:PART=go

Note the difference between the coding of the preclitic “aus” and the prefix “ge” in this example. Compounds are also represented as combinations, as in this analysis of angelfish.

n|+n|angel+n|fish

Here, the first characters (n|) represent the part of speech of the whole compound and the subsequent tags, after each plus sign, are for the parts of speech of the components of the compound.  Proper nouns are not treated as compounds.  Therefore, they take forms with underlines instead of pluses, such as Luke_Skywalker or New_York_City.

2.3      Words

Beneath the level of the word group is the level of the word. The structure of each individual word is:

prefix#

part-of-speech|

stem

&fusionalsuffix

-suffix

=english (optional, underscore joins words)

There can be any number of prefixes, fusional suffixes, and suffixes, but there should be only one stem. Prefixes and suffixes should be given in the order in which they occur in the word. Since fusional suffixes are fused parts of the stem, their order is indeterminate. The English translation of the stem is not a part of the morphology, but is included for convenience for non-native speakers.   If the English translation requires two words, these words should be joined by an underscore as in “lose_flowers” for French défleurir.

Now let us look in greater detail at the nature of each of these types of coding. Through­out this discussion, bear in mind that all coding is done on a word-by-word basis, where words are considered to be strings separated by spaces.

2.4      Part of Speech Codes

The morphological codes on the %mor line begin with a part-of-speech code. The basic scheme for the part-of-speech code is:

category:subcategory:subcategory

Additional fields can be added, using the colon character as the field separator. The subcategory fields contain information about syntactic features of the word that are not marked overtly. For example, you may wish to code the fact that Italian “andare” is an intransitive verb even though there is no single morpheme that signals intransitivity. You can do this by using the part-of-speech code v:intrans, rather than by inserting a separate morpheme.

In order to avoid redundancy, information that is marked by a prefix or suffix is not in­corporated into the part-of-speech code, as this information will be found to the right of the | delimiter. These codes can be given in either uppercase, as in ADJ, or lowercase, as in adj. In general, CHAT codes are not case-sensitive.

The particular codes given below are the ones that MOR uses for automatic morpho­logical tagging of English. Individual researchers will need to define a system of part-of-speech codes that correctly reflects their own research interests and theoretical commit­ments. Languages that are typologically quite different from English may have to use very different part-of-speech categories. Quirk, Greenbaum, Leech, and Svartvik (1985) explain some of the intricacies of part-of-speech coding.  Their analysis should be taken as defini­tive for all part-of-speech coding for English.    However, for many purposes, a more coarse-grained coding can be used.

The following set of top-level part-of-speech codes is the one used by the MOR pro­gram.  Additional refinements to this system can be found by studying the organization of the lexicon files for that program  For example, compounds use the main part-of-speech code, along with codes for their components.  Further distinctions can be found by looking at the MOR lexicon.

 

 

English Parts of Speech

 

Category

Code

Adjective

adj

Adjective - Predicative

adj:pred

Adverb

adv

Adverb - Temporal

adv:tem

Communicator

co

Complementizer

comp

Conjunction

conj

Coordinator

coord

Determiner Article

det:art

Determiner - Demonstrative

det:dem

Determiner - Interrogative

det:int

Determiner - Numeral

det:num

Determiner - Possessive

det:poss

Filler

fil

Infinitive

inf

Negative

neg

Noun

n

Noun - letter

n:let

Noun - plurale tantum

n:pt

Proper Noun

n:prop

Onomatopoeia

on

Particle

part

Postmodifier

post

Preposition

prep

Pronoun - demonstrative

pro:dem

Pronoun - existential

pro:exist

Pronoun - indefinite

pro:indef

Pronoun - interrogative

pro:int

Pronoun - object

pro:obj

Pronoun - personal

pro:per

Pronoun - possessive

pro:poss

Pronoun - reflexive

pro:refl

Pronoun - relative

pro:rel

Pronoun - subject

pro:sub

Quantifier

qn

Verb

v

Verb - auxiliary

aux

Verb - copula

cop

Verb - modal

mod

 

2.5      Stems

Every word on the %mor tier must include a “lemma” or stem as part of the morpheme analysis. The stem is found on the right hand side of the | delimiter, following any pre-clitics or prefixes. If the transcript is in English, this can be simply the canonical form of the word. For nouns, this is the singular. For verbs, it is the infinitive. If the transcript is in another language, it can be the English translation. A single form should be selected for each stem. Thus, the English indefinite article is coded as det|a with the lemma “a” whether or not the actual form of the article is “a” or “an.”

 

When English is not the main language of the transcript, the transcriber must decide whether to use English stems. Using English stems has the advantage that it makes the cor­pus more available to English-reading researchers. To show how this is done, take the Ger­man phrase “wir essen”:

*FRI:   wir essen.

%mor:   pro|wir=we v|ess-INF=eat .

Some projects may have reasons to avoid using English stems, even as translations. In this example, “essen” would be simply v|ess-INF. Other projects may wish to use only English stems and no target-language stems. Sometimes there are multiple possible trans­lations into English. For example, German “Sie”/sie” could be either “you,” “she,” or “they.”  Choosing a single English meaning for the stem helps fix the German form.

2.6      Affixes

Affixes and clitics are coded in the position in which they occur with relation to the stem. The morphological status of the affix should be identified by the following markers or delimit­ers: - for a suffix, # for a prefix, and & for fusional or infixed morphology.

The & is used to mark affixes that are not realized in a clearly isolable phonological shape. For example, the form “men” cannot be broken down into a part corresponding to the stem “man” and a part corresponding to the plural marker, because one cannot say that the vowel “e” marks the plural. For this reason, the word is coded as n|man&PL. The past forms of irregular verbs may undergo similar ablaut processes, as in “came,” which is cod­ed v|come&PAST, or they may undergo no phonological change at all, as in “hit”, which is coded v|hit&PAST.  Sometimes there may be several codes indicated with the & after the stem. For example, the form “was” is coded v|be&PAST&13s.  Affix and clitic codes are based either on Latin forms for grammatical function or English words corresponding to particular closed-class items. MOR uses the following set of affix codes for automatic morphological tagging of English.

 

Inflectional Affixes for English

 

Function

Code

adjective suffix er, r

CP

adjective suffix est, st

SP

noun suffix ie

DIM

noun suffix s, es

PL

noun suffix 's, '

POSS

verb suffix s, es

3S

verb suffix ed, d

PAST

verb suffix ing

PRESP

verb suffix en

PASTP

 

Derivational Affixes for English

 

Function

Code

adjective and verb prefix un

UN

adverbializer ly

LY

nominalizer er

ER

noun prefix ex

EX

verb prefix dis

DIS

verb prefix mis

MIS

verb prefix out

OUT

verb prefix over

OVER

verb prefix pre

PRE

verb prefix pro

PRO

verb prefix re

RE

 

2.7      Clitics

Clitics are marked by a tilde, as in v|parl&IMP:2S=speak~pro|DAT:MASC:SG for Ital­ian “parlagli” and pro|it~v|be&3s for English “it's.” Note that part of speech coding with the | symbol is repeated for clitics after the tilde. Both clitics and contracted elements are coded with the tilde. The use of the tilde for contracted elements extends to forms like “sul” in Italian, “ins” in German, or “rajta” in Hungarian in which prepositions are merged with articles or pronouns.

 

Clitic Codes for English

 

Clitic

Code

noun phrase post-clitic 'd

v:aux|would, v|have&PAST

noun phrase post-clitic 'll

v:aux|will

noun phrase post-clitic 'm

v|be&1S, v:aux|be&1S

noun phrase post-clitic 're

v|be&PRES, v:aux|be&PRES

noun phrase post-clitic 's

v|be&3S, v:aux|be&3S

verbal post-clitic n't

neg|not

2.8      Compounds

Here are some words that we might want to treat as compounds: sweatshirt, highschool, playground, and horseback. You can find hundreds of these in files in the English lexicon such as adv+n+adv.cut that include the plus symbol in their names. There are also many idiomatic phrases that could be best analyzed as linkages. Here are some examples: a_lot_of, all_of_a_sudden, at_last, for_sure, kind_of, of_course, once_and_for_all, once_upon_a_time, so_far, and lots_of. You can find hundreds of these in files in the English lexicon with names such as adj_under.cut.

On the %mor tier it is necessary to assign a part-of-speech label to each segment of the compound. For example, the word blackboard is coded on the %mor tier as n|+adj|black+n|board. The part of speech of the compound as a whole is usually given by the part-of-speech of the final segment, although this is not always true.

In order to preserve the one-to-one correspondence between words on the main line and words on the %mor tier, words that are not marked as compounds on the main line should not be coded as compounds on the %mor tier. For example, if the words “come here” are used as a rote form, then they should be written as “come_here” on the main tier. On the %mor tier this will be coded as v|come_here. It makes no sense to code this as v|come+adv|here, because that analysis would contradict the claim that this pair functions as a single unit. It is sometimes difficult to assign a part-of-speech code to a morpheme. In the usual case, the part-of-speech code should be chosen from the same set of codes used to label single words of the language. For example, some of these idiomatic phrases can be coded as compounds on the %mor line.

 

Phrases Coded as Linkages

 

Phrase

Phrase

qn|a_lot_of

adv|all_of_a_sudden

 co|for_sure

adv:int|kind_of

adv|once_and_for_all

adv|once_upon_a_time

adv|so_far

qn|lots_of.

2.9      Punctuation Marks

MOR can be configured to recognize certain punctuation marks as whole word characters.  In particular, the file punct.cut contains these entries:

      {[scat end]} "end"

      {[scat beg]} "beg"

,       {[scat cm]} "cm"

When the punctuation marks on the left occur in text, they are treated as separate lexical items and are mapped to forms such as beg|beg on the %mor tier.  The “end” marker is used to mark postposed forms such as tags and sentence final particles.  The “beg” marker is used to mark preposed forms such as vocatives and communicators.  These special characters are important for correctly structuring the dependency relations for the GRASP program.

2.10  Sample Morphological Tagging for English

The following table describes and illustrates a more detailed set of word class codings for English. The %mor tier examples correspond to the labellings MOR produces for the words in question. It is possible to augment or simplify this set, either by creating additional word categories, or by adding additional fields to the part-of-speech label, as discussed pre­viously.  The entries in this table and elsewhere in this manual can always be double-checked against the current version of the MOR grammar by typing “mor +xi” to bring up interactive MOR and then entering the word to be analyzed.

 

Word Classes for English

 

Class

Examples

Coding of Examples

adjective

big

adj|big

adjective, comparative

bigger, better

adj|big-CP, adj|good&CP

adjective, superlative

biggest, best

adj|big-SP, adj|good&SP

adverb

well

adv|well

adverb, ending in ly

quickly

adv:adj|quick-LY

adverb, intensifying

very, rather

adv:int|very, adv:int|rather

adverb, post-qualifying

enough, indeed

adv|enough, adv|indeed

adverb, locative

here, then

adv:loc|here, adv:tem|then

communicator

aha

co|aha

conjunction, coord.

and, or

conj:coo|and, conj:coo|or

conjunction, subord.

if, although

conj:sub|if, conj:sub|although

determiner, singular

a, the, this

det|a, det|this

determiner, plural

these, those

det|these, det|those

determiner, possessive

my, your, her

det:poss|my

infinitive marker

to

inf|to

noun, common

cat, coffee

n|cat, n|coffee

noun, plural

cats

n|cat-PL

noun, possessive

cat's

n|cat~poss|s

noun, plural possessive

cats'

n|cat-PL~poss|s

noun, proper

Mary

n:prop|Mary

noun, proper, plural

Mary-s

n:prop|Mary-PL

noun, proper, possessive

Mary's

n:prop|Mary~poss|s

noun, proper, pl. poss.

Marys'

n:prop|Mary-PL~poss|s

noun, adverbial

home, west

n|home, adv:loc |home

number, cardinal

two

det:num|two

number, ordinal

second

adj|second

postquantifier

all, both

post|all, post|both

preposition

in

prep|in, adv:loc|in

pronoun, personal

I, me, we, us, he

pro|I, pro|me, pro|we, pro|us

pronoun, reflexive

myself, ourselves

pro:refl|myself

pronoun, possessive

mine, yours, his

pro:poss|mine, pro:poss:det|his

pronoun, demonstrative

that, this, these

pro:dem|that

pronoun, indefinite

everybody, nothing

pro:indef|everybody

pronoun, indef., poss.

everybody's

pro:indef|everybody~poss|s

quantifier

half, all

qn|half, qn|all

verb, base form

walk, run

v|walk, v|run

verb, 3rd singular present

walks, runs

v|walk-3S, v|run-3S

verb, past tense

walked, ran

v|walk-PAST, v|run&PAST

verb, present participle

walking, running

part|walk-PRESP, part|run-PRESP

verb, past participle

walked, run

part|walk-PASTP, part|run&PASTP

verb, modal auxiliary

can, could, must

aux|can, aux|could, aux|must

 

Since it is sometimes difficult to decide what part of speech a word belongs to, we offer the following overview of the major part-of-speech labels used in the standard English grammar.

 

ADJectives modify nouns, either prenominally, or predicatively. Unitary compound modi­fiers such as good-looking should be labeled as adjectives.

 

ADVerbs cover a heterogenous class of words including: manner adverbs, which generally end in -ly; locative adverbs, which include expressions of time and place; intensifiers that modify adjectives; and post-head modifiers, such as indeed and enough.

 

COmmunicators are used for interactive and communicative forms which fulfill a variety of functions in speech and conversation. Also included in this category are words used to express emotion, as well as imitative and onomatopeic forms, such as ah, aw, boom, boom-boom, icky, wow, yuck, and yummy.

 

CONJunctions conjoin two or more words, phrases, or sentences. Examples include: although, because, if, unless, and until.

 

COORDinators include and, or, and as well as.  These can combine clauses, phrases, or words.

 

DETerminers include articles, and definite and indefinite determiners. Possessive deter­miners such as my and your are tagged det:poss.

 

INFinitive is the word “to” which is tagged inf|to.

 

INTerjections are similar to communicators, but they typically can stand alone as complete utterances or fragments, rather than being integrated as parts of the utterances.  They include forms such as wow, hello, good-morning, good-bye, please, thank-you.

 

Nouns are tagged with n for common nouns, and n:prop for proper nouns (names of peo­ple, places, fictional characters, brand-name products).

 

NEGative is the word “not” which is tagged neg|not.

 

NUMbers  are labelled num for cardinal numbers. The ordinal numbers are adjectives.

 

Onomatopoeia are words that imitate the sounds of nature, animals, and other noises.

 

Particles are words that are often also prepositions, but are serving as verbal particles.

 

PREPositions are the heads of prepositional phrases. When a preposition is not a part of a phrase, it should be coded as a particle or an adverb.

 

PROnouns include a variety of structures, such as reflexives, possessives, personal pronouns, deictic pronouns, etc.

 

QUANTifiers include each, every, all, some, and similar items.

 

Verbs can be either main verbs, copulas, or auxililaries.

2.11  Ongoing development

Currently, the most highly developed MOR grammar is the one for English, which achieves 99.18% accuracy in tagging for productions from adult native speakers in databases such as CHILDES and AphasiaBank.  It is more difficult to reliably determine the accuracy of tagging for child utterances, particularly at the youngest ages when there are often ambiguities in one-word and two-word utterances (Bloom, 1973) that even human coders cannot resolve. MOR grammars are also highly evolved for Spanish, French, German, Mandarin, Japanese, and Cantonese, achieving over 95% accuracy for these languages too.  Apart from accuracy of tagging, there is the issue of lexical coverage.  For child language, lexical coverage is largely complete for these languages.  However, as we deal with more advanced adult productions, such as the academic language in the MICASE corpus or the juridicial language in the SCOTUS corpus, we often need to add new items to the lexicons for given language.  Accuracy of grammatical relation tagging by GRASP is highly dependent on accurate tagging by MOR.  However, even with accurate MOR tagging, GRASP tagging is at 93% accuracy for English.

 

3       Running the Program Chain

 

It is possible to construct a complete automatic morphosyntacgtic analysis of a series of CHAT transcripts through a single command in CLAN, once you have the needed programs in the correct configuration.  This command runs the MOR, POST, POSTMORTEM, and MEGRASP commands in an automatic sequence or chain. To do this, you follow these steps:

1.     Place all the files you wish to analyze into a single folder.

2.     Start the CLAN program (see the Part 2 of the manual for instructions on installing CLAN).

3.     In CLAN’s Commands window, click on the buttom labelled Working to set your working directory to the folder that has the files to be analyzed.

4.     Under the File menu at the top of the screen, select Get MOR Grammar and select the language you want to analyze.  To do this, you must be connected to the Internet. If you have already done this once, you do not need to do it again.  By default, the MOR grammar you have chosen will download to your desktop.

5.     If you choose to move your MOR grammar to another location, you will need to use the Mor Lib button in the Commands window to tell CLAN about where to locate it.

6.     To analyze all the files in your Working directory folder, enter this command in the Comands window: mor *.cha

7.     CLAN will then run these programs in sequence: MOR, POST, POSMORTEM, and MEGRASP. These programs will add %mor and %gra lines to your files.

8.     If this command ends with a message saying that some words were not recognized, you will probably want to fix them.  If you do not, some of the entries on the %mor line will be incomplete and the relations on the %gra line will be less accurate. If you have doubts about the spellings of certain words, you can look in the 0allwords.cdc file this is included in the /lex folder for each language.  The words there are listed in alphabetical order.

9.     To correct errors, you can run this command:  mor +xb *.cha.. Guidelines for fixing errors are given in chapter 4 below.

4       Morphological Analysis

4.1      The Design of MOR

The computational design of mor was guided by Roland Hausser’s (1990) MORPH system and was implemented by Mitzi Morris. Since 2000, Leonid Spektor has extended MOR in many ways.  Christophe Parisse built POST and POSTTRAIN (Parisse & Le Normand, 2000). Kenji Sagae built MEGRASP as a part of his dissertation work for the Language Technologies Institute at Carnegie Mellon University (Sagae, MacWhinney, & Lavie, 2004a, 2004b).  Leonid Spektor then integrated the program into CLAN.

The system has been designed to maximize portability across languages, extendability of the lexicon and grammar, and compatibility with the clan programs. The basic engine of the parser is language independent. Lan­guage-specific information is stored in separate data files that can be modified by the user. The lexical entries are also kept in ASCII files and there are several techniques for improving the match of the lexicon to a cor­pus. To maximize the complete analysis of regular formations, only stems are stored in the lexicon and inflected forms appropriate for each stem are compiled at run time.

4.2      Example Analyses

To give an example of the results of a MOR analysis for English, consider this sentence from eve15.cha in Roger Brown’s corpus for Eve. 

*CHI:   oops I spilled it.

%mor:   co|oops pro:subj|I v|spill-PAST pro:per|it.

Here, the main line gives the child’s production and the %mor line gives the part of speech for each word, along with the morphological analysis of affixes, such as the past tense mark (-PAST) on the verb.  The %mor lines in these files were not created by hand.  To produce them, we ran the MOR command, using the MOR grammar for English, which can be downloaded using the Get MOR Grammar function described in the previous chapter. The command for running MOR by itself without running the rest of the chain is: mor +d *.cha. After running MOR, the file looks like this:

*CHI:  oops I spilled it .

%mor:  co|oops pro:subj|I part|spill-PASTP^v|spill-PAST pro:per|it .

In the %mor tier, words are labeled by their syntactic category or “scat”, followed by the pipe separator |, followed then by the stem and affixes. Notice that the word “spilled” is initially ambiguous between the past tense and participle readings. The two ambiguities are separated by the ^ character.  To resolve such ambiguities, we run a program called POST. The command is just “post *.cha” After POST has been run, the %mor line will only have v|spill-PAST. 

Using this disambiguated form, we can then run the MEGRASP program to create the representation given in the %gra line below:

*CHI:   oops I spilled it .

%mor:   co|oops pro:subj|I v|spill-PAST pro:per|it .

%gra:   1|3|COM 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|3|PUNCT

In the %gra line, we see that the second word “I” is related to the verb (“spilled”) through the grammatical relation (GR) of Subject.  The fourth word “it” is related to the verb through the grammatical relation of Object.  The verb is the Root and it is related to the “left wall” or item 0.

4.3      Exclusions in MOR

Because MOR focuses on the analysis of the target utterance, it excludes a variety of non-words, retraces, and special symbols. Specifically, MOR excludes:

1.     Items that start with &

2.     Pauses such as (.)

3.     Unknown forms marked as xxx, yyy, or www

4.     Data associated with these codes: [/?],  [/-], [/], [//], and [///].

4.4      Unique Options

+d     do not run POST command automatically.  POST will run automatically after MOR, unless this switch is used or unless the folder name includes the word “train”.

 

+eS    Show the result of the operation of the arules on either a stem S or stems in file @S.  This output will go into a file called debug.cdc in your library directory.  An­other way of achieving this is to use the +d option inside “interactive MOR”

 

+p     use pinyin lexicon format for Chinese

 

+xi     Run mor in the interactive test mode. You type in one word at a time to the test prompt and mor provides the analysis on line.  This facility makes the following commands available in the CLAN Output window:

        word - analyze this word

        :q  quit- exit program

        :c  print out current set of crules

        :d  display application of arules.

        :l  re-load rules and lexicon files

        :h  help - print this message

 

If you type in a word, such as “dog” or “perro,” MOR will try to analyze it and give you its components morphemes.  If you change the rules or the lexicon, use :l to reload and retest.  The :c and :d switches will send output to a file called de­bug.cdc in your library directory.

 

+xl     Run mor in the lexicon building mode. This mode takes a series of .cha files as input and outputs a small lexical file with the extension .ulx with entries for all words not recognized by mor. This helps in the building of lexicons.

 

+xb   check lexicon mode, include word location in data files

+xa    check lexicon for ambiguous entries

+xc    check lexicon mode, including capitalized words

+xd   check lexicon for compound words conflicting with plain words

+xp   check lexicon mode, including words with prosodic symbols

+xy    analyze words in lex files

4.5      Categories and Components of MOR

MOR breaks up words into their component parts or morphemes.  In a relatively analytic language like English, many words require no analysis at all.  However, even in English, a word like “coworkers” can be seen to contain four component morphemes, including the prefix “co”, the stem, the agential suffix, and the plural.  For this form, MOR will produce the analysis: co#n:v|work-AGT-PL.  This representation uses the symbols # and – to separate the four different morphemes.  Here, the prefix stands at the beginning of the analysis, followed by the stem (n|work), and the two suffixes.  In general, stems always have the form of a part of speech category, such as “n” for noun, followed by the vertical bar and then a statement of the stem’s lexical form. 

 

To understand the functioning of the MOR grammar for English, the best place to begin is with a tour of the files inside the ENG folder that you can download from the server.  At the top level, you will see these files:

1.     ar.cut – These are the rules that generate allomorphic variants from the stems and a