Tools for Analyzing Talk


Part 3:  Morphosyntactic Analysis



Brian MacWhinney

Carnegie Mellon University


April 10, 2018







When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version:


MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates


This allows us to systematically track usage of the programs and data through


Tools for Analyzing Talk. 1

Part 3:  Morphosyntactic Analysis. 1

Brian MacWhinney. 1

1      Introduction. 4

2      Morphosyntactic Coding. 5

2.1      One-to-one correspondence. 5

2.2      Tag Groups and Word Groups. 6

2.3      Words. 6

2.4      Part of Speech Codes. 7

2.5      Stems. 8

2.6      Affixes. 8

2.7      Clitics. 9

2.8      Compounds. 10

2.9      Punctuation Marks. 11

2.10    Sample Morphological Tagging for English.. 11

3      Running the Program Chain. 14

4      Morphological Analysis. 15

4.1      The Design of MOR.. 15

4.2      Example Analyses. 15

4.3      Exclusions in MOR.. 16

4.4      Unique Options. 16

4.5      Categories and Components of MOR.. 17

4.6      MOR Part-of-Speech Categories. 18

4.7      MOR Grammatical Categories. 21

4.8      Compounds and Complex Forms. 22

4.9      Errors and Replacements. 23

4.10    Affixes. 24

4.11    Control Features and Output Features. 24

5      Correcting errors. 26

5.1      Lexicon Building. 28

5.2      Disambiguator Mode. 29

6      A Formal Description of the Rule Files. 30

6.1      Declarative structure. 30

6.2      Pattern-matching symbols. 30

6.3      Variable notation.. 31

6.4      Category Information Operators. 31

6.5      Arules. 32

6.6      Crules. 34

7      Building new MOR grammars. 36

7.1      minMOR.. 36

7.2      Adding affixes. 36

7.3      Interactive MOR.. 37

7.4      Testing. 37

7.5      Building Arules. 38

7.6      Building crules. 39

8      MOR for Bilingual Corpora. 42

9      POST.. 44

9.1      POSTLIST.. 45

9.2      POSTMODRULES. 46

9.3      POSTMORTEM.. 46

9.4      POSTTRAIN.. 47

9.5      POSTMOD.. 50

10       GRASP – Syntactic Dependency Analysis. 51

10.1    Grammatical Relations. 51

10.2    Predicate-head relations. 52

10.3    Argument-head relations. 54

10.4    Extra-clausal elements. 56

10.5    Cosmetic relations. 56

10.6    MEGRASP.. 57

11       Building a training corpus. 59

11.1    OBJ and OBJ2.. 59

11.2    3. JCT and POBJ. 60

11.3    PRED and NJCT.. 61

11.4    AUX and NEG.. 63

11.5    MOD and POSS. 64

11.6    CONJ, and COORD.. 65

11.7    ENUM and LP.. 65

11.8    POSTMOD.. 67

11.9    COMP, LINK.. 68

11.10      QUANT and PQ.. 70

11.11      CSUBJ, COBJ, CPOBJ. 71

11.12      CJCT and XJCT.. 73

11.13      CMOD and XMOD.. 74

11.14      BEG, BEGP, END, ENDP.. 75

11.15      COM and TAG.. 76

11.16      SRL, APP.. 77

11.17      NAME, DATE. 78

11.18      INCROOT, OM.. 79

12       GRs for other languages. 81

12.1    Spanish.. 81

12.2    Chinese. 81

12.3    Japanese. 83

13       IPSyn Rules. 89


1       Introduction


This third volume of the TalkBank manuals deals with the use of the programs that perform automatic computation of the morphosyntactic structure of transcripts in CHAT format.  These manuals, the programs, and the TalkBank datasets can all be downloaded freely from

The first volume of the TalkBank manual describes the CHAT transcription format. The second volume describes the use of the CLAN data analysis programs. This third manual describes the use of the MOR, POST, POSTMORTEM, and MEGRASP programs to add a %mor and %gra line to CHAT transcripts.  The %mor line provides a complete part-of-speech tagging for every word indicated on the main line of the transcript.  The %gra line provides a further analysis of the grammatical dependencies between items in the %mor line.  These programs for morphosyntactic analysis are all built into CLAN. 

Users who do not wish to create or process information on the %mor and %gra lines will not need to read this current manual.  However, researchers and clinicians interested in these features will need to know the basics of the use of these programs, as described in the next chapter.  The additional sections of this manual are directed to researchers who wish to extend or improve the coverage of MOR and GRASP grammars or who wish to build such grammars for languages that are not yet covered.


2       Morphosyntactic Coding

Linguists and psycholinguists rely on the analysis of morphosyntax to illuminate core issues in learning and development. Generativist theories have emphasized issues such as: the role of triggers in the early setting of a parameter for subject omission (Hyams & Wexler, 1993), evidence for advanced early syntactic competence (Wexler, 1998), evidence for early absence functional categories that attach to the IP node (Radford, 1990), the role of optional infinitives in normal and disordered acquisition (Rice, 1997), and the child’s ability to process syntax without any exposure to relevant data (Crain, 1991). Generativists have sometimes been criticized for paying inadequate attention to the empirical patterns of distribution in children’s productions.  However, work by researchers in this tradition, such as Stromswold (1994), van Kampen (1998), and Meisel (1986), demonstrates the important role that transcript data can play in evaluating alternative generative accounts.

Learning theorists have placed an even greater emphasis on the use of transcripts for understanding morphosyntactic development.  Neural network models have shown how cue validities can determine the sequence of acquisition for both morphological (MacWhinney & Leinbach, 1991; MacWhinney, Leinbach, Taraban, & McDonald, 1989; Plunkett & Marchman, 1991) and syntactic (Elman, 1993; Mintz, Newport, & Bever, 2002; Siskind, 1999) development.  This work derives further support from a broad movement within linguistics toward a focus on data-driven models (Bybee & Hopper, 2001) for understanding language learning and structure.  These accounts formulate accounts that view constructions (Tomasello, 2003) and item-based patterns (MacWhinney, 1975) as the loci for statistical learning.

The study of morphosyntax also plays an important role in the study and treatment of language disorders, such as aphasia, specific language impairment, stuttering, and dementia. For this work, both researchers and clinicians can benefit from methods for achieving accurate automatic analysis of correct and incorrect uses of morphosyntactic devices.  To address these needs, the TalkBank system uses the MOR command to automatically generate candidate morphological analyses on the %mor tier, the POST command to disambiguate these analyses, and the MEGRASP command to compute grammatical dependencies on the %gra tier.

2.1      One-to-one correspondence

MOR creates a %mor tier with a one-to-one cor­respondence between words on the main line and words on the %mor tier. In order to achieve this one-to-one correspondence, the following rules are observed:

1.     Each word group (see below) on the %mor line is surrounded by spaces or an initial tab to correspond to the corresponding space-de­limited word group on the main line.  The correspondence matches each %mor word (morphological word) to a main line word in a left-to-right linear order in the utterance.

2.     Utterance delimiters are preserved on the %mor line to facilitate readability and analysis.  These delimiters should be the same as the ones used on the main line.

3.     Along with utterance delimiters, the satellite markers of for the vocative and „ for tag questions or dislocations are also included on the %mor line in a one-to-one alignment format.

4.     Retracings and repetitions are excluded from this one-to-one mapping, as are nonwords such as xxx or strings beginning with &. When word repetitions are marked in the form word [x 3], the material in parentheses is stripped off and the word is considered as a single form.

5.     When a replacing form is indicated on the main line with the form [: text], the material on the %mor line corresponds to the replacing material in the square brackets, not the material that is being replaced. For example, if the main line has gonna [: going to], the %mor line will code going to.

6.     The [*] symbol that is used on the main line to indicate errors is not duplicated on the %mor line.

2.2      Tag Groups and Word Groups

On the %mor line, alternative taggings of a given word are clustered together in tag groups. These groups include the alternative taggings of a word that are produced by the MOR program.  Alternatives are separated by the ^ character. Here is an example of a tag group for one of the most ambiguous words in English:


After you run the POST program on your files, all of these alternatives will be disambiguated and each word will have only one alternative.  You can also use the hand disambiguation method built into the CLAN editor to disambiguate each tag group case by case.

The next level of organization for the MOR line is the word group.  Word groups are combinations marked by the preclitic delimiter $, the postclitic delimiter ~ or the compound delimiter +.  For example, the Spanish word dámelo can be represented as


This word group is a series of three words (verb~postclitic~postclitic) combined by the ~ marker. Clitics may be either preclitics or postclitics. Separable prefixes of the type found in German or Hungarian and other discontinuous morphemes can be represented as word groups using the preclitic delimiter $, as in this example for ausgegangen (“gone”):


Note the difference between the coding of the preclitic “aus” and the prefix “ge” in this example. Compounds are also represented as combinations, as in this analysis of angel+fish.


Here, the first characters (n|) represent the part of speech of the whole compound and the subsequent tags, after each plus sign, are for the parts of speech of the components of the compound.  Proper nouns are not treated as compounds.  Therefore, they take forms with underlines instead of pluses, such as Luke_Skywalker or New_York_City.

2.3      Words

Beneath the level of the word group is the level of the word. The structure of each individual word is:






=english (optional, underscore joins words)

There can be any number of prefixes, fusional suffixes, and suffixes, but there should be only one stem. Prefixes and suffixes should be given in the order in which they occur in the word. Since fusional suffixes are fused parts of the stem, their order is indeterminate. The English translation of the stem is not a part of the morphology, but is included for convenience for non-native speakers.   If the English translation requires two words, these words should be joined by an underscore as in “lose_flowers” for French défleurir.

Now let us look in greater detail at the nature of each of these types of coding. Through­out this discussion, bear in mind that all coding is done on a word-by-word basis, where words are considered to be strings separated by spaces.

2.4      Part of Speech Codes

The morphological codes on the %mor line begin with a part-of-speech code. The basic scheme for the part-of-speech code is:


Additional fields can be added, using the colon character as the field separator. The subcategory fields contain information about syntactic features of the word that are not marked overtly. For example, you may wish to code the fact that Italian “andare” is an intransitive verb even though there is no single morpheme that signals intransitivity. You can do this by using the part-of-speech code v:intrans, rather than by inserting a separate morpheme.

In order to avoid redundancy, information that is marked by a prefix or suffix is not in­corporated into the part-of-speech code, as this information will be found to the right of the | delimiter. These codes can be given in either uppercase, as in ADJ, or lowercase, as in adj. In general, CHAT codes are not case-sensitive.

The particular codes given below are the ones that MOR uses for automatic morpho­logical tagging of English. Individual researchers will need to define a system of part-of-speech codes that correctly reflects their own research interests and theoretical commit­ments. Languages that are typologically quite different from English may have to use very different part-of-speech categories. Quirk, Greenbaum, Leech, and Svartvik (1985) explain some of the intricacies of part-of-speech coding.  Their analysis should be taken as defini­tive for all part-of-speech coding for English.    However, for many purposes, a more coarse-grained coding can be used.

The following set of top-level part-of-speech codes is the one used by the MOR pro­gram.  Additional refinements to this system can be found by studying the organization of the lexicon files for that program  For example, in MOR, numbers are coded as types of determiners, because this is their typical usage.  The word “back” is coded as either a noun, verb, preposition, or adjective.  Further distinctions can be found by looking at the MOR lexicon.

Major Parts of Speech
















Infinitive marker to




Proper Noun














Auxiliary verb, including modals


WH words



2.5      Stems

Every word on the %mor tier must include a “lemma” or stem as part of the morpheme analysis. The stem is found on the right hand side of the | delimiter, following any pre-clitics or prefixes. If the transcript is in English, this can be simply the canonical form of the word. For nouns, this is the singular. For verbs, it is the infinitive. If the transcript is in another language, it can be the English translation. A single form should be selected for each stem. Thus, the English indefinite article is coded as det|a with the lemma “a” whether or not the actual form of the article is “a” or “an.”


When English is not the main language of the transcript, the transcriber must decide whether to use English stems. Using English stems has the advantage that it makes the cor­pus more available to English-reading researchers. To show how this is done, take the Ger­man phrase “wir essen”:

*FRI:   wir essen.

%mor:   pro|wir=we v|ess-INF=eat .

Some projects may have reasons to avoid using English stems, even as translations. In this example, “essen” would be simply v|ess-INF. Other projects may wish to use only English stems and no target-language stems. Sometimes there are multiple possible trans­lations into English. For example, German “Sie”/sie” could be either “you,” “she,” or “they.”  Choosing a single English meaning for the stem helps fix the German form.

2.6      Affixes

Affixes and clitics are coded in the position in which they occur with relation to the stem. The morphological status of the affix should be identified by the following markers or delimit­ers: - for a suffix, # for a prefix, and & for fusional or infixed morphology.

The & is used to mark affixes that are not realized in a clearly isolable phonological shape. For example, the form “men” cannot be broken down into a part corresponding to the stem “man” and a part corresponding to the plural marker, because one cannot say that the vowel “e” marks the plural. For this reason, the word is coded as n|man&PL. The past forms of irregular verbs may undergo similar ablaut processes, as in “came,” which is cod­ed v|come&PAST, or they may undergo no phonological change at all, as in “hit”, which is coded v|hit&PAST.  Sometimes there may be several codes indicated with the & after the stem. For example, the form “was” is coded v|be&PAST&13s.  Affix and clitic codes are based either on Latin forms for grammatical function or English words corresponding to particular closed-class items. MOR uses the following set of affix codes for automatic morphological tagging of English.


Inflectional Affixes for English




adjective suffix er, r


adjective suffix est, st


noun suffix ie


noun suffix s, es


noun suffix 's, '


verb suffix s, es


verb suffix ed, d


verb suffix ing


verb suffix en



Derivational Affixes for English




adjective and verb prefix un


adverbializer ly


nominalizer er


noun prefix ex


verb prefix dis


verb prefix mis


verb prefix out


verb prefix over


verb prefix pre


verb prefix pro


verb prefix re



2.7      Clitics

Clitics are marked by a tilde, as in v|parl&IMP:2S=speak~pro|DAT:MASC:SG for Ital­ian “parlagli” and pro|it~v|be&3s for English “it's.” Note that part of speech coding with the | symbol is repeated for clitics after the tilde. Both clitics and contracted elements are coded with the tilde. The use of the tilde for contracted elements extends to forms like “sul” in Italian, “ins” in German, or “rajta” in Hungarian in which prepositions are merged with articles or pronouns.


Clitic Codes for English




noun phrase post-clitic 'd

v:aux|would, v|have&PAST

noun phrase post-clitic 'll


noun phrase post-clitic 'm

v|be&1S, v:aux|be&1S

noun phrase post-clitic 're

v|be&PRES, v:aux|be&PRES

noun phrase post-clitic 's

v|be&3S, v:aux|be&3S

verbal post-clitic n't


2.8      Compounds

Here are some words that we might want to treat as compounds: sweat+shirt, tennis+court, bathing+suit, high+school, play+ground, choo+choo+train, rock+'n’+roll, and sit+in. There are also many idiomatic phrases that could be best analyzed as linkages. Here are some examples: a_lot_of, all_of_a_sudden, at_last, for_sure, kind_of, of_course, once_and_for_all, once_upon_a_time, so_far, and lots_of.

On the %mor tier it is necessary to assign a part-of-speech label to each segment of the compound. For example, the word blackboard or black+board  is coded on the %mor tier as n|+adj|black+n|board. Although the part of speech of the compound as a whole is usually given by the part-of-speech of the final segment, forms such as make+believe which is coded as adj|+v|make+v|believe show that this is not always true.

In order to preserve the one-to-one correspondence between words on the main line and words on the %mor tier, words that are not marked as compounds on the main line should not be coded as compounds on the %mor tier. For example, if the words “come here” are used as a rote form, then they should be written as “come_here” on the main tier. On the %mor tier this will be coded as v|come_here. It makes no sense to code this as v|come+adv|here, because that analysis would contradict the claim that this pair functions as a single unit. It is sometimes difficult to assign a part-of-speech code to a morpheme. In the usual case, the part-of-speech code should be chosen from the same set of codes used to label single words of the language. For example, some of these idiomatic phrases can be coded as compounds on the %mor line.


Phrases Coded as Linkages












2.9      Punctuation Marks

MOR can be configured to recognize certain punctuation marks as whole word characters.  In particular, the file punct.cut contains these entries:

      {[scat end]} "end"

      {[scat beg]} "beg"

,       {[scat cm]} "cm"

      {[scat bq]} "bq"

      {[scat eq]} "eq"

       {[scat bq]} “bq2”

       {[scat eq]} “eq2”

When the punctuation marks on the left occur in text, they are treated as separate lexical items and are mapped to forms such as beg|beg on the %mor tier.  The “end” marker is used to mark postposed forms such as tags and sentence final particles.  The “beg” marker is used to mark preposed forms such as vocatives and communicators.  The “bq” marks the beginning of a quote and the “eq” marks the end of a quote.  These special characters are important for correctly structuring the dependency relations for the GRASP program.

2.10  Sample Morphological Tagging for English

The following table describes and illustrates a more detailed set of word class codings for English. The %mor tier examples correspond to the labellings MOR produces for the words in question. It is possible to augment or simplify this set, either by creating additional word categories, or by adding additional fields to the part-of-speech label, as discussed pre­viously.  The entries in this table and elsewhere in this manual can always be double-checked against the current version of the MOR grammar by typing “mor +xi” to bring up interactive MOR and then entering the word to be analyzed.


Word Classes for English




Coding of Examples




adjective, comparative

bigger, better

adj|big-CP, adj|good&CP

adjective, superlative

biggest, best

adj|big-SP, adj|good&SP




adverb, ending in ly



adverb, intensifying

very, rather

adv:int|very, adv:int|rather

adverb, post-qualifying

enough, indeed

adv|enough, adv|indeed

adverb, locative

here, then

adv:loc|here, adv:tem|then




conjunction, coord.

and, or

conj:coo|and, conj:coo|or

conjunction, subord.

if, although

conj:sub|if, conj:sub|although

determiner, singular

a, the, this

det|a, det|this

determiner, plural

these, those

det|these, det|those

determiner, possessive

my, your, her


infinitive marker



noun, common

cat, coffee

n|cat, n|coffee

noun, plural



noun, possessive



noun, plural possessive



noun, proper



noun, proper, plural



noun, proper, possessive



noun, proper, pl. poss.



noun, adverbial

home, west

n|home, adv:loc |home

number, cardinal



number, ordinal




all, both

post|all, post|both



prep|in, adv:loc|in

pronoun, personal

I, me, we, us, he

pro|I, pro|me, pro|we, pro|us

pronoun, reflexive

myself, ourselves


pronoun, possessive

mine, yours, his

pro:poss|mine, pro:poss:det|his

pronoun, demonstrative

that, this, these


pronoun, indefinite

everybody, nothing


pronoun, indef., poss.




half, all

qn|half, qn|all

verb, base form

walk, run

v|walk, v|run

verb, 3rd singular present

walks, runs

v|walk-3S, v|run-3S

verb, past tense

walked, ran

v|walk-PAST, v|run&PAST

verb, present participle

walking, running

part|walk-PRESP, part|run-PRESP

verb, past participle

walked, run

part|walk-PASTP, part|run&PASTP

verb, modal auxiliary

can, could, must

aux|can, aux|could, aux|must


Since it is sometimes difficult to decide what part of speech a word belongs to, we offer the following overview of the different part-of-speech labels used in the standard English grammar.


ADJectives modify nouns, either prenominally, or predicatively. Unitary compound modi­fiers such as good-looking should be labeled as adjectives.


ADVerbs cover a heterogenous class of words including: manner adverbs, which generally end in -ly; locative adverbs, which include expressions of time and place; intensifiers that modify adjectives; and post-head modifiers, such as indeed and enough.


COmmunicators are used for interactive and communicative forms which fulfill a variety of functions in speech and conversation. Also included in this category are words used to express emotion, as well as imitative and onomatopeic forms, such as ah, aw, boom, boom-boom, icky, wow, yuck, and yummy.


CONJunctions conjoin two or more words, phrases, or sentences. Examples include: although, because, if, unless, and until.


COORDinators include and, or, and as well as.  These can combine clauses, phrases, or words.


DETerminers include articles, and definite and indefinite determiners. Possessive deter­miners such as my and your are tagged det:poss.


INFinitive is the word “to” which is tagged inf|to.


INTerjections are similar to communicators, but they typically can stand alone as complete utterances or fragments, rather than being integrated as parts of the utterances.  They include forms such as wow, hello, good-morning, good-bye, please, thank-you.


Nouns are tagged with n for common nouns, and n:prop for proper nouns (names of peo­ple, places, fictional characters, brand-name products).


NEGative is the word “not” which is tagged neg|not.


NUMbers  are labelled num for cardinal numbers. The ordinal numbers are adjectives.


Onomatopoeia are words that imitate the sounds of nature, animals, and other noises.


Particles are words that are often also prepositions, but are serving as verbal particles.


PREPositions are the heads of prepositional phrases. When a preposition is not a part of a phrase, it should be coded as a particle or an adverb.


PROnouns include a variety of structures, such as reflexives, possessives, personal pronouns, deictic pronouns, etc.


QUANTifiers include each, every, all, some, and similar items.


Verbs can be either main verbs, copulas, or auxililaries.


3       Running the Program Chain


It is possible to construct a complete automatic morphosyntacgtic analysis of a series of CHAT transcripts through a single command in CLAN, once you have the needed programs in the correct configuration.  This command runs the MOR, POST, POSTMORTEM, and MEGRASP commands in an automatic sequence or chain. To do this, you follow these steps:

1.     Place all the files you wish to analyze into a single folder.

2.     Start the CLAN program (see the Part 2 of the manual for instructions on installing CLAN).

3.     In CLAN’s Commands window, click on the buttom labelled Working to set your working directory to the folder that has the files to be analyzed.

4.     Under the File menu at the top of the screen, select Get MOR Grammar and select the language you want to analyze.  To do this, you must be connected to the Internet. If you have already done this once, you do not need to do it again.  By default, the MOR grammar you have chosen will download to your desktop.

5.     If you choose to move your MOR grammar to another location, you will need to use the Mor Lib button in the Commands window to tell CLAN about where to locate it.

6.     To analyze all the files in your Working directory folder, enter this command in the Comands window: mor *.cha

7.     CLAN will then run these programs in sequence: MOR, POST, POSMORTEM, and MEGRASP. These programs will add %mor and %gra lines to your files.

8.     If this command ends with a message saying that some words were not recognized, you will probably want to fix them.  If you do not, some of the entries on the %mor line will be incomplete and the relations on the %gra line will be less accurate. If you have doubts about the spellings of certain words, you can look in the 0allwords.cdc file this is included in the /lex folder for each language.  The words there are listed in alphabetical order.

9.     To correct errors, you can run this command:  mor +xb *.cha.. Guidelines for fixing errors are given in chapter 4 below.

4       Morphological Analysis

4.1      The Design of MOR

The computational design of mor was guided by Roland Hausser’s (1990) MORPH system and was implemented by Mitzi Morris. Since 2000, Leonid Spektor has extended MOR in many ways.  Christophe Parisse built POST and POSTTRAIN (Parisse & Le Normand, 2000). Kenji Sagae built MEGRASP as a part of his dissertation work for the Language Technologies Institute at Carnegie Mellon University (Sagae, MacWhinney, & Lavie, 2004a, 2004b).  Leonid Spektor then integrated the program into CLAN.

The system has been designed to maximize portability across languages, extendability of the lexicon and grammar, and compatibility with the clan programs. The basic engine of the parser is language independent. Lan­guage-specific information is stored in separate data files that can be modified by the user. The lexical entries are also kept in ASCII files and there are several techniques for improving the match of the lexicon to a cor­pus. To maximize the complete analysis of regular formations, only stems are stored in the lexicon and inflected forms appropriate for each stem are compiled at run time.

4.2      Example Analyses

To give an example of the results of a MOR analysis for English, consider this sentence from eve15.cha in Roger Brown’s corpus for Eve. 

*CHI:   oops I spilled it.

%mor:   co|oops pro:subj|I v|spill-PAST pro:per|it.

Here, the main line gives the child’s production and the %mor line gives the part of speech for each word, along with the morphological analysis of affixes, such as the past tense mark (-PAST) on the verb.  The %mor lines in these files were not created by hand.  To produce them, we ran the MOR command, using the MOR grammar for English, which can be downloaded using the Get MOR Grammar function described in the previous chapter. The command for running MOR by itself without running the rest of the chain is: mor +d *.cha. After running MOR, the file looks like this:

*CHI:  oops I spilled it .

%mor:  co|oops pro:subj|I part|spill-PASTP^v|spill-PAST pro:per|it .

In the %mor tier, words are labeled by their syntactic category or “scat”, followed by the pipe separator |, followed then by the stem and affixes. Notice that the word “spilled” is initially ambiguous between the past tense and participle readings. The two ambiguities are separated by the ^ character.  To resolve such ambiguities, we run a program called POST. The command is just “post *.cha” After POST has been run, the %mor line will only have v|spill-PAST. 

Using this disambiguated form, we can then run the MEGRASP program to create the representation given in the %gra line below:

*CHI:   oops I spilled it .

%mor:   co|oops pro:subj|I v|spill-PAST pro:per|it .

%gra:   1|3|COM 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|3|PUNCT

In the %gra line, we see that the second word “I” is related to the verb (“spilled”) through the grammatical relation (GR) of Subject.  The fourth word “it” is related to the verb through the grammatical relation of Object.  The verb is the Root and it is related to the “left wall” or item 0.

4.3      Exclusions in MOR

Because MOR focuses on the analysis of the target utterance, it excludes a variety of non-words, retraces, and special symbols. Specifically, MOR excludes:

1.     Items that start with &

2.     Pauses such as (.)

3.     Unknown forms marked as xxx, yyy, or www

4.     Data associated with these codes: [/?],  [/-], [/], [//], and [///].

4.4      Unique Options

+d     do not run POST command automatically.  POST will run automatically after MOR, unless this switch is used or unless the folder name includes the word “train”.


+eS    Show the result of the operation of the arules on either a stem S or stems in file @S.  This output will go into a file called debug.cdc in your library directory.  An­other way of achieving this is to use the +d option inside “interactive MOR”


+p     use pinyin lexicon format for Chinese


+xi     Run mor in the interactive test mode. You type in one word at a time to the test prompt and mor provides the analysis on line.  This facility makes the following commands available in the CLAN Output window:

     word - analyze this word

     :q  quit- exit program

     :c  print out current set of crules

     :d  display application of arules.

     :l  re-load rules and lexicon files

     :h  help - print this message


If you type in a word, such as “dog” or “perro,” MOR will try to analyze it and give you its components morphemes.  If you change the rules or the lexicon, use :l to reload and retest.  The :c and :d switches will send output to a file called de­bug.cdc in your library directory.


+xl     Run mor in the lexicon building mode. This mode takes a series of .cha files as input and outputs a small lexical file with the extension .ulx with entries for all words not recognized by mor. This helps in the building of lexicons.


+xb   check lexicon mode, include word location in data files

+xa    check lexicon for ambiguous entries

+xc    check lexicon mode, including capitalized words

+xd   check lexicon for compound words conflicting with plain words

+xp   check lexicon mode, including words with prosodic symbols

+xy    analyze words in lex files

4.5      Categories and Components of MOR

MOR breaks up words into their component parts or morphemes.  In a relatively analytic language like English, many words require no analysis at all.  However, even in English, a word like “coworkers” can be seen to contain four component morphemes, including the prefix “co”, the stem, the agential suffix, and the plural.  For this form, MOR will produce the analysis: co#n:v|work-AGT-PL.  This representation uses the symbols # and – to separate the four different morphemes.  Here, the prefix stands at the beginning of the analysis, followed by the stem (n|work), and the two suffixes.  In general, stems always have the form of a part of speech category, such as “n” for noun, followed by the vertical bar and then a statement of the stem’s lexical form. 


To understand the functioning of the MOR grammar for English, the best place to begin is with a tour of the files inside the ENG folder that you can download from the server.  At the top level, you will see these files:

1.     ar.cut – These are the rules that generate allomorphic variants from the stems and affixes in the lexical files.

2.     cr.cut – These are the rules that specify the possible combinations of morphemes going from left to right in a word.

3.     debug.cdc – This file holds the complete trace of an analysis of a given word by MOR.  It always holds the results of the most recent analysis.  It is mostly useful for people who are developing new ar.cut or cr.cut files as a way of tracing out or debugging problems with these rules.

4.     docs – This is a folder containing a file of instructions on how to train POST and a list of tags and categories used in the English grammar.

5.     post.db – This is a file used by POST and should be left untouched.

6.     ex.cut – This file includes analyses that are being “overgenerated” by MOR and should simply be filtered out or excluded whenever they occur.

7.     lex – This folder contains many files listing the stems and affixes of the language.  We will examine it in greater detail below.

8.     sf.cut – This file tells MOR how to deal with words that end with certain special form markers such as @b for babbling.

When examining these files and others, please note that the exact shapes of the files, the word listings, and the rules will change over time.  We recommend that users glance through these various files to understand their contents.


The first action of the parser program is to load the ar.cut file. Next the program reads in the files in your lexicon folder and uses the rules in ar.cut to build the run-time lexicon. Once the run-time lexi­con is loaded, the parser then reads in the cr.cut file. Additionally, if the +b option is spec­ified, the dr.cut file is also read in. Once the concatenation rules have been loaded the program is ready to analyze input words. As a user, you do not need to concern yourself about the run-time lexicon. Your main concern is about the entries in the lexicon files. For languages that already have a MOR grammar, the rules in the ar.cut and cr.cut files are only of concern if you wish to have a set of analyses and labelings that differs from the one given in the chapter on mor­phosyntactic coding, or if you are trying to write a new set of grammars for some language.

4.6      MOR Part-of-Speech Categories

The final output of MOR on the %mor line uses two sets of categories: part-of-speech (POS) names and grammatical categories.  To survey the part-of-speech names for English, we can take a look at the files contained inside the /lex folder.  These files break out the possible words of English into different files for each specific part of speech or compound structure.  Because these distinctions are so important to the correct transcription of child language and the correct running of MOR, it is worthwhile to consider the contents of each of these various files.  As the following table shows, about half of these word types involve different part of speech configurations within compounds. This analysis of compounds into their part of speech components is intended to further study of the child’s learning of compounds as well as to provide good information regarding the part of speech of the whole. The name of the compound files indicates their composition.  For example, the name adj+n+adj.cut indicates compounds with a noun followed by an adjective (n+adj) whose overall function is that of an adjective. This means that it is treated just as and adjective (adj) by the MOR and GRASP programs.  In English, the part of speech of a compound is usually the same as that of the last component of the compound.             A few additional part of speech (POS) categories are introduced by the 0affix.cut file.  These include: n-cl (noun clitic), v-cl (verb clitic), part (participle), and n:gerund (gerund). Additional categories on the %mor line are introduced from the special form marker file called sf.cut.  The meanings of these various special form markers are given in the CHAT manual.  Finally, the punctuation codes bq, eq, end, beg, and cm are the POS codes used for the special character marks given in the punct.cut file.


File (.cut)






prefixes and suffixes

see expanded list below



terms local to the UK

fave, doofer, sixpence



baby talk adjectives

dipsy, yumsy



baby talk doubles

nice+nice, pink+pink



irregular adjectives

better, furthest



ordinal numerals




predicative adjectives

abreast, remiss



combined adjectives

close_by, lovey_dovey



regular adjectives

tall, redundant




half+hearted, hot+crossed




super+duper, easy+peasy




dog+eared, stir+crazy








make+believe, see+through



temporal adverbs

tomorrow, tonight, anytime



combined adverbs

how_about, as_well



wh term

where, why



regular adverbs

ajar, fast, mostly




half+off, slant+wise




half+way, off+shore







Cantonese forms

wo, wai, la




honey, dear, sir



rhymes, onomatopoeia




multiword phrases

by_jove, gee_whiz



regular communicators

blah, byebye, gah, no



combined conjunctions

even_though, in_case_that




and, although, because


det, art

deictic determiners

this, that, the,




two, twelve




c_d, t_v, w_c



babytalk forms

passie, wawa, booboo



noun combinations

cul_de_sac, seven_up



duplicate nouns

cow+cow, chick_chick



irregular nouns

children, cacti, teeth



loan words

goyim, amigo, smuck



nouns with no singular

golashes, kinesics, scissors



regular nouns

dog, corner, window




big+shot, cutie+pie








four+by+four, dot+to+dot




quack+duck, moo+cow




candy+bar, foot+race




children+bed, dog+fish








wee+wee, meow+meow








squirm+worm, snap+bead




chin+up, hide+out




boom, choo_choo




cluck+cluck, knock+knock




all, too



combined prepositions

out_of, in_between




under, minus



demonstrative pronouns

this, that



indefinite pronouns

everybody, few


see file

personal pronouns

he, himself



possessive pronouns

hers, mine



possessive determiners

her, my



interrogative pronouns

who, what




some, all, only, most




that, which


inf, neg

small forms

not, to, xxx, yyy




had, getting



baby verbs

wee, poo



cliticized forms

gonna, looka




be, become



verb duplications

eat+eat, drip+drip



irregular verbs

came, beset, slept



modal auxiliaries

hafta, gotta




can, ought



regular verbs

run, take, remember




deep+fry, tippy+toe




bunny+hop, sleep+walk







omitted words

0know, 0conj, 0n, 0is


The construction of these lexicon files involves a variety of decisions. Here are some of the most important issues to consider.

1.            Words may often appear in several files.  For example, virtually every noun in English can also function as a verb.  However, when this function is indicated by a suffix, as in “milking” the noun can be recognized as a verb through a process of morphological derivation contained in a rule in the cr.cut file.  In such cases, it is not necessary to list the word as a verb.  Of course, this process fails for unmarked verbs.  However, it is generally not a good idea to represent all nouns as verbs, since this tends to overgenerate ambiguity.  Instead, it is possible to use the POSTMORTEM program to detect cases where nouns are functioning as bare verbs. 

2.            If a word can be analyzed morphologically, it should not be given a full listing.  For example, since “coworker” can be analyzed by MOR into three morphemes as co#n:v|work-AGT, it should not be separately listed in the n.cut file.  If it is, then POST will not be able to distinguish co#n:v|work-AGT from n|coworker.

3.            In the zero.cut file, possible omitted words are listed without the preceding 0.  For example, there is an entry for “conj” and “the”.  However, in the transcript, these would be represented as “0conj” and “0the”.

4.            It is always best to use spaces to break up word sequences that are just combinations of words.  For example, instead of transcribing 1964 as “nineteen+sixty+four”, “nineteen-sixty-four”, or “nineteen_sixty_four”, it is best to transcribe simply as “nineteen sixty four”.  This principle is particularly important for Chinese, where there is a tendency to underutilize spaces, since Chinese itself is written without spaces.

5.            For most languages that use Roman characters, you can rely on capitalization to force MOR to treat words as proper nouns.  To understand this, take a look at the forms in the sf.cut file at the top of the MOR directory.  These various entries tell MOR how to process forms like k@l for the letter “k” or John_Paul_Jones for the famous admiral.  The symbol \c indicates that a form is capitalized and the symbol \l indicates that it is lowercase.

4.7      MOR Grammatical Categories

In addition to the various part-of-speech categories provided by the lexicon, MOR also inserts a series of grammatical categories, based on the information about affixes in the 0affix.cut file, as well as information inserted by the a-rules and c-rules.  If the category is regularly attached, it is preceded by a dash.  If it is irregular, it uses an amerpsand. For English, the inflectional categories are:







nominal plural




past tense




present participle




past participle








first singular




third singular present




first and third




In addition to these inflectional categories, English uses these derivational morphemes:











































deverbal, denominal




In these examples, the features dn, dv, and dadj indicate derivation of the forms from nouns, verbs, or adjectives.


Other languages use many of these same features, but with many additional ones, particularly for highly inflecting languages.  Sometimes these are lowercase and sometimes upper.  Here are some examples:



















































4.8      Compounds and Complex Forms

The lexical files include many special compound files such as n+n+n.cut or v+n+v.cut. Compounds are listed in the lexical files according to both their overall part of speech (X-bar) and the parts of speech of their components.  However, there are seven types of complex word combinations that should not be treated as compounds.

  1. Underscored words.  The n-under.cut file includes 40 forms that resemble compounds, but are best viewed as units with non-morphemic components.  For example, kool_aid and band_aid are not analytic combinations of morphemes, although they clearly have two components.  The same is true for hi_fi and coca_cola.  In general, MOR and CLAN pay little attention to the underscore character, so it can be used as needed when a plus for compounding is not appropriate. The underscore mark is particularly useful for representing the combinations of words found in proper nouns such as John_Paul_Jones, Columbia_University, or The_Beauty_and_the_Beast.  If these words are capitalized, they do not need to be included in the MOR lexicon, since all capitalized words are taken as proper nouns in English.  However, these forms cannot contain pluses, since compounds are not proper nouns.  And please be careful not to overuse this form.
  2. Separate words.  Many noun-noun combinations in English should just be written out as separate words.  An example would be “faucet stem assembly rubber gasket holder”. It is worth noting here that German treats all such forms as single words. This means that different conventions have to be adopted for German in order to avoid the need for exhaustive listing of the infinite number of German compound nouns.
  3. Spelling sequences.  Sequences of letter names such as “O-U-T” for the spelling of “out” are transcribed with the suffix @k, as in out@k.
  4. Acronyms. Forms such as FBI are transcribed with underscores, as in F_B_I.  Presence of the initial capital letter tells MOR to treat F_B_I as a proper noun. This same format is used for non-proper abbreviations such as c_d or d_v_d. 
  5. Products.  Coming up with good forms for commercial products such as Coca-Cola is tricky.  Because of the need to ban the use of the dash on the main line, we have avoided the use of the dash in these names.  They should not be treated as compounds, as in coca+cola, and compounds cannot be capitalized, so Coca+Cola is not possible.  This leaves us with the option of either coca_cola or Coca_Cola.  The option coca_cola seems best, since this is not a proper noun.
  6. Babbling and word play.  In earlier versions of CHAT and MOR, transcribers often represent sequences of babbling or word play syllables as compounds.  This was done mostly because the plus provides a nice way of separating out the separate syllables in these productions.  To make it clear that these separations are simply marked for purposes of syllabification, we now ask transcribers to use forms such as ba^ba^ga^ga@wp or choo^bung^choo^bung@o to represent these patterns.

The introduction of this more precise system for transcription of complex forms opens up additional options for programs like MLU, KWAL, FREQ, and GRASP.  For MLU, compounds will be counted as single words, unless the plus sign is added to the morpheme delimiter set using the +b+ option switch.  For GRASP, processing of compounds only needs to look at the overall part of speech of the compound, since the internal composition of the compound is not relevant to the syntax.  Additionally, forms such as "faucet handle valve washer assembly" do not need to be treated as compounds, since GRASP can learn to treat sequences of nouns as complex phrases header by the final noun. 

4.9      Errors and Replacements

Transcriptions on the main line have to serve two, sometimes conflicting (Edwards, 1992), functions.  On the one hand, they need to represent the form of the speech as actually produced.  On the other hand, they need to provide input that can be used for morphosyntactic analysis.  When words are pronounced in their standard form, these two functions are in alignment.  However, when words are pronounced with phonological or morphological errors, it is important to separate out the actual production from the morphological target.  This can be done through a system for main line tagging of errors.  This system largely replaces the coding of errors on a separate %err line, although that form is still available, if needed.  The form of the newer system is illustrated here:


*CHI:  him [* case] ated [: ate] [* +ed-sup] a f(l)ower and a pun [: bun].


For the first error, there is no need to provide a replacement, since MOR can process “him” as a standard pronoun.  However, since the second word is not a real word form, the replacement is necessary in order to tell MOR how to process the form.  The third error is just an omission of “l” from the cluster and the final error is a mispronunciation of the initial consonant. Phonological errors are not coded here, since that level of analysis is best conducted inside the Phon program (Rose et al., 2005).

4.10  Affixes

The inflectional and derivational affixes of English are listed in the 0affix.cut file. 

1.     This file begins with a list of prefixes such as “mis” and “semi” that attach either to nouns or verbs. Each prefix also has a permission feature, such as [allow mis].  This feature only comes into play when a noun or verb in n.cut or v.cut also has the feature [pre no].  For example, the verb “test” has the feature [pre no] included in order to block prefixing with “de-” to produce “detest” which is not a derivational form of "test".  At the same time, we want to permit prefixing with “re-”, the entry for “test” has [pre no][allow re].  Then, when the relevant rule in cr.cut sees a verb following “re-” it checks for a match in the [allow] feature and allows the attachment in this case.

2.     Next we see some derivational suffixes such as diminutive –ie or agential –er.  Unlike the prefixes, these suffixes often change the spelling of the stem by dropping silent e or doubling final consonants.  The ar.cut file controls this process, and the [allo x] features listed there control the selection of the correct form of the suffix.

3.     Each suffix is represented by a grammatical category in parentheses.  These categories are taken from a typologically valid list given in the CHAT Manual.

4.     Each suffix specifies the grammatical category of the form that will result after its attachment.  For suffixes that change the part of speech, this is given in the scat, as in [scat adj:n].  Prefixes do not change parts of speech, so they are simply listed as [scat pfx] and use the [pcat x] feature to specify the shape of the forms to which they can attach.

5.     The long list of suffixes concludes with a list of cliticized auxiliaries and reduced main verbs.  These forms are represented in English as contractions.  Many of these forms are multiply ambiguous and it will be the job of POST to choose the correct reading from among the various alternatives.

4.11  Control Features and Output Features

The lexical files include several control features that specify how stems should be treated.  One important set includes the [comp x+x] features for compounds. This feature controls how compounds will be unpacked for formatting on the %mor line.  Irregular adjectives in adj-ir.cut have features specifying their degree as comparative or superlative. Irregular nouns have features controlling the use of the plural.  Irregular verbs have features controlling consonant doubling [gg +] and the formation of the perfect tense. Features like [block ed] are used to prevent reocognition of overregularized forms such as goed.

There are also a variety of features that are included in lexical entries, but not necessarily present in the final output.  For example, the feature of gender is used to determine patterns of suffixation in Spanish, but to include this feature in the output it must be present and not commented in the output.cut file.  Other lexical features of this type include root, ptn, num, tense, and deriv.

5       Correcting errors

When running mor on a new set of chat files, it is important to make sure that mor will be able to recognize all the words in these files.  A first step in this process involves running the CHECK program to see if all the words follow basic CHAT rules, such as not including numbers or capital letters in the middle of words. There are several common reasons for a word not being recognized:

1.     It is misspelled.  If you have doubts about the spellings of certain words, you can look in the 0allwords.cdc file this is included in the /lex folder for each language.  The words there are listed in alphabetical order.

2.     The word should be preceded by and ampersand & to block look up through MOR. There are four forms using the ampersand.  Nonwords just take the & alone, as in &gaga.  Incomplete words should be transcribed as &+text, as in &+sn for the beginning of snake.  Filler words should be transcribed as &-uh. Finally, sounds like laughing can be transcribed as &=laughs, as described more extensively in the CHAT manual.

3.     The word should have been transcribed with a special form marker, as in bobo@o or bo^bo@o for onomatopoeia.  It is impossible to list all possible onomatopoeic forms in the MOR lexicon, so the @o marker solves this problem by telling MOR how to treat the form. This approach will be needed for other special forms, such as babbling, word play, and so on.

4.     The word was transcribed in “eye-dialect” to represent phonological reductions.  When this is done, there are two basic ways to allow MOR to achieve correct lookup. If the word can be transcribed with parentheses for the missing material, as in “(be)cause”, then MOR will be happy.  This method is particularly useful in Spanish and German.  Alternatively, if there is a sound substitution, then you can transcribe using the [: text] replacement method, as in “pittie [: kittie]”.

5.     You should treat the word as a proper noun by capitalizing the first letter.  This method works for many languages, but not in German where all nouns are capitalized and not in Asian languages, since those languages do not have systems for capitalization.

6.     The stem is in the lexicon, but the inflected form is not recognized.  In this case, it is possible that one of the analytic rules of MOR is not working.  These problems can be reported to

7.     The stem or word is missing from MOR.  In that case, you can create a file called something like 0add.cut in the /lex folder of the MOR grammar.  Once you have accumulated a collection of such words, you can email them to for permanent addition to the lexicon.

Some of these forms can be corrected during the initial process of transcription by running CHECK.  However, others will not be evident until you run the MOR command with +xb or +xl and get a list of unrecognized words. 

To correct these problems, there are basically two possible tools.  The first is the KWAL program built in to CLAN.  Let us say that your filename.ulx.cex list of unrecognized words has the form “cuaght” as a misspelling of “caught.”  Let us further imagine that you have a single collection of 80 files in one folder.  To correct this error, just type this command into the Commands window:

kwal *.cha +scuaght

KWAL will then send input to your screen as it goes through the 80 files.  There may be no more than one case of this misspelling in the whole collection.  You will see this as the output scrolls by.  If necessary, just scroll back in the CLAN Output window to find the error and then triple click to go to the spot of the error and then retype the word correctly. 

For errors that are not too frequent, this method works fairly well.  However, if you have made some error consistently and frequently, you may need stronger methods.  Perhaps you transcribed “byebye” as “bye+bye” as many as 60 times.  In this case, you could use the CHSTRING program to fix this, but a better method would involve the use of a Programmer’s Editor system such as BBEdit for the Mac or Epsilon for Windows.  Any system you use must include an ability to process Regular Expressions (RegExp) and to operate smoothly across whole directories at a time.  However, let me give a word of warning about the use of more powerful editors.  When using these systems, particularly at first, you may make some mistakes.  Always make sure that you keep a backup copy of your entire folder before each major replacement command that you issue.

Once you know that a corpus passes CHECK, you will want to see whether it contains words that are either misspelled or not yet in the MOR lexicon.  You do this by running the command:

mor +xb *.cha

The output from this command will have the extension .ulx.cex.  After running the command, its name will appear at the end of the output in the CLAN Output window.  If that window tells you that “all words were found in the lexicon”, then you can proceed with running

mor *.cha

However, if not all words are recognized, you can triple-click on the line listing ther “Output File” and it will open the list of words not yet recognized by MOR. In any large corpus, is extremely unlikely that every word would be listed in even the largest mor lexicon. Therefore, users of mor need to understand how to supplement the basic lexicons with additional entries. Before we look at the process of adding new words to the lexicon, we first need to examine the way in which entries in the disk lexicon are structured.

The disk lexicon contains irregular forms of a word as well as the stems of regular forms. For example, the verb “go” is stored in the disk lexicon, along with the past tense “went,” since this latter form is suppletive and does not undergo regular rules. The disk lexicon contains a series of files each with a series of lexical entries with one entry per line. The lexicon may be anno­tated with comments, which will not be processed. A comment begins with the percent sign and ends with a new line.  A lexical entry consists of these parts:

1.     The surface form of the word.

2.     Category information about the word, expressed as a set of feature-value pairs. Each feature-value pair is enclosed in square brackets and the full set of feature-value pairs is enclosed in curly braces. All entries must contain a feature-value pair that identifies the syntactic category to which the word belongs, consisting of the feature “scat” with an appropriate value.

3.     Following the category information is information about the lemmatization of ir­regular forms.  This information is given by having the citation form of the stem followed by the & symbol as the morpheme separator and then the grammatical morphemes it contains.

4.     Finally, if the grammar is for a language other than English, you can enter the English translation of the word preceded by and followed by the = sign.


The following are examples of lexical entries:

can     {[scat v:aux]}

a       {[scat det]}

an      {[scat det]}      "a"

go      {[scat v] [ir +]}

went    {[scat v] [tense past]}       "go&PAST"

When adding new entries to the lexicon it is usually sufficient to enter the citation form of the word, along with the syntactic category information, as in the illustration for the word “a” in the preceding examples.  When working with languages other than English, you may wish to add English glosses and even special character sets to the lexicon.  For example, in Cantonese, you could have this entry:

ping4gwo2        {[scat n]} =apple=

To illustrate this, here is an example of the MOR output for an utterance from Cantonese:

*CHI:   sik6 ping4gwo2 caang2 hoeng1ziu1 .

%mor:   v|sik6=eat n|ping4gwo2=apple

        n|caang2=orange n|hoeng1ziu1=banana .

In languages that use both Roman and non-Roman scripts, such as Chinese, you may also want to add non-Roman characters after the English gloss.  This can be done using this form in which the $ sign separates the English gloss from the representation in characters.

pinyin  {[scat x]} “lemmatization” =gloss$characters=

MOR will take the forms indicated by the lemmatization, the gloss, and the characters and append them after the category representation in the output.  The gloss should not contain spaces or the morpheme delimiters +, -,  and #.  Instead of spaces or the + sign, you can use the underscore character to represent compounds.

5.1      Lexicon Building

When running the mor +xb command, you may wish to run the command in the form mor +xl.  The +xb form lists each separate token of an unrecognized word, whereas the +xl form combines all the tokens into a single type.  The advantage of the +xb format is that you can click on each occurrence and change it.  However, for very common errors, the +xl format is useful because it will allow you to see what forms should be changed globally using the CHSTRING command.


When working with the output of ther +xb form, you must then go through this file and determine whether to discard, complete, or mod­ify each missing case. For example, it may be impossible to decide what category “ta” belongs to without examining where it occurs in the corpus. In this example, a scan of the Sarah files in the Brown corpus (from which these examples were taken), reveals that “ta” is a variant of the infinitive marker “to”:


*MEL:   yeah # (be)cause if it's gon (t)a be a p@l it's

         got ta go that way.

This missing form can be repaired by joining got and ta into gotta, because that form is listed in the lexicon.  Alternatively, the sequence can be coded as here:

*MEL:   yeah # (be)cause if it's gon (t)a be a p@l it's

         gotta [: got to] go that way.

Another common source of error is misspelling.  This can be repaired by correcting the spelling.

In many other cases, you will find that some words are just missing from the lexicon.  For these, you can create a file with a name like 0morewords.cut which you add to the files in /lex.  After doing this, please send the contents of this file to, so that I can add these missing words to the authoritative version of the lexicon.

5.2      Disambiguator Mode

When POST works smoothly, there is littlel need for hand disambiguation.  However, ambiguities within a given part of speech cannot be resolved by POST and must be disambiguated by hand using Disambiguator Mode.   Also, when developing POST for a new language, you may find this tool useful. Toggling the Disambiguator Mode option in the Mode menu allows you to go back and forth be­tween Disambiguator Mode and standard Editor Mode. In Disambiguator Mode, you will see each ambiguous interpretation on a %mor line broken into its alternative possibilities at the bottom of the editor screen. The user double-clicks on the correct option and it is in­serted. An ambiguous entry is defined as any entry that has the ^ symbol in it. For example, the form N|back^Prep|back is ambiguously either the noun “back” or the preposition “back.”

By default, Disambiguator Mode is set to work on the %mor tier. However, you may find it useful for other tiers as well. To change its tier setting, select the Edit menu and pull down to Options to get the Options dialog box. Set the disambiguation tier to the tier you want to disambiguate. To test all of this out, edit the sample.cha file, reset your default tier, and then type Esc-2. The editor should take you to the second %spa line which has:

%spa:   $RES:sel:ve^$DES:tes:ve

At the bottom of the screen, you will have a choice of two options to select. Once the correct one is highlighted, you hit a carriage return and the correct alternative will be inserted. If you find it impossible to decide between alternative tags, you can select the UND or unde­cided tag, which will produce a form such as “und|drink” for the word drink, when you are not sure whether it is a noun or a verb.


6       A Formal Description of the Rule Files

Users working with languages for which grammar files have already been built do not need to concern themselves with this section or the next. However, users who need to develop grammars for new languages or who find they need to modify grammars for existing ones will need to understand how to create the two basic rule files themselves.  You do not need to create a new version of the sf.cut file for special form markers.  You just copy this file from the English MOR grammar.

To build new versions of the arules and crules files for your language, you will need to study the English files or files for a related language.  For example, when you are building a grammar for Portuguese, it would be helpful to study the grammar that has al­ready been constructed for Spanish.  This section will help you understand the basic prin­ciples underlying the construction of the arules and crules. 

6.1      Declarative structure

Both arules and crules are written using a simple declarative notation. The following formatting conventions are used throughout:

1.     Statements are one per line. Statements can be broken across lines by placing the continuation character \ at the end of the line.

2.     Comments begin with a % character and are terminated by the new line. Com­ments may be placed after a statement on the same line, or they may be placed on a separate line.

3.     Names are composed of alphanumeric symbols, plus these characters:

^ & + - _ : \ @ . /

Both arule and crule files contain a series of rules. Rules contain one or more clauses, each of which is composed of a series of condition statements, followed by a series of action statements. For a clause to apply, the input(s) must satisfy all condition state­ments. The output is derived from the input via the sequential application of all the action statements.

Both condition and action statements take the form of equations. The left hand side of the equation is a keyword, which identifies the part of the input or output being processed. The right-hand side of the rule describes either the surface patterns to be matched or gener­ated, or the category information that must be checked or manipulated.

The analyzer manipulates two different kinds of information: information about the sur­face shape of a word, and information about its category. All statements that match or ma­nipulate category information must make explicit reference to a feature or features. Similarly, it is possible for a rule to contain a literal specification of the shape of a stem or affix. In addition, it is possible to use a pattern matching language to give a more general description of the shape of a string.

6.2      Pattern-matching symbols

The specification of orthographic patterns relies on a set of symbols derived from the regular expression (regexp) system in Unix. The rules of this system are:

1.     The metacharacters are: * [   ]   | .   !   All other characters are interpreted literally.

2.     A pattern that contains no metacharacters will only match itself, for example the pattern “abc” will match only the string “abc”.

3.     The period matches any character.

4.     The asterisk * allows any number of matches (including 0) on the preceding character. For example, the pattern '.*' will match a string consisting of any num­ber of characters.

5.     The brackets [ ] are used to indicate choice from among a set of characters. The pattern [ab] will match either a or b.

6.     A pattern may consist of a disjunctive choice between two patterns, by use of the | symbol. For example, the pattern will match all strings which end in x, s, sh, or ch.

7.     It is possible to check that some input does not match a pattern by prefacing the entire pattern with the negation operator !.

6.3      Variable notation

A variable is used to name a regular expression and to record patterns that match it. A variable must first be declared in a special variable declaration statement. Variable decla­ration statements have the format: “VARNAME = regular-expression” where VARNAME is at most eight characters long. If the variable name is more than one character, this name should be enclosed in parenthesis when the variable is invoked.  Variables are particularly important for the arules in the ar.cut file.  In these rules, the negation operator is the up arrow ^, not the exclamation mark.  Variables may be declared through combinations of two types of disjunction markers, as in this example for the definition of a consonant cluster in the English ar.cut file:

O = [^aeiou]|[^aeiou][^aeiou]|[^aeiou][^aeiou][^aeiou]|qu|sq

Here, the square brackets contain the definition of a consonant as not a vowel and the bar or turnstile symbols separate alternative sequences of one, two, or three consonants.  Then, for good measure, the patterns “qu” and “squ” are also listed as consonantal onsets.  For languages that use combining diacritics and other complex symbols, it is best to use the turnstile notation, since the square bracket notation assumes single characters.  In these strings, it is important not to include any spaces or tabs, since the presence of a space will signal the end of the variable.

Once declared, the variable can be invoked in a rule by using the operator $. If the vari­able name is longer than a single character, the variable name should be enclosed in paren­theses when invoked. For example, the statement X = .* declares and initializes a variable named “X.” The name X is entered in a special variable table, along with the regular ex­pression it stands for. Note that variables may not contain other variables.

The variable table also keeps track of the most recent string that matched a named pat­tern. For example, if the variable X is declared as above, then the pattern $Xle will match all strings that end in le.  For example, the string able will match this pattern, because ab will match the pattern named by X and le will match the literal string le.  Because the string ab is matched against the named pattern X, it will be stored in the variable table as the most recent instantiation of X, until another string matches X.

6.4      Category Information Operators

The following operators are used to manipulate category information: ADD [feature value], and DEL [feature value]. These are used in the category action statements. For ex­ample, the crule statement “RESULTCAT = ADD [num pl]” adds the feature value pair [num pl] to the result of the concatenation of two morphemes.

6.5      Arules

The function of the arules in the arules.cut file and the additional files in the /ar folder is to expand the entries in the disk lexicon into a larger num­ber of entries in the on-line lexicon. Words that undergo regular phonological or ortho­graphic changes when combined with an affix only need to have one disk lexicon entry. The arules are used to create on-line lexicon entries for all inflectional variants. These vari­ants are called allos. For example, the final consonant of the verb “stop” is doubled before a vowel-initial suffix, such as “-ing.” The disk lexicon contains an entry for “stop,” whereas the online lexicon contains two entries: one for the form “stop” and one for the form “stopp”.

An arule consists of a header statement, which contains the rulename, followed by one or more condition-action clauses. Each clause has a series of zero or more conditions on the input, and one or more sets of actions. Here is an example of a typical condition-action clause from the larger n-allo rule in the English ar.cut file:



LEXCAT = [scat n]







This is a single condition-action clause, labeled by the header statement “LEX-EN­TRY:” Conditions begin with one of these two keywords:


1.     LEXSURF matches the surface form of the word in the lexical entry to an ab­stract pattern. In this case, the variable declaration is

                Y = .*[^aeiou]

Given this variable statement, the statement “LEXSURF = $Yy” will match all lexical entry surfac­es that have a final y preceded by a nonvowel.

2.     LEXCAT checks the category information given in the matched lexical item against a given series of feature value pairs, each enclosed in square brackets and separated by commas. In this case, the rule is meant to apply only to nouns, so the category information must be [scat n]. It is possible to check that a feature-value pair is not present by prefacing the feature-value pair with the negation op­erator !.


Variable declarations should be made at the beginning of the rule, before any of the condi­tion-action clauses. Variables apply to all following condition-action clauses inside a rule, but should be redefined for each rule.

After the condition statements come one or more action statements with the label AL­LO: In most cases, one of the action statements is used to create an allomorph and the other is used to enter the original lexical entry into the run-time lexicon. Action clauses begin with one of these three keywords:


1.     ALLOSURF is used to produce an output surface. An output is a form that will be a part of the run-time lexicon used in the analysis. In the first action clause, a lexical entry surface form like “pony” is converted to “ponie” to serve as the stem of the plural. In the second action clause, the original form “pony” is kept because the form “ALLOSURF = LEXSURF” causes the surface form of the lexical entry to be copied over to the surface form of the allo.

2.     ALLOCAT determines the category of the output allos. The statement “ALLO­CAT = LEXCAT” causes all category information from the lexical entry to be copied over to the allo entry. In addition, these two actions add the morphologi­cal classes such as [allo nYa] or [allo nYb] in order to keep track of the nature of these allomorphs during the application of the crules.

3.     ALLOSTEM is used to produce an output stem. This action is not necessary in this example, because this rule is fully regular and produces a noninflected stem. However, the arule that converts “postman” into “postmen” uses this AL­LOSTEM action:

                 ALLOSTEM = $Xman&PL

The result of this action is the form postman&PL that is placed into the %mor line without the involvement of any of the concatenation rules.


There are two special category feature types that operate to dump the contents of the arules and the lexicon into the output.  These are “gen” and “proc”.  The gen feature introduces its value as a component of the stem.  Thus, the entry [gen m] for the Spanish word “hombre” will end up producing n|hombre&m.  The entry [proc dim] for Chinese reduplicative verbs wil end up producing v|kan4-DIM for the reduplicated form kan4kan4.  These methods allow allorules to directly influence the output of MOR.

Every set of action statements leads to the generation of an additional allomorph for the online lexicon. Thus, if an arule clause contains several sets of action statements, each la­beled by the header ALLO:, then that arule, when applied to one entry from the disk lexi­con, will result in several entries in the online lexicon. To create the online lexicon, the arules are applied to the entries in the disk lexicon. Each entry is matched against the arules in the order in which they occur in the arules file. This ordering of arules is an extremely important feature.  It means that you need to order specific cases before general cases to avoid having the general case preempt the specific case.

As soon as the input matches all conditions in the condition section of a clause, the ac­tions are applied to that input to generate one or more allos, which are loaded into the on-line lexicon. No further rules are applied to that input, and the next entry from the disk lex­icon is then read in to be processed. The complete set of arules should always end with a default rule to copy over all remaining lexical entries that have not yet been matched by some rule. This default rule must have this shape:

% default rule- copy input to output

RULENAME: default



6.6      Crules

The purpose of the crules in the crules.cut file is to allow stems to combine with affixes. In these rules, sets of conditions and actions are grouped together into if then clauses. This allows a rule to apply to a disjunctive set of inputs. As soon as all the conditions in a clause are met, the actions are carried out. If these are carried out successfully the rule is considered to have “fired,” and no further clauses in that rule will be tried.

There are two inputs to a crule: the part of the word identified thus far, called the “start,” and the next morpheme identified, called the “next.” The best way to think of this is in terms of a bouncing ball that moves through the word, moving items from the not-yet-pro­cessed chunk on the right over to the already processed chunk on the left. The output of a crule is called the “result.” The following is the list of the keywords used in the crules:


keyword     function

STARTSURF   check surface of start input against some pattern

STARTCAT    check start category information

NEXTSURF    check surface of next input against some pattern

NEXTCAT     check next category information

MATCHCAT    check that start and next match for a feature-value pair type

RESULTCAT   output category information

Here is an example of a piece of a rule that uses most of these keywords:

S = .*[sc]h|.*[zxs] % strings that end in affricates

O = .*[^aeiou]o % things that end in o

% clause 1 - special case for "es" suffix



 NEXTSURF = es|-es

 NEXTCAT = [scat vsfx]

 MATCHCAT [allo]




This rule is used to analyze verbs that end in -es. There are four conditions that must be matched in this rule:

1.     The STARTSURF is a stem that is specified in the declaration to end in an affri­cate. The STARTCAT is not defined.

2.     The NEXTSURF is the -es suffix that is attached to that stem.

3.     The NEXTCAT is the category of the suffix, which is “vsfx” or verbal suffix.

4.     The MATCHCAT [allo] statement checks that both the start and next inputs have the same value for the feature allo.  If there are multiple [allo] entries, all must match.

The shape of the result surface is simply the concatenation of the start and next surfaces. Hence, it is not necessary to specify this via the crules. The category information of the re­sult is specified via the RESULTCAT statement. The statement “RESULTCAT = START­CAT” causes all category information from the start input to be copied over to the result. The statement “NEXTCAT [tense]” copies the tense value from the NEXT to the RESULT and the statement “DEL [allo]” deletes all the values for the category [allo].

In addition to the condition-action statements, crules include two other statements: the CTYPE statement, and the RULEPACKAGES statement. The CTYPE statement identifies the kind of concatenation expected and the way in which this concatenation is to be marked. This statement follows the RULENAME header. There are two special CTYPE makers: START and END. “CTYPE: START” is used for those rules that execute as soon as one morpheme has been found. “CTYPE: END” is used for those rules that execute when the end of the input has been reached. Otherwise, the CYTPE marker is used to indicate which concatenation symbol is used when concatenating the morphemes together into a parse for a word. The # is used between a prefix and a stem, - is used between a stem and suffix, and ~ is used between a clitic and a stem.  In most cases, rules that specify possible suffixes will start with CTYPE: -. These rules insert a suffix after the stem.

Rules with CTYPE START are applied as soon as a morpheme has been recognized. In this case, the beginning of the word is considered as the start input, and the next input is the morpheme first recog­nized. As the start input has no surface and no category information associated with it, con­ditions and actions are stated only on the next input.

Rules with CTYPE END are invoked when the end of a word is reached, and they are used to rule out spurious parses. For the endrules, the start input is the entire word that has just been parsed, and there is no next input. Thus, conditions and actions are only stated on the start input.

The RULEPACKAGES statement identifies which rules may be applied to the result of a rule, when that result is the input to another rule. The RULEPACKAGES statement follows the action statements in a clause. There is a RULEPACKAGES statement associ­ated with each clause. The rules named in a RULEPACKAGES statement are not tried until after another morpheme has been found. For example, in parsing the input “walking”, the parser first finds the morpheme “walk,” and at that point applies the startrules. Of these startrules, the rule for verbs will be fired. This rule includes a RULEPACKAGES statement specifying that the rule which handles verb conjugation may later be fired. When the parser has further identified the morpheme “ing,” the verb conjugation rule will apply, where “walk” is the start input, and “ing” is the next input.

Note that, unlike the arules which are strictly ordered from top to bottom of the file, the crules have an order of application that is determined by their CTYPE and the way in which the RULEPACKAGES statement channels words from one rule to the next.

7       Building new MOR grammars

7.1      minMOR

The simplest possible form of a MOR grammar is represented in the “min” grammar that you can download from the MOR grammars page at 

You can begin your work using with the sample minimal MOR grammars available from the net.  This grammar includes

1.                  the sf.cut file that all of the MOR grammars use,

2.                 a sample.cha file with a few words

3.                 a basically blank ar.cut file, because no allomorphy is yet involved,

4.                 a cr.cut file that recognizes the parts of speech you will create, along with one rule for making plural nouns, and