Part
3: Morphosyntactic Analysis
Carnegie Mellon University
May 24, 2023
https://doi.org/10.21415/T5B97X
When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version:
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates
This allows us to systematically track usage of the programs and data through scholar.google.com.
Part 3: Morphosyntactic Analysis
2.2 Tag Groups and Word Groups
2.10 Sample Morphological Tagging for English
4.5 Categories and Components of MOR
4.6 MOR Part-of-Speech Categories
4.7 MOR Grammatical Categories
4.8 Compounds and Complex Forms
4.11 Control Features and Output Features
6 A Formal Description of the Rule Files
6.4 Category Information Operators
10 GRASP – Syntactic Dependency Analysis
11.13 CSUBJ, COBJ, CPOBJ, CPRED
This third volume of the TalkBank manuals deals with the use of the programs that perform automatic computation of the morphosyntactic structure of transcripts in CHAT format. These manuals, the programs, and the TalkBank datasets can all be downloaded freely from https://talkbank.org.
The first volume of the TalkBank manual describes the CHAT transcription format. The second volume describes the use of the CLAN data analysis programs. This third manual describes the use of the MOR, POST, POSTMORTEM, and MEGRASP programs to add a %mor and %gra line to CHAT transcripts. The %mor line provides a complete part-of-speech tagging for every word indicated on the main line of the transcript. The %gra line provides a further analysis of the grammatical dependencies between items in the %mor line. These programs for morphosyntactic analysis are all built into CLAN.
Users who do not wish to create or process information on the %mor and %gra lines will not need to read this current manual. However, researchers and clinicians interested in these features will need to know the basics of the use of these programs, as described in the next chapter. The additional sections of this manual are directed to researchers who wish to extend or improve the coverage of MOR and GRASP grammars or who wish to build such grammars for languages that are not yet covered.
Linguists and psycholinguists rely on the analysis of morphosyntax to illuminate core issues in learning and development. Generativist theories have emphasized issues such as: the role of triggers in the early setting of a parameter for subject omission (Hyams & Wexler, 1993), evidence for advanced early syntactic competence (Wexler, 1998), evidence for early absence functional categories that attach to the IP node (Radford, 1990), the role of optional infinitives in normal and disordered acquisition (Rice, 1997), and the child’s ability to process syntax without any exposure to relevant data (Crain, 1991). Generativists have sometimes been criticized for paying inadequate attention to the empirical patterns of distribution in children’s productions. However, work by researchers in this tradition, such as Stromswold (1994), van Kampen (1998), and Meisel (1986), demonstrates the important role that transcript data can play in evaluating alternative generative accounts.
Learning theorists have placed an even greater emphasis on the use of transcripts for understanding morphosyntactic development. Neural network models have shown how cue validities can determine the sequence of acquisition for both morphological (MacWhinney & Leinbach, 1991; MacWhinney, Leinbach, Taraban, & McDonald, 1989; Plunkett & Marchman, 1991) and syntactic (Elman, 1993; Mintz, Newport, & Bever, 2002; Siskind, 1999) development. This work derives further support from a broad movement within linguistics toward a focus on data-driven models (Bybee & Hopper, 2001) for understanding language learning and structure. These accounts formulate accounts that view constructions (Tomasello, 2003) and item-based patterns (MacWhinney, 1975) as the loci for statistical learning.
The
study of morphosyntax also plays an important role in the study and treatment
of language disorders, such as aphasia, specific language impairment,
stuttering, and dementia. For this work, both researchers and clinicians can
benefit from methods for achieving accurate automatic analysis of correct and
incorrect uses of morphosyntactic devices. To address these needs,
the TalkBank system uses the MOR command to automatically generate candidate
morphological analyses on the %mor tier, the POST command to disambiguate these
analyses, and the MEGRASP command to compute grammatical dependencies on the
%gra tier.
MOR creates a %mor tier with a one-to-one correspondence between words on the main line and words on the %mor tier. In order to achieve this one-to-one correspondence, the following rules are observed:
2.
Utterance
delimiters are preserved on the %mor line to facilitate readability and
analysis. These delimiters should be the same as the ones used
on the main line.
3. Along with utterance delimiters, the satellite markers of ‡ for the
vocative and „ for tag
questions or dislocations are also included on the %mor line in a
one-to-one alignment format.
4. Retracings and repetitions are excluded from this
one-to-one mapping, as are nonwords such as xxx or strings beginning with
&. When word repetitions are marked in the form word [x 3], the material in
parentheses is stripped off and the word is considered as a single form.
5. When a replacing form is indicated on the main line
with the form [: text], the material on the %mor line corresponds to the
replacing material in the square brackets, not the material that is being
replaced. For example, if the main line has gonna [: going to], the %mor line will code going to.
6. The [*] symbol that is used on the main line to
indicate errors is not duplicated on the %mor line.
On the %mor line, alternative taggings of a given word are clustered together in tag groups. These groups include the alternative taggings of a word that are produced by the MOR program. Alternatives are separated by the ^ character. Here is an example of a tag group for one of the most ambiguous words in English:
adv|back^adj|back^n|back^v|back
After you run the POST program on your files, all of these alternatives will be disambiguated and each word will have only one alternative. You can also use the hand disambiguation method built into the CLAN editor to disambiguate each tag group case by case.
The next level of organization for the MOR line is the word group. Word groups are combinations marked by the preclitic delimiter $, the postclitic delimiter ~ or the compound delimiter +. For example, the Spanish word dámelo can be represented as
vimpsh|da-2S&IMP~pro:clit|1S~pro:clit|OBJ&MASC=give
This word group is a series of three words (verb~postclitic~postclitic) combined by the ~ marker. Clitics may be either preclitics or postclitics. Separable prefixes of the type found in German or Hungarian and other discontinuous morphemes can be represented as word groups using the preclitic delimiter $, as in this example for ausgegangen (“gone”):
prep|aus$PART#v|geh&PAST:PART=go
Note the difference between the coding of the preclitic “aus” and the prefix “ge” in this example. Compounds are also represented as combinations, as in this analysis of angelfish.
n|+n|angel+n|fish
Here, the first characters (n|) represent the part of speech of the whole compound and the subsequent tags, after each plus sign, are for the parts of speech of the components of the compound. Proper nouns are not treated as compounds. Therefore, they take forms with underlines instead of pluses, such as Luke_Skywalker or New_York_City.
Beneath the level of the word group is the level of the word. The structure of each individual word is:
part-of-speech|
stem
&fusionalsuffix
-suffix
=english (optional, underscore joins words)
There can be any number of prefixes, fusional suffixes, and suffixes, but there should be only one stem. Prefixes and suffixes should be given in the order in which they occur in the word. Since fusional suffixes are fused parts of the stem, their order is indeterminate. The English translation of the stem is not a part of the morphology, but is included for convenience for non-native speakers. If the English translation requires two words, these words should be joined by an underscore as in “lose_flowers” for French défleurir.
Now let us look in greater detail at the nature of each of these types of coding. Throughout this discussion, bear in mind that all coding is done on a word-by-word basis, where words are considered to be strings separated by spaces.
The morphological codes on the %mor line begin with a part-of-speech code. The basic scheme for the part-of-speech code is:
category:subcategory:subcategory
Additional fields can be added, using the colon character as the field separator. The subcategory fields contain information about syntactic features of the word that are not marked overtly. For example, you may wish to code the fact that Italian “andare” is an intransitive verb even though there is no single morpheme that signals intransitivity. You can do this by using the part-of-speech code v:intrans, rather than by inserting a separate morpheme.
In order to avoid redundancy, information that is marked by a prefix or suffix is not incorporated into the part-of-speech code, as this information will be found to the right of the | delimiter. These codes can be given in either uppercase, as in ADJ, or lowercase, as in adj. In general, CHAT codes are not case-sensitive.
The particular codes given below are the ones that MOR uses for automatic morphological tagging of English. Individual researchers will need to define a system of part-of-speech codes that correctly reflects their own research interests and theoretical commitments. Languages that are typologically quite different from English may have to use very different part-of-speech categories. Quirk, Greenbaum, Leech, and Svartvik (1985) explain some of the intricacies of part-of-speech coding. Their analysis should be taken as definitive for all part-of-speech coding for English. However, for many purposes, a more coarse-grained coding can be used.
The following set of top-level part-of-speech codes is the one used by the MOR program. Additional refinements to this system can be found by studying the organization of the lexicon files for that program For example, compounds use the main part-of-speech code, along with codes for their components. Further distinctions can be found by looking at the MOR lexicon.
English Parts of
Speech
Category |
Code |
Adjective |
adj |
Adjective - Predicative |
adj:pred |
Adverb |
adv |
Adverb - Temporal |
adv:tem |
Communicator |
co |
Complementizer |
comp |
Conjunction |
conj |
Coordinator |
coord |
Determiner
Article |
det:art |
Determiner -
Demonstrative |
det:dem |
Determiner -
Interrogative |
det:int |
Determiner -
Numeral |
det:num |
Determiner -
Possessive |
det:poss |
Filler |
fil |
Infinitive |
inf |
Negative |
neg |
Noun |
n |
Noun - letter |
n:let |
Noun - plurale
tantum |
n:pt |
Proper Noun |
n:prop |
Onomatopoeia |
on |
Particle |
part |
Postmodifier |
post |
Preposition |
prep |
Pronoun -
demonstrative |
pro:dem |
Pronoun -
existential |
pro:exist |
Pronoun -
indefinite |
pro:indef |
Pronoun -
interrogative |
pro:int |
Pronoun - object |
pro:obj |
Pronoun - personal |
pro:per |
Pronoun - possessive |
pro:poss |
Pronoun - reflexive |
pro:refl |
Pronoun - relative |
pro:rel |
Pronoun -
subject |
pro:sub |
Quantifier |
qn |
Verb |
v |
Verb - auxiliary |
aux |
Verb - copula |
cop |
Verb - modal |
mod |
Every word on the %mor tier must include a “lemma” or stem as part of the morpheme analysis. The stem is found on the right hand side of the | delimiter, following any pre-clitics or prefixes. If the transcript is in English, this can be simply the canonical form of the word. For nouns, this is the singular. For verbs, it is the infinitive. If the transcript is in another language, it can be the English translation. A single form should be selected for each stem. Thus, the English indefinite article is coded as det|a with the lemma “a” whether or not the actual form of the article is “a” or “an.”
When English is not the main language of the transcript, the transcriber must decide whether to use English stems. Using English stems has the advantage that it makes the corpus more available to English-reading researchers. To show how this is done, take the German phrase “wir essen”:
*FRI: wir
essen.
%mor: pro|wir=we
v|ess-INF=eat .
Some projects may have reasons to avoid using English stems, even as translations. In this example, “essen” would be simply v|ess-INF. Other projects may wish to use only English stems and no target-language stems. Sometimes there are multiple possible translations into English. For example, German “Sie”/sie” could be either “you,” “she,” or “they.” Choosing a single English meaning for the stem helps fix the German form.
Affixes and clitics are coded in the position in which they occur with relation to the stem. The morphological status of the affix should be identified by the following markers or delimiters: - for a suffix, # for a prefix, and & for fusional or infixed morphology.
The & is used to mark affixes that are not realized in a clearly isolable phonological shape. For example, the form “men” cannot be broken down into a part corresponding to the stem “man” and a part corresponding to the plural marker, because one cannot say that the vowel “e” marks the plural. For this reason, the word is coded as n|man&PL. The past forms of irregular verbs may undergo similar ablaut processes, as in “came,” which is coded v|come&PAST, or they may undergo no phonological change at all, as in “hit”, which is coded v|hit&PAST. Sometimes there may be several codes indicated with the & after the stem. For example, the form “was” is coded v|be&PAST&13s. Affix and clitic codes are based either on Latin forms for grammatical function or English words corresponding to particular closed-class items. MOR uses the following set of affix codes for automatic morphological tagging of English.
Inflectional
Affixes for English
Function |
Code |
adjective suffix
er, r |
CP |
adjective suffix
est, st |
SP |
noun suffix ie |
DIM |
noun suffix s, es |
PL |
noun suffix 's, ' |
POSS |
verb suffix s, es |
3S |
verb suffix ed, d |
PAST |
verb suffix ing |
PRESP |
verb suffix en |
PASTP |
Derivational
Affixes for English
Function |
Code |
adjective and
verb prefix un |
UN |
adverbializer ly |
LY |
nominalizer er |
ER |
noun prefix ex |
EX |
verb prefix dis |
DIS |
verb prefix mis |
MIS |
verb prefix out |
OUT |
verb prefix over |
OVER |
verb prefix pre |
PRE |
verb prefix pro |
PRO |
verb prefix re |
RE |
Clitics are marked by a tilde, as in v|parl&IMP:2S=speak~pro|DAT:MASC:SG for Italian “parlagli” and pro|it~v|be&3s for English “it's.” Note that part of speech coding with the | symbol is repeated for clitics after the tilde. Both clitics and contracted elements are coded with the tilde. The use of the tilde for contracted elements extends to forms like “sul” in Italian, “ins” in German, or “rajta” in Hungarian in which prepositions are merged with articles or pronouns.
Clitic Codes for
English
Clitic |
Code |
noun phrase
post-clitic 'd |
v:aux|would,
v|have&PAST |
noun phrase
post-clitic 'll |
v:aux|will |
noun phrase
post-clitic 'm |
v|be&1S,
v:aux|be&1S |
noun phrase
post-clitic 're |
v|be&PRES, v:aux|be&PRES |
noun phrase
post-clitic 's |
v|be&3S, v:aux|be&3S |
verbal
post-clitic n't |
neg|not |
Here are some words that we might want to treat as compounds: sweatshirt, highschool, playground, and horseback. You can find hundreds of these in files in the English lexicon such as adv+n+adv.cut that include the plus symbol in their names. There are also many idiomatic phrases that could be best analyzed as linkages. Here are some examples: a_lot_of, all_of_a_sudden, at_last, for_sure, kind_of, of_course, once_and_for_all, once_upon_a_time, so_far, and lots_of. You can find hundreds of these in files in the English lexicon with names such as adj_under.cut.
On the %mor tier it is necessary to assign a part-of-speech label to each segment of the compound. For example, the word blackboard is coded on the %mor tier as n|+adj|black+n|board. The part of speech of the compound as a whole is usually given by the part-of-speech of the final segment, although this is not always true.
In order to preserve the one-to-one correspondence between words on the main line and words on the %mor tier, words that are not marked as compounds on the main line should not be coded as compounds on the %mor tier. For example, if the words “come here” are used as a rote form, then they should be written as “come_here” on the main tier. On the %mor tier this will be coded as v|come_here. It makes no sense to code this as v|come+adv|here, because that analysis would contradict the claim that this pair functions as a single unit. It is sometimes difficult to assign a part-of-speech code to a morpheme. In the usual case, the part-of-speech code should be chosen from the same set of codes used to label single words of the language. For example, some of these idiomatic phrases can be coded as compounds on the %mor line.
Phrases Coded as
Linkages
Phrase |
Phrase |
qn|a_lot_of |
adv|all_of_a_sudden |
co|for_sure |
adv:int|kind_of |
adv|once_and_for_all |
adv|once_upon_a_time |
adv|so_far |
qn|lots_of. |
MOR can be configured to recognize certain punctuation marks as whole word characters. In particular, the file punct.cut contains these entries:
„ {[scat end]} "end"
‡ {[scat beg]} "beg"
, {[scat cm]} "cm"
When the punctuation marks on the left occur in text, they are treated as separate lexical items and are mapped to forms such as beg|beg on the %mor tier. The “end” marker is used to mark postposed forms such as tags and sentence final particles. The “beg” marker is used to mark preposed forms such as vocatives and communicators. These special characters are important for correctly structuring the dependency relations for the GRASP program.
The following table describes and illustrates a more detailed set of word class codings for English. The %mor tier examples correspond to the labellings MOR produces for the words in question. It is possible to augment or simplify this set, either by creating additional word categories, or by adding additional fields to the part-of-speech label, as discussed previously. The entries in this table and elsewhere in this manual can always be double-checked against the current version of the MOR grammar by typing “mor +xi” to bring up interactive MOR and then entering the word to be analyzed.
Word Classes for
English
Class |
Examples |
Coding
of Examples |
adjective |
big |
adj|big |
adjective, comparative |
bigger, better |
adj|big-CP, adj|good&CP |
adjective, superlative |
biggest, best |
adj|big-SP, adj|good&SP |
adverb |
well |
adv|well |
adverb, ending in ly |
quickly |
adv:adj|quick-LY |
adverb, intensifying |
very, rather |
adv:int|very, adv:int|rather |
adverb, post-qualifying |
enough, indeed |
adv|enough, adv|indeed |
adverb, locative |
here, then |
adv:loc|here, adv:tem|then |
communicator |
aha |
co|aha |
conjunction, coord. |
and, or |
conj:coo|and, conj:coo|or |
conjunction, subord. |
if, although |
conj:sub|if, conj:sub|although |
determiner, singular |
a, the, this |
det|a, det|this |
determiner, plural |
these, those |
det|these, det|those |
determiner, possessive |
my, your, her |
det:poss|my |
infinitive marker |
to |
inf|to |
noun, common |
cat, coffee |
n|cat, n|coffee |
noun, plural |
cats |
n|cat-PL |
noun, possessive |
cat's |
n|cat~poss|s |
noun, plural possessive |
cats' |
n|cat-PL~poss|s |
noun, proper |
Mary |
n:prop|Mary |
noun, proper, plural |
Mary-s |
n:prop|Mary-PL |
noun, proper, possessive |
Mary's |
n:prop|Mary~poss|s |
noun, proper, pl. poss. |
Marys' |
n:prop|Mary-PL~poss|s |
noun, adverbial |
home, west |
n|home, adv:loc |home |
number, cardinal |
two |
det:num|two |
number, ordinal |
second |
adj|second |
postquantifier |
all, both |
post|all, post|both |
preposition |
in |
prep|in, adv:loc|in |
pronoun, personal |
I, me, we, us, he |
pro|I, pro|me, pro|we, pro|us |
pronoun, reflexive |
myself, ourselves |
pro:refl|myself |
pronoun, possessive |
mine, yours, his |
pro:poss|mine, pro:poss:det|his |
pronoun, demonstrative |
that, this, these |
pro:dem|that |
pronoun, indefinite |
everybody, nothing |
pro:indef|everybody |
pronoun, indef., poss. |
everybody's |
pro:indef|everybody~poss|s |
quantifier |
half, all |
qn|half, qn|all |
verb, base form |
walk, run |
v|walk, v|run |
verb, 3rd singular present |
walks, runs |
v|walk-3S, v|run-3S |
verb, past tense |
walked, ran |
v|walk-PAST, v|run&PAST |
verb, present participle |
walking, running |
part|walk-PRESP, part|run-PRESP |
verb, past participle |
walked, run |
part|walk-PASTP, part|run&PASTP |
verb, modal auxiliary |
can, could, must |
aux|can, aux|could, aux|must |
Since it is sometimes difficult to decide what part of speech a word belongs to, we offer the following overview of the major part-of-speech labels used in the standard English grammar.
ADJectives modify nouns, either prenominally, or predicatively. Unitary
compound modifiers such as good-looking
should be labeled as adjectives.
ADVerbs cover a heterogenous class of words including: manner
adverbs, which generally end in -ly; locative adverbs, which include
expressions of time and place; intensifiers that modify adjectives; and
post-head modifiers, such as indeed
and enough.
COmmunicators are used for interactive and communicative forms which
fulfill a variety of functions in speech and conversation. Also included in
this category are words used to express emotion, as well as imitative and
onomatopeic forms, such as ah, aw, boom, boom-boom, icky, wow, yuck, and yummy.
CONJunctions conjoin two or more words, phrases, or sentences. Examples
include: although, because, if, unless, and until.
COORDinators include and, or, and as well as. These can combine clauses, phrases, or words.
DETerminers include articles, and definite and indefinite determiners.
Possessive determiners such as my
and your are tagged det:poss.
INFinitive is the word “to” which is tagged inf|to.
INTerjections are similar to communicators, but they typically can stand alone as complete utterances or fragments, rather than being integrated as parts of the utterances. They include forms such as wow, hello, good-morning, good-bye, please, thank-you.
Nouns are tagged with n for
common nouns, and n:prop for proper
nouns (names of people, places, fictional characters, brand-name products).
NEGative is the word “not” which is tagged neg|not.
NUMbers are labelled num for cardinal numbers. The ordinal
numbers are adjectives.
Onomatopoeia are words that imitate the sounds of nature, animals, and
other noises.
Particles are words that are often also prepositions, but are serving
as verbal particles.
PREPositions are the heads of prepositional phrases. When a preposition
is not a part of a phrase, it should be coded as a particle or an adverb.
PROnouns include a variety of structures, such as reflexives,
possessives, personal pronouns, deictic pronouns, etc.
QUANTifiers include each, every,
all, some, and similar items.
Verbs can be either main verbs, copulas, or auxililaries.
Currently, the most highly developed MOR grammar is the one for English, which achieves 99.18% accuracy in tagging for productions from adult native speakers in databases such as CHILDES and AphasiaBank. It is more difficult to reliably determine the accuracy of tagging for child utterances, particularly at the youngest ages when there are often ambiguities in one-word and two-word utterances (Bloom, 1973) that even human coders cannot resolve. MOR grammars are also highly evolved for Spanish, French, German, Mandarin, Japanese, and Cantonese, achieving over 95% accuracy for these languages too. Apart from accuracy of tagging, there is the issue of lexical coverage. For child language, lexical coverage is largely complete for these languages. However, as we deal with more advanced adult productions, such as the academic language in the MICASE corpus or the juridicial language in the SCOTUS corpus, we often need to add new items to the lexicons for given language. Accuracy of grammatical relation tagging by GRASP is highly dependent on accurate tagging by MOR. However, even with accurate MOR tagging, GRASP tagging is at 93% accuracy for English.
It is possible to construct a complete automatic morphosyntacgtic analysis of a series of CHAT transcripts through a single command in CLAN, once you have the needed programs in the correct configuration. This command runs the MOR, POST, POSTMORTEM, and MEGRASP commands in an automatic sequence or chain. To do this, you follow these steps:
1. Place all the files you wish to analyze into a single folder.
2. Start the CLAN program (see the Part 2 of the manual for instructions on installing CLAN).
3. In CLAN’s Commands window, click on the buttom labelled Working to set your working directory to the folder that has the files to be analyzed.
4. Under the File menu at the top of the screen, select Get MOR Grammar and select the language you want to analyze. To do this, you must be connected to the Internet. If you have already done this once, you do not need to do it again. By default, the MOR grammar you have chosen will download to your desktop.
5. If you choose to move your MOR grammar to another location, you will need to use the Mor Lib button in the Commands window to tell CLAN about where to locate it.
6. To analyze all the files in your Working directory folder, enter this command in the Comands window: mor *.cha
7. CLAN will then run these programs in sequence: MOR, POST, POSMORTEM, and MEGRASP. These programs will add %mor and %gra lines to your files.
8. If this command ends with a message saying that some words were not recognized, you will probably want to fix them. If you do not, some of the entries on the %mor line will be incomplete and the relations on the %gra line will be less accurate. If you have doubts about the spellings of certain words, you can look in the 0allwords.cdc file this is included in the /lex folder for each language. The words there are listed in alphabetical order.
9. To correct errors, you can run this command: mor +xb *.cha.. Guidelines for fixing errors are given in chapter 4 below.
The computational design of mor was
guided by Roland Hausser’s (1990) MORPH system and was implemented by Mitzi
Morris. Since
2000, Leonid Spektor has extended MOR in many ways. Christophe Parisse built POST and POSTTRAIN (Parisse
& Le Normand, 2000). Kenji Sagae built MEGRASP as a part
of his dissertation work for the Language Technologies Institute at Carnegie
Mellon University (Sagae, MacWhinney, & Lavie,
2004a, 2004b). Leonid
Spektor then integrated the program into CLAN.
The system has been designed to maximize portability across languages, extendability of the lexicon and grammar, and compatibility with the clan programs. The basic engine of the parser is language independent. Language-specific information is stored in separate data files that can be modified by the user. The lexical entries are also kept in ASCII files and there are several techniques for improving the match of the lexicon to a corpus. To maximize the complete analysis of regular formations, only stems are stored in the lexicon and inflected forms appropriate for each stem are compiled at run time.
To
give an example of the results of a MOR analysis for English, consider this
sentence from eve15.cha in Roger Brown’s corpus for
Eve.
*CHI: oops I spilled it.
%mor: co|oops pro:subj|I v|spill-PAST
pro:per|it.
Here,
the main line gives the child’s production and the %mor line gives the part of
speech for each word, along with the morphological analysis of affixes, such as
the past tense mark (-PAST) on the verb.
The %mor lines in these files were not created by hand. To produce them, we ran the MOR command,
using the MOR grammar for English, which can be downloaded using the Get MOR Grammar function described in
the previous chapter. The command for running MOR by itself without running the
rest of the chain is: mor +d *.cha. After running MOR, the file looks like
this:
*CHI: oops I spilled it .
%mor: co|oops pro:subj|I part|spill-PASTP^v|spill-PAST
pro:per|it .
In the %mor tier, words are labeled by their syntactic category or “scat”, followed by the pipe separator |, followed then by the stem and affixes. Notice that the word “spilled” is initially ambiguous between the past tense and participle readings. The two ambiguities are separated by the ^ character. To resolve such ambiguities, we run a program called POST. The command is just “post *.cha” After POST has been run, the %mor line will only have v|spill-PAST.
Using this
disambiguated form, we can then run the MEGRASP program to create the
representation given in the %gra line below:
*CHI: oops I spilled it .
%mor: co|oops pro:subj|I v|spill-PAST
pro:per|it .
%gra: 1|3|COM
2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|3|PUNCT
In
the %gra line, we see that the second word “I” is related to the verb
(“spilled”) through the grammatical relation (GR) of Subject. The fourth word “it” is related to the verb
through the grammatical relation of Object.
The verb is the Root and it is related to the “left wall” or item 0.
Because MOR focuses on the analysis of the target utterance, it excludes a variety of non-words, retraces, and special symbols. Specifically, MOR excludes:
1. Items that start with &
2. Pauses such as (.)
3. Unknown forms marked as xxx, yyy, or www
4. Data associated with these codes: [/?], [/-], [/], [//], and [///].
+d do not run POST command automatically. POST will run automatically after MOR, unless this switch is used or unless the folder name includes the word “train”.
+eS Show the result of the operation of the arules on either a stem S or stems in file @S. This output will go into a file called debug.cdc in your library directory. Another way of achieving this is to use the +d option inside “interactive MOR”
+p use pinyin lexicon format for Chinese
+xi Run mor in the interactive test mode. You type in one word at a time to the test prompt and mor provides the analysis on line. This facility makes the following commands available in the CLAN Output window:
word - analyze this word
:q quit- exit program
:c print out current set of crules
:d display application of arules.
:l re-load rules and lexicon files
:h help - print this message
If you type in a word, such as “dog” or “perro,” MOR will try to analyze it and give you its components morphemes. If you change the rules or the lexicon, use :l to reload and retest. The :c and :d switches will send output to a file called debug.cdc in your library directory.
+xl Run mor in the lexicon building mode. This mode takes a series of .cha files as input and outputs a small lexical file with the extension .ulx with entries for all words not recognized by mor. This helps in the building of lexicons.
+xb check lexicon mode, include word location in data files
+xa check lexicon for ambiguous entries
+xc check lexicon mode, including capitalized words
+xd check lexicon for compound words conflicting with plain words
+xp check lexicon mode, including words with prosodic symbols
+xy analyze words in lex files
MOR
breaks up words into their component parts or morphemes. In a relatively analytic language like
English, many words require no analysis at all.
However, even in English, a word like “coworkers”
can be seen to contain four component morphemes, including the prefix “co”, the
stem, the agential suffix, and the plural.
For this form, MOR will produce the analysis: co#n:v|work-AGT-PL.
This representation uses the symbols # and – to separate the four
different morphemes. Here, the prefix
stands at the beginning of the analysis, followed by the stem (n|work), and the two suffixes. In general, stems always have the form of a
part of speech category, such as “n” for noun, followed by the vertical bar and
then a statement of the stem’s lexical form.
To
understand the functioning of the MOR grammar for English, the best place to
begin is with a tour of the files inside the ENG folder that you can download
from the server. At the top level, you
will see these files:
1. ar.cut – These are the rules that generate allomorphic variants from the stems and a