Episodicity Derived Corpus

	Rui He Universitat Pompeu Fabra wolfram.hinzen@upf.edu		Wolfram Hinzen Universitat Pompeu Fabra wolfram.hinzen@upf.edu
DOI:

Citation

If using this corpus in published materials, please use the following citation.

Mary Lofgren , Wolfram Hinzen , Breaking the flow of thought: increase of empty pauses in the connected speech of people with mild and moderate Alzheimer’s disease, Journal of Communication Disorders (2022), doi: https://doi.org/10.1016/j.jcomdis.2022.106214

This .zip file has the publication, the data and the coding scheme.

Methods

We used the Pitt Corpus (https://dementia.talkbank.org/access/English/Pitt.html, grant support: NIA AG03705 and AG05133) available on Dementia Bank to get the spontaneous speech data (Becker et al., 1994). This corpus provides audio recordings and transcripts from healthy controls and probable ADs with their demographic data. We selected 70 healthy older controls (HOC), 82 people with mild AD, and 46 with moderate AD, using the exact cutoffs used in Lofgren and Hinzen (2022). The filtering criteria used in the present study were: (1) HOC: categorized as control in Pitt corpus with an MMSE score above 26; (2) Mild AD: categorized as probable AD in Pitt corpus with an MMSE score from 16 to 24; (3) Moderate AD: categorized as probable AD in Pitt corpus with an MMSE score from 8 to 15. In this corpus, participants are instructed to describe what is going on in the Cookie Theft picture. There were 3896 utterances in total, including 1087 utterances from the interviewers and 2809 utterances from the interviewees. Only utterances from the interviewees are included in the analyses.

Annotation

As episodicity is an event-based concept, we first segmented the texts into utterances. In this study, we defined an utterance as a grammatically independent unit of speech that provides new information to the discourse. According to this definition, an utterance does not have to be a clause (e.g. Great shot! can be an utterance, and it is not syntactically a clause), while a clause does not have to be an utterance (since a clause can occur embedded in another, in which case it will not be an utterance). The ‘new information’ requirement excludes reduced utterances that primarily serve a discourse function (e.g., Oh well). The original utterance division of the Pitt Corpus was changed when it did not comply with our definition. As clausal structures carry event information and can be embedded in an utterance, we also annotated episodicity separately in embedded clauses (EC). These were defined as syntactically dependent units with a predicate and an argument, where the predicates in question are verb phrases (VPs).

We classified utterances as falling into three main categories: episodic (EP), non-episodic (NONEP), and others (OTHER). We defined an episodic utterance as an utterance describing a dynamic (as opposed to static) event, which is both specific (rather than indefinite or generic) and presented from a first-person perspective. In the context of a picture description task, it will often be the description of an event as ongoing as and when the speech act proceeds. Judgments of episodicity were intentionally qualitative, in the sense that we could not identify a specific set of linguistic markers that were jointly necessary and sufficient for episodicity. However, in every utterance, a number of linguistic indicators were considered to inform this judgment, such as the grammatical Aspect of verbs or copular clauses, as noted in the introduction. Details of these linguistic features with a flowchart for the annotation scheme can be found in the supplementary materials. Within OTHER, three subdivisions were introduced since it seemed necessary to distinguish primary on-task utterances from different off-task ones: (i) Meta: This included non-verbal components that appear in the description, verbal but off-task, and meta-task utterances, e.g., xxx (hesitation), I can’t read, thank you; (ii) Interj: Interjections through sounds, words, or brief sentences typically expressing an emotional reaction, e.g., My God, yeah, mhm; (iii) Incomp: Incomprehensible utterances that are severely impaired semantically and/or syntactically, e.g., She’s, and then the, cappdfk five seven.

Two annotators labeled all utterances separately. The interrater reliability, as measured by the ratio of utterances where the two annotators agreed on both whole utterance and EC annotations, was 95.91%. A third annotator, who also supervised this study, checked all utterances where the two annotators disagreed, and 10% of the utterances they agreed.