# PARLO Dementia Corpus

The **PARLO Dementia Corpus (PDC)** is a German multi-center clinical
speech corpus collected at nine academic memory clinics. This release
provides the recordings as audio (`wav/`), manual professional
transcripts (`transcripts/`), and CLAN/CHAT files with word-level
time alignments (`CHAT/`). The raw forced-alignment word tables are
also included under `alignments/`.

## 1. Citation

Braun, F., Witzl, C., Hönig, F., Nöth, E., Bocklet, T., & Riedhammer, K.
(2026). *The PARLO Dementia Corpus: A German Multi-Center Resource for
Alzheimer's Disease.* In Proceedings of the Fifteenth Language Resources
and Evaluation Conference (LREC 2026), pages 9581–9591, Palma, Mallorca,
Spain. European Language Resources Association (ELRA).
<https://doi.org/10.63317/5eo6ayamaqnq>

```bibtex
@inproceedings{braun-etal-2026-parlo,
  author    = {Braun, Franziska and Witzl, Christopher and H{\"o}nig, Florian and N{\"o}th, Elmar and Bocklet, Tobias and Riedhammer, Korbinian},
  title     = {The PARLO Dementia Corpus: A German Multi-Center Resource for Alzheimer's Disease},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  editor    = {Piperidis, Stelios and Bel, N{\'u}ria and van den Heuvel, Henk and Ide, Nancy and Krek, Simon and Toral, Antonio},
  month     = {May},
  year      = {2026},
  pages     = {9581--9591},
  address   = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  doi       = {10.63317/5eo6ayamaqnq},
  abstract  = {Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.}
}
```

## 2. Corpus Overview

s

### 2.1 Subjects

| Group                          | Code         | n   |
| ------------------------------ | ------------ | --: |
| Healthy controls               | `control`    |  83 |
| Mild cognitive impairment      | `patient_mci`|  59 |
| Dementia                       | `patient_ad` |  66 |
| **Total**                      |              | 208 |

| Demographic            | Value                                |
| ---------------------- | ------------------------------------ |
| Age range              | 55 – 87 years |
| Mean age (± SD)        | 70.4 ± 8.4 |
| Gender                 | 96 male / 112 female |

### 2.2 Tasks (PARLO Dementia Test Battery, PDTB)

All eight PDTB tasks are present in this distribution. Six are aligned
at word level; the two repetition tasks carry the manual transcript
without a `%wor` tier (see §4).

| task name             | description                                            | word-level FA |
| --------------------- | ------------------------------------------------------ | :-----------: |
| story_reading         | Reading aloud the *Johanna Subway* story               | ✓             |
| boston_naming         | Confrontation naming, 15 line drawings (BNT variant)   | ✓             |
| animal_naming         | Semantic verbal fluency, animals, 60 s                 | ✓             |
| picture_description   | Description of the *Mountain Scene* picture            | ✓             |
| pataka_repetition     | Word repetition `pataka`, 10 s                         | —             |
| sischafu_repetition   | Word repetition `sischafu`, 10 s                       | —             |
| story_recall          | Delayed recall of *Johanna Subway*                     | ✓             |
| picture_recall        | Delayed recall of the *Mountain Scene* picture         | ✓             |

## 3. Directory Layout

```
pdc/
├── alignments/<stem>.tsv      forced-alignment word table
├── CHAT/<stem>.cha            CLAN-format transcript with bullets
├── wav/<stem>.wav             audio (44.1 kHz, 16-bit mono)
└── transcripts/<stem>.txt     manual transcript (Dresing & Pehl)
```

`wav/` and `transcripts/` cover all 1633 recordings.
`CHAT/` and `alignments/` cover the 1196 recordings for
which word-level forced alignment is available (i.e. excluding the two
word-repetition tasks and a small number of empty recordings — see §2).
Stems missing from `CHAT/` always have a matching `wav/` + `transcripts/`
pair; consult those for the manual transcript and the audio.

### 3.1 File-Naming Convention

```
PARLO_<patient_id>_<task_name>_<YYYYMMDDhhmm>.<ext>
```

- `<patient_id>` — unique numeric subject ID (matches the `patient_id`
  column in `metadata.csv`)
- `<task_name>` — one of the human-readable task names from §2.2
- `<YYYYMMDDhhmm>` — recording timestamp (minute precision)

Example: `PARLO_1002_picture_description_202006081001.cha`.

### 3.2 TSV-Format under `alignments/`

Each alignment TSV has the columns `word\tstart\tend\tspeaker`
(start/end in seconds, floats with 3 decimals, speaker `A` or `B`).
File-Naming follows the same `PARLO_<patient_id>_<task_name>_<ts>` schema
as the CHA/wav/txt outputs, or — for legacy FA runs done before the
source rename — the older `psp-prod_<patient_id>_<test_id>_…_<ts>` form.

## 4. CHAT Files (`CHAT/*.cha`)

The `.cha` files contain a **word-level time-aligned transcription** of
each recording, encoded in **CLAN-CHAT** format (UTF-8). Each spoken
word carries millisecond-precise start and end times. For the full
format specification see the TalkBank CHAT manual:
<https://talkbank.org/manuals/CHAT.html>.

The transcription is derived from the recorded audio. Surface features
of the manual transcripts (pauses, paraverbal annotations, truncation
markers, direct speech) are not reproduced on the speaker tiers; the
original `transcripts/*.txt` files remain part of this distribution and
are the canonical reference for that layer.

### 4.1 Tier Layout

Each utterance is encoded as a pair of CHAT-standard tiers:

```
*PAR:  Johanna nahm den Flug von Frankfurt nach Barcelona . ●750_10230●
%wor:  Johanna●750_2340● nahm●2340_2670● den●2670_2880● Flug●2880_3210●
       …
```

- **Main tier** (`*PAR:` for the patient, `*INV:` for the investigator):
  the orthographic word forms, followed by a sentence-final `.` and
  the utterance-level time bullet `\x15<start_ms>_<end_ms>\x15`.
- **`%wor:` dependent tier**: one word-level bullet per token, line-
  wrapped to ≤ 80 characters with tab continuation.

Token counts on the two tiers are identical by construction — no
main/`%wor:` mismatches occur.

Speaker mapping in multi-speaker recordings: `A` (investigator) →
`*INV`, `B` (patient) → `*PAR`. Recordings without speaker markers in
the source transcript are patient-only and contain a single `*PAR`
participant.

### 4.2 Utterance Segmentation

One utterance corresponds to one continuous **speaker block**: all
consecutive words from the same speaker form a single utterance,
terminated by `.` plus the utterance time bullet. The only segmentation
signal is a change of speaker; pauses within a speaker's stretch are
not split into separate utterances.

For tasks without word-level alignment (`pataka_repetition`,
`sischafu_repetition`), no `.cha` file exists. See §3.

### 4.3 Conventions on Special Tokens

**Truncated words / single articulated letters** appear on the main
tier with a CHAT phonological-fragment marker (`&+a`, `&+h`, `&+U` …).
They are not duplicated on `%wor:`. Example: a speaker who began „A…"
before saying „Ameise" shows up as `*PAR: &+A Ameise`.

### 4.4 What Is and Isn't on the Tiers

| Information | On the tiers? | Recovery source |
| ----------- | :-----------: | --------------- |
| Spoken word forms (orthographic) | ✓ | — |
| Word-level start/end times (ms)  | ✓ | — |
| Speaker identity per word        | ✓ | — |
| Capitalisation                   | ✓ | (matches manual transcript) |
| Subject metadata (age, group, MMSE, …) | ✓ | header `@Comment` lines, see §4.5 |
| Audio reference                  | ✓ | `@Media:` header line |
| Filled pauses (`äh`, `ähm`)      | (✓) | only when articulated and aligned |
| Word truncations (`Treppen/`)    | ✗ | `transcripts/*.txt` |
| Silent pauses (`(.)`, `(...)`, `(5)`) | ✗ | `transcripts/*.txt` |
| Paraverbal annotations (`(lachen)`, `(räuspern)`) | ✗ | `transcripts/*.txt` |
| Direct speech / quotation marks  | ✗ | `transcripts/*.txt` |
| Sentence-level punctuation `.!?` | ✗ | `transcripts/*.txt` |

### 4.5 Per-Subject Metadata in `@Comment` Lines

Each `.cha` header carries every relevant column from
`metadata.csv` as a series of grouped `@Comment` lines (`ids:`,
`demographics:`, `cognition:`, `biolog:`, `biomarker:`, `recording:`,
`study:`). Empty cells appear as `key=`. This makes batch filtering
trivial:

```bash
# all picture-description recordings of AD patients with MMSE ≤ 22
grep -l 'diagnosis_of_AD=True' CHAT/*picture_description*.cha \
    | xargs grep -l 'mmse_value=2[0-2]\b'
```

## 5. WAV Files (`wav/*.wav`)

Renamed copies (or symlinks, depending on how the directory was
populated) of the original recordings. Format matches the source:
**44.1 kHz, 16-bit mono PCM WAV**.

## 6. Manual Transcripts (`transcripts/*.txt`)

Manual professional transcription of every recording, following the
**extended scientific transcription rules of Dresing & Pehl**
([Praxisbuch, 9th ed., 2024](https://www.audiotranskription.de/wp-content/uploads/2024/06/Praxisbuch_09_02_Web2.pdf)).
The text includes:

- Filler particles (`äh`, `ähm`, `mhh`)
- Silent-pause notation: `(.)` short, `(..)` medium, `(...)` long;
  numeric for ≥ 4 s, e.g. `(5)` for a 5-second pause
- Paraverbal events in parentheses: `(lachen)`, `(räuspern)`,
  `(nachdenkend)`, `(unv.)` for unintelligible
- Word truncations marked with trailing `/`, e.g. `Treppen/`
- Speaker markers `A:` (investigator) and `B:` (patient) for
  multi-speaker turns
- End-of-turn timestamps `#hh:mm:ss-d#` (`d` = tenths of seconds)
  recorded via a button press by the investigator

## 7. Metadata Fields

All 56 columns of `metadata.csv` are mirrored into each `.cha`
header (split between `@ID`, `@Date`, and grouped `@Comment` lines).
Boolean fields use the strings `True` / `False`; missing values are
represented by an empty string. Values are passed through verbatim from
the source CSV.

### 7.1 Identifiers & Setting

| field                     | description                                           |
| ------------------------- | ----------------------------------------------------- |
| `patient_id`              | unique numeric subject ID                             |
| `task`                    | task name (see §2.2; derived from filename)           |
| `type`                    | diagnostic group: `control` / `patient_mci` / `patient_ad` |
| `type_numeric`            | numeric encoding of `type` (0=control, 1=mci, 2=ad)   |
| `clinic_id`               | study-internal numeric ID of the recording clinic     |
| `investigator_clinic_city`| federal state of the recording clinic                 |
| `assessment_creation_date`| date the assessment file was generated (`@Date` header) |
| `assessment_creation_time`| time the assessment file was generated (`HH:MM:SS`)   |
| `ios_version`             | iOS version of the iPad used for recording            |

### 7.2 Demographics

| field                | description                                              |
| -------------------- | -------------------------------------------------------- |
| `age`                | age at recording, in years (also `@ID` position 4)       |
| `age_in_range`       | True if age within the study eligibility range           |
| `gender`             | `male` / `female` (also `@ID` position 5)                |
| `gender_numeric`     | 0 = male, 1 = female                                     |
| `female_not_fertile` | for female subjects: childbearing potential excluded     |
| `education1`         | coded school qualification (e.g. `SDAL`, `SDMS`, `SDPS`) |
| `education2`         | coded professional qualification (e.g. `DUNI`, `DVTR`)   |

**Education codes** (from the paper, Table 3):

| code | qualification               | avg. years          |
| ---- | --------------------------- | -------------------:|
| SDAL | A-Level                     | 13                  |
| SDPS | Polytechnical School        | 12                  |
| SDSS | Secondary School            | 10                  |
| SDMS | Main Schooling              | 9                   |
| SDWO | Without                     | 9                   |
| SDNC | No Comment                  | 9                   |
| DPHD | PhD                         | 21                  |
| DUNI | University                  | 17                  |
| DVTR | Vocational Training         | SD years + 3        |
| DWTR | Without Vocational Training | equals SD years     |

### 7.3 Cognition & Diagnosis

| field                          | description                                              |
| ------------------------------ | -------------------------------------------------------- |
| `totalYearsofEducation`        | total years of education (also `@ID` position 9)         |
| `mmse_value`                   | Mini-Mental State Examination raw score (0 – 30)         |
| `mmse_score_in_range`          | True if MMSE score is within accepted study range        |
| `mmse_score_above_24`          | True if MMSE > 24                                        |
| `cdr_value`                    | Clinical Dementia Rating (0, 0.5, 1, 2, 3)               |
| `cdr_global_score_in_range`    | True if CDR within accepted study range                  |
| `cdr_global_score_equals_zero` | True if CDR = 0                                          |
| `cerad_cognitive_test_normal`  | CERAD neuropsychological battery normal (≤ 1.2 SD below age-adjusted norms) |
| `cognitive_complaints_reported`| subject reports subjective cognitive complaints          |
| `diagnosis_of_AD`              | clinical diagnosis of Alzheimer's disease present (NIA-AA AD criteria) |
| `diagnosis_of_mci`             | clinical diagnosis of mild cognitive impairment present (NIA-AA AD criteria) |
| `diagnosis_excludes_dementia`  | dementia explicitly excluded by clinician                |

### 7.4 Biological / Speech-Relevant

| field                                              | description                                               |
| -------------------------------------------------- | --------------------------------------------------------- |
| `german_native_speaker`                            | subject is a native German speaker                        |
| `no_impairments_german`                            | no impairments affecting use of the German language       |
| `no_impairments_speech`                            | no impairments of speech production                       |
| `no_impairments_vision_and_hearing`                | no impairments of vision or hearing                       |
| `parkinson`                                        | clinical diagnosis of Parkinson's disease                 |
| `intracranial_lesions_or_other`                    | intracranial lesions or comparable condition present      |
| `major_cerebrovascular_disease`                    | major cerebrovascular disease                             |
| `using_psychoactive_medicine`                      | uses psychoactive medication relevant to cognition        |
| `medical_condition_interfering`                    | any medical condition interfering with the assessment     |
| `medical_condition_prohibits_digital_device_use`   | unable to use the iPad due to medical condition           |
| `ongoing_disease_could_impair_memory_or_speech`    | ongoing disease that could impair memory or speech        |
| `memory_impairment_due_to_other_causes`            | memory impairment attributed to other causes              |

### 7.5 Biomarker & Imaging

| field                            | description                                                  |
| -------------------------------- | ------------------------------------------------------------ |
| `biomarker_in_liquor_amyloid`    | β-amyloid measured in cerebrospinal fluid                    |
| `beta_amyloid_value`             | β-amyloid concentration (study-internal units)               |
| `no_beta_amyloid_change_detected`| no abnormal β-amyloid change detected                        |
| `biomarker_in_liquor_p_tau`      | phosphorylated-tau measured in cerebrospinal fluid           |
| `p_tau_value`                    | p-Tau concentration (study-internal units)                   |
| `mri_t1_t2_flair`                | MRI scan with T1/T2/FLAIR sequences available                |
| `pet_based_on_disease`           | PET imaging consistent with disease group                    |

### 7.6 Recording Conditions

| field             | description                                                |
| ----------------- | ---------------------------------------------------------- |
| `quiet_room`      | recording took place in a sufficiently quiet environment   |
| `sufficient_time` | enough time was available to complete the assessment       |

### 7.7 Study Compliance

| field                                  | description                                                 |
| -------------------------------------- | ----------------------------------------------------------- |
| `informed_consent`                     | written informed consent obtained                           |
| `full_legal_competence`                | subject is legally competent                                |
| `parallel_study_participation`         | participates in a parallel study                            |
| `document_side_effects_agreed`         | consent to document side effects                            |
| `medical_adversed`                     | adverse medical event recorded during participation         |
| `document_bugs_agreed`                 | consent to log app/device technical issues                  |
| `guarantee_stable_internet_connection` | stable internet connection guaranteed at recording site     |
| `pharmakovigilanz`                     | pharmacovigilance reporting applies                         |

## 8. Tools

The corpus is designed to be used with the standard CHAT/TalkBank tools:

- **CLAN** (canonical CHAT editor + audio playback):
  <https://dali.talkbank.org/clan/>
- **ELAN** (timeline-based annotation viewer; convert CHAT → EAF via the
  CLAN command `chat2elan`):
  <https://archive.mpi.nl/tla/elan>
- **TalkBank CHAT manual:** <https://talkbank.org/manuals/CHAT.html>

Open any `CHAT/*.cha` in CLAN — the audio loads automatically through
the relative `@Media` reference into `../wav/`.