TalkBank Workflow

TalkBank

Workflow

The workflow for the ingestion of data in TalkBank has the following steps: contributors read the guidelines, they send mail to macw@cmu.edu, we reply, data is transferred. This is described in greater detail here .
We then check the data using CHECK and Chatter, we create metadata files for the OLAC and CMDI/CLARIN systems, we add documentation to the database documentation files, we create streaming media, we commit all files to github, and we announce the availability of the new corpus on googlegroups mailing lists.
We change archival data for two purposes. The first is to update the syntax of coding symbols. This does not involve data loss. The second is to normalize spellings for morphosyntactic analysis. In this case, we maintain the original form alongside the normalized form in the text, using a special code.
Our three programmers all have M.S. degrees in Computer Science. Our transcribers are selected for ability to accurately transcribed speech.
All data types go through the same workflow.
All data are included that conform to the CHAT transcription standard.
We never receive data that do not conform to the mission, because the work of putting data into CHAT format already ensures that researchers want to have their relevant data included in the database.
Privacy of subjects is guarded through anonymization.
Data producers are asked to check over the final versions of their files to make sure they conform to their expectations.