The workflow for the ingestion of data in TalkBank has the
following steps: contributors read the guidelines, they send mail
to macw@cmu.edu, we reply, data is transferred. This is described in
greater detail here .
We then check the data
using CHECK and Chatter, we create metadata files for the OLAC
and CMDI/CLARIN systems, we add
documentation to the database documentation files, we create
streaming media, we commit all files to github, and we announce the
availability of the new corpus on googlegroups mailing lists.
We change archival data for two purposes. The first is to
update the syntax of coding symbols. This does not involve data
loss. The second is to normalize spellings for morphosyntactic
analysis. In this case, we maintain the original form alongside the
normalized form in the text, using a special code.
Our three programmers all have M.S. degrees in Computer
Science. Our transcribers are selected for ability to accurately
transcribed speech.
All data types go through the same workflow.
All data are included that conform to the CHAT transcription
standard.
We never receive data that do not conform to the mission,
because the work of putting data into CHAT format already ensures
that researchers want to have their relevant data included in the
database.
Privacy of subjects is guarded through anonymization.
Data producers are asked to check over the final versions of their files to make sure they conform to their expectations.