Data Preservation and Migration

TalkBank

Data Preservation

TalkBank has a preservation policy based on backups in CMU Cloud and longterm preservation by the University.
Transcript data is backed up through github.com repositories. The git repositories are stored both in several local machines from which commits are managed and in a master repository on a server in the CMU Cloud facility running Ubuntu Linux. The TalkBankDB facility also stores time-stamped versions of the database.
Media data are on a local machine with four 5TB local disks, along with backup disks for each local disk. Whenever new data are ingested, the sync-all.sh script sends copies of new materials to synchronize with a master copy in the CMU Cloud Plus facility.
Media data can be recovered from either the local backups or the CMU Cloud Plus backups.
In addition to transcript and media data, we include a variety of documentation files inside the transcript databases. These include Excel .xlsx files, Word .docx files, Adobe PDF files, and various image files.
Transcript and documentation data can be recovered from the git repositories.
Risk management is based on trying to minimize the possibility of either complete data loss through disk failure or hacking or partial data loss through system error. The former is addressed through keeping multiple image copies and the latter through running of ChronoSync comparison between image copies and the current archive.
One image copy is kept offsite, one in another University building, and one in another part of Baker Hall. All are under lock and key.
Because the storage media are hard drives, deterioration means disk failure. If one drive fails, we can restore the data from one of the three remaining complete copies or from CMU Campus Cloud. Because these local drives are only accessed during the copying process, they do not have much wear and tear and they never fail. The chances of all four data storage methods failing at once are extremely low, barring catastrophes impacting the entire city of Pittsburgh. In that case, copies of much of the data would still be preserved in Nijmegen and throughout European CLARIN centers. In the event of a fullscale nuclear war, involving several continents, it is possible that all of the data would be lost.

Data Migration and Compatibility

Our basic file format relies on text-only UTF8 files. We expect only minor changes in this format over time. However, the CHAT coding system occasionally undergoes changes. To guarantee preservation of the data on this level, we use the Chatter program to make sure that the XML version of the CHAT files can be roundtripped from CHAT to XML and back without changes. For audio, we maintain both MP3 and WAV formats, assuming that the lossless WAV format could be converted to any new popular formats. However, some corpora come with only MP3 files. For video, we have stored raw video for some corpora, but for others we only have resources to store compressed versions. For those we focus on making sure that everything is in .H264 format.

The transcript files will be usable in their current format as long as computers can read text files and Unicode. We have developed programs that convert when necessary to six other current file formats, but we rely on CHAT format as the current standard in the field.