TalkBank Data Preservation

Data Preservation

  1. TalkBank has a preservation policy based on mirroring in other data sites and longterm preservation by the University.
  2. The data is backed up through github.com and a series of three complete image backups on 3TB thunderbolt disks.  Image backups, using ChronSync, are updated weekly using rotation.
  3. Data recovery is from the image backups.
  4. Risk management is based on trying to minimize the possibility of either complete data loss through disk failure or hacking or partial data loss through system error.  The former is addressed through keeping multiple image copies and the latter through running of ChronoSync comparison between image copies and the current archive.
  5. Consistency across archival copies is achieved through use of ChronoSync.
  6. One image copy is kept offsite, one in another University building, and one in another part of Baker Hall. All are under lock and key.
  7. Because the storage media are hard drives, deterioration means disk failure.  If one drive fails, we can restore the data from one of the three remaining complete copies.  Because these drives are only running during the copying process, they never fail. The chances of all four failing at once are extremely low, barring catastrophes impacting the entire city of Pittsburgh.  In that case, copies would still be preserved in Nijmegen and throughout European CLARIN centers.  Of course, there could also be global catastrophes. 

Data Migration and Compatibility

  1. Our basic file format relies on text-only Unicode files.  We expect only minor changes in this format over time. More importantly, the CHAT coding system continually undergoes changes.  To guarantee preservation of the data on this level, we use the Chatter program to make sure that the XML version of the CHAT files can be roundtripped from CHAT to XML and back without changes.  Obsolescence of media files is a more difficult problem.  For audio, we maintain both MP3 and WAV formats, in hope that the latter could be converted without loss to any new popular formats.  For video, we have stored raw video for some corpora, but for others we only have resources to store compressed versions.  For those we focus on making sure that everything is in .H264 format.
  2. The transcript files will be usable in their current format as long as computers can read text files and Unicode.  We have developed programs that convert when necessary to six other current file formats, but we rely on CHAT format as the current standard in the field.