TalkBank has a preservation policy based on backups in CMU Cloud
and longterm preservation by the University.
Transcript data is backed up through github.com repositories. The git repositories
are stored both in several local machines from which commits are managed
and in a master repository on a server in the CMU Cloud facility running
Ubuntu Linux. The TalkBankDB facility also stores time-stamped versions of the database.
Media data are on a local machine with four 5TB local disks, along
with backup disks for each local disk. Whenever new
data are ingested, the sync-all.sh script sends copies of new materials to synchronize with
a master copy in the CMU Cloud Plus facility.
Media data can be recovered from either the local backups or the CMU Cloud Plus backups.
In addition to transcript and media data, we include a variety of documentation files inside
the transcript databases. These include Excel .xlsx files, Word
.docx files, Adobe PDF files, and various image files.
Transcript and documentation data can be recovered from the git repositories.
Risk management is based on trying to minimize the possibility
of either complete data loss through disk failure or hacking or
partial data loss through system error. The former is addressed
through keeping multiple image copies and the latter through running
of ChronoSync comparison between image copies and the current
archive.
One image copy is kept offsite, one in another University
building, and one in another part of Baker Hall. All are under lock
and key.
Because the storage media are hard drives, deterioration means
disk failure. If one drive fails, we can restore the data from one
of the three remaining complete copies or from CMU Campus Cloud.
Because these local drives are
only accessed during the copying process, they do not have much wear and
tear and they never fail. The
chances of all four data storage methods failing at once are extremely low, barring
catastrophes impacting the entire city of Pittsburgh. In that case,
copies of much of the data would still be preserved in Nijmegen and
throughout European CLARIN centers. In the event of a fullscale nuclear war,
involving several continents, it is possible that all of the data would be lost.
Data Migration and Compatibility
Our basic file format relies on text-only UTF8 files. We expect only
minor changes in this format over time. However, the CHAT coding system
occasionally undergoes changes. To guarantee preservation of the data
on this level, we use the Chatter program to make sure that the XML
version of the CHAT files can be roundtripped from CHAT to XML and back
without changes. For audio, we maintain both MP3 and WAV formats,
assuming that the lossless WAV format could be converted to any new
popular formats. However, some corpora come with only MP3 files. For
video, we have stored raw video for some corpora, but for others we only
have resources to store compressed versions. For those we focus on
making sure that everything is in .H264 format.
The transcript files will be usable in their current format as long as computers can
read text files and Unicode. We have developed programs that convert when necessary to six other
current file formats, but we rely on CHAT format as the current standard in the field.