TalkBank has a preservation policy based on backups in CMU Cloud
and longterm preservation by the University.
Transcript data is backed up through github.com repositories. The git repositories
are stored both in several local machines from which commits are managed
and in a master repository on a server in the CMU Cloud facility running
Ubuntu Linux. The TalkBankDB facility also stores time-stamped versions of the database.
Media data are on a local machine with four 5TB local disks, along
with backup disks for each local disk. Whenever new
data are ingested, the sync-all.sh script sends copies of new materials to synchronize with
a master copy in the CMU Cloud Plus facility.
Media data can be recovered from either the local backups or the CMU Cloud Plus backups.
In addition to transcript and media data, we include a variety of documentation files inside
the transcript databases. These include Excel .xlsx files, Word
.docx files, Adobe PDF files, and various image files.
Transcript and documentation data can be recovered from the git repositories.
Risk management is based on trying to minimize the possibility
of either complete data loss through disk failure or hacking or
partial data loss through system error. The former is addressed
through keeping multiple image copies and the latter through running
of ChronoSync comparison between image copies and the current
archive.
One image copy is kept offsite, one in another University
building, and one in another part of Baker Hall. All are under lock
and key.
Because the storage media are hard drives, deterioration means
disk failure. If one drive fails, we can restore the data from one
of the three remaining complete copies or from CMU Campus Cloud.
Because these local drives are
only accessed during the copying process, they do not have much wear and
tear and they never fail. The
chances of all four data storage methods failing at once are extremely low, barring
catastrophes impacting the entire city of Pittsburgh. In that case,
copies of much of the data would still be preserved in Nijmegen and
throughout European CLARIN centers. In the event of a fullscale nuclear war,
involving several continents, it is possible that all of the data would be lost.
Data Migration and Compatibility
Our basic file format relies on text-only Unicode files. We expect only minor changes
in this format over time. However, the CHAT coding system occasionally undergoes changes.
To guarantee preservation of the data on this level, we use the Chatter program to make sure
that the XML version of the CHAT files can be roundtripped from CHAT to XML and back without changes.
Obsolescence of media files is a more difficult problem. For audio, we maintain both MP3 and WAV formats,
in hope that the latter could be converted without loss to any new popular formats. However, some corpora come with
only MP3 files.
For video, we have stored raw video for some corpora, but for others we only have resources to
store compressed versions. For those we focus on making sure that everything is in .H264 format.
The transcript files will be usable in their current format as long as computers can
read text files and Unicode. We have developed programs that convert when necessary to six other
current file formats, but we rely on CHAT format as the current standard in the field.
Responses to Core Trust Seal (CTS) queries
Does the repository have a documented approach to preservation? Response: Yes, it is given on this web page.
Is the level of responsibility for the preservation of each item understood? How is this defined?
Response: We assume responsibility for preservation of all items in TalkBank,
including transcripts, media, and documentation files.
Are plans related to future migrations or similar measures to address the threat of obsolescence in place?
Response: We do not think that .txt or .wav files will become obsolescent. If the MP4 video format becomes obsolescent,
we will convert it to a new format. We did that in previous decades for .mov, .avi, and .mpeg video formats.
Does the contract between depositor and repository provide for all
actions necessary to meet the responsibilities? Response: We assume responsibility for preservation and migration.
Is the transfer of custody and responsibility handover clear to the depositor and repository? Response: Yes, it is clear.
If a depositor wishes to remove data, we can de-accession it.
Does the repository have the rights to copy, transform, and store
the items, as well as provide access to them? Response: Yes, this is a
fundamental requirement for keeping the TalkBank databases in the best
possible condition. Earlier versions of transcripts or the whole database
can be retrieved from TalkBankDB.
Are actions relevant to preservation specified in documentation,
including custody transfer, submission information standards, and
archival information standards? Response: Yes, the standards are documented
in the CHAT manual. Contribution procedures are documented at
https://talkbank.org/share/contributing.html
Are there measures to ensure these actions are taken? Response: Yes, our
curation software requires that all standards be met.