DSA logo-->

Implementation of the Data Seal of Approval

The Data Seal of Approval board hereby confirms that the Trusted Digital repository TalkBank complies with the guidelines version 2014-2015 set by the Data Seal of Approval Board.

The afore-mentioned repository has therefore acquired the Data Seal of Approval of 2013 on April 8, 2014.

The Trusted Digital repository is allowed to place an image of the Data Seal of Approval logo corresponding to the guidelines version date on their website. This image must link to this file which is hosted on the Data Seal of Approval website.

Yours sincerely,

 

The Data Seal of Approval Board

Assessment Information

Guidelines Version:2014-2015 | July 19, 2013
Guidelines Information Booklet:DSA-booklet_2014-2015.pdf
All Guidelines Documentation:Documentation
 
Repository:TalkBank
Seal Acquiry Date:Apr. 08, 2014
 
For the latest version of the awarded DSA
for this repository please visit our website:
http://assessment.datasealofapproval.org/seals/
 
Previously Acquired Seals: None
 
This repository is owned by: TalkBank
254M Baker Hall
CMU - Psychology 5000 Forbes Avenue
15213 Pittsburgh
PA
USA

T +1 412 268-3793
E macw@cmu.edu
W http://talkbank.org/

Assessment

0. Repository Context

Applicant Entry

Self-assessment statement:

A General Description of TalkBank

TalkBank is an archive of transcripts of spoken language interactions, many of which are linked to either audio or video. The major designated communities involved include child language researchers, speech and language pathologists, linguists, conversation analysts, and second language acquisition researchers. Long-term data preservation is provided by Carnegie Mellon University and CLARIN (www.clarin.eu). Several of the CLARIN centers have received the Data Seal of Approval and TalkBank data is currently mirrored by the CLARIN Center at the MPI in Nijmegen that has the Data Seal of Approval. The only outsourcing we do is for cloud backup through backblaze.com to guarantee preservation. This project has been funded continuously by the National Institutes of Health since 1984 and has also received support from the National Science Foundation and the MacArthur Foundation. A search of scholar.google.com shows that there are now 6450 published articles based on use of the TalkBank databases. Current NIH support involves four major ongoing five-year grants for child language (CHILDES), aphasia (AphasiaBank), fluency (FluencyBank), and phonology (PhonBank). The central website is http://talkbank.org. Within the overall TalkBank corpus, there are several subcorpora, the largest and oldest of which is CHILDES (Child Language Data Exchange System) located at http://childes.talkbank.org .

In the responses to the Guidelines, "we" refers to the programming and data analysis staff employed by the TalkBank Project at Carnegie Mellon. The term "producers" refers to the scholars who contribute data. The term "users" refers to the scholars who use the data. All URLs were visited on Tuesday June 27th, 2016.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

1. The data producer deposits the data in a data repository with sufficient information for others to assess the quality of the data, and compliance with disciplinary and ethical norms.

Minimum Required Statement of Compliance:
Level 3: In progress: We are in the implementation phase.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:

Data producers are the scholars who have collected the spoken interactions and produced the transcripts and media that are then included in TalkBank. We support the data producers and guarantee data quality through these methods:

  • We teach best practices through online tutorials available at http:://childes.talkbank.org/clan/tutorial.zip that explain how to use the CLAN program for transcription and analysis.
  • We conduct seminars and tutorials at international meetings of the relevant professional societies.
  • The CLAN program is downloadable from http://talkbank.org/clan/
  • The CLAN manual is downloadable from http://childes.talkbank.org/manuals/CLAN.pdf.
  • The data producer verifies the correctness of the transcription process using the CHECK command inside CLAN and the XML checker available from http://talkbank.org/software/chatter.html.
  • Methods for data submission are found here .
  • Standards for corpus documentation are presented in chapter 4.5 of the CHAT manual
  • Data producers/contributors provide data release forms as found at http://talkbank.org/share/release.pdf
  • Adherence to ethical norms is treated through the IRB (Instiutional Review Board) process summarized at http://talkbank.org/share/irb/
  • Metadata regarding each file is constructed in accord with the CMDI standard at https://www.clarin.eu/content/component-metadata and is included in each archive.
  • We track citations of corpora and the database through yearly reviews using scholar.google.com, as well as letters from contributors.
  • Corpora that meet all of these standards are judged to be valuable and are included in TalkBank.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

2. The data producer provides the data in formats recommended by the data repository.

Minimum Required Statement of Compliance:
Level 3: In progress: We are in the implementation phase.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • The repository has created a manual for the single required format. The format is called CHAT and the manual is available at http://childes.talkbank.org/manuals/CHAT.pdf. There are also published translations into Japanese, Italian, Portuguese, Chinese, and Spanish, as well as several introductions and tutorials.
  • Quality control is achieved by running the CHECK command inside the CLAN program and then the XML checker available at http://talkbank.org/software/chatter.html
  • The tools used to guarantee correct use of the CHAT standard are the CHECK command and the XML checker described in (2) above.
  • We do not accept any data that are not in CHAT.
  • We do not require detailed statements about data formats, because all data must be in CHAT format.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

3: The data producer provides the data together with the metadata requested by the data repository.

Minimum Required Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.
Self-assessment statement:

  • Using the documentation provided by producers, we create metadata files for each resource. For the purposes of harvesting by OLAC (Online Language Archiving Community at http://www.language-archives.org, we produce a single metadata file for each corpus that is included in the relevant .zip file that can be downloaded.
  • For harvesting in the IMDI/CMDI framework at https://www.clarin.eu/content/component-metadata, we use a program built into CLAN to automatically generate metadata records for each transcripts and media file. These can be seen for TalkBank , CHILDES , and HomeBank.
  • We create permanent identifiers (PIDs) for each transcript file and media file through the Handle Server system.
  • We also generate and mint a DOI (Digital Object Identifier) code for each corpus using the EZID System.
  • We do not require data producers to generate OLAC and CMDI metadata files. We do this using a program that harvests the information from metadata text files based on the data they provide.
  • The programs creating CMDI, OLAC, and DOI metadata enforce a quality check for consistency and proper use of identifiers.
  • Our metadata formats are in compliance with the two major standards for linguistic metadata documentation, e.g. OLAC and CMDI. Both include Dublin Core as subsets.
  • The primary use of metadata is for resource discovery through OLAC and IMDI. Secondary analysis depends on use of the CLAN programs themselves.
  • It is possible that data producers will have failed to collect or transcribe some data that will turn out in the future to be important. However, because we have the raw media for most of our new corpora, transcriptions can be refined later on. None of these issues should lead to problems in terms of long-term preservation.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

4: The data repository has an explicit mission in the area of digital archiving and promulgates it.

Minimum Required Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Applicant Entry

Statement of Compliance:
Level 3: In progress: We are in the implementation phase.
Self-assessment statement:

  • TalkBank is a government-funded project at Carnegie Mellon University. The mission of TalkBank is to provide a preservable archive of publicly shared data on spoken language.
  • By providing data on many different aspects of language learning, processing, conversation, and disorders, TalkBank seeks to help researchers in the development of a more integrated and comprehensive understanding of the nature of human language and thought.
  • TalkBank depends on a deep level of commitment from its component research communities. For child language, aphasia, bilingualism, and CA (Conversation Analysis), this involves maintenance of mailing lists, help centers, presentations at conferences, publications of results in special issues, and summer workshops.
  • We do not outsource our basic functions. However, we backup data in the cloud through Backblaze.com and we mirror data through the CLARIN Max Planck Institute Center in Nijmegen that also has the Data Seal of Approval.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

5. The data repository uses due diligence to ensure compliance with legal regulations and contracts including, when applicable, regulations governing the protection of human subjects.

Minimum Required Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • The repository is supported by Carnegie Mellon University, which is the relevant legal entity in contractual matters.
  • We use a standard data contribution form given at http://talkbank.org/share/release.pdf
  • Data consumers are asked to follow our usage guidelines as stated at http://talkbank.org/share/
  • Our conditions and terms of used are given at http://talkbank.org/share/
  • Additional conditions applying to the HomeBank unvetted audio recordings are explained at the HomeBank membership page.
  • If conditions would not be met, we would make the specifics of the non-compliance known to the research community. In the 28 years of functioning of TalkBank and CHILDES, there has never been a case of non-compliance.
  • We insure compliance with national and international laws through the IRB (Institutional Review Board) procedure at Carnegie Mellon University. Copyright is based on a Creative Commons License declared at the bottom of the homepage.
  • Data with disclosure risk are password protected. All data in AphasiaBank are password protected. About 3% of the data in other areas are in this category. This is explained in detail at http://talkbank.org/share/irb/options.html
  • Data with disclosure risk are password protected.
  • Data with levels of disclosure risk beyond that of password protection are archived but not distributed.
  • Files are anonymized through replacement of lastnames with the word LastName and replacement of addresses with the word Address.
  • Issues relating to disclosure risk are discussed in detail between the Director and the Data Producer.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

6. The data repository applies documented processes and procedures for managing data storage.

Minimum Required Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • TalkBank has a preservation policy based on mirroring in other data sites and longterm preservation by the University and the CLARIN system. This policy is stated here.
  • The data is backed up through GIT repositories at bitbucket.com, continual incremental backups through backblaze.com, and three complete image backups on 3TB thunderbolt disks which are updated weekly using ChronoSync and backup rotation.
  • Data recovery is from the image backups, BitBucket, GIT repositories, and BackBlaze.
  • Risk management is based on trying to minimize the possibility of data loss through disk failure, hacking, or system error. Disk failure is addressed through keeping multiple image copies and complete copies in BackBlaze. Possible effects of hacking are addresses through running of ChronoSync comparison between image copies and the current archive, as well as continual data version backups in GIT.
  • Consistency across archival copies is achieved through use of ChronoSync.
  • One image copy is kept offsite, one in another University building, and one in another part of Baker Hall. All are under lock and key.
  • Because the storage media are hard drives, deterioration means disk failure. If one drive fails, we can restore the data from one of the three remaining complete copies or from BackBlaze. Because backup drives are only running during the copying process, they never fail. The chances of all three failing at once are extremely low. If that happened, data would still be preserved in BackBlaze or in the CLARIN centers.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

7. The data repository has a plan for long-term preservation of its digital assets.

Minimum Required Statement of Compliance:
Level 3: In progress: We are in the implementation phase.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • Aspects of our system for resource preservation were described in responses to item 6 above.
  • Carnegie Mellon Libraries, has established a system for long-term data preservation (LTDP) of resources created at the University, including TalkBank. We are working on the process of including all of TalkBank in this system.
  • In addition to preservation at Carnegie Mellon, all TalkBank materials are included in the TLA (The Language Archive) archive at mpi.nl as a part of the CLARIN system. The plans for CLARIN long term preservation are described here . Brian MacWhinney, the Director of TalkBank, is the Chairman of the CLARIN Scientific Advisory Board. CLARIN is currently adding TalkBank as one if its B-Level data repository centers, based in part on having received the Data Seal of Approval.
  • When the current director, Brian MacWhinney, retires in 2024, the current Director of the AphasiaBank Project, Davida Fromm, will assume the role of Director of that project. Yvan Rose of Memorial University Newfoundland will assume directorship of the PhonBank component. Johannes Wagner of Southern Denmark University will assume directorship of CABank. Nan Bernstein Ratner will assume directorship of FluencyBank. Fromm, Rose, Wagner, and Ratner will continue coordination of efforts through Fromm at CMU.
  • Our basic file format relies on text-only Unicode files. We expect only minor changes in this format over time. More importantly, the CHAT coding system continually undergoes changes. To guarantee preservation of the data on this level, we use the Chatter program to make sure that the XML version of the CHAT files can be roundtripped from CHAT to XML and back without changes. Obsolescence of media files is a more difficult problem. For audio, we maintain both MP3 and WAV formats, in hope that the latter could be converted without loss to any new popular formats. For video, we have stored raw video for some corpora, but for others we only have resources to store compressed versions. For those we focus on making sure that everything is in .H264 format.
  • The transcript files will be usable in their current format as long as computers can read text files and Unicode. We have developed programs that convert when necessary to six other current file formats, but we rely on CHAT format as the current standard in the field.
  • These principles are posted here.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

8. Archiving takes place according to explicit work flows across the data life cycle.

Minimum Required Statement of Compliance:
Level 3: In progress: We are in the implementation phase.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • The workflow for the inclusion of data in TalkBank has the following steps: contributors read the Ground Rules, they send mail to macw@cmu.edu, we reply, data is transferred through secure accounts at Google Drive, we check the data using CHECK and Chatter, we create metadata files, we add documentation to the database documentation files, we create streaming media, we commit all files to GIT through BitBucket, we generate new Handle Server PIDs and new DOIs, and we announce the availability of the new corpus on one of our googlegroups mailing lists.
  • We change archival data for three reasons. The first is to update the syntax of coding symbols. This does not involve data loss. The second is to normalize spellings for morphosyntactic analysis. In this case, we maintain the original form alongside the normalized form in the text, using a special code. The third reason is to add annotations, such as morphosyntactic coding, grammatical dependency analysis, IPA coding, or gesture coding.
  • Our three programmers all have M.S. degrees in Computer Science. Our transcribers are selected for ability to accurately transcribed speech.
  • All data types go through the same workflow.
  • All data are included that conform to the CHAT transcription standard.
  • We never receive data that do not conform to the mission, because the work of putting data into CHAT format already ensures that researchers want to have their relevant data included in the database.
  • Privacy of subjects is guarding through anonymization. For HomeBank data, there are additional access restrictions, as described in the HomeBank data use agreement . These are necessary, because we are not initially sure what might be contained in the daylong untranscribed naturalistic audio recordings included in HomeBank.
  • Data producers are asked to check over the final versions of their files to make sure they conform to their expectations.
  • These workflow procedures are described online here

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

9. The data repository assumes responsibility from the data producers for access and availability of the digital objects.

Minimum Required Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • We obtain contribution forms from data producers, as given at http://talkbank.org/share/release.pdf.
  • We enforce licenses through these contribution forms.
  • Our crisis management plan focuses on possible data loss, as described for Guideline 6.

Reviewer Entry

Accept or send back to applicant for modification:
Accept

Comments:

10. The data repository enables the users to discover and use the data and refer to them in a persistent way.

Minimum Required Statement of Compliance:
Level 3: In progress: We are in the implementation phase.

Applicant Entry

Statement of Compliance:
Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

Self-assessment statement:
  • Corpora for particular projects can be located through a hierarchical series of HTML access tables for TalkBank , CHILDES , and HomeBank .
  • Each of these access tables provides for each corpus: an HTML page documenting the nature of the corpus, a downloadable .zip file with the transcripts, and a web page for downloading of the media.
  • Our data are provided in the CHAT format which provides the codes required by the relevant research communities (Child Language, CA, SLA, Aphasia, Fluency, Phonology etc.). In addition, some researchers wish to study transcripts in ELAN format, as described at http://tla.mpi.nl/tools/tla-tools/elan/ , and we can convert automatically from CHAT to ELAN using CLAN.
  • The repository can be searched directly using file by file or through the search commands built into the TalkBank Browser window at the bottom left.
  • Using a PEPPER input module, we have included TalkBank corpora into the ANNIS system for corpus analytic searching.
  • We generate metadata for OAI-PMH harvesting by the CLARIN system which then feeds data into the the CLARIN Virtual Language Observatory.
  • We also generate XML metadata for harvesting through the OLAC system.
  • We generate PIDs through the Handle System and subsequent use by CLARIN VLO.
  • Reviewer Entry

    Accept or send back to applicant for modification:
    Accept

    Comments:

    11. The data repository ensures the integrity of the digital objects and the metadata.

    Minimum Required Statement of Compliance:
    Level 3: In progress: We are in the implementation phase.

    Applicant Entry

    Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Self-assessment statement:
    • Files are downloaded from the database in .zip format. Therefore, data loss through transfer will be revealed when the user unzips the file.
    • Integrity of the data is monitored through daily running of XML roundtrip validation using a Python script called SCONS, use of ChronoSync for file comparison on image copies, and placement of an @End code at the end of each transcript file.
    • Once data are in the database, we avoid creating multiple versions of data files. However, when changes are made, older versions can re retrieved from image backups, GIT, and BackBlaze.

    Reviewer Entry

    Accept or send back to applicant for modification:
    Accept

    Comments:

    12. The data repository ensures the authenticity of the digital objects and the metadata.

    Minimum Required Statement of Compliance:
    Level 3: In progress: We are in the implementation phase.

    Applicant Entry

    Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Self-assessment statement:
    • We create metadata files ourselves based on documentation provided by contributors.
    • When there are changes in the basic data format, we advise data producers through an email to our members GoogleGroups list.
    • Files are grouped into corpora that all maintain the same provenance and no additional data is inserted. After data have been included in the database, we maintain an audit trail of changes in terms of versions in GIT.
    • We do not maintain links to other datasets. We maintain links to our own metadata files, as described earlier.
    • When making changes to files, we compare the original with the revised version using DIFF.
    • In our online HTML documentation files and separate GoogleDocs gsheets, we maintain complete contact information and personal contacts with all depositors.

    Reviewer Entry

    Accept or send back to applicant for modification:
    Accept

    Comments:

    13: The technical infrastructure explicitly supports the tasks and functions described in internationally accepted archival standards like OAIS.

    Minimum Required Statement of Compliance:
    Level 3: In progress: We are in the implementation phase.

    Applicant Entry

    Statement of Compliance:
    Level 3: In progress: We are in the implementation phase.
    Self-assessment statement:
    • The repository is structured in accord with OAIS standards.
    • OAIS standards are implemented in terms of our policies for long term preservation, data migration, metadata creation, storage practices, documentation, legal responsibilities, and data access.
    • Our longterm plan for infrastructure development is to continually expand the scope of the archive and the analysis programs to deal with all aspects of human language. We rely on organization of the relevant research community in each case to make a case for federal funding for each of these separate initiatives.

    Reviewer Entry

    Accept or send back to applicant for modification:
    Accept

    Comments:

    We recommend the provision of supporting documentation for this item before the next DSA submission

    14: The data consumer complies with access regulations set by the data repository.

    Minimum Required Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Applicant Entry

    Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Self-assessment statement:
    • We do not require End User licenses.
    • We have no special requirements.
    • We do not need contracts for data access.
    • We use a Creative Commons license CC BY-NC-SA 3.0., as stated at the bottom of our homepage.
    • Access to unvetted HomeBank Data represent an exception to the above. For those data, we have no copyright license, because the audio data cannot be copied. Also, for those data, we have a special data use agreement .
    • As stated under Guideline 5, if conditions are not met, we make cases of non-compliance known to the research community. In the 28 years of functioning of TalkBank and CHILDES, there has never been a case of non-compliance.

    Reviewer Entry

    Accept or send back to applicant for modification:
    Accept

    Comments:

    15. The data consumer conforms to and agrees with any codes of conduct that are generally accepted in the relevant sector for the exchange and proper use of knowledge and information.

    Minimum Required Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Applicant Entry

    Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Self-assessment statement:
    • We have a stated codes of conduct policy at http://talkbank.org/share/.
    • We have received IRB (Institutional Review Board) clearance from Carnegie Mellon and this has been approved by NIH (National Institutes of Health). Contributors also receive IRB review at their institutions.
    • Users agree to the guidelines .
    • Institutional bodies are not involved in agreeing to the terms of use.
    • As stated under Guideline 5, if conditions are not met, we make cases of non-compliance known to the research community. In the 28 years of functioning of TalkBank and CHILDES, there has never been a case of non-compliance.
    • We provide guidance in the responsible use of all data, as described at http://talkbank.org/share/ particularly points 1, 3, and 6.

    Reviewer Entry

    Accept or send back to applicant for modification:
    Accept

    Comments:

    16. The data consumer respects the applicable licences of the data repository regarding the use of the data.

    Minimum Required Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Applicant Entry

    Statement of Compliance:
    Level 4: Implemented: This guideline has been fully implemented for the needs of our repository.

    Self-assessment statement:

    We rely on Creative Commons license CC BY-NC-SA 3.0 as noted on our homepages. To control password access to data in the AphasiaBank segment of TalkBank with a possible disclosure risk, we rely on these access measures:

    • Password access is only given to fulltime faculty or clinicians with SLP (Speech and Language Pathology) certification from ASHA (the American Speech and Hearing Association).
    • Students can only access data under faculty supervision.
    • Faculty must apply for membership in AphasiaBank and state their intended use of the data.
    • Members agree to the Ground Rules.

    Reviewer Entry

    Accept or send back to applicant for modification:
    Accept
    Comments: