TalkBank IRB Approval

This page explains the principles involved in securing IRB permission for data sharing. If you already have IRB clearance and are ready to contribute your data to TalkBank (CHILDES, AphasiaBank, SLABank, etc.), you should follow these instructions on how to actually submit your data.

1. IRB Principles

TalkBank members who are interested in contributing their data need to make sure that they obtain IRB approval for their study, along with informed consent from individual participants. There are no standard forms for IRB applications, since every university or institute creates their own forms, procedures, and templates. For the purposes of contributing to TalkBank, the important thing is to select the appropriate level of access to the data that participants are being asked to grant. To help you determine this, we have created an OPTIONS summary for the 9 options that are available. We would recommend that you ask participants to permit unrestricted access with pseudonymization of the transcripts (Options 1 and 2). You should include on your form the fact that participants always have the right to request that parts or all of the data in which they participate be removed from TalkBank at any time.

2. Contributions of Archival Data

Often researchers will wish to contribute data collected in projects that have already been completed. In such cases, it may be difficult or impossible to contact participants to obtain a new consent form. However, IRBs are allowed to permit including these data in TalkBank, if certain conditions are met.
  1. The original consent forms should not have exclusionary language such as "These data will only be made available to Professor XYZ and her laboratory". If the consent forms says something like "These data will only be made available to qualified researchers," then inclusion in TalkBank should be allowed, as long as only qualified researchers are given the necessary password. If the consent form is still more general, then passwords may not be necessary.
  2. Data should be anonymized.
  3. Additional protection is possible, as described on the options summary page .
  4. It is important to emphasize that granting agencies stipulate that data collected with federal funds should be made available to researchers, as long as anonymity is preserved.

3. GDPR Compliance

The General Data Protection Regulation (GDPR) establishes rules for personal data on the web. The EU web site for GDPR issues is In regards to TalkBank, there are five core GDPR issues
  1. Commercial purposes issue: GDPR is designed to apply to data transferred for commercial purposes. TalkBank has no commercial purposes. However, it could still apply if TalkBank were to collect emails and addresses, which it does not do.
  2. The scientific data issue: A good summary of these issues can be found in this Nature article which notes that, consent is given "to certain areas of scientific research when in keeping with recognised ethical standards for scientific research." Article 89 of the GDPR states that, "Where personal data are processed for scientific or historical research purposes or statistical purposes, Union or Member State law may provide for derogations from the rights referred to in Articles 15, 16, 18 and 21 subject to the conditions and safeguards referred to in paragraph 1 of this Article in so far as such rights are likely to render impossible or seriously impair the achievement of the specific purposes, and such derogations are necessary for the fulfilment of those purposes." In other words, data-sharing is allowed for research purposes. In addition, Recital 113 allows for transfers of data from a limited number of data subjects for scientific purposes for an increase of knowledge.
  3. The informed consent Issue: NIH IRB informed consent guidelines are in accord with the GDPR Consent rules. Given this, if participants give consent for making data available to qualified researchers, then this should be approved. GDPR emphasizes also that this consent must be revocable and that there should be methods for allowing participants to revoke consent.
  4. The deidentification issue: If data are deidentified, then they are not personal data and are not covered by GDPR and IRB. Data are not be anonymous or deidentified if they have: name plus surname, credit card, telephone, address, or number plate. First name alone is not identifying. It has to be Name plus Surname. Anonymization must be irreversible. This means that contributors should destroy participant names. This holds in both EU and USA. However, the GDPR catch-22 here is that a link to the data needs to be maintained to allow for data removal. The solution for this is to make the information linking to a person only available to a third party "honest broker". See below for a discussion of identification based on voice samples.
  5. The Code of Conduct issue: Article 40 allows for development of a Code of Conduct to facilitate data transfer to non-EU countries. In the case that an institution prefers to have identifiable media stored on servers in the EU, it is possible to implement CORS (cross origin resource sharing) from a CHAT file at CMU to a media server in the EU. This is done by allowing access from https://*

4. Deidentification

In order to deidentify transcripts, it is important to replace any last names with the word "Lastname" with a capital L. Also addresses or local city names should be replaced with "Addressname" with a capital A. Other forms include "Cityname", "Schoolname", "Hospitalname" and so on. These same English words should be used even in other languages. It is not crucial to replace children's first names unless they are very unique.

The EU Amnesia project at provides software for deidentification of spreadsheet data.

The Canadian CONP Ethics and Governance Committee has a series of recommendations for deidentification of neuroimaging data .

For audio deidentfication, we can then use the occurrences of the terms Lastname and Address in the transcripts to guide the removal of the names and addresses from the corresponding segment in the audio track. You can follow the suggestions in the section of the CLAN manual on "Audio Anonymization" which are also available here Once this is done, children and others can only be identified by people who already know them. Because of this, contribution of audio is equivalent in IRB terms to contribution of transcripts.

You can also save yourself a lot of trouble if you avoid using idetifying information when making recordings.


Researchers often ask about whether they need to request additional IRB approval for contributing audio data. The concern is that audio data may be less confidential than transcript data. However, as long as identifying material is removed from both transcripts and audio, they do not present additional confidentiality issues.

Some reviewers and IRB committees believe that spoken data is identifiable through voice recognition technology. However, this judgment is based on a confusion between closed-set identification and open-set identification. Closed-set identification relies on a pre-existing pool of voiceprints from a given group, such as members of a company or subscribers to a service. Open-set identification does not rely on this pre-existing pool of voiceprints. As noted by Togneri and Pullella (2011), "in open-set identification the unknown individual can come from the general population. However as identification is always carried out against a finite, known pool of individuals it is not possible to identify arbitrary people."

Togneri, R., & Pullella, D. (2011). An overview of speaker identification: Accuracy and robustness issues. IEEE circuits and systems magazine, 11(2), 23-61. pdf

As Yuan and Liberman (2008) discovered, speaker identification in even a closed group of Supreme Court judges in TalkBank's SCOTUS corpus is still very difficult.

Yuan, J., & Liberman, M. (2008). Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America, 123(5), 3878. pdf

5. Contributions to CHILDES and PhonBank

Although each University and project will have different requirements, contributors often ask for a generic contribution template form, so here is a sample CHILDES/PhonBank consent form based roughly on the local format at CMU.

6. Contributions to AphasiaBank/DementiaBank/TBIBank/RHDBank:

Research with subjects with disabilities requires additional access restriction, such as password protection. It may also require more complete IRB documentation. In this regard, researchers working with the AphasiaBank protocol will find these additional IRB-approved materials useful:

Contributions to the other three clinical databanks -- DementiaBank, RHDBank, and TBIBank can follow formats similar those given above for AphasiaBank. The issues involved are generally similar.

7. Contributions to FluencyBank

To protect subject confidentiality, all research contributions to FluencyBank are restricted and require password to access. We suggest that new projects use a graduated consent form developed at the University of Maryland, that allows participants to specify use of video, audio-only, or transcript-only in contributed data.

When communicating with your IRB, you may find the suggestions in this briefing sheet helpful.

For projects underway, or recently completed, or longitudinal projects in which PIs would like to have an ongoing relationship before making a contribution request of subjects, we have a sample post-hoc consent form from the University of Maryland.

For completed projects that have used video without permission to share the video, we will work with you to extract the audio tracks from your video files. (Please see Contributing audio, above, for reasons why this may not require additional IRB consideration). Please contact Brian MacWhinney or Nan Bernstein Ratner to determine how best to handle your data.

8. Contributions to HomeBank

Please consult the HomeBank guidelines.