CallHome - English Corpus


Participants: 120
Type of Study: naturalistic
Location: USA
Media type: audio
DOI: doi:10.21415/T5KP54

Browsable transcripts

Download transcripts

Media folder

Citation information

Some citation here.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The CallHome English corpus of telephone speech was collected and transcribed by the Linguistic Data Consortium primarily in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense.

This release of the CallHome English corpus consists of 120 unscripted telephone conversations between native speakers of English. The CD-ROM distribution contains the speech data only, along with essential documentation files and software for handling the compressed speech data. The transcripts and other text data and documentation are distributed separately (typically via electronic transmission from the LDC's ftp/web server), and will be subject to periodic updates. The transcripts cover a contiguous 5 or 10 minute segment (see section 2 below) taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends overseas. All calls originated in North America; 90 of the 120 calls were placed to various locations overseas, while the remaining 30 were placed within North America. The distribution of call destinations can be found in the file "spkrinfo.tbl". The transcripts are timestamped by speaker turn for alignment with the speech signal, and are provided in standard orthography.

Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call.

Although the goal of the call collection effort was to have unique speakers in all calls, a handful of repeat speakers are included in the corpus. In all, 200 calls were transcribed. Of these, 80 have been designated as training calls, 20 as development test calls, and 100 as evaluation test calls. For each of the training and development test calls, a contiguous 10-minute region was selected for transcription; for the evaluation test calls, a 5-minute region was transcribed. For the present publication, only 20 of the evaluation test calls are being released; the remaining 80 test calls are being held in reserve for future LVCSR benchmark tests.

After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes.
English Sex Age Age Place
0638F2015PA
6067F1812NJ
4838F1813NY
6079F1913NY
6100F1915LA
4092F2216PA
5788F2216TX
6479F2217OH
5352F2217PA
6107F2316NY
5273F2316OH
4432F2317NY
4886F2413PA
4624F2416MI
5931F2418PA
4490F2418SC
4844F2513OH
5777F2516MI
4887F2518DC
4365F2521WI
5573F2618MA
4913F2618NY
4660F2716MO
4157F2816MO
4926F3012NY
4077F3016WA
4861F3018IN
4628F3019WY
4576F3118IL
6467F3118OH
6625F3218FL
4145F3221CA
4245F3221NE
4248F3221PA
6348F3312IL
5388F3313NE
4595F3318NY
4610F3321NY
4315F3416MA
4564F3416VA
4927F3418NY
6071F3518FL
5254F3518NY
6047F3616WI
4571F3618NE
4431F3620IL
4065F3623MD
4459F3716CA
4325F3813VT
4580F3818NJ
5907F3820ID
5046F3922CA
4104F4016CA
5700F4016IA
4234F4217NY
5866F4312AR
5888F4416ID
4544F4518MA
4822F4618CA
5278F4618NY
4310F4718NH
4665F4718PA
4666F4816PA
4335F4818IA
5551F4916WI
4623F5213NY
6161F5318TX
6456F5420OH
6033F5420WI
4941F5616NY
4112F5619CA
5736F5716CA
6274F5716WI
4705F5722OR
5495F6120WA
5242F6312IL
6314F6314IL
6252F6517WI
5712F6520MI
5648F6618CT
6045F6618WI
4556F6714NJ
5532F6716IL
6447F7120WI
5208F7416WI
6408F7712MN
4673F8017MI
4569F8017MI
4677M137WV
4521M1913NY
6825M1914NY
6265M2015NY
6521M2115OH
4967M267UT
4093M2716WA
4801M2717WA
4721M2718FL
6861M2719UT
5166M2818MD
5872M2915CA
4485M2921MA
4686M3018FL
4074M3018TX
4371M3020NE
5373M3125Canada
6313M3218NY
4415M3317NC
4612M3716VA
4829M3818CT
6179M4316NY
4247M4317varied
5713M4318CT
6785M4616AR
4792M4817NY
4807M5417NJ
4184M5420NY
4702M5513KA
6298M7412WI
4808M7614MA
4629M83IL