PhonBank Polish Weist-Jarosz Corpus

Richard Weist
Department of Psychology
SUNY Fredonia
weist@a12t.cc.fredonia.edu

Gaja Jarosz
Department of Linguistics
UMass Amherst
jarosz@linguist.umass.edu

Participants:	4
Type of Study:	naturalistic, longitudinal
Location:	Poland
Media type:	audio
DOI:	doi:10.21415/7TYG-KF32

Citation information

Weist, Richard, & Witkowska-Stadnik, Katarzyna. (1986). Basic relations in child language and the word order myth. International Journal of Psychology, 21, 363–381.

Weist, Richard, Wysocka, Hanna, Witkowska-Stadnik, Katarzyna, Buczowska, Ewa, & Konieczna, Emilia (1984). The defective tense hypothesis: On the emergence of tense and aspect in child Polish. Journal of Child Language, 11, 347–374.

Jarosz, Gaja (2010). Implicational markedness and frequency in constraint-based computational models of phonological learning. Journal of Child Language. Special Issue on Computational Models of Child Language Learning 37(3). Cambridge University Press. 565-606.

Jarosz, Gaja, Calamaro, Shira, and Zentz, Jason (2017). Input Frequency and Inductive Bias in the Acquisition of Syllable Structure in Polish. Manuscript, Linguistics Department, Yale University.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references. In this case, it would be good to cite one from Weist and one from Jarosz.

Project Description

Participant Name Age Range Sessions Sex
Bartosz 1;7-1;11 6 M
Kubuś 2;1-2;6 7 M
Marta 1;7-1;10 6 (3 audio) F
Wawrzon 2;2-3;2 20 (19 audio) M

All of the children were from middle-class families raised in the urban environment of Poznań, Poland. In general, their parents were highly educated. The children were recorded in their homes (typically an apartment) by two experimenters. One of the experimenters carried a small bag containing the tape recorder and the other took context notes, which were integrated during transcription.

Phonetic Transcription Description

The children’s productions were transcribed using broad phonetic transcription with the help of the open-source Phon software (Rose et al. 2006). The orthographic transcripts were used as the basis for creating phonetic transcriptions of the children’s target pronunciations, and the audio recordings were used to phonetically transcribe the children’s actual productions and align them with the target transcriptions word by word. The transcription of all child productions was first performed independently by two transcribers trained in phonetic transcription, at least one of whom was a native speaker of Polish. Then, two Polish speakers trained in phonetic transcription worked together to create a consensus transcription of all productions, relying on a third phonetically trained native speaker of Polish to adjudicate in cases when agreement could not be reached. The resulting corpus includes phonetic transcriptions of the children’s productions in all the available audio files, providing word-by-word alignment of target pronunciations and actual pronunciations.

Transcription Conventions

Boundaries: We use word groups to delineate phonological word boundaries. In all cases except one, orthographic word boundaries correspond to phonological word boundaries. The only exception is the proclitics 'w' [v]/[f] and 'z' [z]/[s] which attach to the following word and cannot be pronounced independently. In this case, the orthography tier encodes the orthographic word boundaries, putting the proclitic in its own word group, while the IPA Target and IPA Actual tiers encode the proclitic together with the next word. So for example. '[z][kotem]' would be '[][skotɛm]' on the Target tier and potentially something like '[][sotɛm]' on the Actual tier.

Tier Conventions: We have maintained many of the conventions from the original CHAT transcripts and introduced several codes to denote special situations regarding phonetic transcription.

The following codes were used on the orthography tier:

@c a child-specific form
@n neologism
@o - onomatopoeia
@f - family specific form
@q (for 'quote') for things the child is reciting from memory or by repetition
@i - interjection
@wp - whispered
A comment (error:) after a word indicates a morphological or syntactic error
A comment (t:trail off) is used to indicate an incomplete word or utterance
A comment (++) at the beginning of an utterance means this is a completion of an adult's prompt
Angle brackets < > around a word portion indicate this portion of the word was not uttered and is not present on the IPA tiers
0 at the beginning of a word group indicates unpronounced words
[yyy] as a word group indicates material for which a target could not be identified but which is transcribed
[xxx] as a word group indicates material for which a target could not be identified and which could not be transcribed

IPA Conventions

For the most part each word group is just a sequence of individual IPA symbols that can be treated literally.
One exception is that we've used ligatures for affricates for convenience and to make sure the affricates were consistently differentiated from stop-fricative sequences (which are contrastive in Polish).
We used the postalveolar affricate ligatures for convenience, but these are actually usually transcribed as retroflex and belong with the retroflex fricative series, which we transcribe as such.
Our level of transcription is relatively broad and pretty standard for Polish, but it does encode some non-contrastive phonetic characteristics of the targets and actuals. In particular:
nasal vowels are transcribed on the target tiers according to their standard pronunciation by context (as vowel+nasal stop homorganic with the following noncontinuant consonant and as vowel+nasalized glide otherwise)
we paid special attention to voicing of obstruents, and target transcriptions account for word-final devoicing and voicing assimilation (including across word boundaries)
due to the longer phonetic length of the palatal portion of palatalized labials (e.g. 'piesek') we coded these palatals as labial-palatal sequences (e.g. [pjɛsɛk]), and we coded the palatalized velars (e.g. 'kiedy') using secondary palatalization (e.g. [kʲɛdɨ]) in the standard pronunciation and when children produced them as adult-like. These are not contrastive distinctions (there's no contrast between [Cʲ] and [Cj] in Polish).