Bou-Saboun Corpus


Imane Bou-Saboun
Department of Linguistics
Yale University

Participants: 2
Type of Study: longitudinal
Location: Italy
Media type: audio
DOI: xxx

Browsable transcripts

Download transcripts

Link to media folder

Citation information

Bou-Saboun, I. (2025). The Acquisition of Wh-Questions in Tashlhiyt Berber: Novel Behavioral and Corpus Studies (Doctoral dissertation, University of Maryland, College Park).

Project Description

This corpus investigates the acquisition of the morphosyntax of Tashlhiyt Berber an Afroasiatic language, as spoken by two sisters, Rima and Nala. There is a particular focus on the quality of child-directed speech, so efforts were made to transcribe the speech of the parents as accurately as possible. Children’s speech was narrowly transcribed using IPA by trying not to impose an analysis on what children said.

Transcription format and transcription choices

The transcription is made using IPA and it is largely a phonetic one, to avoid making claims about the underlying form.

The mid vowel [e] and the mid-back vowel [o] are not vowel phonemes of Tashlhiyt but they may occur as allophones in pharyngealization contexts. For instance, we opted for transcribing the TB word for ‘no’ as ‘oho’ instead of ‘uhu’. The former is a more accurate phonetic transcription, but ‘uhu’ would be the appropriate underlying form, that reflects TB’s proper set of phonemes.

In this transcription, I chose not to mark pharyngealization on the consonants, but on the vowel level, pharyngealized u and i are rendered o and e. Another note is that I transcribed the vowels as a broad transcription, so why, for instance, would be transcribed as [maʕʕ] instead of [mæʕʕ].

The account of wh-questions followed in the transcript is one where there is an inaudible complementizer [a] following the ma wh-word. This ma a transcription will be found in the %xcod: tier under some of the glossed and non-glossed wh-question instances. This specific convention is crucially not meant to signal what is heard, but rather, what rules underlie wh-question formation in Berber languages, since in all of these instances, the ‘ma-a’ string is pronounced [ma].

Similarly, the one context where the “underlying representation” overrode phonetic realization was that of certain assimilation processes. I attempted as much as possible not to transcribe assimilation as such and maintain word boundaries. For instance, let us consider the question “Whose is it?” that appears in the transcript for the final session: *GRAN: win mit aj? %gls: Of Who DEM? %eng: Whose is it?

It is actually pronounced as [wimmitaj], as can be ascertained from the linked audio file. An alternate possible transcription could be to just show both the assimilation process in the main utterance line and do the segmentation in an %xcod: line, but the workaround reported here was intended for time efficiency.

Improvements and corrections of the transcript on the part of the community are highly encouraged. The goal of making this transcription available in the current state is for improvements to be made by the community of speakers of Tashlhiyt Berbers since it was not possible to recruit people to do the double-checking during the corpus construction process. The transcription was made to the best of the transcriber’s knowledge, drawing from the Berber language literature as much as possible. Negation, for instance, is rendered as “ur”, following Mettouchi 2009, Bensoukas 2009 inter alia. However, the adverbial “sur” is rendered with a word-final [r], despite it being pronounced as [sul] and reported as [sul] in some documentation work from the region, because that is how it is pronounced by the speakers of the dialect here represented in the corpus. All inconsistencies are mine.

Grammar categories At the moment, no MOR grammar or UD have been created for Tashlhiyt. Here are some of the conventions used for morphological analysis throughout the corpus. %com: tier is used as a mode both to label the type of construction in %the independent tier and also as a comment on the context in which the %independent tier utterance was produced. This is also used to give %context that is not available in the video.

The acronyms in the glosses are generally adapted from the Leipzig glossing rules, plus Berber-specific additions, specified below.

Acronym in gloss or comment tier What it stands for Description AAE Anti-agreement effect A suffix that shows up on the verb when the subject is fronted in wh-questions, relative clauses and focus clefts CS Construct state Special marking on subjects in post-verbal position FS Free State ‘Elsewhere’ marking for NPs in TB

Tashlhiyt Child Forms: These are variants of certain strings that were used by the children first and then incorporated into the family’s language, sometimes used in adult-directed speech as well. Child form Standard TB Translation skukkur/skokkor smuqqul/smoqqol Look otto ʕmmtˤu/ʕmmto Aunt (on paternal side) namnam tirmt food ʕammʕamm tirmt food diddi ma-d-udnh That which hurts/booboo baħħa ma-d-udnh That which hurts/booboo baski@c - Bike Ninni gʷin Sleep Nuhnnu/nohnno gʷin Sleep tiwtiw Tiglaj/taglajt eggs Mummu/mommo rrosom (borrowing from Arabic word for drawings) cartoons

Rima

‘Rima’ was born (full-term birth) in Italy in August 26th 2018 as the first-born in the family that was involved in the corpus. Her father is a native speaker of Tashliyt Berber who grew up in Morocco and moved to Italy when he was 17 years old. The specific locality he grew up is known as Aït Ilougane, a constellation of villages between Agadir and Tiznit, in the Souss-Massa region. He speaks Italian fluently and he is a carpenter by profession. Rima’s mother is a heritage speaker of Tashlhiyt Berber (Tafraoute, Tiznit Province) who grew up in Marrakesh speaking Moroccan Arabic. Going back to Rima’s development, until schooling, which for her started at age 4, she was exposed to Tashlhiyt Berber at home through her father, paternal grandmother, paternal grandfather and her uncle and aunts. Rima’s portion of the corpus was collected between ages 2;03,26 and 4;08,05 (Y;MM(,DD)). Recordings were interrupted about 7 months into her schooling, in April of 2023, as the amount of exposure to Italian was increasing. Nevertheless, Tashlhiyt Berber was still the language she was exposed to the most daily.

Nala

‘Nala’ was born (full-term birth) in Italy June 3rd, 2021. She is Rima’s younger sister. Therefore, the profile of the parents is the same. She was involved in the corpus between 04,06 (4 months and 6 days of age) and 1;10,28 old (Y;MM(,DD)). She joined the corpus as a babbler and an addressee from the second session onwards, dated October 10th 2021. She is present in sessions #3, #7, #8, #9, #10. In sessions #3, at 9 months, she gets addressed with full sentences by her grandmother. In the sessions #4-6 she is being nursed or sleeping in the meantime. Although she is not present across all recording sessions, we document her ability to utter complete one and two-word utterances in Tashlhiyt Berber during session #7, which is dated January 15th 2023, when she is 1;07,13.

Acknowledgments

This work was undertaken as part of Imane Bou-Saboun’s PhD dissertation (2025) at the University of Maryland, College Park, entitled The Acquisition of Wh-Questions in Tashlhiyt Berber: Novel Behavioral and Corpus Studies. I want to thank the family involved, first and foremost, who trusted me to document such intimate aspects of their family life and immortalizing the development of their children in a consultable format. I also thank the children, Rima and Nala for letting me record their play sessions for as long as I did. I am grateful to my two advisors at the Department of Linguistics, Maria Polinsky and Jeffrey Lidz for their support from conception to development and completion, as well as the support of the rest of the faculty in the department at UMD, and the support of the Language Acquisition Laboratory, including the Lab manager, Tara Mease and my undergraduate Research Assistant, Josh Kwak. I am also grateful for the support and encouragement of Professor Nan Bernstein Ratner and the conversations we had on the topic over the years.

I want to thank my students who chose to sign up and attend my advanced course on the acquisition of understudied languages (LING419M) for all the discussions and for really sharing your minds and wonderfully daring ideas and opinions with me. In alphabetical order: Sofia Bendana, Adelaide Bouthet, Tzipporah Harker, Eli Herbst, Takuya Kameyama, Madeline Keen, Alexa Kolosey, Sara Riso.

I would like to thank the Contact and Documentation reading group at Yale University for welcoming me and for the great discussions about the data here presented during the first year of my postdoc.

Many thanks also go to researchers from other universities whose advice improved the execution of this project. Thank you to Athulya Aravind, Karim Bensoukas, Claire Bowern, Abdellah Elouatiq, Dalila Dehbia Gaoua, Jenia Gutova, Mohamed Lahrouchi, Brian MacWhinney and Sophie Pierson.

Warnings

If you have comments on how to improve the transcription in Tashlhiyt, requests to improve the media linking and if you come across either major or minor inconsistencies you are warmly invited to send me an email at: tashlhiytberbercorpus@gmail.com. All are sincerely encouraged to help ameliorate and expand on this effort!

Although the original names of the two children are kept in the transcripts, please only refer to the names of the children through pseudonyms when reporting the results. The original names are kept in the transcript due to some morpho-phonological phenomena such as the construct state that surface in nouns that start in A-, such as the name of one of the girls.

Corpus-specific correspondence through: tashlhiytberbercorpus@gmail.com