PhonBank Portuguese Pereira/Freitas Corpus

Rodrigo Pereira
Linguistics
University of Lisbon
rodrigopereira1@campus.ul.pt

Maria João Freitas
Linguistics
University of Lisbon
joaofreitas@letras.ul.pt
website

Participants:	6
Type of Study:	naturalistic
Location:	Portugal
Media type:	audio
DOI:	doi:10.21415/T50P5V

Citation information

Freitas, Maria João (1997). Aquisição da Estrutura Silábica do Português Europeu. Ma, Universidade de Lisboa.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

Name	Age Range	Sessions	Sex
João	0;10.02-2;08.27	23	M
Laura	2;02.29 – 3;03.10	12	F
Luís	1;09.29 – 2;11.02	12	M
Marta	1;02.00 – 2;02.17	12	F
Pedro	2;07.00 – 3;07.24	12	M
Raquel	1;10.02 – 2;10.08	11	F

The recordings were made in the child’s home, normally in the child’s bedroom. Naturalistic data were collected longitudinally: each child was videotaped for a period of one year. Each session lasted from 30 to 60 minutes. The recordings were made using a Sony Handycam video 8, AF Hi-FI Stereo.

The data collection was supported by Fundação para Ciência e a Tecnologia (research project PCSH/C/LIN/524/93). This new version of the corpus was all supported by Fundação para Ciência e a Tecnologia (UID/LIN/00214/2019). The updated version (with João and Pedro) in November 2020 was supported by Fundação para Ciência e a Tecnologia (UIDB/00214/2020).

The data were manually entered into the Phon application. Orthographic and phonetic transcriptions were made of the target and children's actual forms. Transcriptions were performed by a native speaker of European Portuguese highly trained in phonetic transcription. All problematic transcriptions were noted and listened to by another judge, also highly trained in the task. Criteria adopted during the data editing process are listed below. In research contexts, and due the nature of the transcription task, we advise users to carefully review the selected files.

The data files in this corpus are part of the Acquisition of European Portuguese Databank (AcEP – www.clul.ul.pt/en/research-teams/476-acquisition-of-european-portuguese- databank), in CLUL's research group Grammar & Resources. Only audio files and orthographic/phonetic transcriptions are public and available online. Access to video recordings is not allowed (for any further details on this issue, please contact the AcEP- Acquisition of European Portuguese Databank Project: joaofreitas@letras.ulisboa.pt). Additional longitudinal data on the acquisition of European Portuguese is available at: https://phon.talkbank.org/access/Romance/Portuguese/CCF.html

It is important to note that the first 8 sessions of “João” are not segmented or transcribed, only the audio will be provided, since the child did not produce almost any speech. These sessions may be of interest for those who aim to study early language acquisition (including babbling).

Criteria used for the corpus editing. Overall, the transcriptions were organized in terms of orthographical words, not phonological words. The external sandhi rules and its own division are described bellow. However, some words, those that orthographically have a hyphen/dash, e.g. “foi-se” [ˈfojsɨ], pronominalized verb forms with clitics, were transcribed in the same transcription group in square brackets in the orthography tier. When the speech was unintelligible or contained extralinguistic sounds such mumbling, singing, screaming, etc. the top tier (orthography) has the mark [xxx] and the other two (IPA target and actual) have [*].

In the notes tier, some abbreviations or indications were used, such as:

PC: Phonological clue give by the adult
Rep: Repetition
Overlapping: Voice overlapping
Whispered: Whispered/murmured speech
Unintelligible: Not perceptible

Any other notes, however diverse they may be, are fully described in English in the note tier in PHON. One of the most commons was: Incomplete word after initial clue – which means that the adult provides the first two or three syllables of the target word, and the child just completes what’s missing from that word. Some choices had to be done for the cases of external sandhi (across word boundaries), and a few rules apply:

Contraction/crasis of vowels: “[para] [o] [chão]” IPA:[p] [ɔ] [ʃɐ̃w̃]: Each orthographic word has its own transcription group, no matter how reduced the sounds become. If the sound was elided, nothing was transcribed.
Across word boundaries with the same vowel: “toca aqui” ['tɔka] ['ki]: Since there was a vowel merger, only one of the sides keeps the vowel; the target transcription provides more information about the phonological structure of the words.
Coalescence of [ʃ] and [s] in diferente words: “olho[ʃ] [s]ão” > [ˈɔʎuʃ] [‘ɐ̃w̃] or [ˈɔʎu] [‘ʃɐ̃w̃], only one side is transcribed with the esh sign.
Glide insertion: “é esta” [ˈɛ] [ˈɛʃtɐ] > the glide is just transcribed once on either side, e.g. [ˈɛ] [ˈjɛʃtɐ].
Whenever a coda consonant assimilates the voicing of the following one in a different word, such was written in the IPA target instead of the canonical realization before a pause.
Exclamation and question marks or other punctuation signs are written orthographically separated with spaces from the glyphs’ sequence in order to ease the word search.