PhonBank Taiwanese Tsay Corpus


Jane S. Tsay
Linguistics
National Chung Cheng University

website

Participants: 14
Type of Study: longitudinal, naturalistic
Location: Taiwan
Media type: audio
DOI: doi:10.21415/T59K7C

Browsable transcripts

Phon data

CHAT data

Link to media folder

Citation information

Tsay, Jane (2007). Construction and automatization of an Minnan child speech corpus with some research findings. Computational Linguistics and Chinese Language Processing 12(4): 411-442.

Tsay, Jane (2014). A Phonological Corpus of L1 Acquisition of Taiwan Southern Min. In the Oxford Handbook of Corpus Phonology, eds. by Jacques Durand, Ulrike Gut and Gjert Kristoffersen, 576-587. UK, Oxford University Press.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

Taiwanese Child Language Corpus (TAICORP) is a corpus based on spontaneous conversations between young children and their adult caretakers in Minnan (Taiwan Southern Min) speaking families in Chiayi County, Taiwan. This corpus is special in several ways: (1) It is a Minnan corpus; (2) It is a speech-based corpus; (3) It is a corpus of a language that does not yet have a conventionalized orthography; (4) It is a collection of longitudinal child language data; (5) It is one of the largest child corpora in the world with about two million syllables in 497,426 lines (utterances) based on about 330 hours of recordings. Regarding the format, TAICORP adopted the Child Language Data Exchange System (CHILDES) [MacWhinney and Snow 1985; MacWhinney 1995] for transcribing and coding the recordings into machine-readable text. The goals of this paper are to introduce the construction of this speech-based corpus and at the same time to discuss some problems and challenges encountered. The development of an automatic word segmentation program with a spell-checker is also discussed. Finally, some findings in syllable distribution are reported.

From the linguistics point of view, there is an urgent need to construct a Minnan child language corpus, partly because there has not been any such corpus available and partly because it may be getting more and more difficult to find young children learning Minnan as their first language, especially in the cities. On top of that, the significance of a large collection of longitudinal child language data for linguistic studies goes beyond saying.

Mandarin and Minnan are the two major Chinese languages in Taiwan. For over forty years, Mandarin was the only official language for instruction at school in spite of the fact that about 73% of the population belonged to the Minnan ethnic group [Huang 1993]. Young children in kindergartens and elementary schools were not allowed to speak Minnan, even if Minnan was the language spoken at home. This policy caused a decrease in the number of young children learning Minnan as their first language.

Although the situation has changed in recent years and other local languages besides Mandarin, including Minnan, Hakka, and the aboriginal (Formosan) languages have been included in the curriculum of elementary schools, there is still a serious concern about the decrease of native Minnan speakers. This concern can be supported by a more recent survey. Tsay [2005] reports that in a survey of all 8th graders in Chiayi City in Southern Taiwan, an area where the population should be overwhelmingly Minnan, only about 26% of 14 year-olds used Minnan in their daily life, although over 80% of their grandparents and over 70% of their parents were native Minnan speakers.

Under this consideration, Minnan was chosen as the target language. The project was conducted in a rural area in Chiayi County in Southern Taiwan with the hope to find young children who were raised in a Minnan-speaking environment. Data collection took place over a period of around three years between August 1997 and July 2000.

Participants

Young children from Minnan-speaking families were recruited in Min-hsiung Village, Chiayi County, in Southern Taiwan. Nine boys and five girls from the following villages in Min-hsiung Xiang participated in this project: Fengshou(豐收村), Sanxing(三興村), Dongxing(東興村), Xidibu(溪底部), and Zhenbei(鎮北村). They ranged in age from one year, two months (1;02) to three years, eleven months (3;11) at the beginning of the recording. More than half of the children were recorded over the span of more than two years. The age range at the offset of the recordings is between two years, seven months (2;07) and five years, three months (5;03).

Participant
Name
Age Range
Sessions
Sex
CEY
2;01.27-3;10.00
37
F
CQM
2;09.07-4;06.22
30
M
HBL
2;01.22-4;00.03
45
M
HYS
1;02.28-3;04.12
51
M
LJX
3;09.20-4;02.24
8
M
LMC
2;08.07-5;03-21
50
F
LWJ
2;01.08-3;07.03
36
F
LYC
1;02.13-3;03.29
48
F
TWX
1;05.12-3;06.15
44
F
WZX
2;01.17-4;03.15
44
M
YCX
3;10.16-4;00.16
6
M
YDA
3;11.02-4;04.26
9
M
YJK
2;06.11-2;06.26
2
M
YSW
1;07.17-2;07.14
21
M

Recording & Transcription

Regular home visits were conducted every two weeks for younger children, and every three weeks for children older than three years old. The recording setup was children at play in their home settings, interacting naturally with the adult(s), usually one of their caretakers (parents, grandparents, or, in very few cases, the nanny) and/or the investigator. The activities were children's daily life at home: playing with toys or games, reading picture books, or just talking without any specific topics. Since we hoped to keep the environment as natural as possible for the chilren, Mini-disc recorders and microphones were used so that it was easier for the recorder (the investigator) to follow the child wherever she/he went. Usually, each recording session lasted from 40 to 60 minutes.

The symbol “ǂ” is used in the phonological notations to indicate the contraction of two syllables into one, or three syllables into two. The zero in these notations indicates a missing syllable.

Articles based on the use of this corpus should read and cite this article, which provides additional information on Romanization and part of speech coding.

Acknowledgements

This project was conducted under the support of the National Science Council in Taiwan (NSC 87-2411-H-194-019, NSC 88- 2411-H-194-019, NSC 88-2418-H-194-002).