SpiCE: Speech in Cantonese and English
A transcribed audio corpus of conversational Cantonese-English bilingual speech
This SpiCE readme.md file was generated on 2021-05-13 by Khia A. Johnson
This is the Speech in Cantonese and English (SpiCE) corpus. SpiCE is an audio corpus of conversational Cantonese-English bilingual speech recorded in Vancouver, Canada during 2018-2020. The corpus includes high-quality recordings of 34 early bilinguals in both English and Cantonese. Participants completed a sentence reading task, storyboard narration, and conversational interview in each language. These different speech tasks are available in a single audio file for each language for each talker. A Praat textgrid file accompanies each audio file. The textgrids provide hand-corrected orthographic transcription and phoneme-level forced-alignment in Cantonese and English. As an open-access language resource, SpiCE will promote bilingualism research for a typologically distinct pair of languages, of which Cantonese remains understudied despite there being millions of speakers around the world. The SpiCE corpus is especially well-suited for phonetic research on conversational speech, and enables researchers to study cross-language within-speaker phenomena for a diverse group of early Cantonese-English bilinguals. These are areas with few existing high-quality resources.
General Information
- Dataset Persistent ID: doi:10.5683/SP2/MJOXP3
- Author Contact Information:
- Johnson, Khia A. (University of British Columbia, Linguistics, khia.johnson@ubc.ca) - ORCID: 0000-0001-6225-9446
- Babel, Molly (University of British Columbia, Linguistics, molly.babel@ubc.ca) - ORCID: 0000-0003-1153-9557
- Date of Collection: Start: 2018-11-05 ; End: 2020-03-10
- Production Place: Vancouver, BC, Canada
- Producer: Speech in Context Lab (Department of Linguistics, University of British Columbia) (SpeeCon) https://speechincontext.arts.ubc.ca/
- Data Collectors: Yiu, Nancy & Fong, Ivan
- Grant Information:
- UBC Public Scholars Initiative Award
- Social Sciences and Humanities Research Council: 435-2017-0136
Sharing and Access Information
- License: This work is licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/
- Documentation: While acopy is included in this corpus download, the documentation is best viewed at: https://spice-corpus.readthedocs.io/
- Recommended Citation: Johnson, Khia A., 2021, "SpiCE: Speech in Cantonese and English", https://doi.org/10.5683/SP2/MJOXP3, Scholars Portal Dataverse, V1.
- Related Publication: Johnson, K. A., Babel, M., Fong, I., & Yiu, N. (2020). SpiCE: A New Open-Access Corpus of Conversational Bilingual Speech in Cantonese and English. Proceedings of The 12th Language Resources and Evaluation Conference, 4089–4095. issn: 2522-2686 https://www.aclweb.org/anthology/2020.lrec-1.503/
Data and File Overview
Audio and Transcription
The cantonese/ and english/ directories contain the main corpus data. Each directory has one .wav audio file and .TextGrid annotation file per session with the same base file name. File names provide basic information and can be parsed as follows:
{TALKER}_{LANGUAGE}_{ORDER}_{DATE}.extension
- TALKER:
- V stands for Vancouver
- F(emale) or M(ale) represents self-reported gender
- 19-34 represents the talker's age at time of recording
- A-E represent unique identifiers to differentiate same gender/age talkers
- LANGUAGE: The primary language of the interview is Cantonese or English
- ORDER: Of the two sessions,
I1indicates first andI2second - DATE: The date of the recording session in a YYYYMMDD format
Language Background
The info/participants folder contains two tabular data files and one PDF.
spice-lbq-detailed.tabprovides a detailed, de-identified summary of language knowledge, proficiency, and use, in addition to demographic informationspice-lbq-summary.tabis a more succinct, basic summary of participant language backgroundsubc-lbq-survey.pdfis the Qualtrics survey that participants completed — the tabular files in this folder are based on the output from this survey
Supplementary Information
The info/alignment directory contains files used for and produced by the forced alignment process with the Montreal Forced Aligner 1.0.1, including:
- Pronunciation dictionaries with
.dictextensions- English: stressed ARPABET, based on dictionary provided with Montreal Forced Aligner 1.0.1
- Cantonese: developed from Jyutping romanization using
pycantonese 3.2.3https://pycantonese.org/
- "Out Of Vocabulary" files identifying words that were not in the pronunciation dictionarys with
oovin the filenames (see Montreal Forced Aligner Documentation for additional information: https://montreal-forced-aligner.readthedocs.io/) - Directories containing the acoustic models,
english_modelandcantonese_model. If used in future alignment, these folders should be compressed into.zipfiles.
The info/documentation folder contains a copy of the SpiCE Corpus documentation hosted at https://spice-corpus.readthedocs.io/, which describes the design, recording procedures, and transcription pipeline used to develop the SpiCE Corpus.