Bulgarian Phonetic Corpus BulPhonC Version 3

The Bulgarian Phonetic Corpus BulPhonC contains speech signals annotated automatically on phoneme level. The creation of the BulPhonC corpus has been supported by the project AComIn: Advanced Computing for Innovation funded by the FP7 Capacity Programme, Research Potential of Convergence Regions, Grant Agreement: 316087.

The corpus has been compiled at the Department of Linguistic Modelling and Knowledge Processing of the Institute of Information and Communication Technologies at the Bulgarian Academy of Sciences by Dimitar Hristov, Ivan Zamanov, Ivana Yovcheva, Marina Kraeva, Nelly Hateva, Petar Mitankin and Stoyan Mihov.

Corpus Description

Authors:	Dimitar Hristov, Ivan Zamanov, Ivana Yovcheva, Marina Kraeva, Nelly Hateva, Petar Mitankin and Stoyan Mihov
ISLRN:	755-406-235-455-4
DCMI Type:	Sound
Language:	Bulgarian
Year:	2015
Speakers:	140 speakers, 59 male and 81 female Bulgarian speakers, average speaker's age - 37 years
Recording environment:	Studio
Microphone:	Sennheiser MK 4
Sampling rate:	16 kHz
Number of bits per sample:	16
Sample type:	One-channel pcm
Number of utterances:	21891
Number of sentences:	The corpus contains 319 phonetically rich sentences divided into two parts. Part 1 contains 148 sentences and Part 2 contains the remaining 171 sentences. Most of the speakers have read only Part 1.
Phonetic annotation:	Each utterance has a corresponding annotation on phoneme level in a format supported by praat. The recorded signals were automatically segmented into utterances. All automatically segmented utterances were manually verified and the incorrectly segmented utterances were removed from the corpus. The remaining utterances were automatically annotated on phoneme level.
Phonetic system:	The phonetic system consists of 30 phonemes.
Size of the corpus:	2.7 GB tar.gz
Duration:	~ 40 hours
Citation:	Hateva, N., Mitankin, P., Mihov, S., BulPhonC: Bulgarian Speech Corpus for the Development of ASR Technology. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 2016, ISBN:978-2-9517408-9-1, pp. 771-774

Samples

Contents of the BulPhonC Corpus Version 3

Old versions: BulPhonC Version 2, 28.05.2015, BulPhonC Version 1, 30.04.2015

Availability

Free for scientific usage. Orders for a scientific licence must include a written declaration that no parts of the corpus will be used for the development of commercial products of any kind.

For more information, please, contact BulPhonC at lml dot bas dot bg.