RUSSIAN-BULGARIAN JOINT VENTURE
Goals
The goal of the joint Russian-Bulgarian venture is
-
to apply to the Russian texts on the Mannheim server (TELRI
FTP site)
-
the high-speed Bulgarian analysing tool (100,000 w/s) whose
main components are:
1. Indexing and compressing facilities (made in LML);
2. Russian DELAF Dictionary (wordforms lexica coded in INTEX* format).
This dictionary was constructed on the basis of:
-
a CFRL lexeme dictionary;
-
generating program made in CFRL and LML.
* INTEX - a system for lexicon and corpora processing
made in LADL (University Paris VII) where,
within the framework of COPERNICUS?94 JRP #790 BILEDITA,
a new component (Bulgarian) was added
to the existing 5 components (French, English, German,
Italian, Spanish).
|
Sharing Linguistic Knowledge
(mappings and transformations)
From CFRL tagset to INTEX tagset format
The CFRL Russian tagset (mirroring directly the notation of Grammatical
Dictionary of Russian of A. Zaliznjak) underwent some changes. They consist
of:
a) removing tags from the tagset;
b) changing the tags hierarchy.
Removing tags. It is done for the tags denoting syntactic operations,
referring to generation rather than analysis (Phrasal element, Analytic
form, Postposition)
Changing the hierarchy. Following the INTEX format, three types
of features (2 lexical and 1 grammatical) are distinguished: lexical identifier
- the Part of Speech, lexical attribute - a characteristic of the lexeme
(the whole paradigm), and grammatical feature - a characteristic of the
concrete paradigm member (in the Table: LEXID, LEXATT and GRAMM).
To follow the INTEX hierarchy, some structural changes were made in cases
as 8-10, 4-7, and 12,18:
Two nodes of the same level ----> One of them is pushed up in the tree
as a parent and a new sister is added to the other one (8-10)
Main node ----> Subordinated to a new main node (4-7)
Main node --à Subordinated
to an existing main node (12, 18)
The full mapping of CRLF lexicon features to INTEX-LML tagset is given
in Tables.
A sample of analyzed Russian text is given in Results.