The main research objective of the project is to develop novel methods and related algorithms for knowledge discovery in big data based on semantic attributes. These methods will provide solutions to the classical pattern search tasks with effectiveness comparable to the performance of existing approaches but will also solve more complex pattern search tasks taking into consideration contexts and sequencing.

The project will deliver new theoretical results in the area of theoretical computer science, computational linguistics and medical informatics.

In theoretical informatics, new formal models for representation of data collections with structured and unstructured information will be developed. The novelty of the representation will stem from integration of various approaches for organization of semantic attributes: for discrete and continuous numeric values as well as nominal values. The semantic attributes might have unstructured or structured nature: hierarchical, sequence, net etc. The project will produce also formal descriptions of algorithms for pattern search in data featuring semantic attributes with diverse organization, taking into consideration frequent patterns of data items as well as temporal sequences of data items including parallel sequences.

In the area of computational linguistics, IZIDA will develop formal descriptions of algorithms for pattern search in texts incorporating morphological and syntactic features as semantic attributes. A comparison with the classical statistical methods for pattern search in texts (collocations, n-grams etc.) will be provided.

As contribution to the field of medical informatics, the project will develop formal descriptions of algorithms for pattern search in hybrid data (discrete, continuous, nominal and text) in the medical domain, with diverse organization of the respective semantic attributes. A retrospective analysis will be made using outpatient records (the latter are anonymized as required by the Bulgarian Personal Data Protection Law). As an output of this analysis, structured resources will be generated that might be combined with secondary data to enable more complex knowledge discovery analytics. The test data sets will be extracted using various techniques: stochastic sampling, quota sampling, and cluster sampling.

As experimental tools that allow for performance assessment of the novel methods and algorithms, several software prototypes (lab versions) will be implemented. A series of tests will be run on different data collections and the effectiveness of the novel algorithms will be evaluated. The scalability of the algorithms will be investigated too by experiments with collections of big data.

The results achieved in the project will enable further elaboration of tools that tackle complex issues related to social problems in health management: development of more precise methods for studying comorbidity between chronic and acute diseases; monitoring changes of patient status depending on the diagnoses, patterns of combined treatment, and illness progress; investigation of appearance of chronic and acute diseases and their reflection on the complex patient status depending on the sex, age, region etc.

This project operates under contract DNO2/4 as of 13.12.2016.