A modular and extensible framework that ensures inference of a common schema of multi-model data. It enables to re-use verified single-model approaches, it is able to reveal both intra- and inter-model references as well as overlapping of models, i.e., data redundancy. Following the current trends, the implementation can process efficiently also large amounts of data.
For the purpose of demonstration of key contributions, MM-infer supports three DBMSs selected in order to cover most of the distinct features related to schema inference. In particular:
If there is more than one possibility how to treat a DBMS, for each of the options there are implemented respective separate wrappers.
In our demo of MM-infer, we will gradually go through the process of schema inference of a part of the multi-model scenario from Unibench extended to demonstrate all interesting cases.
In particular, the following steps will be taken during the demonstration:
The user selects a set of named single- or multi-model DBMSs, i.e., creates a polystore, that will be used in the process of the schema inference. Once added, the particular DBMS instance can be edited or removed.
The user can choose from a range of supported DBMSs, select a particular database instance, and set its optional description. Finally, based on a particular DBMS wrapper, the user also specifies the connection to the database, e.g., hostname, port, database, user, and password.
The set of the DBMSs is automatically loaded and kinds stored in the respective database instances are trivially compared, which results in the redundancy recommendations (i.e., overlapping data). The user confirms/refutes these candidates.
The user selects kinds in the chosen DBMSs to be used for inference of RSDs and selects the respective inference strategy: 1) inherit, i.e., loading of an existing schema from schema-full data, 2) verified existing strategy for a particular schema-less data (e.g., JSON, XML), or 3) universal strategy covering schema-mixed and schema-less multi-model data in general.
Having defined database instances and particular kinds to be inferred, the distributed Apache Spark task processes all the data (or theirs schemas, if exist) on background and creates local schemas for each kind. In parallel, the additional task is executed in order to generate candidates for redundancy, references and selected integrity constraints.
The user modifies suggested candidates for basic integrity constraints. In the current version of MM-infer, the user confirms/refutes candidates for intra- and inner-model references, inferred data types of particular properties, and a range of values for each property.
In addition, the user confirms/refutes pre-computed candidates for data redundancy. The redundancy may be of two types: 1) a complete data overflow in which the active domains of two particular kinds are equal, or 2) a partial data redundancy, where data of one kind are a subset of data of another kind, e.g., Customer being a subset of Person.
Having merged all the local schemas and checked candidates for integrity constraints and data redundancy, MM-infer visualizes global schema of the chosen set of DBMSs (including references and data redundancy). Additionally, the resulting RSD can be exported into various popular formats, e.g., JSON Schema, XML Schema, etc.
Department of Software Engineering
Faculty of Mathematics and Physics
Charles University
Malostranské náměstí 25
118 00 Prague
Czech Republic
irena.holubova@matfyz.cuni.cz
pavel.koupil@matfyz.cuni.cz
sebastian.hricko@gmail.com