Informatika - Softwarové a Datové Inženýrství / Computer Science Master Programs

NPRG069 Softwarový projekt / Softwarove project
NPRG070 Výzkumný projekt / Research project
NPRG071 Firemní projekt / Company project

Important documents

Project proposals

Advisor to identify genuine functional dependencies

Parent project: MM-Infer

Currently, there are a number of tools for detecting functional dependencies in relational data (e.g. Metanome). The approaches on which these tools are based are usually optimized for small data samples, which may lead to the detection of functional dependencies that are only valid on a given sample by chance. In a real and larger dataset, these functional dependencies may not be valid.

The goal of this research project is to implement a tool that will not only detect functional dependencies in the data, but more importantly focus on eliminating spurious functional dependencies that are only valid in a small sample of the data. A key component of the tool will be the use of so-called negative examples - data records that purposely violate the detected functional dependencies but still correspond to potentially real data. The goal is to keep the number of these negative examples as small as possible, yet eliminate false functional dependencies as efficiently as possible.

Furthermore, interaction with domain experts (e.g., crowdsourcing) can play an important role in assessing whether the proposed negative examples correspond to real data values without accidentally disturbing the actual functional dependencies valid in the domain.

The research project can be carried out together with the follow-up thesis "Identification of genuine functional dependencies".

Contact: Pavel Koupil
Advisor for schema conversion from logical to conceptual

Parent project: MM-Infer

Existing tools for extracting schema from data collections can generate schema that faithfully match the logical structure of the data. However, it is often the case that these schemas are unconnected, too complex, or contain repetitive structural elements. These problems are caused, for example, by redundancy in datasets, recursive data structures, or inconsistencies in attribute and entity naming across different datasets.

The goal of the research project is to implement tools that convert the extracted logical schema into a conceptual form. The resulting conceptual schema will be comprehensible, compact and usable by the user without losing fidelity to the original data structure. This schema simplification will be achieved through redundancy detection, resolution of inconsistencies in the data, and merging of similar or duplicate elements. Existing methods such as detecting inconsistencies in the data, schema matching based on ontologies and entity or attribute naming, or techniques for detecting recursive structures in the data can be used for this purpose.

The research project can be carried out together with the follow-up thesis "Transformation of a logical schema into a conceptual schema".

Contact: Pavel Koupil
Nástroj pro optimalizaci databázového schématu

Parent project: MM-cat

Cílem projektu je implementovat nástroj, který bude schopen na základě změn ve struktuře dat a/nebo dotazech nad daty navrhovat změny v databázovém schématu tak, aby byly dotazy vyhodnocovány efektivněji. Předpokládá se, že data i databázový systém jsou multi-modelové. Pro realizaci se přepokládá využití (integrace) existujících nástrojů:
- MM-evocat, který umožňuje ručně vytvořit konceptuální model dané domény a tento mapovat na zvolenou kombinaci více modelů reprezentovaných v jedné nebo více databázích (nyní PostgreSQL, MongoDB a neo4j), tj. specifikovat které části domény mají být reprezentovány jako tabulka, graf nebo dokument.
- MM-infer, který umožňuje takový konceptuální model semi-automaticky odvodit ze zadaných dat.
- MM-quecat, který podporuje dotazování prostřednictvím zjednodušeného jazyka SPARQL nad konceptuálním modelem dané domény (reprezentovaným jako graf).
Cílem projektu je využít tyto nástroje pro vytvoření prvotního návrhu databázového schématu a jeho následnou modifikaci. Nástroje tuto funkcionalitu (v omezené míře) umožňují, ale v současné době musí vše specifikovat uživatel ručně. Cílový nástroj by je měl být schopen realizovat (polo)automaticky (tj. uživatel je dotazován pouze v případě, že existuje více možností) na základě vstupních dat a jejich měnící se struktury, měnících se dotazů, popř. efektivity jejich vyhodnocování, znalosti historie předchozích uživatelských rozhodnutí apod. Pro návrh změn schématu bude stačit vytvořit malou, ale rozšiřitelnou sadu vhodných heuristických pravidel - cílem projektu je vytvořit vhodné uživatelské rozhraní pro zadávání vstupů, integrovat a rozšířit uvedené nástroje a vytvořit proof of concept. Projekt je možné rozšířit do diplomové práce, která bude místo triviálních heuristických pravidel využívat sofistikovanější algoritmy, popř. LLM nebo vhodně natrénovanou neuronovou síť, a provede experimentální srovnání s proof-of-concept řešením.

Contact: Jáchym Bártík
Querying over multi-model data

Parent project: MM-cat

Most applications need some persistent storage. The most common way to store data is to use a relational database. However, in more complex use cases, it's much more efficient to store the data in different data models (e.g., graph, document, or key-value) and databases (e.g., Neo4j, MongoDB, or Redis). We want to develop a tool that would enable users to query all these databases in a unified way.

There already is the MM-cat framework that allows us to create a unified, database-independent model of the data and then map its parts to the specific databases. We have also developed MMQL, a SPARQL-like query language over the unified model, and implemented the most basic use cases for three databases (PostgreSQL, Neo4j, MongoDB). The goal of the project is to implement the rest, specifically:
- filtering,
- sorting,
- aggregations,
- optional statements,
- nested queries.
The project can be extended to a diploma thesis, where the focus will be on implementing query rewriting/updating. This means that when the unified model of the data changes, the tool will automatically update the queries to reflect these changes. There are several ways to do this, e.g., heuristics, machine learning, or formal methods.

Contact: Jáchym Bártík
Propagation of changes to multi-model data

Parent project: MM-cat

Data can be stored in various formats (e.g., graph, document, or key-value) and databases (e.g., Neo4j, MongoDB, or Redis). We have developed a framework (called MM-cat) that allows us to access all these databases at once. For this, we use a unified, database-independent model, which is then mapped to the specific databases. However, all applications need to evolve over time, and so does the data model. We want to allow the users to propagate the changes:
- from the unified model to the mappings between it and the databases,
- from the mappings to the data stored in the databases.
The goal of the project is to implement algorithms for the propagation, as well as a GUI for the users to interact with. The theory behind this is already here - we have defined operations that can be applied to the unified model, and we have derived how they should affect the mappings and the data. The focus of the project will be on the implementation of these operations. In many cases, the propagation will require manual input from the users. Therefore, the project can be extended to a diploma thesis, where the focus will be on automating the propagation via machine learning methods.

Contact: Jáchym Bártík
Vizuální dotazování nad multi-modelovými daty

Related projects: MM-quecat, MM-evoque.

Cílem projektu je implementovat nástroj, který umožní specifikovat dotaz nad danými grafovými daty prostřednictvím interakce s vizualizovaným grafem. Uživatel vyznačí požadovaný grafový vzor a přidá do něj další informace (podmínky, zástupce apod.). Nástroj takový dotaz bude umět vyjádřit v předem dané podmnožině jazyka SPARQL a stejně tak bude umět zadaný zjednodušený SPARQL dotaz vizualizovat v grafu. Další funkcionalitou bude vyhodnocení dotazu. Implementace bude navazovat na existující toolset vytvořený v rámci jiných studentských projektů. Díky němu bude k dispozici nejen graf (reprezentující konceptuální model nějaké domény), ale i jeho mapování na jeden nebo více databázových systémů (nyní PostgreSQL, neo4j, nebo MongoDB). Vyhodnocení dotazu bude sestávat z jeho dekompozice na základě daného mapování, překladu dekomponovaných částí dotazu do jazyků příslušných databázových systémů, vyhodnocení poddotazů v daných systémech a sloučení mezivýsledků do finálního výsledku. Pro sloučení mezivýsledků bude stačit implementace naivního přístupu bez optimalizací - cílem projektu je realizovat proof of concept. Další optimalizace vyhodnocování a experimentální srovnání s naivním řešením může být předmětem navazující diplomové práce.

Contact: Irena Holubová
Harnessing the Power of Large Language Models for Conceptual Modeling in Dataspecer

Parent project: Dataspecer - Date modelling a schema generation

Conceptual modeling plays a crucial role in data and software engineering, serving as the foundation for designing complex systems and offering a clear representation of the domain. However, creating precise and comprehensive conceptual models remains a challenge for many engineers. In this project, we aim to explore how large language models can be harnessed to develop an intelligent assistant that guides human modelers in creating structured conceptual models from an abundance of existing resources, including unstructured verbose textual sources, emails, and stakeholder interview notes, as well as structured resources, including existing database structures or excel spreadsheets where users collect their proprietary data. The goal is to support and streamline the modeling process, enhancing the overall quality of data and software engineering projects.

In this project, students will have the opportunity to delve into cutting-edge artificial intelligence and data engineering technologies by working with large language models and applying them to real-world problems. The project's success could open new doors for AI-driven conceptual modeling assistance tools and revolutionize how we design and develop intricate systems. By joining this project, you'll contribute to the future of data and software engineering, honing your skills and making a tangible impact on the field. Don't miss this chance to be part of an innovative research endeavor that could transform the way we approach conceptual modeling in data and software engineering with the help of intelligent assistants.

Contact: Martin Nečaský
Revolutionizing Data Querying with AI-Powered Chatbots

Parent project: Dataspecer - Date modelling a schema generation

Imagine you're working at your university's data center, where there are countless databases containing tons of valuable information. These databases aren't just used for the school's own systems, but they're also used to generate reports, analyze data, and manage information in general. Basically, the entire school depends on being able to find answers to questions hidden in the data as quickly as possible. But there's a problem - there are only a few employees who really understand both the organization and the databases, and they're the only ones who can translate management's questions into database queries. This creates a bottleneck, where the people who have questions far outnumber the employees who can help them.

The goal of this project is to create a chatbot that anyone in the organization can use to ask questions. The chatbot will refine the query through conversation and eventually translate it into an evaluatable database query. This means that instead of relying on a few knowledgeable employees, anyone can get the information they need quickly and easily. To make this possible, we'll use Large Language Models (LLMs). But we won't stop there - we'll refine the chatbot's knowledge with a conceptual data model of the organization and technical data artifacts generated from it in Dataspecer.

By joining this research project, you'll play an important role in transforming how your university's data center operates, making valuable information more accessible to everyone. You'll have the opportunity to work with cutting-edge Large Language Models and contribute to the development of a chatbot that streamlines database querying. This experience will not only enhance your technical skills but also enable you to make a tangible impact on the way data-driven insights are obtained within the organization. Don't miss this chance to be part of an important project that combines advanced technology and practical problem-solving.

Contact: Martin Nečaský
Web-based plugin for protein-ligand docking results visualization

Parent project: PrankWeb

Prediction of protein-ligand binding sites often is one of the first steps in a drug discovery pipeline. Previously, we developed a state-of-the-art tool for protein-ligand binding site prediction called P2Rank that is available either as a stand-alone tool or as a web service, PrankWeb. PrankWeb is currently used by more than 1,000 unique users per month. Recently, we started to extend its capabilities towards enabling different types of analysis over the predicted binding sites. One of such new features is the ability to specify a small molecule and predict how such a molecule would bind to the predicted binding site. This process is called docking. However, the web server currently cannot interactively visualize the docking results. The goal of this project is to develop a web-based plugin for visualization of the docking results using the Mol*, which is currently being used within the web server.

The project does not require any prior knowledge of molecular biology.

Contact: David Hoksza
Module for the detection of similar protein-ligand binding sites

Parent project: PrankWeb

Prediction of protein-ligand binding sites often is one of the first steps in a drug discovery pipeline. Previously, we developed a state-of-the-art protein-ligand binding site prediction tool called P2Rank, which is available either as a stand-alone tool or as a web service, PrankWeb. PrankWeb is currently used by more than 1,000 unique users per month. Recently, we started to extend its capabilities towards enabling different types of analysis over the predicted binding sites. This project aims to develop a module into PrankWeb, which would seek to detect similar binding sites to the predicted ones in databases of known proteins. That would require coming up with a method for fast detection of similar local protein 3D structures, possibly with the help of FoldSeek, as current structure databases such as AlphaFold Protein Structure Database can contain hundreds of millions of protein structures. Eventually, the developed pipeline will be available as a Docker module to PrankWeb.

The project does not require any prior knowledge of molecular biology.

Contact: David Hoksza
Module for the detection and exploration of molecular tunnels starting in predicted protein-ligand binding sites

Parent project: PrankWeb

Prediction of protein-ligand binding sites often is one of the first steps in a drug discovery pipeline. Previously, we developed a state-of-the-art protein-ligand binding site prediction tool called P2Rank, which is available either as a stand-alone tool or as a web service, PrankWeb. PrankWeb is currently used by more than 1,000 unique users per month. Recently, we started to extend its capabilities towards enabling different types of analysis over the predicted binding sites. This project aims to develop a module into PrankWeb, which would explore molecular tunnels starting in the predicted binding sites. Quite often, when a molecule binds to a protein, it uses so-called tunnels to travel inside of the protein structure. There exist databases (such as ChannelsDB) of tunnels and also tools (such as MOLE) for the tunnels detection. The module, eventually implemented as a Docker module, would use these external resources to connect the predicted binding sites with tunnels and visualize the results in the PrankWeb interface (an existing visualization library will be used for this, so no knowledge of graphics is needed).

The project does not require any prior knowledge of molecular biology.

Contact: David Hoksza
Phosphorilation binding sites prediction with protein language models

Parent project: PrankWeb

Protein language models (pLMs) represent a novel technique, harnessing extensive sequence datasets to generate high-dimensional contextual embeddings for every amino acid within a protein sequence. Recently, we have successfully applied pLMs to achieve state-of-the-art performance in predicting protein-ligand binding sites. This project aims to apply pLMs to detect phosphorylation sites. These are specific locations on proteins where a phosphate group attaches, resulting in various downstream processes. This project is in collaboration with the Department of Cell Biology of the Faculty of Science, who will provide the data and biological expertise to enrich the prediction process.

Contact: David Hoksza
AlphaFold2 predictions analysis

Parent project: PrankWeb

Recently, the DeepMind (Google) team published a deep learning method for protein structure prediction called AlphaFold2 (AF2), which has far surpassed existing methods.

The goal of this data-driven project is to analyze AF2 predictions. The focus will be on determining how much AF2 actually contributes to the expansion of the protein structure space, with respect to the AF2 "confidence score", which indicates how confident a prediction is at a particular location.

Contact: David Hoksza
Generování dokumentace a schémat datových formátů na základě ontologického modelu

Parent project: Dataspecer - Datové modelování a generování schémat

Data a datová interoperabilita jsou stále důležitější složkou softwarového vývoje. Softwarový systém často stojí na datech, zpracovává je a vyměňuje si je s jinými systémy. K tomu je nutné vyvinout řadu prvků - datová schémata popisující strukturu dat (např. JSON schémata, CSV schémata, XML schémata nebo RDF slovníky), kód obsluhující takto strukturovaná data, transformační kód konvertující data z jedné podoby do druhé (např. mezi dvěma různými JSON formáty) nebo REST API pro výměnu dat mezi různými systémy. V neposlední řadě je pak nutno vše dokumentovat. To je spousta programátorské, testovací a dokumentační práce. Přitom je často zbytečná, protože je rutinní a triviální. V rámci výzkumného projektu se snažíme najít způsoby, jak ji automatizovat. Student se zapojí do výzkumného projektu, v rámci kterého vyvíjíme postupy a software pro generování datové dokumentace a schémat pro různé datové formáty na základě ontologického modelu dané domény (např. turistických cílů).

Contact: Jakub Klímek
Generating user-friendly diagrams for data specifications

Parent project: Dataspecer - Datové modelování a generování schémat

Currently, when a data specification contains a diagram of what the specification is about, the diagram is created by humans. The diagram is a very important part of the data specification, because it gives people a quick overview of what the specification is about. However, creating a usable layout is very time consuming, and often distracts authors from the actual contents of the specification, when it is being created. The goal of this project is to come up with a convenient way of creating diagrams for data specifications so that the authors can focus on its content, and not have to deal with the layout. This may include collecting additional information about the concepts in the specification, configuring existing auto-layouting mechanisms, etc.

Contact: Jakub Klímek

Informatika - Softwarové a Datové Inženýrství / Computer Science Master Programs

Important documents

Project proposals

Advisor to identify genuine functional dependencies

Advisor for schema conversion from logical to conceptual

Nástroj pro optimalizaci databázového schématu

Querying over multi-model data

Propagation of changes to multi-model data

Vizuální dotazování nad multi-modelovými daty

Harnessing the Power of Large Language Models for Conceptual Modeling in Dataspecer

Revolutionizing Data Querying with AI-Powered Chatbots

Web-based plugin for protein-ligand docking results visualization

Module for the detection of similar protein-ligand binding sites

Module for the detection and exploration of molecular tunnels starting in predicted protein-ligand binding sites

Phosphorilation binding sites prediction with protein language models

AlphaFold2 predictions analysis

Generování dokumentace a schémat datových formátů na základě ontologického modelu

Generating user-friendly diagrams for data specifications