Framework for Linked Data Integration

  • scraping
  • storing
  • cleaning
  • linking
  • providing aggregated Linked Data

Contents

Overview

The paradigm of publishing (governmental) data is shifting from data trapped in the private data silos to open data, or even linked open data, bringing the information consumers (citizens, companies) unrestricted access to the data and enabling the agile information aggregation, which has up to now not been possible. Such information aggregation comes with inherent problems, such as provision of poor quality, inaccurate, irrelevant or fraudulent information. All will come with an associate cost of the data aggregation which will ultimately affect data consumer's decision making and linked data applications usage and uptake.

Thefefore, as part of the OpenData.cz initiative and the European project LOD2, we are developing a framework which will enable creation, maintenance, and usage of the data infrastructure formed by the linked open data. The overall picture of the framework is as follows:

ODCleanStore accepts arbitrary RDF data through the webservice for publishers, together with provenance metadata. The data are stored to the dirty database where they are cleaned, scored, linked to other data etc. Subsequently, the data are moved to the clean database where it can be queried through the webservice for consumers. The response to a query consists of relevant RDF triples together with their provenance information and quality estimate. Openlink Virtuoso is used for storing the data.

The webservices will communicate in standard formats (RDF/XML, TriG) in order to integrate with an arbitrary producer or consumer of data. A data acquisition module Strigil, that would obtain information from (X)HTML pages or Excel spreadsheets, convert it to RDF and feed it to ODCleanStore, is currently under development. In the future, a data visualization and analysis module built on the webservice for consumers will be developed.

Strigil - Data Acquisition

Main Requirements

When we started the project we wanted to create a specialized web crawler with RDF output which can be connected to a database for storing all the downloaded data. The main requirements on this system were:

  • open
  • pluggable and easily extendable
  • cross-platform
  • distributed
  • scalable

Strigil Web Crawler

The whole system is splitted into two logical parts Data Application (DA) and Download System (DS). The Data Application is responsible for the data processing, creating RDF output and contains all major data structures (UST, Frontier). The Download System on the other hand is responsible for the whole downloading process which means to distribute the downloads equally over the source servers with respect to the politeness rules.

The DA produces prioritized download requests which are sent to the DS. The DS then tries to complete all the requests with as much respect to the priorities as possible. After successful download the files are sent to the DA to be processed.

Data Application

The Data Application can be again spitted into two major parts. One is the data processing part as mentioned above and the second is the control part.

The control part is created by a Web Application, which provides a simple and easy to use control interface. The whole system can be managed via web browser. This the place where scraping scripts are created/uploaded and managed. The Web Application is connected to resources such as Onthology Repository which are necessary for creating the scraping scripts.

All the data processing and the usual crawler data structures (UST, Frontier) are included in the Scraper Engine component. Each running script is represented by a Scraping Activity which is a process that is managed by the Scraper Application. The Scraper Application is responsible for starting each scraping script, managing its run or putting it to sleep when it is done or it is waiting for the data provided by the Download System. All the download requests created by a Scraping Activity are passed to the URQ Controller where they are prioritized and afterwards sent to the Download System to be downloaded.

Download System

The DS contains 4 major components, these are: Download Manager, Downloader, DNS Resolver, Proxy. Every component can run on a different machine and can be replicated to increase download performance.

Download Manager

The Download Manager is the "brain" of the whole Download System. This is the entry point where the prioritized download requests from the Data Application arrive. All these requests are then stored in the Download Requests Buffer. These requests are then processed by the Scheduler, which is responsible for abiding all the politeness rules and download source restrictions. It is continuously distributing the download requests over the available Downloaders. This is done in such a way that every download requests is passed to one particular Downloader and gets assigned one Proxy trough which the download is redirected. After successful download the Data Application is receives a notification about which files were downloaded and were can they be found.

Of course there is always a possibility that a download request can not be executed for some time. This can be eg. because the number of the downloads would exceed the allowed number for a given source and would end up in blocking the Proxy IP address. These requests are returned to the Data Application which must re-prioritized them with consideration of the time for which they can not be downloaded.

Downloader

Each download request from scheduler is sent to one particular Downloader. These requests are stored in Source Queues component and are waiting to be executed. Every download request has one Proxy assigned trough which the download will be redirected.

For every downloader it is necessary to maintain opened connections with all accessible Proxy-s, this is done by the Connection Controller component. It receives a request for connection to a particular Proxy which is then provided to the Downloader.

Proxy

Proxy servers are used to for a couple of reasons:

  • for security reasons we want to hide all the other parts of the system from the internet
  • provide higher throughput for one download source
  • we can easily redirect downloads between Proxy-s if one gets blocked

DNS Resolver

The last component is the DNS Resolver. It is a very simple component for caching DNS translations and minimize the amount of DNS requests. Before every download requests is sent to a Downloader, the domain name of the address of the downloaded file is translated to an IP address.

Deployment

The deployment model shows again that every component which was mentioned can run on a different machine and can be replicated. This way a high level of scalability can be provided. The troughput for one source can be increased by adding new Proxy servers to the system. The download preformance can be increased by adding more Downloaders to the system. And when all of this is not enough even the number of Download Managers can be increased.

ODCS Module

The central module of our efforts is the ODCleanStore project. The goal of the ODCleanStore project is to build a Java Web application which will clean, link, and score incoming RDF data and provide aggregated and integrated views on the data to Linked Data consumers. The module can be administered via a graphical user interface.

Data Processing

Data accepted through the webservice for publishers is stored as a named graph to the dirty database. The ODCleanStore Engine takes these named graphs and runs them through a pipeline of transformers. A transformer is a Java class implementing a defined interface. Each transformer may modify the processed named graph (e.g. normalize values, deal with blank nodes) or attach a new named graph (e.g. quality assessment results, links to the data in the clean database or links to other datasets). Custom transformers can be easily plugged in to an arbitrary place in the processing pipeline.

Several transformers have special importance with regard to integration of data from various sources and the quality assessment, and are integrated to the web user interface: Data Normalization, Quality Assessment and Object Identification. These are described in more detail below.

When the named graph passes through all the transformers in the pipeline, it is moved to the clean database and made available for queries. The high-level architecture is captured on a UML component diagram.

Engine

The Engine is a multithreaded Java based server that runs input and output webservices.

Input SOAP WebService receives incoming data in RDF/XML format together with provenance metadata describing source and other properties of the incoming data. Input service implements simple authentication and authorization mechanism and supports SSL.

Output RESTful WebService enables to use the Query Execution module directly from the web. It supports HTTP GET and HTTP POST and also allows to choose the format of the output. Output RESTful WebService uses Restlet.

According to predefined plans of the Web Frontend the Engine runs the transformers modifying stored data. The engine serves as an orchestrator, calling individual transformers to accomplish well defined separated data transformations, one at a time. he Engine enables to use transformers from third-party developers.

There are three important data spaces - the dirty RDF database, the clean RDF database, and a relational database to store configuration information, such as user accounts. We use Openlink Virtuoso to host all three databases. We store the incoming named graphs together with its provenance metadata.

Special emphasis is put on failure handling and failure recovery.

Transformers

Policies can be defined on the global level (applied to all incoming data), or based on the ontology used to describe the data or based on the source of the data. Transformers will apply to the incoming resources (on the way from the dirty database to the clean database) and also regularly on top of the clean database.

Data Normalization

Data normalization component is as special type of Transformer aimed to be applied early in the whole data evaluation process to simplify work of other transformers. Its main goal is to remove inconsistencies in forms the data is provided in. This is achieved by rules that specify patterns (data that comply to certain conditions) that need to be transformed and the way to transform them. The pairs of patterns and transformations are stored in database as rules and the set of all rules can be modified through the web frontend. The patterns are described in SPARQL conditions.

Quality Assessment

The Quality Assessment component as a special Transformer assigns a score to each graph based on coefficients of different patterns present in the graph to reflects to what degree the data contained in it comply to certain rules. Each time a score of a graph changes a total score of its publisher (domain of origin) is updated. Again the rules formed of patterns and coefficients are supplied as a special resource to this particular transformer. The patterns are defined as GroupGraphPattern followed by optional GROUP BY and HAVING clauses of SPARQL.

Here are some examples of Quality Assessment rules for the Public Contracts Ontology:

RuleDescription
{{?s procurement:referenceNumber ?o} FILTER (bif:regexp_like(?o, '[a-zA-Z]'))}Procurement reference number must be valid
{{?s procurement:procedureType ?o}} GROUP BY ?g ?s HAVING count(?o) > 1Procedure type must not be ambiguous
{{?s procurement:awardDate ?a; procurement:tenderDeadline ?d.} FILTER (?d > ?a)}Tender must not be awarded before the application deadline
{{?s procurement:numberOfTenders ?n. ?s procurement:tender ?t}} GROUP BY ?g ?s ?n HAVING count(?t) != ?nList of tenders must have a size matching procurement:numberOfTenders
{{?s procurement:contactPerson> ?c}} GROUP BY ?g HAVING count(?c) != 1A procurement contact person must be present

Object Identification

The Object Identification component is a special implementation of a Trasformer. The main purpose of this component is to interlink URIs which represent the same real-world entity by generating owl:sameAs links. It can be also used for creating other types of links between differently related URIs. Silk framework is used as the linking engine. Sets of linkage rules for the engine are written in Silk-LSL, stored in database and can be managed through our Web Frontend.

Query Execution

The Query Execution component is what is behind the Output RESTful WebService for data consumers. Its purpose is to retrieve all RDF triples relevant for the given query from the clean database, apply conflict resolution and return the aggregated view on the data. Currently, two types of queries are supported:

  • URI query - retrieves all triples where the given URI appears as the subject or the object,
  • keyword query - fulltext search for occurences of the given keyword(s).

Besides triples returned in response to a query, Query Execution also loads all data necessary for the Conflict Resolution process, such as provenance metadata (who created it, when etc.) or relevant owl:sameAs links.

Consumers can also query the application using SPARQL query language. In this case, however, the information about metadata and data quality scores is limited and the conflict resolution not supported.

Examples

Suppose the clean database contains information about the city Berlin obtained from multiple sources (e.g. DBpedia, Freebase). First, the user can search for term "Berlin" or "Berlin Germany" using the keyword query. A list of triples where the searched keywords occur is returned (using a fulltext search). These may include e.g. the triple <http://dbpedia.org/resource/Berlin> rdfs:label "Berlin".

Now the user can search for more information by using the URI query for <http://dbpedia.org/resource/Berlin> and receive all triples where the URI occures. The result will also include data about all other URI resources that represent the same concept (i.e. are linked to the searched URI by owl:sameAs). Query Execution also adds metadata about the provenance of the result and triples with human-readable labels for URI resources in the result so that applications built on top of ODCleanStore may display a more user-friendly output.

The search can be further refined by liming the result only to data newer than a given data or having a Quality Assessment score higher than a given treshold.

Conflict Resolution

Conflict Resolution components resolves instance level conflicts (conflicting values) at query time, aggregates the values according to given aggregation policies and calculates a quality estimate of each resulting triple. Aggregation policies are defined by the ontology creators, but can be further customized for the particular query by the data consumer.

Conflict Resolution enables integration of data originating from multiple sources with various quality. These sources may agree on some information or they may provide conflicting data. Conflict Resolution aggregates values representing the object of triples sharing the same subject and predicate. The user can choose from several aggregation strategies, e.g. ANY, ALL, BEST, CONCAT, LATEST, AVG, MEDIAN, MIN. Because data from different sources may use different vocabularies and resource identifiers, ontology mappings and resource mappings are considered during conflict resolution.

The output of Conflict Resolution are triples with resolved conflicts and for each such triple the list of sources and a quality estimate. The quality depends on the score of its source and is increased when multiple sources agree on the same value and conversely decreased when conflicting values appear.

Example

We will continue with the example query for <http://dbpedia.org/resource/Berlin> from above. The result will contain triples such as the geographical location of Berlin, population, human-readable name etc. As the first thing, Conflict Resolution will translate URI <http://rdf.freebase.com/ns/en.berlin> identifying Berlin at Freebase to its equivalent <<http://dbpedia.org/resource/Berlin>; the same will be done with other URI resources and properties so that conflict can be resolved.

Than each set of triples with conflicting values in place of the object are aggregated using the given aggregation policies. For example, the user may choose AVG for values of latitude and longtitude, BEST for labels (rdfs:label) and ALL as default aggregation for all other properties. The result of the aggregation may look like in the following table:

SubjectPredicateObjectQualitySource named graphs
dbpedia:Berlinrdfs:label"Berlin"@en0,88505ex:linkedgeodata, ex:geonames, ex:dbpedia, ex:freebase
dbpedia:Berlingeo:lat"52.516314575"0,82446ex:dbpedia, ex:linkedgeodata, ex:freebase, ex:geonames
dbpedia:Berlingeo:long"13.40274009"0,82446ex:linkedgeodata, ex:dbpedia, ex:geonames, ex:freebase
dbpedia:Berlindbprop:populationTotal"3420768"0,79598ex:linkedgeodata
dbpedia:Berlindbprop:populationTotal"3426354"0,79667ex:geonames
dbpedia:Berlindbprop:populationTotal"3450889"0,89663ex:dbpedia
dbpedia:Berlinrdf:typehttp://schema.org/City0,92000ex:dbpedia, ex:freebase
dbpedia:Berlinrdf:typehttp://schema.org/Place0,90000ex:dbpedia
dbpedia:Berlindbprop:countrydbpedia:Germany0,90000ex:dbpedia
dbpedia:Germanyrdfs:label"Deutschland"0,90000ex:dbpedia2
geo:latrdfs:label"Latitude"1,00000ex:labels
geo:longrdfs:label"Longtitude"1,00000ex:labels

The user can also customize aggregation error strategy, i.e. what happens when a non-numeric value is processed by AVG aggregation, for instance. Such value may be either discarded, or included in the output. The quality calculation can also be influenced.

Web Frontend

The Web Frontend component is built on top of the Apache Wicket web framework. It is a professional, commonly used, component oriented framework. It is easy to use and provides a convenient abstraction over the HTTP protocol, which makes web development easier and more robust.

At the DAO tier we use the dependency injection and jdbc template facilities of the Spring application development framework.

The Web Frontend allows basic configuration of the whole application, in particular:

  • managing user accounts and roles associated with them(i.e. restricting permissions to use various parts of the website and to insert data through input services)
  • managing ontologies,
  • managing all kinds of policies (policies for the Object Identification component, data aggregation policies, Quality Assessment policies and ontology mapping policies)
  • configuring (custom) transformers and the engine.
The configurations are expected to be done through simple HTML forms. A more convenient (AJAX based) user interface will be implemented in a future version.

The Web Frontend component comunicates with the rest of the system (e.g. especially the Engine component) directly through the relational and RDF databases.

Other Features

Ontology Maintenance & Mapping

The project will maintain ontologies describing the consumed data (explicitly imported) and enable creation of mappings between these ontologies which is a crucial aspect when aggregating data. Mappings between ontologies will be taken into account during query answering.

Roles

The application will support several users' roles:

  • administrator - rights to assign other roles, particularize settings for the application, components,
  • ontology creator - rights to adjust ontologies, ontology mappings, aggregation policies
  • policy creator - rights to adjust policies for the Transformers -- Quality Assessment component, Object Identification component, cleaners,
  • scraper - rights to insert data, list registered ontologies,
  • user - rights to query the application.

Contact information

The ODCleanStore project is developed by a group of six developers under the XRG at KSI MFF UK. A list of all participants follows:

Initiatives