Software project proposal

Software project proposal: ParaSite

Supervisor: Ondřej Bojar (bojar@ufal.mff.cuni.cz)

Project name: ParaSite Framework - Parallel Multilingual Corpora Site Framework

Participants: Karel Vandas (karel.vandas@mensa.cz), Srđan Prodanović (tarzanka@gmail.com), Miloš Ercegovčević (ercegovcevic@hotmail.com), Ximena Gutiérrez (eyesonly19@gmail.com), Tomáš Kraut (tomas.kraut@matfyz.cz) and Jan Návrat (jan.navrat@gmail.com)

1 Motivation

The existence of parallel corpora (see Background) is a necessary prerequisite for the Machine Translation task (see Background). The process of finding the sources of text and building the parallel corpus, is often a laborious and demanding task, especially in the case of languages that have low digital representation or in the case of highly obscure domains. The task of building the aligned corpora (see Background) can sometimes be redundant, mainly due to the fact that researchers are not aware that the corpus they are building already exists somewhere. We are hoping to offer precisely this: to provide a single location on the Web where the community could share and use parallel corpora, sorted automatically by their domain and language.

Additional types of users that will benefit from our system are translators/translatologists and ordinary users who are looking for a way to improve their translations - they will able to see how the context(neighbouring words) in both target and source language look like for the words/phrases for which they are looking the translation for. They will also be provided with simple statistics that will help in disambiguating problematic words.

Access to resources will be given on demand, and we are hoping to make Parasite the key aggregator of parallel texts on the Web.

2 Background

Machine translation sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. [MT]

The term parallel corpus is typically used in linguistic circles to refer to texts that are translations of each other, in other words, a parallel corpus contains texts in two (or more) languages. There are two main types of parallel corpus - a comparable corpus and a translation corpus. [ParallelCorpus]

Comparable corpus of this kind refer to texts that are comparable in their content. An example would be a corpus of articles about football from English and Danish newspapers; or legal contracts in Spanish and Greek. Translation corpus is a corpus where the texts in one language (L1) are translations of texts in the other language (L2)

Parallel text alignment is the task of identifying translation relationships among the sentences in both halves of the parallel texts (bitexts) . Sentence alignment is an important supporting task for most methods of statistical machine translation; the parameters of statistical machine translation models are typically estimated by observing aligned bitexts. [Alignment]

Cluster analysis is the assignment of a set of observations into a subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning and a common technique for statistical data used in many fields including machine learning, data mining, pattern recognition, image analysis and bio- informatics. [Clustering]

Clustering in computational linguistics is helpful when identifying duplicates. Clusters are made by various methods mostly based on statistics. Classes of documents can contain comparable documents in their form, content or the combination of form and content.

3 Description of the project

The purpose of this software project is to design and implement a general NLP system whose main aim will be to aggregate parallel texts (texts in more than one language with sentence-to-sentence correspondence) from web sources and present them in a search-able interface like in [Linguee] or [Linear]. It is also meant to serve as a common platform where researchers and common users would be able to both contribute and use the resources offered by the system. Apart from that, the system will also help with the construction of comparable corpora (no sentence-to-sentence correspondence) and document classification based on the topic.

Through a web interface users will be able to perform a keyword-based or phrase-based search, where they will be able to see the results in parallel sentences of both target and source languages. Category-based search will be provided as well, as a visualization of document clustering.

3.1 General architecture

A brief overview of functionalities along with total estimated number of expected participants involved for every module (given in parenthesis) is given below, along with estimated number of working hours(in square brackets)

Web interface (two team members)[576 hours]

inputs : pairs of URLs, pairs of text files(in different formats)
outputs : sentence aligned documents (with visualized parallel sentences), statistics visualization
search:

keyword and phrase based: interactive search results
category-based: as a result of document clustering

Core: (one team member)[288 hours]

storage of data

storage of documents
indexing
sentence alignment

Modules: (three team members)[864 hours]

Preprocessing (two team members)

Crawling: domain specific spiders;
PDF extraction
HTML cleaning
language recognition
tokenization and stemming: based on a previous step
document management: list of files and URLs processed, storage of the full texts, export as gzipped plaintext (available through GUI and command-line interface or simple HTTP request)

Alignment (one team member)

text indexing
sentence alignment

Postprocessing (one team member)

deduplication: automatic recognition of content identical documents; based on document clustering techniques
document clustering: automatic clustering of documents into adequate automatically created categories, based on their content

3.2 Integration of existing tools

A part of the project will be the integration of a number of available third-party libraries for text processing into a cooperating framework. A table describing functionalities and links of available third-party libraries is given below.

Functionality	URL
Crawling -domain specific spiders	http://www.robotstxt.org/orig.html
PDF extraction	http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html
Language recognition	http://www.antlr.org/
Indexing	http://lucene.apache.org/java/docs/index.html
GZip exporting	http://java.sun.com/developer/technicalArticles/Programming/compression/
Sentence alignment	http://mokk.bme.hu/resources/hunalign
Word alignment	http://code.google.com/p/giza-pp/

The system will comprise of a web interface (receiving user requests and displaying results) and a core module, which will be divided into several sub-modules, each responsible for a different task. A complete modularization approach will be imposed so that every component is completely independent. Core of the system will hold all interfaces, so that the system will be extensible, if more functionalities were to be added in the future.

Completion date: 9 months starting from project approval

4 References

[MT] http://en.wikipedia.org/wiki/Machine_translation

[ParallelCorpora] http://www.athel.com/parallel_corpora.html

http://en.wikipedia.org/wiki/Parallel_text

[Clustering] http://en.wikipedia.org/wiki/Cluster_analysis

[Alignment] http://en.wikipedia.org/wiki/Word_alignment

[Linguee] www.linguee.com

[Linear] www.linearb.co.uk