Software project proposal: ParaSite
Supervisor: Ondřej Bojar (bojar@ufal.mff.cuni.cz)
Project name: ParaSite Framework - Parallel Multilingual Corpora Site Framework
Participants: Karel Vandas (karel.vandas@mensa.cz), Srđan Prodanović (tarzanka@gmail.com), Miloš Ercegovčević (ercegovcevic@hotmail.com), Ximena Gutiérrez (eyesonly19@gmail.com), Tomáš Kraut (tomas.kraut@matfyz.cz) and Jan Návrat (jan.navrat@gmail.com)
The existence of parallel corpora (see Background) is a necessary prerequisite for the Machine Translation task (see Background). The process of finding the sources of text and building the parallel corpus, is often a laborious and demanding task, especially in the case of languages that have low digital representation or in the case of highly obscure domains. The task of building the aligned corpora (see Background) can sometimes be redundant, mainly due to the fact that researchers are not aware that the corpus they are building already exists somewhere. We are hoping to offer precisely this: to provide a single location on the Web where the community could share and use parallel corpora, sorted automatically by their domain and language.
Additional types of users that will benefit from our system are translators/translatologists and ordinary users who are looking for a way to improve their translations - they will able to see how the context(neighbouring words) in both target and source language look like for the words/phrases for which they are looking the translation for. They will also be provided with simple statistics that will help in disambiguating problematic words.
Access to resources will be given on demand, and we are hoping to make Parasite the key aggregator of parallel texts on the Web.
Machine translation sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. [MT]
The term parallel corpus is typically used in linguistic circles to refer to texts that are translations of each other, in other words, a parallel corpus contains texts in two (or more) languages. There are two main types of parallel corpus - a comparable corpus and a translation corpus. [ParallelCorpus]
Comparable corpus of this kind refer to texts that are comparable in their content. An example would be a corpus of articles about football from English and Danish newspapers; or legal contracts in Spanish and Greek. Translation corpus is a corpus where the texts in one language (L1) are translations of texts in the other language (L2)
Parallel text alignment is the task of identifying translation relationships among the sentences in both halves of the parallel texts (bitexts) . Sentence alignment is an important supporting task for most methods of statistical machine translation; the parameters of statistical machine translation models are typically estimated by observing aligned bitexts. [Alignment]
Cluster analysis is the assignment of a set of observations into a subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning and a common technique for statistical data used in many fields including machine learning, data mining, pattern recognition, image analysis and bio- informatics. [Clustering]
Clustering in computational linguistics is helpful when identifying duplicates. Clusters are made by various methods mostly based on statistics. Classes of documents can contain comparable documents in their form, content or the combination of form and content.
The purpose of this software project is to design and implement a general NLP system whose main aim will be to aggregate parallel texts (texts in more than one language with sentence-to-sentence correspondence) from web sources and present them in a search-able interface like in [Linguee] or [Linear]. It is also meant to serve as a common platform where researchers and common users would be able to both contribute and use the resources offered by the system. Apart from that, the system will also help with the construction of comparable corpora (no sentence-to-sentence correspondence) and document classification based on the topic.
Through a web interface users will be able to perform a keyword-based or phrase-based search, where they will be able to see the results in parallel sentences of both target and source languages. Category-based search will be provided as well, as a visualization of document clustering.
A brief overview of functionalities along with total estimated number of expected participants involved for every module (given in parenthesis) is given below, along with estimated number of working hours(in square brackets)
3.2 Integration of existing tools
A part of the project will be the integration of a number of available third-party libraries for text processing into a cooperating framework. A table describing functionalities and links of available third-party libraries is given below.
Functionality | URL |
Crawling -domain specific spiders | http://www.robotstxt.org/orig.html |
PDF extraction | http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html
|
Language recognition |
|
Indexing | http://lucene.apache.org/java/docs/index.html
|
GZip exporting | http://java.sun.com/developer/technicalArticles/Programming/compression/
|
Sentence alignment | |
Word alignment |
The system will comprise of a web interface (receiving user requests and displaying results) and a core module, which will be divided into several sub-modules, each responsible for a different task. A complete modularization approach will be imposed so that every component is completely independent. Core of the system will hold all interfaces, so that the system will be extensible, if more functionalities were to be added in the future.
Completion date: 9 months starting from project approval
[MT] http://en.wikipedia.org/wiki/Machine_translation
[ParallelCorpora] http://www.athel.com/parallel_corpora.html
http://en.wikipedia.org/wiki/Parallel_text
[Clustering] http://en.wikipedia.org/wiki/Cluster_analysis
[Alignment] http://en.wikipedia.org/wiki/Word_alignment
[Linguee] www.linguee.com
[Linear] www.linearb.co.uk