About Project

Head Investigator: Prof. RNDr. Jaroslav Pokorný, CSc.
Time Span: 2013-2015
Contact: para-escience (at) ksi (dot) mff (dot) cuni (dot) cz
Topic: Development of methods that employ highly parallel hardware in the tasks of knowledge discovery and data mining in astroinformatics.

Project Description

The Internet and the sciences — notably physical sciences, biological sciences, medicine, and engineering — generate large and complex datasets (Big Data) that require more advanced database and architectural support. In the former, the development led to web databases, in the latter, to data processing called e-science today. Traditional scientific methods based on individual examination of facts are not applicable to data of this enormous size. A new kind of research methodology has emerged, sometimes called the fourth paradigm of scientific exploration [Hey2007], based on statistical exploration of big amounts of data; in many areas, this approach lead to new results unreachable by traditional approaches. Astroinformatics as a good example of such an approach is based on systematic application of modern informatics and advanced statistics on huge astronomical data sets. Machine learning, classification, clustering and data mining yield the new discoveries and better understanding of nature of astronomical objects. It is sometimes presented as a new way of doing astronomy [Borne2009].

For a long time, the term scientific computing was almost synonymous with high performance computing, i.e. large scale numerical processing in a distributed environment. In addition, as modern science created a number of large datasets, storing and organizing the data themselves became an important problem. Although the principal requirements placed on a scientific database are similar to other database applications, there are also significant differences that often cause that standard database architectures are not applicable.

For example, to successfully solve problems whose dynamics is “hidden” in the data, it is important to have fast and effective methods of data mining. Although data mining has been studied for decades, only few data-mining methods are scalable enough to cope with data of today and future sizes [GarciaPedrajas2012]. For example, in astronomy and astrophysics the size of data is doubled each nine months — twice as quickly as the hardware power grows according to the Moore law [Quinn2007]. Thus, new approaches to data mining and data processing in general are needed.

The complexity of the Knowledge discovery in databases (KDD) is one of the reasons for a slow adoption of these methods by astronomical community [Brescia2011]. In order to be effective, a KDD application requires a good understanding of the mathematics underlying the methods, of the computing infrastructure, and implemented workflows. In most cases the optimal results can be found only on a trial-error base by comparing the outputs of different methods or of different implementations of the same method. And even in the simplest cases, KDD methodology requires multiple experiments to be run with the same method in order to optimize the internal parameters of the models or evaluate internal error. So in order to solve a specific problem a long fine-tuning phase is often required.

Generally, queries in e-science may require domain-specific algorithms that can be computationally difficult to evaluate. To handle the evaluation of such requirements in acceptable time, the following measures are required: special indexing methods, high level of parallelism, and approximate evaluation (where acceptable).

Till now, the parallel capabilities and the extensibility of relational database systems (RDBMS) were successfully used in a number of computationally-intensive analytical or text-processing applications. Unfortunately these database systems may fail to achieve expected performance in scientific tasks for various reasons like invalid cost estimation, skewed data distribution, or poor cache performance. Discussions initiated mainly by Stonebraker et al in [Stonebraker2005] and [Stonebraker2007] have shown advantages of specialized databases architectures for stream data processing, data warehouses, text processing, business intelligence applications, and also for scientific data. Highly scalable solutions both in data size and the number of parallel users are now achieved mainly within the architectures of NoSQL databases [Pokorny2011].

The goals of this project comprise applying of advanced methods of knowledge discovery in astroinformatics, including soft-computing techniques like evolutionary design in the development of new data mining and knowledge discovery algorithms and techniques. For the specific needs of such algorithms, a parallel and distributed platform will be designed and implemented, covering traditional as well as emerging hardware architectures like GPGPU. Such a platform will also allow the evaluation of several non-conventional database design approaches in the context of e-science.

Modified or new methods will be tested on real data sets from astrophysical observations. We want to use large archives of stellar spectra from spectroscopic surveys like SDSS or LAMOST complemented by large photometric archives to extract the feature vectors and then apply different KDD techniques of clustering and supervised training to find objects of interesting types (e.g., emission stars, blazars, active galactic nuclei). To prepare the feature vectors, the methods of wavelet power spectra and other advanced data analysis techniques will be used [Li2010]. As the advanced methods of clustering or self organizing maps cannot be easily done using the MapReduce principle or massively parallel processing, the new database handling techniques described in this project will be used.

  • [Borne2009] K. D. Borne. Scientific Data Mining in Astronomy. in Next Generation of Data Mining (Taylor & Francis: CRC Press), pp. 91-114 (2009)
  • [Brescia2011] M. Brescia, S. Cavuoti, S. G. Djorgovski, C. Donalek, G. Longo, et al. Extracting Knowledge From Massive Astronomical Data Sets. Astrostatistics and Data Mining in Large Astronomical Databases, eds. L.M. Barrosaro et al., Springer Series on Astrostatistics (2011)
  • [GarciaPedrajas2012] García-Pedrajas, N., Haro-García, A., Scaling up data mining algorithms: review and taxonomy, Progress in Artificial Intelligence, 1, 2012.
  • [Hey2007] Hey, T., Tansley, S., and Tolle, K. (Eds.): Jim Gray on eScience: A Transformed Scientific Method. Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on Jan. 11, 2007, Microsoft Research.
  • [Li2010] T. Li, S. Ma, and M. Ogihara. Wavelet methods in data mining. In Oded Maimon and Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 553{571. Springer, 2010.
  • [Pokorny2011] Pokorný, J.: NoSQL Databases: a step to database scalability in Web environment. In: Proc. of the 13th International Conference on Information Integration and Web-based Applications & Services (iiWAS) 2011, December 5–7, 2011, Ho Chi Minh City, Vietnam, Taniar D., Pardede E., Nguyen H.-Q., Rahayu W., Khalil I. (Eds.), ACM, 278-283
  • [Quinn2007] Quinn, P. 2007, Data Intensive Science needs for Australian Astronomy, http://astronomyaustralia.org.au/ASTRO-projects-infrastructure.pdf
  • [Stonebraker2005] Stonebraker, M. Çetintemel U. One Size Fits All: An Idea Whose Time has Come and Gone". In: Proc. of the International Conference on Data Engineering (ICDE), 2005, pp. 2-11.
  • [Stonebraker2007] Stonebraker, M., Bear, Ch., Çetintemel, U., Cherniack, M., Hachem, T.N., Harizopoulos, S., LiRogers, L. J, and Zdonik, S.: One Size Fits All? – Part 2: Benchmarking Results. In: Proc. 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 7-10, 2007, Asilomar, California, USA, pp. 173-184