NDBI040: Big Data Management and NoSQL Databases
Basic Information
- Course annotation
- Lecturer: Martin Svoboda
- Language: English
- Schedule
- Lectures: Tuesday 12:20 - 13:50 (S1)
- Practical classes: Tuesday 17:20 - 18:50 (SW2)
- Table with points
Exam Dates
- Friday 6. 1. 2017: 14:00 - 17:00 (FEL ČVUT, Karlovo náměstí, KN:E-107)
- Tuesday 24. 1. 2017: 14:00 - 17:00 (S4)
- Monday 30. 1. 2017: 14:00 - 17:00 (S1)
- Monday 6. 2. 2017: 14:00 - 17:00 (
S1 S7)
- Monday 13. 2. 2017: 14:00 - 17:00 (
S1 S6)
Lectures
Practical Classes
Formal Requirements
- Attendance at practical classes is not compulsory, yet warmly recommended.
- Altogether 6 individual homework will be assigned during the semester.
- Each student must choose their distinct topic before the first homework.
- This topic must be reported to and accepted by the lecturer.
- Possible examples are: library, cinema, cookbook, university, flights, etc.
- Your solutions must stick to the topic, be original, realistic, and non-trivial.
- Homework solutions are submitted via a script executed on the NoSQL server.
- At most 20 points can be gained for each homework, i.e. 120 points in total.
- In case of any shortcomings, fewer points will be granted appropriately.
- Solutions can be submitted even repeatedly before the deadline.
- However, each homework will be assessed by the lecturer only once.
- Delay of one day is penalized by 5 points, proportionally to the number of hours.
- Should the delay be even longer, the penalty stays just 5 points nevertheless.
- During some of the practical classes, extra activity points can be acquired, too.
- At least 100 points are required for the course credit to be granted.
- All the points above this boundary are transferred to the examination as a bonus.
- Only students with a credit already acquired can sign up for the final test.
- At most 100 points can be acquired from the actual final written test.
- The final grade is based on a sum of the test and bonus points, if any.
- 90 points and more for 1 (excellent), 75+ for 2 (very good), and 60+ for 3 (good).
- However, having less than 30 points from the test, no final grade is granted at all.
- Moreover, each such unsuccessful date diminishes 10 points from the bonus, if any.
Assignments
- Preliminaries:
- NoSQL server: nosql.ms.mff.cuni.cz:42222
- Login and password: sent by e-mail
- Submissions:
- Put all the files you would like to submit into a separate directory.
- The name of this directory must correspond to the name of the assignment.
- I.e. mapreduce, riak, redis, mongodb, neo4j, cassandra (case sensitive).
- Call submit name from the parent directory. Wait for the confirmation of success.
- Should any complications appear, send your solution by e-mail to svoboda@ksi.mff.cuni.cz.
1: MapReduce
- Assignment:
- Implement a non-trivial MapReduce problem within the topic you have selected.
- Choose, e.g., from aggregation, grouping or querying general MapReduce patterns.
- Create a sample input data file with about 10 realistic items of one entity type.
- It is recommended to keep the format of the input, i.e. to keep items organized into lines.
- Use WordCount.java example as a basis and implement the actual MapReduce job.
- Both the Map and Reduce functions should be non-trivial, each about 10 lines of code.
- Comment the source file and provide a description of the problem you are solving.
- Finally, create a shell script that allows for the execution of the MapReduce job.
- I.e., compile the job, prepare the input files, execute the job and retrieve the results.
- Do not use absolute paths when accessing local working files within your submission.
- Make sure that the output HDFS directory is removed at the beginning of the script.
- Alter the script so that /user/svoboda/your-login/ directory is created and used instead.
- Submission:
- Input data file (e.g. data.txt): text file containing sample input data.
- Java source file (e.g. mapreduce.java): source file with MapReduce implementation.
- Bash script (e.g. script.sh): shell script allowing to repeatedly execute the job.
- Software:
- Deadline:
Monday 24. 10. 2016 Thursday 27. 10. 2016 until 23:59
2: Riak
- Assignment:
- Create a shell script that works with Riak database via HTTP interface using cURL tool.
- Insert about 5 key-value objects of different entity types into each of 3 separate buckets.
- You can work with any data format you like (e.g. text, JSON), but include content headers.
- Prepend names of all the buckets you will use with your login name, e.g. svoboda_movies.
- Perform 1 read, 1 nontrivial update, and 1 remove request. Add also 5 meaningful links.
- Express 2 link walking queries, one with at least 1 navigation step, one with at least 2.
- Comment the script, especially explain the real-world meaning of your link walking queries.
- Submission:
- Bash script (e.g. script.sh): shell script allowing to execute all the HTTP requests.
- Software:
- Deadline:
Monday 31. 10. 2016 Wednesday 2. 11. 2016 until 23:59
3: Redis
- Assignment:
- Create a script (ordinary text file) with a sequence of commands working with Redis.
- Switch to your database (number was sent by e-mail), then remove all its content.
- Illustrate you can work with strings, lists, sets, sorted sets and hashes. In particular:
- Strings: 5 insertions (SET), 1 read (GET), 1 update (APPEND, SETRANGE, INCR, ...), 1 removal (DEL).
- Lists: 5 insertions (LPUSH, RPUSH, ...), 2 reads (LPOP, RPOP, LINDEX, LRANGE), 1 removal (LREM).
- Sets: 5 insertions (SADD), 2 reads (SISMEMBER, SUNION, SINTER, SDIFF), 1 removal (SREM).
- Sorted sets: 5 insertions (ZADD), 1 read (ZRANGE, ZRANGEBYSCORE), 1 update (ZINCRBY), 1 removal (ZREM).
- Hashes: 5 insertions (HSET, HMSET), 2 reads (HGET, HMGET, HKEYS, HVALS, ...), 1 removal (HDEL).
- Submission:
- Redis script (e.g. script.txt): text file with Redis database commands.
- Software:
- Deadline: Monday 7. 11. 2016 until 23:59
4: MongoDB
- Assignment:
- Create a JavaScript script with a sequence of commands working with MongoDB database.
- Execute your script using mongo login script.js, where login is a name of your database.
- Create 3 collections for entities of different types, insert about 5 documents into each one of them.
- These documents must be realistic, non-trivial (e.g. with nested arrays or objects) and with references.
- Express 2 non-trivial update operations, one without update operators, the other with at least 2 different.
- Express 5 non-trivial find queries (with non-trivial selections) and describe their meaning in comments.
- Use positive projection, negative projection, and sort modifier (each of these constructs at least once).
- You can print the output of the queries using, e.g., db.actors.find().forEach(printjson); approach.
- Express 1 non-trivial MapReduce query, describe its meaning and contents of emitted key-value pairs.
- Submission:
- JavaScript script (e.g. script.js): script with MongoDB database commands.
- Software:
- Deadline: Monday 28. 11. 2016 until 23:59
5: Neo4j
- Assignment:
- Implement a Java application working with embedded Neo4j graph database.
- Insert about 10 nodes and 15 relationships into your database, both with a few properties.
- Work with at least 2 different node labels and 2 relationship types. Associate nodes with user identifiers.
- Define and process results of at least 2 non-trivial graph traversals (with expanders and evaluators).
- Express, execute and process results of at least 5 non-trivial Cypher query expressions.
- Use at least once MATCH, OPTIONAL MATCH, RETURN, WITH, WHERE, and ORDER BY (sub)clauses.
- Describe meaning of both your graph traversals and Cypher expressions, comment your source file.
- Submission:
- Java source file (e.g. main.java): source file for your application.
- Software:
- Deadline: Monday 19. 12. 2016 until 23:59
6: Cassandra
- Assignment:
- Create a script (ordinary text file) with a sequence of CQL statements working with Cassandra database.
- Create (when not yet exists) and use your own keyspace. Use your login name as a name for this keyspace.
- Create 2 tables for entities of different types, insert about 5 rows into each one of them.
- Define at least one column for each of the following data types: tuple, list, set and map.
- Express 3 update statements with replace, add and remove operations on columns of all 3 collection types.
- Express 3 select statements, use WHERE clause and ALLOW FILTERING. Create at least 1 secondary index.
- Submission:
- CQL script (e.g. queries.cql): text file with CQL statements.
- Software:
- Deadline: Monday 2. 1. 2017 until 23:59
Examination Topics
Introduction
- Big Data and NoSQL terms, V characteristics (volume, variety, velocity, veracity, ...), current trends and challenges, principles of RDBMS; types of NoSQL systems (key-value, wide column, document, graph, ...), their data models, features and use cases; general features of NoSQL systems (aggregates, schemalessness, maintenance, scaling, flexibility, ...)
Principles
- Vertical and horizontal scaling, pros and cons, network fallacies
- Cluster architecture, design questions (scalability, availability, consistency, latency, durability, resilience)
- distribution models, sharding (objectives, strategies), master-slave and peer-to-peer replication (objectives, replication factor, handling of read and write requests)
- CAP theorem, its guarantees (consistency, availability, partition tolerance) and consequences, CA, CP and AP systems, ACID properties (atomicity, consistency, isolation, durability), BASE properties (basically available, soft state, eventual consistency), their mutual relationship
- Strong and eventual consistency, read and write consistency, inconsistency window, session consistency, read and write quora
MapReduce
- Programming models, paradigms and languages; parallel programming models, process interaction (shared memory, message passing), problem decomposition (task parallelism, data parallelism)
- MapReduce programming model, functions (map, reduce, partition, compare, combine), architecture (master and workers), execution phases (splitting, mapping, shuffling, reducing), implementation details, counters, fault tolerance, stragglers, usage patterns (aggregation, grouping, querying, sorting, ...)
- Apache Hadoop: modules (HDFS, YARN, MapReduce); HDFS: data model (namespace, files, blocks), architecture (NameNode, DataNode), replica placement, FsImage and EditLog structures, HeartBeat messages, FS commands; MapReduce: architecture (JobTracker, TaskTracker), job implementation (Configuration, Mapper, Reducer, Combiner, Context and write method, Writable and WritableComparable interfaces), job execution schema
Data Formats
- XML: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML
- JSON: constructs (object, array, value), types of values (strings, numbers, ...); BSON: document structure (elements, type selectors, property names and values)
- RDF: data model (resources, IRI identifiers, literals), triples (subject, predicate, object), statements, blank nodes, literals (types, language tags); graph representation (vertices, edges); N-Triples notation (RDF file, statements, triple components, literals, IRI references); Turtle notation (TTL file, object and predicate-object lists, blank nodes, prefixed names)
- Protocol Buffers: components (language, compiler, format), intended usage; schema structure (messages, enumerations), fields (rules, names, types, tags)
Key-Value Stores
- Data model (key-value pairs), key management, use cases, representatives, TTL functionality
- Riak: data model (buckets, objects); HTTP interface, cURL tool; CRUD operations (POST, PUT, GET and DELETE methods, composition of URLs); links (definition, tags, link walking, navigational steps), data types (Convergent Replicated Data Types: register, flag, counter, set, map), search 2.0 Yokozuna (architecture, SOLR document, extractors, fields, schema, full-text index creation and usage, wildcards, ranges, ...); causal context (timestamps, vector clocks); Riak Ring (physical and virtual nodes, consistent hashing, partitions, replica placement, hinted handoff, handling of requests)
- Redis: data model (databases, objects), data types (string, list, set, sorted set, hash), basic operations, TTL
Wide Column Stores
- Data model (column families, rows, columns), query patterns, use cases, representatives
- Cassandra: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing, empty, native data types, tuples, user-defined types, collections: lists, sets, maps), additional data (TTL, timestamp); CQL language: DDL statements: CREATE KEYSPACE (replication options), DROP KEYSPACE, USE keyspace, CREATE TABLE (column definitions, primary key), DROP TABLE, TRUNCATE TABLE; DML statements: SELECT statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY and LIMIT clauses, DISTINCT modifier, selectors, non/filtering queries, ALLOW FILTERING mode, aggregates), INSERT statements (update parameters), UPDATE statements (assignments, modification of collections), DELETE statements (deletion of rows, removal of columns, removal of values in collections)
Document Stores
- Data model (documents), query patterns, use cases, representatives
- MongoDB: data model (databases, collections, documents), document identifiers; CRUD operations (insert, update, save, remove, find); update operation: replace vs. update behavior, multi and upsert modes, update operators; find operation: query criteria (value equality vs. query operators), query operators (comparisons, logical, array, ...), projection (positive and negative), projection operators, modifiers (sort, skip, limit); dot notation; MapReduce; primary and secondary index structures (index types, forms and properties)
XML Databases
- Native XML databases vs. XML-enabled RDBMS; data model (XDM): tree (nodes for document, elements, attributes, texts, ...), document order, sequences, atomic values
- XPath language: path expressions (relative vs. absolute, steps, evaluation algorithm), axes (forward, reverse, attribute), node tests, predicates (positions), abbreviations
- XQuery language: path expressions, constructors (direct and computed, nested queries), FLWOR expressions (for, let, where, order by, and return clauses), common FLWOR use cases (joining, grouping, aggregation, integration, ...), conditional expressions (if, then, else), switch expressions (case, default, return), universal and existential quantifiers (some, every, satisfies), comparisons (value, general, node), atomization
Graph Databases
- Data model (property graphs), use cases, representatives
- Non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, subgraph/supergraph patterns, similarity querying); data structures (adjacency matrix, adjacency list, incidence matrix), graph traversals, data locality (BFS Layout, matrix bandwidth, Cuthill-McKee minimization), graph partitioning (1D, 2D), non/mining based indexing
- Neo4j: data model (graph, nodes, relationships, labels, types, properties); traversal framework: traversal description, expanders (directions), order (breadth-first, depth-first), uniqueness (NODE_GLOBAL, ...), evaluators (predefined and custom evaluators, include/exclude and continue/prune results), traverser (paths, end nodes, last relationships); Cypher language: path patterns, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); read sub/clauses: MATCH (WHERE conditions, uniqueness requirement, OPTIONAL mode); general sub/clauses: RETURN (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation), WITH; write sub/clauses: CREATE, DELETE (DETACH mode), SET, REMOVE
RDF Stores
- SPARQL: graph pattern matching (solution sequence, variable binding, compatibility of solutions), graph patterns (basic, group, optional, alternative, graph, minus); prologue declarations (BASE, PREFIX clauses), SELECT queries (SELECT, FROM, and WHERE clauses), dataset (default graph, named graphs), variable assignments (BIND), FILTER constraints (comparisons, logical connectives, accessors, tests, ...), solution modifiers (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), query forms (SELECT, ASK, DESCRIBE, CONSTRUCT)
Advanced Aspects
- Transactions: business vs. system transactions, local vs. distributed transactions, optimistic and pessimistic offline locks; performance tuning: scalability goals, Amdahl's law, Little's law, message cost model; visualization: motivation, visualization types (scatter plot, matrix chart, network diagram, correlation matrix, dendrogram, bar chart, histogram, box plot, bubble chart, line graph, stack graph, pie chart, treemap, tag cloud, arc diagram, centralized burst, globe, radial chart); polyglot persistence
Recommended Literature
- Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: Big Data a NoSQL databáze.
ISBN: 978-80-247-5466-6. Grada Publishing, a.s., 2015.
- Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled.
ISBN: 978-0-321-82662-6. Pearson Education, Inc., 2013.