B4M36DS2, BE4M36DS2: Database Systems 2
Basic Information
- Annotations:
B4M36DS2 (Czech),
BE4M36DS2 (English)
- Lecturer: Martin Svoboda
- Schedule: B4M36DS2, BE4M36DS2
- Lectures: Monday 9:15 - 10:45 (KN:E-301) (English)
- Practical classes (group 101): Monday 12:45 - 14:15 (KN:E-328) (English)
- Practical classes (group 102): Monday 14:30 - 16:00 (KN:E-328) (Czech)
- Practical classes (group 103): Monday 16:15 - 17:45 (KN:E-328) (Czech)
- Table with points
Exam Dates
- Tuesday 16. 1. 2018: 9:15 - 11:30 (KN:E-301)
- Tuesday 23. 1. 2018: 9:15 - 11:30 (KN:E-301)
- Monday 5. 2. 2018: 9:15 - 11:30 (KN:E-301)
- Monday 12. 2. 2018: 9:15 - 11:30 (KN:E-301)
- There will be no additional exam dates
Lectures
- 02. 10. 2017: 01 - Introduction: Big Data, NoSQL Databases
- 09. 10. 2017: 02 - Data Formats: XML, JSON, BSON, RDF, Protocol Buffers
- 16. 10. 2017: 03 - XML Databases: XPath, XQuery
- 23. 10. 2017: 04 - RDF Stores: SPARQL
- 30. 10. 2017: 05 - Apache Hadoop: MapReduce, HDFS
- 06. 11. 2017: 06 - Basic Principles: Scaling, Sharding, Replication, CAP Theorem, Consistency
- 13. 11. 2017: 07 - Key-Value Stores: RiakKV
- 20. 11. 2017: 08 - Wide Column Stores: Cassandra, CQL
- 27. 11. 2017: 09 - Document Databases: MongoDB
- 04. 12. 2017: 10 - Graph Databases: Neo4j, Cypher
- 11. 12. 2017: 11 - Advanced Aspects: Graph Databases
- 18. 12. 2017: 12 - Advanced Aspects: Transactions, Performance Tuning, Visualization
- 08. 01. 2018: Canceled
Practical Classes
- 02. 10. 2017: 00 - Organization
- 09. 10. 2017: Canceled
- 16. 10. 2017: 01 - XPath
- 23. 10. 2017: 02 - XQuery
- 30. 10. 2017: 03 - SPARQL
- 06. 11. 2017: 04 - MapReduce
- See /home/NOSQL/mapreduce/ for input data, Java source files, and Hadoop libraries
- 13. 11. 2017: 05 - RiakKV
- 20. 11. 2017: 06 - Redis
- 27. 11. 2017: 07 - Cassandra
- See /home/NOSQL/cassandra/ for CQL exercise solutions
- 04. 12. 2017: 08 - MongoDB
- 11. 12. 2017: 09 - MongoDB
- See /home/NOSQL/mongodb/ for input JSON data and exercise solutions
- 18. 12. 2017: 10 - Neo4j
- See /home/NOSQL/neo4j/ for input graph data and Cypher exercise solutions
- 08. 01. 2018: 11 - Neo4j
- See /home/NOSQL/neo4j/ for Neo4j libraries
Formal Requirements
- Attendance during practical classes is not compulsory, yet warmly recommended
- Altogether 7 individual homework assignments will take place during the semester
- Everyone must choose their distinct topic, not later than at the second practical class
- This topic must always be reported to and accepted by the lecturer in advance
- Possible topics could be: library, cinema, cookbook, university, flights, etc.
- See the list below for additional suitable topics, feel free to choose your own topic
- Your homework solutions must be within the topic, original, realistic, and non-trivial
- The solutions can only be submitted via a script executed on the NoSQL server
- At most 120 points in total can be gained for all the homework assignments
- Solutions are awarded by up to 20, 15 or 10 points respectively, depending on the assignment
- In case of any shortcomings, fewer points will be awarded appropriately
- Solutions can be submitted even repeatedly, however only the latest version is assessed
- Once a given assignment is assessed by the lecturer, it cannot be resubmitted once again
- Delay of one whole day is penalized by 5 points, shorter delays are penalized proportionally
- Should the delay be even longer, the penalty stays the same and does not further increase
- During some of the practical classes, extra activity points can be acquired, too
- At least 100 points are required for the course credit to be granted
One fourth Half of all the points above this boundary is transferred as a bonus to the exam
- Only students with a course credit already acquired can sign up for the final exam
- The final exam consists of a compulsory written test and an optional oral examination
- At most 100 points can be acquired from the actual final written test
- In particular, 50 points from the theoretical part and 50 from the practical part
- Having less than 15 points from any of the two parts prevents from passing the exam successfully
- The final score corresponds to the sum of the written test and bonus points, if any
- Based on the result, everyone can voluntarily choose to undergo an oral examination
- In such a case, the final score is further adjusted by up to minus 5 to plus 5 points
- Final grade: 90 points and more for A, 80+ for B, 70+ for C, 60+ for D, and 50+ for E
Assignments
- Preliminaries:
- NoSQL server: 147.32.83.196:22 or acheron.ms.mff.cuni.cz:20104
- Login and password: sent by e-mail
- Tools:
- Submissions:
- Use sftp or WinSCP to upload your files to the NoSQL server
- Put all the files to be submitted into a standalone directory
- Name of this directory must correspond to the assignment name
- I.e. xquery, mapreduce, riak, redis, cassandra, mongodb, neo4j (case sensitive)
- Execute submit name from the parent directory. Wait for the confirmation of success
- Should any complications appear, send your solution by e-mail to martin.svoboda@fel.cvut.cz
- Requirements:
- Respect the prescribed names of individual files to be submitted (case sensitive)
- Place all the files in the root directory of your submission
- Do not include shared libraries or files that are not requested
- I.e. do not submit files that were not explicitely requested
1: XQuery
- Points: 20
- Assignment:
- Create an XML document with sample data from your individual topic
- Work with mutually interlinked entities of at least 3 different types (e.g. lines, flights and tickets)
- Insert data about at least 15 particular entities (e.g. 3 lines, 4 flights, 8 tickets)
- Create expressions for 2 different XPath queries
- Use each of the following constructs at least once
- Axes: descendant or descendant-or-self
- Predicates: path expression, position testing, value comparison, general comparison
- Create expressions for 3 different XQuery queries
- Use each of the following constructs at least once
- Direct or computed constructor
- FLWOR expression
- Conditional expression
- Existential or universal quantifier
- Requirements:
- XML document must be well-formed (i.e. syntactically correct)
- Put each XPath / XQuery expression into a standalone file (e.g. xpath1.xp)
- Always add a comment describing the intended query meaning in a natural language via (: comment :)
- Each query expression must be evaluated to a non-empty sequence
- Submission:
- data.xml: XML document with your data to be queried
- xpath1.xp and xpath2.xp: files with XPath expressions
- xquery1.xq, xquery2.xq and xquery3.xq: files with XQuery expressions
- Software:
- Server: 147.32.83.196:22
- Deadline: Sunday 29. 10. 2017 until 23:59
2: MapReduce
- Points: 20
- Assignment:
- Create an input text file with sample data from your individual topic
- Insert realistic and non-trivial data about at least 10 entities of one type
- Put each of these entities on a separate line, i.e. assume that each line of the input file yields one input record
- Organize the actual entity data in whatever format you are able to parse
- E.g. Medvídek 2007 53 100 Trojan Macháček Vilhelmová corresponding to a pattern Movie Year Rating Length Actors...
- Implement a non-trivial MapReduce job
- Choose from aggregation, grouping, filtering or any other general MapReduce query pattern
- Use WordCount.java source file as a basis for your own implementation
- Both the Map and Reduce functions should be non-trivial, each about 10 lines of code
- Comment the source file and also provide a description of the problem you are solving
- Create a shell script that allows for the execution of your MapReduce job
- I.e. compile the sources, prepare the input directories and files, execute the job, and retrieve the result
- Requirements:
- Only use /user/NOSQL/your-login/ HDFS directory for your input and output directories and files
- More precisely, use your own HDFS directory but switch to this one before you submit your assignment
- Create this directory (and all the other nested ones) at the beginning of your script (if not yet exist)
- Make sure your shell script is executable at all (i.e. has the x permission assigned)
- Also make sure your script can be executed repeatedly without failures
- I.e. at least remove the output HDFS directory at the beginning of your script (not end)
- Only use relative paths when accessing local files of your submission (e.g. mapreduce.java, data.txt etc.)
- Use Java Standard Edition version 7
- Do not organize your Java source files into explicit packages
- Do not submit your Netbeans (or other) project directory
- Submission:
- input.txt: text file with your sample input data
- readme.txt: description of the input data structure and objective of the MapReduce job
- *.java: Java source files with your MapReduce implementation (e.g. mapreduce.java)
- script.sh: Bash script allowing for the compilation and execution of your job
- result.txt: expected output of your MapReduce job (i.e. submit the result of the execution you performed by yourself)
- output.txt: actual output of your MapReduce job (do not submit it again, just make sure you will use this file name)
- Software:
- Server: acheron.ms.mff.cuni.cz:20104
- Deadline: Sunday 12. 11. 2017 until 23:59
3: Riak
- Points: 15
- Assignment:
- Create a shell script that works with Riak database via HTTP interface using cURL tool
- Insert about 5 key-value objects of different entity types into each of 3 buckets
- You can work with any data format you like (e.g. text, JSON)
- However, always include content headers
- Include at least 5 meaningful links of at least 2 different tags
- Execute 1 read and 1 update request
- Perform 2 link walking queries
- First with at least 1 navigational step
- Second with at least 2 navigational steps
- Remove all your objects at the end of your script (i.e. empty all your buckets)
- Requirements:
- Use the following convention for names of your buckets: login_bucket
- Replace login with your actual login name and bucket with the intended bucket name
- E.g. f171_svobom25_movies
- Explain the real-world meaning of your link walking queries (using comments #)
- Make sure your shell script is executable at all (i.e. has the x permission assigned)
- Also make sure your script can be executed repeatedly without failures
- Submission:
- script.sh: Bash script allowing to execute all the HTTP requests
- Software:
- Server: 147.32.83.196:22
- Deadline: Sunday 19. 11. 2017 until 23:59
4: Redis
- Points: 10
- Assignment:
- Create a script (ordinary text file) with a sequence of commands working with Redis
- Illustrate you can work with all data types (strings, lists, sets, sorted sets and hashes)
- In particular, perform all the following operations:
- Strings: 5 insertions (SET), 1 read (GET), 1 update (APPEND, SETRANGE, INCR, ...), 1 removal (DEL).
- Lists: 5 insertions (LPUSH, RPUSH, ...), 2 different reads (LPOP, RPOP, LINDEX, LRANGE), 1 removal (LREM).
- Sets: 5 insertions (SADD), 2 different reads (SISMEMBER, SUNION, SINTER, SDIFF), 1 removal (SREM).
- Sorted sets: 5 insertions (ZADD), 1 read (ZRANGE, ZRANGEBYSCORE), 1 update (ZINCRBY), 1 removal (ZREM).
- Hashes: 5 insertions (HSET, HMSET), 2 different reads (HGET, HMGET, HKEYS, HVALS, ...), 1 removal (HDEL).
- Your database (i.e. keys and values) as well as commands must be realistic and within your individual topic
- E.g. use a hash to store a mapping from seats to passengers for each flight: HMSET tickets-EK140-20171121 42A Peter 65F John
- Requirements:
- Only use a database you were assigned during your work on the assignment (sent by e-mail)
- Assume your script will be executed using cat script.txt | redis-cli -n 5
- Do not switch to your database in your script (via a SELECT command)
- I.e. assume that a specific database will be used when assessing your homework
- And this database will be specified outside your script (see above)
- Make sure your script can be executed repeatedly without failures
- Empty the database at the end of your script
- Submission:
- script.txt: text file with Redis database commands
- Software:
- Server: 147.32.83.196:22
- Deadline: Sunday 26. 11. 2017 until 23:59
5: Cassandra
- Points: 15
- Assignment:
- Create a script (ordinary text file) with a sequence of CQL statements working with Cassandra database
- Define a schema for 2 tables for entities of different types
- Define at least one column for each of the following data types: tuple, list, set and map
- Insert about 5 rows into each of your tables
- Express 3 update statements with replace, add and remove operations (all of them) on columns of collection types (all of them)
- Express 3 select statements, use WHERE clause and ALLOW FILTERING
- Create at least 1 secondary index
- Requirements:
- Create (if not yet exists) and only work with your own keyspace, use your login name as a name for this keyspace
- Make sure your script can be executed repeatedly without failures
- Empty the database at the end of your script
- Submission:
- queries.cql: text file with CQL statements
- Software:
- Server: 147.32.83.196:22
- Deadline: Sunday 3. 12. 2017 until 23:59
6: MongoDB
- Points: 20
- Assignment:
- Create a JavaScript script with a sequence of commands working with MongoDB database
- Explicitly create 2 collections for entities of different types
- Insert about 5 documents into each one of them
- These documents must be realistic, non-trivial, and with both embedded objects and arrays
- Interlink the documents using references
- Use both insert and save operations, each at least once
- Express 3 update operations
- One without update operators, another with at least 2 different operators, the last based on the upsert mode
- Express 5 find queries (with non-trivial selections)
- Use at least one logical operator ($and, $or, $not)
- Use $elemMatch operator on array fields at least once
- Use both positive and negative projection (each at least once)
- Use sort modifier
- Describe the real-world meaning of all your queries in comments
- Express 1 MapReduce query (non-trivial, i.e. not easily expressed using ordinary find operation)
- Describe its meaning, contents of intermediate key-value pairs and the final output
- Note that reduce function must be associative, idempotent, and commutative
- Print the output of your MapReduce job using out: { inline: 1 } option
- Requirements:
- Call export LC_ALL=C in case you have difficulties in launching the mongo shell
- Execute your script using mongo database script.js
- Replace database with your database name, i.e. your login name
- Do not switch to your database explicitly from within your script
- I.e. do not execute use database nor db.getSiblingDB('database') commands
- Print the output of all your find operations
- Use db.collection.find().forEach(printjson); approach for this purpose
- Make sure your script can be executed repeatedly without failures
- Empty the database at the end of your script
- Submission:
- script.js: JavaScript script with MongoDB database commands
- Software:
- Server: 147.32.83.196:22
- Deadline: Sunday 17. 12. 2017 until 23:59
7: Neo4j
- Points: 20
- Assignment:
- Insert realistic nodes and relationships into your database
- Use a single CREATE statement for this purpose
- Insert altogether at least 10 nodes for entities of at least 2 different types (i.e. different labels)
- Insert altogether at least 15 relationships of at least 2 different types
- Include properties (both for nodes and relationships)
- Associate all your nodes with user-defined identifiers
- Express 5 Cypher query expressions
- Use at least once MATCH, OPTIONAL MATCH, RETURN, WITH, WHERE, and ORDER BY (sub)clauses
- At least one query should be based on aggregation
- Requirements:
- Assume your script will be executed via cat queries.cypher | neo4j-shell --path database
- Describe the meaning of your Cypher expressions (using comments // ...)
- Make sure your script can be executed repeatedly without failures
- I.e. empty your database (remove the entire graph) at the end of your script
- Submission:
- queries.cypher: text file with a sequence of Cypher statements
- Software:
- Server: 147.32.83.196:22
- Deadline: Sunday 7. 1. 2018 until 23:59
Individual Topics
- Try to propose your own original topic in the first place
- You can also get inspired by the following topics (in alphabetical order)
-
Adresní místa,
Armáda,
Autobusové nádraží,
Autosalon,
Autoškola,
Banka,
Bankovní účet,
Bazar,
Bezpečnostní agentura,
Blog,
Botanická zahrada,
Burza,
Catering,
Cestovní agentura,
Cestovní kancelář,
Cukrárna,
Cvičiště pro psy,
Čajovna,
Čerpací stanice,
Dálniční poplatky,
Darování zážitků,
Deskové hry,
Diskuzní fórum,
Divadelní hry,
Divadlo,
Dodávka vody,
Docházkový systém,
Dopravní dispečink,
Dopravní nehody,
Dopravní podnik,
Dopravní uzavírky,
Doručování zásilek,
Dotační programy,
Elektronická evidence tržeb,
Elektronické recepty,
Evidence smluv,
Evidence součástek,
Evidence zaměstnanců,
Exekuce,
Farmářské trhy,
Filmy,
Finanční poradenství,
Finanční trhy,
Finanční úřad,
Fitness centrum,
Fotbalová liga,
Fotbalový tým,
Fotoalbum,
Galerie,
Golfové kluby,
Grantová agentura,
Hobby market,
Hodinový manžel,
Hokejová liga,
Horská služba,
Hotel,
Hrady a zámky,
Hudební festival,
Hudební nástroje,
Hudební produkce,
Jaderná elektrárna,
Jazyková škola,
Jazykové pobyty,
Jednání zastupitelstva,
Jeskyně,
Jídelníček,
Jízdenky na autobus,
Jízdní řády,
Kadeřnický salon,
Kamionová doprava,
Kasino,
Katastr nemovitostí,
Kavárna,
Kino,
Kniha jízd,
Knihkupectví,
Knihovna,
Konference,
Kravín,
Kuchařka,
Kurýrní služba,
Kurzy vaření,
Lékárna,
Léky,
Lesní školka,
Letecká společnost,
Letecká záchranná služba,
Letiště,
Letní tábor,
Logistická firma,
Logistické centrum,
Logistický sklad,
Lyžařská škola,
Lyžařský areál,
Mateřská škola,
Menzy,
Městská hromadná doprava,
Mobilní operátor,
Mobilní telefony,
Modely vláčků,
Multifunkční aréna,
Muniční sklad,
Muzeum,
Mýtné brány,
Nabídky dovolené,
Nabídky práce,
Nadnárodní společnost,
Národní park,
Nebankovní půjčky,
Nemocnice,
Nutriční hodnoty,
Obchodní centrum,
Obchodní rejstřík,
Očkování do ciziny,
Odevzdávání úkolů,
Online cvičení,
Online půjčovna seriálů,
Ordinace lékaře,
Orientační běh,
Osobní doklady,
Osobní trenér,
Parkoviště,
Pekařství,
Personální agentura,
Pěstounská péče,
Pizzerie,
Plánovací kalendář,
Plánování termínů schůzek,
Platební karty,
Plavecký bazén,
Počítačové hry,
Pohádky,
Pojišťovna,
Policejní databáze,
Politické strany,
Populární hudba,
Porodnice,
Poslanecká sněmovna,
Pošta,
Požární ochrana,
Pracovní úřad,
Prodej výtvarných děl,
Provoz metra,
Průmyslová zóna,
Předpověď počasí,
Přepravní kontrola,
Přírodní rezervace,
Přístupový systém,
Psychiatrická léčebna,
Půjčování kol po městě,
Půjčovna auta,
Půjčovna lodí,
Půjčovna svatebních šatů,
Realitní agentura,
Redakční systém,
Registr obyvatel,
Regulační poplatky,
Restaurace,
Rezervace letenek,
Rezervace místností,
Rezervace ubytování,
Rezervace v restauraci,
Rozvodná síť,
Rozvoz jídla,
Řízení letecké dopravy,
Řízení projektů,
Sázková kancelář,
Sbírka zákonů,
Sdílené cestování,
Síť bankomatů,
Síť multikin,
Skautské středisko,
Sklad nápojů,
Sklárna,
Sociální dávky,
Sociální síť,
Soudní řízení,
Spediční firma,
Společenství vlastníků jednotek,
Sportovní klub,
Sportovní turnaj,
Správa hřbitova,
Správa objektů,
Správce financí,
Srovnání elektrospotřebičů,
Srovnávač ubytování,
Státy světa,
Stavebnice lego,
Střední škola,
Studijní materiály,
Studijní systém,
Supermarket,
Světové dědictví,
Svoz a likvidace odpadů,
Symfonický orchestr,
Taneční škola,
Taxi služba,
Televizní program,
Televizní seriály,
Turistické cesty,
Turistický oddíl,
Turistický ruch,
Ubytování v soukromí,
Uprchlický tábor,
Válečné konflikty,
Včelař,
Vědecké projekty,
Vědecké publikace,
Velkochov drůbeže,
Velkoobchod,
Veřejná zeleň,
Veřejné zakázky,
Vesmír,
Vězení,
Videopůjčovna,
Virtuální prohlídky,
Víza,
Vlakové nádraží,
Vojenský prostor,
Volby,
Volnočasové kurzy,
Vozový park,
Vydavatelství novin,
Výkaz práce,
Výrobní procesy,
Vysokoškolská kolej,
Výstaviště,
Vývoj softwaru,
Vzdělávací instituce,
Webhosting,
Webový obchod,
Zábavní centrum,
Zahrádkářská kolonie,
Zahradnictví,
Zastavárna,
Zbraně,
Zdravotní pojišťovna,
Zdravotní úhrady,
Zemědělská výroba,
Zimní úklid komunikací,
Zoologická zahrada,
Zpravodajská služba,
Žákovská knížka,
Železniční síť
- Nevertheless, the following topics are not allowed this semester
- Movies, actors
- Airport, flights
Exam Requirements
NoSQL Introduction
- Big Data and NoSQL terms, V characteristics (volume, variety, velocity, veracity, value, validity, volatility), current trends and challenges (Big Data, Big Users, processing paradigms, ...), principles of relational databases (functional dependencies, normal forms, transactions, ACID properties); types of NoSQL systems (key-value, wide column, document, graph, ...), their data models, features and use cases; common features of NoSQL systems (aggregates, schemalessness, scaling, flexibility, sharding, replication, automated maintenance, eventual consistency, ...)
Data Formats
- XML: constructs (element, attribute, text, ...), content model (empty, text, elements, mixed), entities, well-formedness; document and data oriented XML
- JSON: constructs (object, array, value), types of values (strings, numbers, ...); BSON: document structure (elements, type selectors, property names and values)
- RDF: data model (resources, referents, values), triples (subject, predicate, object), statements, blank nodes, IRI identifiers, literals (types, language tags); graph representation (vertices, edges); N-Triples notation (RDF file, statements, triple components, literals, IRI references); Turtle notation (TTL file, prefix definitions, triples, object and predicate-object lists, blank nodes, prefixed names, literals)
- CSV: constructs (document, header, record, field)
- Protocol Buffers: components (interface description language, source code compiler, binary serialization format), intended usage; schema structure (messages, enumerations), message fields (rules, types, names, tags)
XML Databases
- Native XML databases vs. XML-enabled relational databases; data model (XDM): tree (nodes for document, elements, attributes, texts, ...), document order, reverse document order, sequences, atomic values, singleton sequences
- XPath language: path expressions (relative vs. absolute, evaluation algorithm), path step (axis, node test, predicates), axes (forward: child, descendant, following, ...; reverse: parent, ancestor, preceding, ...; attribute), node tests, predicates (path conditions, position testing, ...), abbreviations
- XQuery language: path expressions, direct constructors (elements, attributes, nested queries, well-formedness), computed constructors (dynamic names), FLWOR expressions (for, let, where, order by, and return clauses), typical FLWOR use cases (joining, grouping, aggregation, integration, ...), conditional expressions (if, then, else), switch expressions (case, default, return), universal and existential quantified expressions (some, every, satisfies), comparisons (value, general, node; errors), atomization of values (elements, attributes)
RDF Stores
- Linked Data: principles (identification, standard formats, interlinking, open license), Linked Open Data Cloud
- SPARQL: graph pattern matching (solution sequence, solution, variable binding, compatibility of solutions), graph patterns (basic, group, optional, alternative, graph, minus); prologue declarations (BASE, PREFIX clauses), SELECT queries (SELECT, FROM, and WHERE clauses), query dataset (default graph, named graphs), variable assignments (BIND), FILTER constraints (comparisons, logical connectives, accessors, tests, ...), solution modifiers (DISTINCT, REDUCED; aggregation: GROUP BY, HAVING; sorting: ORDER BY, LIMIT, OFFSET), query forms (SELECT, ASK, DESCRIBE, CONSTRUCT)
MapReduce
- Programming models, paradigms and languages; parallel programming models, process interaction (shared memory, message passing, implicit interaction), problem decomposition (task parallelism, data parallelism, implicit parallelism)
- MapReduce: programming model (data parallelism, map and reduce functions), cluster architecture (master, workers, message passing, data distribution), map and reduce functions (input arguments, emission and reduction of intermediate key-value pairs, final output), data flow phases (mapping, shuffling, reducing), input parsing (input file, split, record), execution steps (parsing, mapping, partitioning, combining, merging, reducing), combine function (commutativity, associativity), additional functions (input reader, partition, compare, output writer), implementation details (counters, fault tolerance, stragglers, task granularity), usage patterns (aggregation, grouping, querying, sorting, ...)
- Apache Hadoop: modules (Common, HDFS, YARN, MapReduce), related projects (Cassandra, HBase, ...); HDFS: data model (hierarchical namespace, directories, files, blocks, permissions), architecture (NameNode, DataNode, HeartBeat messages, failures), replica placement (rack-aware strategy), FsImage (namespace, mapping of blocks, system properties) and EditLog structures, FS commands (ls, mkdir, ...); MapReduce: architecture (JobTracker, TaskTracker), job implementation (Configuration; Mapper, Reducer, and Combiner classes; Context, write method; Writable and WritableComparable interfaces), job execution schema
NoSQL Principles
- Scaling: scalability definition; vertical scaling (scaling up/down), pros and cons (performance limits, higher costs, vendor lock-in, ...); horizontal scaling (scaling out/in), pros and cons, network fallacies (reliability, latency, bandwidth, security, ...), cluster architecture; design questions (scalability, availability, consistency, latency, durability, resilience)
- Distribution models: sharding: idea, motivation, objectives (balanced distribution, workload, ...), strategies (mapping structures, general rules), difficulties (evaluation of requests, changing cluster structure, obsolete or incomplete knowledge, network partitioning, ...); replication: idea, motivation, objectives, replication factor, architectures (master-slave and peer-to-peer), internal details (handling of read and write requests, consistency issues, failure recovery), replica placement strategies; mutual combinations of sharding and replication
- CAP theorem: CAP guarantees (consistency, availability, partition tolerance), CAP theorem, consequences (CA, CP and AP systems), consistency-availability spectrum, ACID properties (atomicity, consistency, isolation, durability), BASE properties (basically available, soft state, eventual consistency)
- Consistency: strong vs. eventual consistency; write consistency (write-write conflict, context, pessimistic and optimistic strategies), read consistency (read-write conflict, context, inconsistency window, session consistency), read and write quora (formulae, motivation, workload balancing)
Key-Value Stores
- Data model (key-value pairs), key management (real-world identifiers, automatically generated, structured keys, prefixes), basic CRUD operations, use cases, representatives, extended functionality (MapReduce, TTL, links, structured store, ...)
- Riak: data model (buckets, objects, metadata headers); HTTP interface, cURL tool (options); CRUD operations (POST, PUT, GET, and DELETE methods, structure of URLs, data, headers), buckets operations (buckets, keys, properties); links (definition, headers, tags, link walking, navigational steps: bucket, tag and keep components), data types (Convergent Replicated Data Types: register, flag, counter, set, map; conflict resolution policies; usage restrictions), Search 2.0 Yokozuna (architecture; indexation and query evaluation processes; extractors: text, XML, JSON; SOLR document: extracted and technical fields; indexing schema: tokens, triples; full-text index creation, association and usage; query patterns: wildcards, ranges, ...); causal context (motivation, low-level techniques: timestamps, vector clocks, ...); vector clocks (logical clocks, vector of clocks, message passing); Riak Ring (physical vs. virtual nodes, consistent hashing, partitions, replica placement strategy, hinted handoff, handling of read and write requests)
- Redis: features (in-memory, data structure store), data model (databases, objects), data types (string, list, set, sorted set, hash), string commands (SET, GET, APPEND, SETRANGE, INCR, DEL, ...), list commands (LPUSH, RPUSH, LPOP, RPOP, LINDEX, LRANGE, LREM, ...), set commands (SADD, SISMEMBER, SUNION, SINTER, SDIFF, SREM, ...), sorted set commands (ZADD, ZRANGE, ZRANGEBYSCORE, ZINCRBY, ZREM, ...), hash commands (HSET, HMSET, HGET, HMGET, HKEYS, HVALS, HDEL, ...), general commands (EXISTS, KEYS, DEL, RENAME, ...), time-to-live commands (EXPIRE, TTL, PERSIST)
Wide Column Stores
- Data model (column families, rows, columns), query patterns, use cases, representatives
- Cassandra: data model (keyspaces, tables, rows, columns), primary keys (partition key, clustering columns), column values (missing; empty; native data types, tuples, user-defined types; collections: lists, sets, maps; frozen mode), additional data (TTL, timestamp); CQL language: DDL statements: CREATE KEYSPACE (replication options), DROP KEYSPACE, USE keyspace, CREATE TABLE (column definitions, usage of types, primary key), DROP TABLE, TRUNCATE TABLE; native data types (int, varint, double, boolean, text, timestamp, ...); literals (atomic, collections, ...); DML statements: SELECT statements (SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT clauses; DISTINCT modifier; selectors; non/filtering queries, ALLOW FILTERING mode; filtering relations; aggregates; restrictions on sorting and aggregation), INSERT statements (update parameters: TTL, TIMESTAMP), UPDATE statements (assignments; modification of collections: additions, removals), DELETE statements (deletion of rows, removal of columns, removal of items from collections)
Document Stores
- Data model (documents), query patterns, use cases, representatives
- MongoDB: data model (databases, collections, documents, field names), document identifiers (features, ObjectId), data modeling (embedded documents, references); CRUD operations (insert, update, save, remove, find); insert operation (management of identifiers); update operation: replace vs. update mode, multi option, upsert mode, update operators (field: $set, $rename, $inc, ...; array: $push, $pop, ...); save operation (insert vs. replace mode); remove operation (justOne option); find operation: query conditions (value equality vs. query operators), query operators (comparison: $eq, $ne, ...; element: $exists; evaluation: $regex, ...; logical: $and, $or, $not; array: $all, $elemMatch, ...), dot notation (embedded fields, array items), querying of arrays, projection (positive, negative), projection operators (array: $slice, $elemMatch), modifiers (sort, skip, limit); MapReduce (map function, reduce function, options: query, sort, limit, out); primary and secondary index structures (index types: value, hashed, ...; forms; properties: unique, partial, sparse, TTL)
Graph Databases
- Data model (property graphs), use cases, representatives
- Neo4j: data model (graph, nodes, relationships, directions, labels, types, properties), properties (fields, atomic values, arrays); embedded database mode; traversal framework: traversal description, order (breadth-first, depth-first, branch ordering policies), expanders (relationship types, directions), uniqueness (NODE_GLOBAL, RELATIONSHIP_GLOBAL, ...), evaluators (INCLUDE/EXCLUDE and CONTINUE/PRUNE results; predefined evaluators: all, excludeStartPosition, ...; custom evaluators: evaluate method), traverser (starting nodes, iteration modes: paths, end nodes, last relationships); Java interface (labels, types, nodes, relationships, properties, transactions); Cypher language: graph matching (solutions, variable bindings); query sub/clauses (read, write, general); path patterns, node patterns (variable, labels, properties), relationship patterns (variable, types, properties, variable length); MATCH clause (path patterns, WHERE conditions, uniqueness requirement, OPTIONAL mode); RETURN clause (DISTINCT modifier, ORDER BY, LIMIT, SKIP subclauses, aggregation); WITH clause (motivation, subclauses); write clauses: CREATE, DELETE (DETACH mode), SET (properties, labels), REMOVE (properties, labels); query structure (chaining of clauses, query parts, restrictions)
Advanced Aspects
- Graph databases: non/transactional databases, query patterns (CRUD, graph algorithms, graph traversals, graph pattern matching, similarity querying); data structures (adjacency matrix, adjacency list, incidence matrix, Laplacian matrix), graph traversals, data locality (BFS layout, matrix bandwidth, bandwidth minimization problem, Cuthill-McKee algorithm), graph partitioning (1D partitioning, 2D partitioning, BFS evaluation), graph matching (sub-graph, super-graph patterns), non/mining based indexing
- Transactions: business vs. system transactions, local vs. distributed transactions, optimistic vs. pessimistic offline locks
- Performance tuning: scalability goals (reduce latency, increase throughput), Amdahl's law, Little's law, message cost model
- Visualization: motivation, visualization types (scatter plot, matrix chart, network diagram, correlation matrix, dendrogram, bar chart, histogram, box plot, bubble chart, line graph, stack graph, pie chart, treemap, tag cloud, arc diagram, centralized burst, globe, radial chart)
- Polyglot persistence
Recommended Literature
- Holubová, Irena - Kosek, Jiří - Minařík, Karel - Novák, David: Big Data a NoSQL databáze.
ISBN: 978-80-247-5466-6 (hardcover), 978-80-247-5938-8 (eBook PDF), 978-80-247-5939-5 (eBook EPUB).
Grada Publishing, a.s., 2015.
- Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled.
ISBN: 978-0-321-82662-6.
Pearson Education, Inc., 2013.
- Wiese, Lena: Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases.
ISBN: 978-3-11-044140-6 (hardcover), 978-3-11-044141-3 (eBook PDF), 978-3-11-043307-4 (eBook EPUB).
DOI: 10.1515/9783110441413.
Walter de Gruyter GmbH, 2015.
- Zomaya, Albert Y. - Sakr, Sherif: Handbook of Big Data Technologies.
ISBN: 978-3-319-49339-8 (hardcover), 978-3-319-49340-4 (eBook).
DOI: 10.1007/978-3-319-49340-4.
Springer International Publishing AG, 2017.