Data Intensive System Design — Effective Designing data models

Data Intensive System Design — Effective Designing data models

Data models are the most important part of developing software. As a developer, it is always key to design the data structure in such a way that it represents the real world scenario like people, actions, money flows etc.,

To represent these data models, in the relational database this is shown as tables where as in general purpose model it can be shown as JSON or XML or it can also be shown as graph. This helps us to query, search or manipulate and process the data in the form of this model. In complex applications, there are APIs which are used to manipulate the data and can be multiple layers of APIs as abstraction but beneath is the clean data model.

Relational Model vs Document Model

SQL is well known data model where data is organized into relations (tables) and each relation is an unordered collection of tuples ( rows). There are numerous other models that was proposed to replace the relational model but was not succeeded. The 2 main competitive approaches were network model and hierarchical model.

NoSQL which gained popularity from a conference raised its adoption which supported greater scalability with very high write throughput and its open source software with specialized query operations supported. It is more dynamic and expressive is one of the key reasons for getting into the race.

The main issue with current structure is the SQL data model requires translational layer to adopt it to the code and so many Object-relational mapping (ORM) frameworks has come into picture like Hibernate and ActiveRecord. Depending on the usecase for example you would like to store a resume, which should be made into rows and identify features that can be put as columns. But each one resume is different and do not have any universal adoption. It can be created with one-to-many or many-to-one relationships but it messes the output query with too many joins. Here document model can be used which is in the form of JSON.

If there are too many joins for many-to-many relationships, we use query optimizers to optimize the query and the code remains simple. But with document databases, it somewhat follows hierarchical model of storing nested records. However, when it comes to many to one and many-to-many relationships both are not different. In both cases, the related item is referenced by a unique identifier which is called a foriegn key / document reference

When should each one be used?

If shredding a document like structure into multiple tables with joins is complex then it can lead to unnecessary schemas and unnecessarily complicated application code.

At the same time, the document model has limitations. If there are too many nested items within the document, then it seems more like hierarchial model which refers data with the access path. As long as it is not deeply nested document model can be used. Document database can be well used to record the events occurred at a given point of time.

It is always good to evaluate if the denormalizing helps by reducing the joins but the application again should keep the data consistent so the complexity is moved to the application code.

For highly interconnected data, the document model is awkward, the relational model is acceptable and graph models are more natural.

Schema in document model is little flexible which means it is not schema less but is schema-on-read like dynamic runtime type checking and schema-on-write like static compile time type checking. For example, if you want to change the name field which has full name to first name and last name then going forward you will modify your code to store with new fields and handle all the old records in your code to split the field. But with relational there is significant downtime to adopt the changes and also we have to use the default fields as NULL so it does not throw error for existing data.

If you want to access large part of the document, then it is useful to be on document model. If the data is split across multiple tables then multiple index lookups are required that needs more disk seeks and take more time.

Relational and document databases are becoming more similar overtime, if the database is able to handle the document like data and also perform relational queries on it. It is the best route.

Query languages for Data

Relational model is declarative query language where we specify the pattern of the data and what conditions the result should meet and how the data should be transformed. It is upto the query optimizers to decide the indexes and join methods to use. this is more consice and simple. It also allows for parallel execution, because we are only specifying the pattern of the results and not the algorithm that is used to determine the results.

MapReduce is a programming model for processing large amounts of data in bulk across many machines. This is not declarative or imperative query API. It is based on the map ( also known as collect) and reduce ( also known as fold or inject) functions that exist in many functional programming languages.

map function is called once for every document that matches query and emit a key and value which is grouped by key and the reduce function is called once with same key and adds up the number of animals. Final output is written to the collection. the map and reduce functions cannot perform any additional database queries. It can also be used for distributed execution on a cluster of machines. MongoDB also introduced a support for declarative query language called aggregation pipeline which has plain English sentence style syntax.

A graph consists of vertices and edges. There are different ways we can structure and query data in graphs. Some of the models are property graph model and triple store model.

Property graph model which has vertex —unique identifier, outgoing edges, incoming edges and collection of properties. Edges — unique identifier, tail vertex where edge starts, head vertex where edge ends, a label to describe relationship, collection of properties. Cypher Query language for property graphs.

We can also represent in SQL queries with recursive common table expressions like below

Same query of 4 lines written in 29 lines.

Triple Store model is mostly equivalent to property graph model but uses different words to describe same ideas. Here the information is stored in the form of very simple three-part statements subject, predicate and object

There is also RDF model which takes same information but the format is similar to the XML. We use SPARQL query language for triple stores using the RDF data model. The query becomes as follows

Datalog is similar but older than SPARQL or CYPHER. Datalog’s data model is similar to the triple store model but instead of writing triple we write it as Predicate(subject, object)

We define rules that tells the database about new predicates within_recursive and migrated. The rules can refer to other rules just like functions can call other functions or recursively call themselves. A rule applies if the system can find a match for all predicates on the righthand side of the :- operator.

The datalog approach requires different kind of thinking to the other query languages but it is powerful approach, because rules can be combined and reused in different queries.

To Summarize, Document databases target use cases where data comes in self contained documents and relationships between one document and another are rare. Graph databases go in opposite direction, targeting use cases where anything is potentially related to everything.

All three models — document, relational and graph are widely used today and each is good in its respective domain.

References: