Designing Data Intensive Applications summary-chapter 2

7 min readOct 2, 2022

Part 1. Foundations of Data Systems

Summary of part 1 chapter 1 can be found here.

Chapter 2. Data Models and Query Languages

Data models plays an important role in how we think about the problem that we are solving. real world entities such as people, organizations, goods, actions, sensors etc are modeled in objects, data structures and APIs that manipulate those data structures. To store data structures developers express them in terms of data model such as JSON,XML documents, tables in relational database or a graph model. such data models are represented in terms of bytes in the memory, on disk or on a network. Such representation allows data to be queried, searched, manipulated, and processed in various ways. At lower levels, hardware engineers represents bytes in terms of electric currents, pulses of light, magnetic fields and more.

Every data model has assumptions about how it is going to be used. some operations are fast while others are super slow. some kinds of usage are easy while others may not be supported. there are some data transformations that are natural and others are awkward.

Relational Model Versus Document Model

relational model is highly popular where data is organized into relations(called tables in SQL) where each relation is an unordered collection of tuples(rows in SQL)
relation databases has roots in business data processing performed in mainframe computers in 1960s and ’70s. Other databases during 70s had mindset of implementation in terms of internal representation of data in the database. Relational model introduced the idea of hiding the implementation details behind cleaner interface.

Data storage and querying has been the area of high interest over many years. In 80s relational model was able to dominate network model and hierarchical model. With computers becoming more powerful and networked, relational database was able to serve the diverse purpose and scope. Even today relational model has a good grasp in online publishing, social networking, e-commerce, SAAS productivity applications and much more.

The Birth of NoSQL

The origin of NoSQL stems from the need of greater scalability, preferences of free and open sources software, specialized query operations not supported by the relational model and the limitations created by relational schemas.

polyglot persistence seems to be likely in near future which means continued use of relational database along with non-relational data stores.

The Object-Relational Mismatch

SQL data model requires a developer to write a translation layer so that the objects created as a product of OOP language in application code can be understood by the database model of tables, rows and columns. This disconnect sometimes is called impedance mismatch. ORM is used for writing such translation layer.
For self contained documents like Resume, a JSON representation is appropriate. MongoDB, RethinkDB, couchDB and Espresso are document oriented databases and support JSON representation model. Lack of schema is often considered as an advantage for developers where JSON is used as a data encoding format. JSON representation also has the better locality. The resolution of one to many relationships application is JSON representation.

Many-to-One and Many-to-Many Relationships
Document databases suits for one-to-many structures and joins are not needed. support for joins are weak and can be emulated in application code. Basically the load of implementing join is transformed from database to application code. With increased added features relationships changes too and document databases may not fit the use case.

Are Document Databases Repeating History?

How to best represent the many-to-many relationships in database?
This debate is much older than NoSQL. IBM’s Information management System design used the simple data model known as hierarchical model (just like JSON) but made many-to-many relationships and joins difficult. After various proposed solutions to resolve the limitations of hierarchical model, 2 prominent model were born i.e. relational model and network model.
The relational model made it easier for developers by having to write less code while network model forced a developer to write more detailed code which also made it slow and the code for querying and updating the database was complicated and inflexible. Making change to application’s data model was hard meanwhile relational model has a very simple concept of managing data by simply laying out the data in open using relation(table) — a collection of tuples(rows). There is no complicated access paths to follow unlike in network model. You simply have to designate a column as key to access a record. The main difference is that in relational model the developer does not have to write code for access paths, it is done automatically by the query optimizer.
Query Optimizer has consumed years of research and development of computer scientists, but once it was a success all the database benefits from it. You only need to build the query optimizer once and it takes care of all the redundant code you had to write otherwise for every applications.

Document databases somehow resembled to hierarchical model as stores nested records within their parent record rather than in a separate table.

Pros of relational databases
-support for joins
-support for many-to-one and many-to-many relationships

Pros of document databases
-better performance because of locality(tree structure of one to many relationships, where entire tree is loaded once)
-data model has schema flexibility

which data model leads to a simpler application code?
It depends on the kinds of relationships that exist between data items. Document model is awkward choice for highly interconnected data.

Schema flexibility in the document model
There is an implicit schema to document databases, but it is not enforced by the database. schema-on-read(similar to dynamic-runtime)is a concept of implicit structure of the data and only interpreted when the data is read. schema-on-write(similar to static-compile time) is a concept, approach in relational model, to express that schema is explicit and the database ensures all written data conforms to it.
Schema-on-read is useful when data is heterogenous i.e. there are many different types of objects, and it is not practical to put each type of object in its own table. Also, it is useful when the structure of the data is determined by external systems that you have no control over.
In a use case where you expect all the records to have the same structure, schemas are useful and structure can be enforced.

Data locality for queries
A document is stored as a single continuous string encoded as JSON, XML, or a binary variant thereof (MongoDB’s BSON). If an application needs to access an entire document, there is performance advantage to this storage locality. If data was split into multiple tables like in relational databases, multiple index lookups would have been required for retrieval requiring more disk seeks and more time.

The suggested approach to take full advantage of locality is to keep documents fairly small and avoid writes that that increases the size of a document. Grouping related data together for locality is not necessarily the idea of document model. Google’s Spanner database offers the same locality properties in a relational data model by allowing the schema to declare that a table’s row should be nested within a parent table.Multi-table index cluster table of oracle allows the same and column-family concept in Bigtable data model also serves the purpose of managing locality.

A hybrid of the relational and document models, where database is able to handle document-like data and also perform relational queries on it, is a good route for databases to take in the future!

Query Languages for Data
IBM’s Information Management Service (IMS) and CODASYL queried the database using imperative code while SQL queried the database using declarative query language. Imperative language design enforces the order for certain operations. In Declarative query language, you don’t tell the computer how to achieve the goal. It is up to the one time invented database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query. It also enables parallel processing because order of operation is not enforced. The queries also don’t need to be changed to improve performance as implementation details of the database engine is hidden.
Declarative query languages are not limited to databases only, they are better use cases for the web too!

MapReduce Querying
MapReduce is a programming model for processing huge amounts of data in bulk across many machines, popularized by Google. It is a fairly low level programming model for distributed execution on cluster of machines. It is somewhere in between declarative query language and fully imperative query API. The logic of the query is expressed in snippet of the code, which are called repeatedly by the processing framework.

Graph-Like Data Models
Relational model can handle simple cases of many-to-many relationships, but if you have to handle more complex relationships, graph like model is the right pick to solve the problem. Graph allows to store completely different types of objects in a single datastore. It has 2 kinds of objects: vertices and edges.

Cypher is declarative query language for property graphs like Neo4j that has vertex and edge. You can think of them as 2 relational tables one for vertices and one for edges.

SPARQL is a query language for triple stores. Triple-Stores is similar to property graphs but instead of using nodes and relationships, all the information is stored in the form of very simple 3 part statements. (subject, predicate and object). Triples can be a great choice to internal data model for applications, even if you have no interest in publishing RDF data on the semantic web.

Turtle language is a human readable format for Resource Description Framework(RDF) data. Sometimes, RDF is also written in an XML format. RDF intent was to publish data in consistent format for different websites allowing data to be automatically combined into a web of data.

The foundational query language is Datalog which is much older than SPARQL or Cypher and other query languages are built upon it. It requires a different mode of thinking and approach as compared to other query languages but a powerful one because rules can be combined and reused in different queries. Rules in Datalog can refer to other rules just like how functions can call other functions or itself.

Data models are complex research subject and more is yet to be researched for solving specialized problem but what we learned in this chapter should be give you good understanding of which data model to pick that best fits your application’s requirements.

Designing Data Intensive Applications summary-chapter 2

Written by Bimala Bogati