Designing Data-Intensive Applications summary — preface and chapter 1

Bimala Bogati
7 min readSep 30, 2022

Preface:
Some buzzwords relating to storage and processing of data includes
NOSQL, Big Data, Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time!

Some developments in last decade:
Most companies are dealing with huge volumes of data and are creating the tools to efficiently handle such scale.

short development cycle, short and flexible data models.

popularity of free and open source software

increase in parallel computing with increased multi core processors

distributed systems possible because of IAAS such as AWS

highly available services

Data Intensive: primary challenge is the quantity and complexity of data or the speed at which it is changing

compute intensive: CPU cycles are the bottlenecks

Software Engineers and architects need to have a precise understanding of various technologies and their trade offs if we are invested in building good applications. understanding buzzwords is just the first step.

This book is for those who develop applications that involves some kind of backend for storing or processing the data. If you are the one responsible for choosing the tools for the given problem and figure how to best apply, then this book is the right book for you.

This book is all about the basic principles that always remain true no matter what technologies you use.This book digs into the internals of those systems, discuss their principles and the trade offs they have to make. It also take an approach of how to think about data systems.

It will help you make decisions on what technology is proper for what purpose and understand how different tools can be combined to form the foundation of the good architecture.

This book is valuable if you want to:
make data systems scalable: supporting web/mobile app for millions of users
making application highly available
easily maintainable system as they grow and requirements and technology change
natural curiosity on how things work and want to know what goes inside major websites and online services.

“This book has bias toward free and open source software(FOSS) because reading, modifying, and executing source code is a great way to understand how something works in details.” xiv | Preface

Part 1: Foundations of Data Systems

Chapter 1: Reliable, Scalable, and Maintainable Applications

Many applications today are data-intensive which makes amount of data, complexity of data and speed at which it is changing the big problems.

Building blocks of data intensive applications:
1.
store data in order to find it later(databases)
2.remember the result for fast reads(caches)
3.ability to filter or search by keyword(search indexes)
4.asynchronously handle message passing(stream processing)
5.periodically process huge amount of accumulated data(batch processing)

stream processing handles never ending data streams gracefully and naturally also called real time analytics, streaming analytics, complex event processing, real time streaming analytics and event processing.

how to think of data systems?
Database and message queue are different hence they have different access patterns thereby having different performance characteristics and different implementations.
Many applications now days fulfill the requirements of the business by stitching different tools together using application code. Example below

Questions to ask when designing a data system:
1.when things go wrong internally, how do you ensure that data remains correct and complete?
2.when parts of system are degraded, how do you provide consistent good performance to clients?
3.how do you scale to handle the increase in load?
4.what does a good API look like?

3 important concerns in most software systems:
1. Reliability

A fault is not the same as failure. fault tolerant systems can be implemented by deliberately triggering faults e.g randomly killing processes without warning. The Netflix Chaos Monkey continually tests faults that gives you confidence on natural handling of faults. For security matters, it is better to prevent than cure!

Hardware faults:
system failures can be caused by hard disk crash, faulty RAM, power grid blackout, unplugging wrong network cable. These happens all the time in large data centers. Mean time to failure of hard disks is 10 to 50 years.
multi machine redundancy is required for small applications for which high availability is highly essential. With increased use of large number of machines stemming from the need created by huge data, the rate of hardware faults also increases. There is an advocacy for software fault tolerance techniques along with hardware redundancy because it is fairly common for virtual machines instances to become unavailable without warnings.

Software Errors:
Unlike hardware crash, software crashes can be correlated. for ex.
when given a bad input, a software bug can cause every instance of an application server to crash.
some processes that use some shared resource such as CPU time, memory, disk space or network bandwidth
A service that an application depends on slows down, becomes irresponsive or gives corrupted responses.
cascading failures
errors takes place when some kind of assumptions made happens to be wrong.

Human Errors
study shows that configuration errors by operators are the leading causes of errors, whereas hardware faults were responsible only for 10–25% of outages.

how to make systems reliable given the unreliable human beings?
1. use abstractions, APIs, and admin interfaces. discourage the wrong thing to do. I would probably have a list of it and have it distributed to the team.
2.provide sandbox environments where people can make mistakes, explore and experiment safely using real data without affecting real users.
3.perform a thorough tests at all levels: unit tests, whole-system integration tests and manual tests
4.quick and easy recovery from human errors
5.monitoring aka telemetry in other engineering disciplines. it is important to detect the early signals and allow us to check whether any assumptions or constraints are being violated.
6. Adopt good management practices and training.

2. Scalability

Describing Load:
Load depends on the architecture and by the number of load parameters such as:
requests per second to a web server
the ratio of reads to writes in a database
the number of simultaneously active users in a chat room
the hit rate on a cache or something else

Twitters read operation is twice as higher than the write operation. Twitter decided to do more work at write time than at read time. This approach is excepted for celebrities who has lot of fan following because that will be a lot of writes when a celebrity posts anything!

Describing Performance:
once the load is described for your application, the next question becomes what happens when the load increases:
how is performance affected and how much more resources do you need to keep the performance constant?
Latency and response time are not the same.

All requests are never going to take the same time. There are factors involved such as context switch to a background process, the loss of a network packet and TCP transmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack and many other causes.
Mean is not good metric to measure response time because it does not tell you how many users experienced the delay in real. hence percentile is a better measure.
high percentiles of response time are crucial aka tail latencies as they directly affect users’ experience.
Amazon’s response time requirements are in terms of the 99.9 percentiles even if only affects 1/1000 requests. Amazon’s data shows that the 1 second slowdown reduces a customer satisfaction metric by 16%.
random events that you have no control over makes it hard to reduce response time at a very high percentile.
Queueing delays accounts for the large part of response time at high percentiles.

Approaches for Coping with Load
The scalability of the system is highly specific to the application. The factors involved are volume of the reads, the volume of the writes, the volume of data to store, the complexity of data, the response time requirements, the access patterns (eg sequential or random). The practice is to look at the number of operations that are common before building any architecture. In early startups it is common to quickly work on product features rather than to scale to a hypothetical future load.

3. Maintainability

Majority of the costs comes from ongoing maintenance such as fixing bugs, keeping its systems operational, investigating failures, adapting to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.
For a maintainable software 3 things are important:
operability, simplicity and evolvability (extensibility, modifiability, or plasticity)
example of complex systems are: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere and many more. there is also a high chance of introducing a bug in a complex system when some changes are made to the codebase.
High level programming languages can be seen as abstractions that hides machine code, CPU registers, and syscalls. SQL hides complex on disk and in memory data structures, concurrent requests from other clients and inconsistencies after crashes. This book will take an approach of extracting the parts of large system into well defined, reusable components.

Our system evolves over time for reasons such as new ideas, new facts, undiscovered use cases show up, business priorities change, users request new features, replacing the existing platform, changing legal or regulatory requirements, architectural changes etc. TDD is useful for frequently changing environment. simple and easy to understand systems are easier to modify.

There is no easy fix for a reliable, scalable, or maintainable application but there are certain patterns and techniques that are the same in different kinds of applications.

--

--

Bimala Bogati

M-F:4pm-8pm Sat/Sun: 7a.m. — 7p.m. I teach coding for AP/undergrads reach @ bimalacal2022@gmail.com. How about we build/learn/discover/engineer sth new today?