Designing Data-Intensive Applications summary-chapter 3
Part 1: Foundations of Data Systems
Chapter 3: Storage and Retrieval
summary for chapter 2 can be found here
Chapter 2 was all about Application developer giving user’s data to the database in certain format so that it could be asked for later!
In this chapter we will learn the same but from the database’s point of view, how data can be stored by the database and how it can be found when asked. Even though an app developer may not necessarily need to know how to implement a database from scratch but it can be helpful in choosing the right storage engine when you understand what goes behind the scene.
Storage engines have lot of differences in terms of optimization depending upon whether the use case is for transactional workloads or for analytics. The two families of storage engines that we will talk about are log-structured storage engines and page-oriented storage engines such as B-trees.
Data Structures That Power Your Database
A simple implementation of database is as simple as a text file. Databases internally use log, an append-only data file. However, we need to be able to find the key in db efficiently. Indexing is the answer. Index is an additional data structure that is derived from the primary data. Index is good for reads, but it is hard to beat the performance of log for writes where we simply append at the end of the file.
Indexes for key-value data can simply be created by keeping an in-memory hash map where every key is mapped to a byte offset in the data file- the location at which value can be found.
Storage engine like Bitcask is suitable where value for each requires frequent updating. for ex. counts.
How can we handle the eventual run out of disk space from append to file approach?
break log into segments of specific size by closing a segment file when it reaches certain size and making subsequent writes to a new segment file, after that we can perform compaction on those segments. Compaction is the process of throwing away the duplicate keys and keeping only the most recent update for each key. segments are never modified after they have been written.
CSV are not the best format for log. Bitcask speeds up the crash recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.
why not update a file in place by overwriting the old value with new value instead of compaction?
1.appending and segment merging are sequential write operations which are generally much faster than random writes, especially on magnetic spinning-disk hard drives.
2.concurrency and crash recovery are much simpler if segment files are append only or immutable.
limitations o hash table indexes:
1.hash table must fit in memory and causes issue with large number of keys. maintaining hash map on disk is very difficult.
2.range queries are not efficient
SSTables and LSM-Trees
SSTable is short form of Sorted String Table. the order of the key-value pairs didn’t matter in log-structured storage segment. In SSTable, we require that the sequence of key-value pairs is sorted by key.
Advantages over log:
1.Merging segments is simple and efficient, even if files are bigger than the available memory. Algorithm is similar to the one used in merge-sort algorithm.
2.Sorted keys makes it easier to find the keys easier even if you will not have index of all keys in memory, but you will still need indexes of some keys to get offsets but it can be sparse: 1 key for every few kilobytes of segment file is enough as a few kilobytes can be scanned very quickly.
3.compression reduces I/O bandwidth use as it is made possible by range of records access.
Constructing and maintaining SSTables
B Trees allows to achieve maintaining a sorted structure on disk. red-black trees or AVL trees can be used achieve sorting in memory. However at the times of crash, we have to keep a separate log on disk to which every write is immediately appended and can be restored back!
Making an LSM-tree out of SSTables
LSM-Tree is the short form of Log-Structured Merge-Tree building on earlier work of log-structured filesystems. Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.
Lucene, an indexing engine for full text search used by Elasticsearch and Solr also makes use of similar method like that of LSM-Tree.
LSM-Trees can be slow for accessing a key that does not exist in db. Bloom filters can be used for optimizing such access. LevelDB and RocksDB use leveled compaction . HBase uses size-tiered. Cassandra supports both! sorted data in LSM-Trees makes it easy for range queries giving performance boost!
B-trees are most widely used indexing structure.
B-tree is a balanced tree which means that a B-tree with n keys always has a depth of O(log n). Most db can fit into a B-Tree that is 3 or 4 levels deep. A 4 level tree of 4Kb pages with a branching factor of 500 can store upto 256TB. The number of references to child pages in 1 page of the B-tree is called the branching factor.
Making B-Trees reliable
Write operation in B-Tree is to overwrite a page on disk with new data. to make db resilient to crashes, B-Tree implementations include an additional data structure on disk: a write-ahead log(WAL, aka redo log)
LSM-trees are faster for writes and B-trees are faster for reads.
Other Indexing Structures
It is common to have secondary indexes. They are useful for performing joins efficiently. multi-column indexes are used to query multiple columns of a table simultaneously. It is particularly important for geospatial data. A standard B-tree or LSM-tree is unable to answer the query efficiently where you will need ti query multiple fields in the document or or columns in table simultaneously. Most common specialized spatial indexes are R-trees. PostGIS implements geospatial indexes as R-trees using PostgreSQL’s generalized search tree indexing facility. 3 dimensional index are sometimes used to search for products in a certain range of colors. with one dimensional index you would have to go through all rows, 2d indexes narrow down the searches.
Fuzzy querying requires different techniques unlike the query we discussed so far that look for the exact data. Full-text search allows a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words and to search for occurrences of words near each other in the same document and support various other features that depend on linguistic analysis of the text. Lucene is able to search text for words within a certain edit distance to cope with typos in documents or queries.
Keeping everything in memory
The data structures discussed so far resolves the limitations of disks. Even though disk has some disadvantage, they are durable and have a lower cost per gigabyte than RAM. Many datasets are not too huge, so they can entirely be stored in memory rather than having to deal with disks. This can be attained by the use of in-memory databases such as Memcached(only intended for caching and it should be acceptable if data is lost when machine restarts). Other in memory databases however aim for durability that can be achieved by special hardware. VoltDB, MemSQL, and Oracle TimesTen are in-memory databases with a relational model and have big performance boost as overheads associated with managing on-disk data structures are removed. In memory databases provide data models that are difficult to implement with disk-based indexes. Ex. Redis offers a db-like interface to various data structures such as priority queues and sets. In memory database can support large datasets by adopting approaches like LRU for eviction policies.
Transaction Processing or Analytics?
In early days computers were mostly used for commercial transactions such as making a sale, placing an order with supplier, paying an employees salary etc. with growing data, database has been used or accessed for data analytics. Data analytics purpose db has a very different access patterns. they need to scan over huge number of records and deriving statistical insights from them.
OLTP: Online Transaction Processing
OLAP: Online Analytic Processing
Courtesy: pg 91
A data warehouse is a separate db designed for analysts to run query without affecting OLTP operations. Data is extracted from OLTP databases transformed into analysis-friendly schema, cleaned up, and then loaded into the data warehouse. It contains the read-only copy of the data. It can be optimized for analytic access patterns.
Stars and Snowflakes: Schema for Analytics
Unlike diverse data models available for transaction processing, analytics has much lesser diversity of data models. one of them is star schema.(aka dimensional modeling). Variation of star model is called snowflake schema, but star schemas are preferred for their easy use.
Most OLTP database, data is stored in row-oriented fashion: all the values from one row of a table are stored next to each other. Document db stores entire document as one contiguous sequence of bytes. Such storage design does not fit for all use cases where you don’t necessarily have to load all the rows from disk into memory , parse, and filter out to get to the final answer. This is where the power of column-oriented storage comes into place. The idea is not to store all the values from one row together, but store all the values from each column together. the query needs to read and parse only the columns used in query! This saves lot of time.
Aggregation: Data Cubes and Materialized Views
Every data warehouse is not a column store. In relational data model, materialized view is often defined like a standard(virtual) view: a table-like object whose contents are the results of some query. they are useful for read heavy applications. Special case of materialized view is data cube or OLAP cube. It is a grid of aggregates grouped by different dimensions.
This chapter has armed you with understanding of internals of storage engines, it will put you in a better position to pick the tool best suited for your application.