Logs and Distributed Systems

Why Logs are so important

Mohit Gupta
Level Up Coding

--

Logs are not new to the software engineering world. Probably this is one of the first things that many of us heard and learned. However, that is not the whole story about Logs. It is a potent data structure to enable faster data writing and efficient replication in distributed systems.

Let us understand it more, in simple terms.

What is a Log

Writing any event (or its metadata), as it happens, to simple storage in sequential order is called Logging. Every entry written is the Log.

It is the simplest form of storage, as does not need any complex database checks and constraints to implementing. Few highlights:

  • Logs can be written in simple plain files.
  • New information will keep on appending towards the end of the file.
  • Logs are append-only and are not meant for updates or deletion.
  • Any update or deletion event will be logged as a new entry in the log.
  • Logs are always ordered by the time of the event.
  • If the file size is too big, the system can optimize it by creating a new file.
  • The previous file will be stored in order, so the log sequence is maintained.

Logs in Application Development

In the simplest form, which most developers learned as the first thing, Logs are used heavily in software development to log the system information for debugging and information purposes.

Whatever information is important, it is being written to files in an append-only manner.

This helps a lot later in debugging any issue in the system, and also helps in extracting useful system information by analyzing the logged information.

Read more about the importance of logging in to software development here

However, there are more places where Logs-based storage plays a very important role. Let us discuss this.

Logs in Database Systems

Logs are very useful in the implementation of many database systems.

These systems store the change instructions in the logs, and then make changes to the core data structures.

The reason is that storing change instructions in logs is simpler and faster. It just needs to append the instructions in the log files, in sequence. No other processing, checks, constraints, etc are required.

Once change instructions are stored in logs, Database systems then apply these in sequence to the actual database storage data structure. This process will take care of all ACID, etc properties.

In case of any issue with these operations, database systems can always refer back to these logs to make necessary corrections. Logs are original raw data representing the ‘intent of change’.

These logs play an important role in data replications also across multiple nodes of database systems.

Data replication can be done by copying the data from table-to-table across data nodes, which could be a heavy process as data is in normalized form or could be stored in a multitude of structures, indexes, and much more.

Another simpler solution is to replicate the event logs (change instructions). Once logs are on other database nodes, instructions can be (re)played in sequence to create the same state of the system.

It is like a state machine, where if events are fired in the same sequence, it will always take the system to the same state.

It is much simpler and optimized than replicating a lot more complex data structures.

Many products are built on this approach to enable replications across database nodes. ‘Golden Gate’ is one example.

Citing this use case of database systems, it is evident that logs are an important vehicle for data replication, across nodes.

Let us understand how this is useful in Distributed systems.

Logs in Distributed System

From the previous use case, it is clear that logs play an important role in data replication. Hence it helps in ensuring better system uptime by enabling multiple nodes to have updated data.

This is a very important (key) aspect of distributed systems. In a distributed system, multiple nodes work together to provide services ensuring high availability, and high throughput while ensuring that system states are consistent (even if eventually) and the data is not lost or outdated for long.

To enable this, system states/data have to be replicated efficiently across the nodes. Logs and log replication help to enable this implementation.

The basic is to log the events in primary node as these happen.

Events can be logged in any format, which could be specific to each system.

  • It can be simple instructions that the system receives from the client.
  • It could be the service and operations names that are invoked by the client.
  • It could be the resultant changes in the system once change requests are executed by the system.

Each system can design it differently. But ultimately, events will be logged. Once logged, these will be replicated to other nodes and will be replayed then to recreate the same state in the system on other nodes.

As long as, events are played in same sequence, the system will produce the same state.

That's how Logs play an important role in enabling Distributed systems and act as basic building blocks for the whole ecosystem.

This reminds me of a very interesting article by Martin Fowler on LMAX, which is based on event processing. Refer to it here.

But Why ‘Logs’

The important question for understanding is, why should we use Logs to store the data, and not any other data structure like map, tree, etc?

The reason is

  • Append-only Logs are very simple yet powerful data structure to capture the events as it happens.
  • Writing to these is much faster than any other data structure.
  • These logs are immutable, i.e. data can be appended to Logs but can not be changed.
  • Immutability and sequential writing make logs very fast without the need of finding the right place of change in data storage.
  • Another reason for being faster is that logging does not need to apply various checks and optimizations, which database systems usually need to apply before writing data to ensure ACID (and more) kind of properties.

Hence logs are very fast for writing. Optimizing the ‘writes’ is one of the biggest tasks in data-intensive applications.

Reading performance is equally important, however, systems have more opportunities to optimize ‘reads’ by introducing

  • Multiple levels of caches
  • Bloom filter kind of data structure for checking the possibility of data presented in a specific file
  • And storage optimization by merging files with the latest updates, and compression kind of techniques.

However, writing is always a bottleneck as it has to be in real-time (near) while ensuring the durability of data.

A simple and efficient data structure like plain append-only logs helps a lot in this.

Won’t Write to Disk still Makes it Slower

Although the simplicity of Log-based data structure helps for faster writes. However, writing to disk is always costly and hence slower.

Improvements are being made by using caches. A bunch of ‘write’ statements can be collected in the cache first and later can be flushed to disk. However, it carries the risk of losing the data from the cache if the system crashes.

However, caches are presented at every level. Even operating systems and drivers cache the writes before flushing these to disk. So even if we execute the statement to write the data to persistent storage, data can still be in one of these caches.

The data storage system usually optimizes this by tweaking the flush frequency of the OS (and drivers etc) to an optimum number. So setting the optimum frequency for cache flush is one of the solutions to strike the balance of performance and durability.

There is more innovation happening in hardware also to enable faster writing. Flash-based SSDs are one of these. It has very high performance and does not lose data even if the system is down. I hope even more is on the way.

That’s how writing is becoming faster and faster, by keeping data structure simpler, and by using better designs of cache and hardware.

More benefits of Logs

  • Logs are like raw data for everything that happened in the system. This raw data can be used to recreate any state, from any point in time. It is like replaying the whole life events of the system. Refer to LMAX architecture in the references section for an interesting illustration.
  • Raw data can also be used to produce different reports and to enable whole analytics ecosystems.

Summing up

Logs are simple, yet very powerful and efficient data structures to write data. Due to its simplicity and efficiency, it becomes an effective solution in distributed system design to enable data replication.

Data replication is the core requirement of the distributed system. Having an efficient data structure here improves the efficiency of the whole ecosystem.

That’s how logs play an important role in Distributed System designs.

In one of the future blogs, we shall discuss how Distributed systems stores such huge data efficiently, and how good design with Logs play an important role there. We shall do a deep dive.

References

WIKI append-only logs

LSM Tree

HDD vs SSD

LMAX — Event-driven Architecture based trading platform

Refer here for more System Design articles

If you enjoyed reading this, please share, give a clap, and follow for more stories!

For any suggestions, feel free to reach me on Linkedin: Mohit Gupta

--

--

Enjoy building great teams and products. Sharing my experience in software development, personal development, and leadership