September 13, 2016
It’s worth emphasizing that fully incremental architectures are so widespread that many people don’t realize it’s possible to avoid their problems with a different architecture. These are great examples of familiar complexity—complexity that’s so ingrained, you don’t even think to find a way to avoid it. ~ Nathan Marz
The SSIX Platform will be an implementation of the Lambda Architecture that has been proposed for big data systems. Most of the existing systems and architectures are fully incremental which means that the system updates its state when new data enters the system. That is, the system acts like a one-way street in terms of processing. Data once processed through the system cannot be reprocessed. Let us look at few common problems with fully incremental systems.
Problems with Fully Incremental Systems
- Disk Compaction – This is the process of reclaiming unused disk space and is known to cause server lockdowns in production environments in a fully incremental architecture.
- Human-fault Tolerance – Coding mistakes from average developers are inevitable. In a fully incremental system, there is no way to come back if a mistake causes incorrect calculations, leaving the stored results in an inconsistent state.
- Embracing Updates – If there is an update on an algorithm, this can only be applied to new data. There is no way to apply the new algorithm to the data that has already come in.
- Generalization – Since built around specific ideas, fully incremental systems are hard to generalize.
Lambda Architecture is built around solving the problems listed above and more. The architecture essentially has three layers named as the batch, speed, and service layers. An overview diagram of the architecture and description of the basic components are given in the following .
- New Data: All data entering the system is dispatched to both the batch layer and the speed layer for processing.
- Batch layer: This layer has two functions: (i) managing the master dataset, an immutable, append-only set of raw data, and (ii) to pre-compute arbitrary query functions, called batch views.
- Serving layer: This layer indexes the batch views so that they can be queried in ad hoc with low latency.
- Speed layer: This layer compensates for the high latency of updates to the serving layer, due to the batch layer. Using fast and incremental algorithms, the speed layer deals with recent data only.
- Queries: Last but not least, any incoming query can be answered by merging results from batch views and real-time views.
The primary storage is append-only so no disk compaction is needed. The combination of speed and batch layers help in achieving high availability and eventual consistency. The availability of raw data and the ability to batch process it ensure human-fault tolerance, embracing updates and the support for ad-hoc queries.
SSIX architecture is based upon the Lambda Architecture mainly because of its excellent suitability for big data systems. Lambda architecture advocates storage of complete raw data that shields against human error. Storage of immutable raw data and the ability to reprocess it is particularly useful for a research project as the exact scope may not be clear at the beginning and processing algorithms are very likely to change. These changes can be reapplied to all the raw data. The architecture also has two processing layers; the batch layer for processing large amounts of data and the speed layer for processing views of streaming data. This allows maintaining throughput and managing latency by efficient use of both these layers. SSIX aims to provide one minute feed that can be be managed by using speed layer and moving heavier tasks to batch layer. This is in addition to other uses of batch layer like reprocessing raw data in cases such as back-testing. The SSIX platform is being developed around financial domain but should be applicable to other domains such as news analytics, politics and product reviews. Lambda architecture is highly generalizable; hence its systems can be adapted to different domains and datasets.
SSIX would need to handle over a dozen terabytes approximately for the financial domain at the end of year three. As a rough estimate, it may be handling more than double of that after five years. SSIX also has to be adaptable to other domains like elections or education. The implementation of Lambda Architecture will provide the needed scalability as well as generalization for SSIX use as an open source analysis platform.