Designing a scalable real-time analytics model is a step-by-step process involving multiple tools and technologies. There is no single tool that provides a complete solution. One approach to solution development is Lambda architecture. This approach takes advantage of both batch and stream-processing architecture to provide a balance among latency, throughput and fault-tolerance. It uses batch processing to provide comprehensive and accurate analytics on entire data, while simultaneously using real-time stream processing to provide incremental analytics on the continually arriving data. The two analytics results are then merged to generate a comprehensive overview of insights.
Batch layer – for batch processing of all data
Speed layer – for real-time processing of streaming data
Serving layer – for responding to queries
The below diagram illustrates an example of designing a scalable real-time analytics architecture.
Designing scalable, real-time analytics systems requires the integration of several components. We have described the major building blocks of such a system by combining the features of both batch processing and stream processing using lambda architecture concepts. A scalable, real-time analytics architecture can be realized through the integration of the following open source components:
- Apache Flume for data aggregation and transformation
- Apache Kafka as the distributed messaging layer
- Apache HDFS for batch layer incoming data store
- Apache Spark for batch processing logic
- Elephant DB or Druid for batch layer output store
- Spark Streaming for speed layer
- Apache Cassandra or Druid for speed layer store
- Druid for serving layer store.