Realtime Analytics

Designing a scalable real-time analytics model is a step-by-step process involving multiple tools and technologies. There is no single tool that provides a complete solution. One approach to solution development is Lambda architecture. This approach takes advantage of both batch and stream-processing architecture to provide a balance among latency, throughput and fault-tolerance. It uses batch processing to provide comprehensive and accurate analytics on entire data, while simultaneously using real-time stream processing to provide incremental analytics on the continually arriving data. The two analytics results are then merged to generate a comprehensive overview of insights.

Batch layer – for batch processing of all data

Speed layer – for real-time processing of streaming data

Serving layer – for responding to queries

The below diagram illustrates an example of designing a scalable real-time analytics architecture.

Designing scalable, real-time analytics systems requires the integration of several components. We have described the major building blocks of such a system by combining the features of both batch processing and stream processing using lambda architecture concepts. A scalable, real-time analytics architecture can be realized through the integration of the following open source components:

  • Apache Flume for data aggregation and transformation
  • Apache Kafka as the distributed messaging layer
  • Apache HDFS for batch layer incoming data store
  • Apache Spark for batch processing logic
  • Elephant DB or Druid for batch layer output store
  • Spark Streaming for speed layer
  • Apache Cassandra or Druid for speed layer store
  • Druid for serving layer store.