Overview

The fortune 100 Firm set out to revamp their Enterprise Data Lake which was on a Cloudera platform to a new data platform primarily for

  • cost reduction
  • adopt to modern data architectures and
  • importantly, to enable the data lake to be easily governed and consumable

Business Challenges:

The data lake development at this customer was in works for many years. With iterative development and on-the-fly data governance, the lake over time has become very difficult to consume with lack of proper metadata management, access provisioning and cataloging. The lake was built on a Hadoop cluster and with the costs growing higher, there was a need to retake a look on options on new technologies and platforms.

Solutions Delivered:

Quadratic Systems was the primary partner for designing and implementing the data lake solution on a new platform comprising of.

  • On-Prem S3 object store (Scality) for data storage (replacing HDFS),
  • Spark/Scala on Kubernetes containers (CaaS Platform)
  • Dremio as a query tool (replacing Hive/Impala)

Scality was chosen for Data Storage and a Caas Platform (Kubernetes) for compute to replace the exisiting Hadoop environment.

Data Governance

  • Data Quality addressed with Balance and Controls and Reconciliation Framework
  • Data Access/Security using Service Now Integration for provisioning
  • Metadata Management with integration with Informatica EDC
  • Incremental feeds to the lake were recorded in a metadata and available for consumers to easily find new data as it was made available.

Data Ingestion Framework

  • A repeatable config driven framework written in Scala enabled Systems of Records to quickly on-board to the lake
  • Kafka Integration and Spark Streaming to accomplish near real time ingest to the lake

Data Consumption

  • Dremio was made available for users to consume data on Scality with ease for exploration / reporting./li>
  • Advanced users employed Pyspark for analytical consumption

Enabling Technology

  • On-Prem S3 (Scality)
  • Spark on Kubernetes
  • Dremio
  • Scala / Pyspark