Business Problem

The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization. Additional ingestions from data sources are frequently added on an as-needed basis, making it difficult to leverage shared functionality between pipelines. Identifying when technical debt is prohibitive for an organization can be difficult, but remedying it can be even more so.

As our customer's data engineering team grappled with their own technical debt, they identified the need for higher data quality enforcement, the consolidation of shared pipeline functionality, and a scalable way to implement complex business logic for their downstream data scientists and machine learning engineers.

In this case study, we will explain how Quadratic designed our client's new new end-to-end pipeline architecture to make the creation of additional pipelines robust, maintainable and scalable, all while writing fewer lines of code with Apache Spark.

Original Architecture:

The original architecture designed to support our clients Annual Enrollment. start with the collection of data. Support for both internal and external sources were needed. For the internal sources, product teams was landing data in a staging area. For external sources, typical API Calls were made to land the same data. The data landed as JSON and the first thing was to convert it to Parquet. Depending on the requirements of the pipeline, a small amount of custom logic might be implemented. . All of the pipelines are run using oozie as an orchestrator and all logic is executed as combination of pig and hive jobs. Each pipeline is architected independently.

In the last stage of processing, data sources that have similar structure and business rules were combined into a single table. This reduces the number of tables that would have to maintain. Lastly, the data was exposed via Hive Tables.

Data Engineering Refactor Scope:

The original architecture evolved over time but continued to implement data pipelines using the older Hadoop technologies. The problem with this was two fold: (1) It was very difficult to on-board new data sources and (2) The pipelines were non-performant, buggy and the customer had to invest into monitoring of the jobs 24x7 with most this monitoring done manually because of short remediation time for SLAs . There was a huge opportunity to refactor these pipelines.

New Architecture:

In the new architecture, we drive the generation of the pipeline through three components. The first component is a YAML configuration file that’s structured with a list of transforms, configurations to those transforms and information to construct the actual pipeline. Looking at the transform configurations piece, this should define all the information needed to configure the step. . The second component is a set of schema files that define how the data is structured in the three hive layers throughout our pipeline. We use these schema files to create the actual hive tables and to drive the schema validation step at the beginning of the pipeline. The third component is an orchestration library that parses the configuration files, constructs the ingest calls and other operational blocks relates those steps together and then constructs the airflow DAG and tasks.

Config Driven Ingestion

In the new architecture, we drive the generation of the pipeline through three components. The first component is a YAML configuration file that’s structured with a list of transforms, configurations to those transforms and information to construct the actual pipeline. Looking at the transform configurations piece, this should define all the information needed to configure the step. . The second component is a set of schema files that define how the data is structured in the three hive layers throughout our pipeline. We use these schema files to create the actual hive tables and to drive the schema validation step at the beginning of the pipeline. The third component is an orchestration library that parses the configuration files, constructs the ingest calls and other operational blocks relates those steps together and then constructs the airflow DAG and tasks.