Unified Data Lake Platform in Azure for a British Multinational Consumer Goods Enterprise 

 

Solution Approach​

Technical Architecture

Common Data Model – Data Flow Diagram

Common Data Model – Technical Architecture

Picture6
  1. Data from EDAP transformation layer will be read and prepared using PySpark on Databricks cluster to suit the CDM design
  2. Common Data Model Subject Area deigns will be deployed to Data Lake in the format of Microsoft Common Data Model folders with manifest files holding metadata of entities, attributes, relations
  3. Data is ingested into CDM folders as per the model defined using Industry Data Workbench. Data will be stored in Data Lake in Parquet formatted files
  4. Data from CDM folders will be read using PySpark on Databricks cluster to convert to Delta format.  Using Spark SQL, Delta Tables will be created with Delta Log which enable Data Warehouse like properties to manage data along with Change Data Capture
  5. Delta tables will be archived periodically as per the defined data policy of EDAP
  6. Enterprise ETL tools will be used to build Data Pipelines & Meta Data Management
  7. Azure Active Directory and CI/CD would ne implemented for platform security and  management
  8. Whole EDAP & CDM design will be hosted on Microsoft Azure and utilize afore mentioned Azure services

Benefits

Single Data Model

30% saving in project delivery cost (efforts) due to faster delivery, Data model reuse & quick dashboarding directly from landing layer data

Alignment of Data Strategy

Creating a single source of truth will reduce efforts in data validation between multiple sources

Centralized Governance

Data security framework & DevOps to be centrally governed, thus reducing project costs by approx. 10%

Improved Pipeline Performance

Reduced data storage costs due to reduction in data redundancy - Approximately 20% lower data storage costs

Reduced Operating Costs

Around 25% saving due to pulse/RB Once optimization (redesigning of azure components)

Data Democratization

Faster delivery of data reports/projects to business through increase in turn around time, thus delivering value faster through CDM & Dremio