December 18, 2023

Data Quality Checks in Python

Data Quality Checks in Python

December 10, 2023

Data Lake vs Delta Lake vs Data Warehouse

Data Lake vs Delta Lake vs Data Warehouse vs Data Lakehouse

Data Lake

Aspect Description
Definition A centralized repository that allows you to store all your structured and unstructured data at any scale.
Key Features
  • ✔️ Stores all types of data (no schema enforcement).
  • ✔️ Built on cheap object storage (S3, ADLS, HDFS).
  • ✔️ Supports batch & streaming processing (Spark, Flink).
  • No ACID transactions (risk of dirty reads/writes).
  • No built-in indexing (slow for analytics).
Data Types Supports all types of data including raw, unprocessed data (e.g., logs, images, videos).
Schema Schema-on-read: Data is stored in its raw format and schema is applied when the data is read.
Use Cases Big data analytics (Hadoop, Spark), machine learning (raw data storage), data exploration, Log & event data archiving.
Examples Amazon S3, Azure Data Lake Storage, Google Cloud Storage.

Delta Lake

Aspect Description
Definition An open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to data lakes, built on top of Apache Parquet. It allows schema enforcement and, time travel to data lakes.
Key Features
  • ✔️ Adds ACID transactions to data lakes (ensures data integrity).
  • ✔️ Supports schema evolution & enforcement (prevents bad data).
  • ✔️ Provides time travel (query historical data versions).
  • ✔️ Built on Parquet format (efficient storage & compression).
  • ✔️ Integrates with big data tools (Spark, Presto, Hive).
  • ❌ Not a standalone DB (needs Spark/Flink)..
Data Types Structured and semi-structured data (tables, JSON, CSV).
Schema Schema-on-write: Enforces schema when data is written, but allows schema evolution.
Use Cases Data engineering (ETL pipelines), machine learning (feature store), real-time analytics, streaming data processing.
Examples Delta Lake on Databricks, Delta Lake on AWS, Delta Lake on Azure.

Data Warehouse

Aspect Description
Definition A centralized repository designed for reporting and data analysis, optimized for read-heavy operations and complex queries.
Key Features
  • ✔️ Optimized for complex queries and analytics (OLAP).
  • ✔️ Enforces a strict schema (schema-on-write).
  • ✔️ Supports indexing and partitioning (improves query performance).
  • ✔️ Provides data governance and security features.
  • ❌ Expensive storage (compared to data lakes).
  • ❌ Not ideal for unstructured data.
Data Types Primarily structured data (tables, relational data).
Schema Schema-on-write: Data must conform to a predefined schema before being loaded.
Use Cases Business intelligence (BI) reporting, historical data analysis, complex queries, data aggregation.
Examples Amazon Redshift, Google BigQuery, Snowflake. SQL Server, Oracle, MySQL.

Data Lakehouse

Aspect Description
Definition A modern data architecture that combines the benefits of data lakes and data warehouses, enabling both structured and unstructured data processing with ACID transactions.
Key Features
  • ✔️ Combines the flexibility of data lakes with the performance of data warehouses.
  • ✔️ Supports ACID transactions (data integrity).
  • ✔️ Enables both batch and real-time processing.
  • ✔️ Allows schema evolution and enforcement.
  • ✔️ Optimized for machine learning and analytics workloads.
  • ❌ Relatively new concept (less mature ecosystem).
Data Types Supports all types of data (structured, semi-structured, unstructured).
Schema Flexible schema: Supports both schema-on-read and schema-on-write.
Use Cases Unified analytics, machine learning, real-time analytics, data science workloads.
Examples Databricks Lakehouse, Snowflake, Apache Hudi, Apache Iceberg.

Comparison Table

Aspect Data Lake Delta Lake Data Warehouse Data Lakehouse
Definition Centralized repository for all types of data at any scale. Storage layer that adds ACID transactions to data lakes. Centralized repository optimized for reporting and analysis. Combines benefits of data lakes and warehouses with ACID transactions.
Key Features Stores all data types, cheap storage, batch & streaming processing, no ACID, no indexing. ACID transactions, schema enforcement, time travel, Parquet-based, integrates with big data tools. Optimized for complex queries, strict schema, indexing, data governance, expensive storage. Flexibility of lakes + performance of warehouses, ACID, batch & real-time processing, schema evolution.
ACID ❌ No ✔️ Yes ✔️ Yes ✔️ Yes
Performance Slow (Full Scan) Improved (Indexing, Caching) Fast (Optimized for Queries) Fast (Optimized for Analytics)
Cost Cheap (Object Storage) Moderate (Storage + Compute) Expensive (Optimized Storage) Moderate to High (Depends on Implementation)
Data Types All types (structured & unstructured). Structured & semi-structured (tables, JSON, CSV). Primarily structured (relational data). All types (structured, semi-structured, unstructured).
Schema Schema-on-read. Schema-on-write with evolution support. Schema-on-write. Flexible schema: both schema-on-read & schema-on-write.
Use Cases Big data analytics, machine learning (raw data), data exploration, log archiving. Data engineering (ETL), machine learning (feature store), real-time analytics. BI reporting, historical analysis, complex queries, data aggregation. Unified analytics, machine learning, real-time analytics, data science.
Examples Amazon S3, Azure Data Lake Storage, Google Cloud Storage. Delta Lake on Databricks, AWS, Azure. Amazon Redshift, Google BigQuery, Snowflake, SQL Server, Oracle. Databricks Lakehouse, Snowflake, Apache Hudi, Apache Iceberg.
Best for ML, raw storage Data engineering BI, complex queries Unified analytics

When to Use Which?

  • Data Lake: Low Cost; When you need to store vast amounts of raw, unstructured data for future processing or exploration.
  • Delta Lake: Reliability & Transactions; When you want to add reliability and ACID transactions to your existing data lake for better data integrity.
  • Data Warehouse: Fast SQL Analytics; When your primary need is for structured data analysis, reporting, and complex queries with high performance.
  • Data Lakehouse: Combination of Raw with Analytics; When you want a unified platform that supports both structured and unstructured data with ACID transactions and real-time analytics capabilities.

Key Takeaways

  • Dump everything here. Data Lakes are great for storing all types of data cheaply but lack ACID transactions and indexing.
  • Data Lake with reliability. Delta Lake enhances data lakes with ACID transactions, schema enforcement, and time travel capabilities.
  • Clean data for fast SQL. Data Warehouses are optimized for structured data analysis with strict schemas and high performance but can be costly.
  • Best of both worlds. Data Lakehouses combine the flexibility of data lakes with the performance of data warehouses, supporting diverse data types and workloads.
  • Choose based on needs. Select the right architecture based on your specific use cases, data types, and performance requirements.

There are real-world scenarios where Data Lake, Delta Lake, and Data Warehouse work together in a unified architecture.

Here’s a practical use case where all three are needed:

Use Cases: Unified Analytics Platform for an E-Commerce Company

Goal: Build a scalable, end-to-end data pipeline that handles raw data ingestion, reliable transformations, and business-friendly analytics.

1. Data Lake (Raw Storage)

  • What: Store all raw data from various sources in native formats.
  • Data Examples:
    • Clickstream logs (JSON)
    • Product images (JPEG, PNG)
    • Customer reviews (CSV)
    • Transaction records (Parquet)
    • IoT sensor data (CSV/Parquet)
    • Social media feeds (APIs → JSON)
  • Why Needed:
    • Cheap storage for large volumes of diverse data.
    • Flexibility to store unstructured and semi-structured data.
    • Foundation for future processing and exploration.
  • Technology: Use Amazon S3 or Azure Data Lake Storage to store raw, or HDFS unstructured data in its native format (JSON, CSV, Parquet).

2. Delta Lake (Reliable Processing Layer)

  • What: Adds structure, reliability, and ACID compliance to the Data Lake.
  • Use Cases:
    • CDC (Change Data Capture): Merge real-time order updates from databases.
    • Schema Enforcement: Ensure clean data before loading to the warehouse.
    • Time Travel: Debug issues by querying past versions (e.g., "Why did sales drop on Day X?").
  • Why Needed:
    • Prevents "data swamp" issues (bad quality, duplicates).
    • Prevents bad data with schema enforcement.
    • Enables historical data analysis with time travel.
    • Enables batch + streaming in one system (e.g., Spark + Kafka).
  • Technology: Delta Lake on Databricks / Delta Lake Standalone on AWS / Azure.

3. Data Warehouse (Business Analytics)

  • What: Optimized SQL layer for reporting, dashboards, and ad-hoc queries.
  • Use Cases:
    • BI Reports: "Top 10 products by revenue last quarter."
    • Customer Segmentation: SQL queries for marketing teams.
    • Regulatory Compliance: Auditable, governed data.
  • Why Needed:
    • Fast performance for complex JOINs/aggregations.
    • Self-service for business users (Tableau, Power BI).
    • Data governance and security features.
    • Optimized storage for structured data.
  • Technology: Snowflake / BigQuery / Redshift

How They Work Together

  • Data Lands in Data Lake (e.g., raw JSON logs from apps).
  • Delta Lake Processes It (clean, deduplicate, enforce schema).
  • Curated Data Moves to Data Warehouse (for BI/analytics).
  • Business Users Query the Warehouse (fast, reliable reports).

Real-World Example: Fraud Detection Pipeline

  • Data Lake : Stores raw transaction logs (some fraudulent, some legit).
  • Delta Lake :
    • Runs Spark ML to flag suspicious transactions.
    • Uses ACID transactions to update fraud labels.
  • Data Warehouse :
    • Executes fast SQL queries for fraud dashboards.
    • Alerts analysts via Tableau.