Data Engineering On Aws

Home / Dashboard

Fundamentals▸
Amazon S3 (Data Lake)▸
Lake Formation & Governance▸
Open Table Formats▸
Ingestion & CDC▸
AWS Glue (ETL)▸
Amazon EMR (Spark/Hadoop)▸
Amazon Athena▸
Amazon Redshift▸
Streaming (Kinesis/MSK)▸
Orchestration▸
Data Quality & Observability▸
Security & Compliance▸
Cost Optimization▸
Reliability & DR▸
ML Integration▸
BI & Visualization▸
Migration & Interop▸
Networking & Multi-Account▸
Archival & Retention▸
Testing & CI/CD▸
Data Mesh▸

v1.0 • SwiftLessons

Iceberg/Hudi/Delta on EMR

1. Introduction

In this lesson, we will explore how to use Apache Iceberg, Apache Hudi, and Delta Lake on Amazon EMR (Elastic MapReduce) for data engineering tasks. These technologies are designed to manage large datasets, providing capabilities for schema evolution, time travel, and ACID transactions.

2. Key Concepts

2.1 Apache Iceberg

Iceberg is an open table format for huge analytic datasets that enables SQL queries on data stored in a distributed file system.

2.2 Apache Hudi

Hudi (Hadoop Upserts Deletes and Incrementals) is a data management framework that provides capabilities for upserts and incremental data processing.

2.3 Delta Lake

Delta Lake is a storage layer that brings ACID transactions to Apache Spark and big data workloads, allowing for reliable data lakes.

3. Setup & Configuration

To get started with Iceberg, Hudi, or Delta on EMR, follow these steps:

Launch an Amazon EMR cluster with Spark and the desired libraries (Iceberg, Hudi, Delta).
Configure the EMR cluster to use the appropriate file system (S3 or HDFS).
Install necessary dependencies via bootstrap actions or custom scripts.

Note: Ensure that you have the necessary IAM roles and policies for accessing S3 and managing the EMR cluster.

3.1 Example of launching an EMR Cluster


aws emr create-cluster --name "Iceberg-Hudi-Delta-Cluster" \
    --release-label emr-6.5.0 \
    --applications Name=Spark Name=Hudi Name=Iceberg Name=DeltaLake \
    --ec2-attributes KeyName=myKey --instance-type m5.xlarge \
    --instance-count 3 --use-default-roles

4. Data Management

Once your EMR cluster is set up, you can start managing your data.

4.1 Writing Data with Iceberg


import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
    .appName("IcebergExample")
    .getOrCreate()

val df = spark.read.format("csv").load("s3://path/to/your/data.csv")
df.write.format("iceberg").mode("overwrite").save("iceberg_table_name")

4.2 Upserting Data with Hudi


import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.HoodieSparkSqlWriter

df.write.format("hudi")
    .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_id")
    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp")
    .mode("overwrite")
    .save("hudi_table_name")

4.3 Reading Data with Delta Lake


val deltaDf = spark.read.format("delta").load("s3://path/to/delta-table")
deltaDf.show()

5. Best Practices

Use partitioning wisely to optimize query performance.
Regularly vacuum your Delta Lake to remove obsolete files.
Monitor and optimize your EMR cluster resources based on workload.

6. FAQ

What are the differences between Iceberg, Hudi, and Delta Lake?

While all three provide similar functionalities, their designs, and implementations differ. Iceberg focuses on large analytic datasets, Hudi specializes in incremental processing, and Delta Lake emphasizes ACID transactions on Spark.

Can I use these frameworks together?

Yes, you can use them in different contexts or stages of your data pipeline, but each has its own set of optimizations and should be used based on your specific use case.