Data Engineering On Aws

Home / Dashboard

Fundamentals▸
Amazon S3 (Data Lake)▸
Lake Formation & Governance▸
Open Table Formats▸
Ingestion & CDC▸
AWS Glue (ETL)▸
Amazon EMR (Spark/Hadoop)▸
Amazon Athena▸
Amazon Redshift▸
Streaming (Kinesis/MSK)▸
Orchestration▸
Data Quality & Observability▸
Security & Compliance▸
Cost Optimization▸
Reliability & DR▸
ML Integration▸
BI & Visualization▸
Migration & Interop▸
Networking & Multi-Account▸
Archival & Retention▸
Testing & CI/CD▸
Data Mesh▸

v1.0 • SwiftLessons

Amazon EMR on EKS

Introduction

Amazon EMR on Amazon EKS allows you to run big data frameworks such as Apache Spark and Apache Hadoop on Kubernetes. This integration provides flexibility, scalability, and cost-effectiveness.

Key Concepts

**Amazon EMR**: A cloud-native big data platform.
**Amazon EKS**: A managed Kubernetes service for running containerized applications.
**Pod**: The smallest deployable unit in Kubernetes.
**Task**: A unit of work in EMR.

Setup Steps

1. Create an EKS Cluster

To create an EKS cluster, use the AWS Management Console or AWS CLI.

aws eks create-cluster --name my-cluster --role-arn arn:aws:iam::123456789012:role/EKS-ClusterRole --resources-vpc-config subnetIds=subnet-12345678,subnet-87654321

2. Create an EMR on EKS Configuration

Deploy the EMR on EKS configuration using the following command:

aws emr-containers create-virtual-cluster --name "MyVirtualCluster" --container-provider "{ \"type\": \"EKS\", \"id\": \"arn:aws:eks:us-west-2:123456789012:cluster/my-cluster\" }"

3. Submit a Spark Job

Submit a job to your EMR on EKS virtual cluster:

aws emr-containers start-job-run --virtual-cluster-id  --job-driver "{ \"sparkSubmitJobDriver\": { \"entryPoint\": \"s3://your-bucket/your-script.py\" } }"

Best Practices

Use spot instances to reduce costs.
Optimize your Spark jobs for performance.
Enable logging to monitor job performance.
Regularly update your EKS and EMR configurations.

FAQ

What is the difference between EMR and EMR on EKS?

EMR is a managed service for big data processing, while EMR on EKS runs on Kubernetes, enabling more flexibility and containerized workloads.

Can I run non-Spark jobs on EMR on EKS?

Yes, you can run other jobs like Hadoop MapReduce and Hive in addition to Spark.

How is pricing done for EMR on EKS?

Pricing is based on the resources consumed by your EMR tasks and the EKS cluster, including EC2 instances and storage.