Data Engineering On Aws

Home / Dashboard

Fundamentals▸
Amazon S3 (Data Lake)▸
Lake Formation & Governance▸
Open Table Formats▸
Ingestion & CDC▸
AWS Glue (ETL)▸
Amazon EMR (Spark/Hadoop)▸
Amazon Athena▸
Amazon Redshift▸
Streaming (Kinesis/MSK)▸
Orchestration▸
Data Quality & Observability▸
Security & Compliance▸
Cost Optimization▸
Reliability & DR▸
ML Integration▸
BI & Visualization▸
Migration & Interop▸
Networking & Multi-Account▸
Archival & Retention▸
Testing & CI/CD▸
Data Mesh▸

v1.0 • SwiftLessons

Batch vs Streaming in Data Engineering on AWS

Introduction

In the field of data engineering, understanding the differences between batch and streaming data processing is crucial. This lesson explores these two paradigms, their use cases, and how they can be implemented using AWS services.

Definitions

Batch Processing

Batch processing is the execution of a series of jobs on a computer without manual intervention. Data is collected over a period and processed in a single batch.

Streaming Processing

Streaming processing involves continuous input and output of data, where data is processed as soon as it is available. This allows for real-time analytics.

Batch Processing

Batch processing is suitable for scenarios where data can be processed periodically. Common characteristics include:

Data is accumulated over time.
Processing occurs at scheduled intervals.
Latency is acceptable, often on the order of hours or days.

Example in AWS

A common service for batch processing in AWS is AWS Glue. Below is an example of how to create a Glue job:


import boto3

glue = boto3.client('glue')

response = glue.create_job(
    Name='BatchJob',
    Role='AWSGlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://my-bucket/scripts/my_script.py',
        'PythonVersion': '3'
    },
    MaxRetries=0,
    Timeout=60
)

print(response)

Streaming Processing

Streaming processing is ideal for cases where immediate processing of data is required. Characteristics include:

Data is processed in real-time.
Lower latency, often in milliseconds.
Continuous data flow.

Example in AWS

A popular service for streaming data processing on AWS is Amazon Kinesis. Below is an example of how to put records into a Kinesis stream:


import boto3
import json

kinesis = boto3.client('kinesis')

data = {
    'message': 'Hello, World!'
}

response = kinesis.put_record(
    StreamName='MyStream',
    Data=json.dumps(data),
    PartitionKey='partitionkey1'
)

print(response)

Best Practices

For Batch Processing

Optimize data storage formats such as Parquet or ORC.
Use partitioning to improve read performance.
Schedule jobs during off-peak hours to reduce resource contention.

For Streaming Processing

Implement backpressure handling to manage data spikes.
Use proper monitoring tools to track stream health.
Utilize windowing to aggregate data over specific time intervals.

FAQ

What are the primary differences between batch and streaming processing?

Batch processing is designed for processing large volumes of data at intervals, while streaming processing handles data in real-time as it is produced.

Can AWS handle both batch and streaming processing?

Yes, AWS provides services like AWS Glue for batch processing and Amazon Kinesis for streaming processing, allowing users to implement both approaches.

When should I use batch processing over streaming?

Batch processing is suitable when latency is not a critical factor and when processing large datasets at once is more efficient.