Data Engineering On Aws

Home / Dashboard

Fundamentals▸
Amazon S3 (Data Lake)▸
Lake Formation & Governance▸
Open Table Formats▸
Ingestion & CDC▸
AWS Glue (ETL)▸
Amazon EMR (Spark/Hadoop)▸
Amazon Athena▸
Amazon Redshift▸
Streaming (Kinesis/MSK)▸
Orchestration▸
Data Quality & Observability▸
Security & Compliance▸
Cost Optimization▸
Reliability & DR▸
ML Integration▸
BI & Visualization▸
Migration & Interop▸
Networking & Multi-Account▸
Archival & Retention▸
Testing & CI/CD▸
Data Mesh▸

v1.0 • SwiftLessons

Distribution & Sort Keys in Amazon Redshift

Overview

In Amazon Redshift, data distribution and sorting are crucial for optimizing query performance and resource utilization. Understanding how to define distribution and sort keys will help you manage large datasets effectively.

Distribution Keys

A distribution key determines how data is distributed across the nodes in a Redshift cluster. Properly defining distribution keys can significantly enhance query performance by reducing data shuffling.

Important: A poor choice of distribution keys can lead to uneven data distribution, which might result in performance degradation.

Types of Distribution Styles

KEY: Data is distributed based on the values of a specified column.
EVEN: Data is distributed evenly across all nodes, regardless of the values.
ALL: A full copy of the table is stored on every node, useful for small dimension tables.

Example of Creating a Table with a Distribution Key

CREATE TABLE sales (
    sales_id INT,
    product_id INT,
    quantity INT
)
DISTSTYLE KEY
DISTKEY (product_id);

Sort Keys

Sort keys define the order in which data is stored in a table. This is beneficial for speeding up query performance, especially for range-restricted queries.

Tip: Choose sort keys based on the columns you frequently query against, particularly with range queries.

Types of Sort Keys

Compound Sort Key: A combination of multiple columns is used for sorting.
Interleaved Sort Key: Provides equal importance to all specified columns in terms of sorting.

Example of Creating a Table with a Sort Key

CREATE TABLE product_sales (
    sales_id INT,
    product_id INT,
    sale_date DATE,
    quantity INT
)
SORTKEY (sale_date);

Best Practices

To optimize your Redshift database performance, consider the following best practices:

Analyze your query patterns and choose keys accordingly.
Regularly monitor and adjust distribution and sort keys based on data growth and usage.
Avoid using large tables with ALL distribution style whenever possible.
Use the VACUUM command to reclaim space and sort data after large updates.

FAQ

What is the default distribution style in Redshift?

The default distribution style is EVEN, which distributes rows evenly across all nodes.

Can I change the distribution style of an existing table?

Yes, but you will have to create a new table with the desired distribution style and then copy the data over.

How do I determine the best keys for my table?

Analyze your query workload to identify columns that are frequently used in joins and filters, and choose those as distribution or sort keys.