Skip to main content

Cluster Key: Automatic Data Organization for Query Acceleration

Cluster keys provide automatic data organization to dramatically improve query performance on large tables. Databend seamlessly and continually manages all clustering operations in the background - you simply define the cluster key and Databend handles the rest.

What Problem Does It Solve?

Large tables without proper organization create significant performance and maintenance challenges:

ProblemImpactAutomatic Clustering Solution
Full Table ScansQueries read entire tables even for filtered dataAutomatically organize data, read only relevant blocks
Random Data AccessSimilar data scattered across storageContinuously group related data together
Slow Filter QueriesWHERE clauses scan unnecessary rowsAuto-skip irrelevant blocks entirely
High I/O CostsReading massive amounts of unused dataMinimize data transfer automatically
Manual MaintenanceNeed to monitor and manually re-cluster tablesZero maintenance - automatic background optimization
Resource ManagementMust allocate compute for clustering operationsDatabend handles all clustering resources automatically

Example: An e-commerce table with millions of products. Without clustering, querying WHERE category IN ('Electronics', 'Computers') must scan all product categories. With automatic clustering by category, Databend continuously groups Electronics and Computers products together, scanning only 2 blocks instead of 1000+ blocks.

Benefits of Automatic Clustering

Ease-of-Maintenance: Databend eliminates the need for:

  • Monitoring the state of clustered tables
  • Manually triggering re-clustering operations
  • Designating compute resources for clustering
  • Scheduling maintenance windows

How it Works: After you define a cluster key, Databend automatically:

  • Monitors table changes from DML operations
  • Evaluates when tables would benefit from re-clustering
  • Performs background clustering optimization
  • Maintains optimal data organization continuously

All you need to do is define a clustering key for each table (if appropriate) and Databend manages all future maintenance automatically.

How It Works

Cluster keys organize data into storage blocks (Parquet files) based on specified columns:

Cluster Key Visualization

  1. Data Organization → Similar values grouped into adjacent blocks
  2. Metadata Creation → Block-to-value mappings stored for fast lookup
  3. Query Optimization → Only relevant blocks read during queries
  4. Performance Boost → Fewer rows scanned, faster results

Quick Setup

-- Create table with cluster key
CREATE TABLE sales (
order_id INT,
order_date TIMESTAMP,
region VARCHAR,
amount DECIMAL
) CLUSTER BY (region);

-- Or add cluster key to existing table
ALTER TABLE sales CLUSTER BY (region, order_date);

Choosing the Right Cluster Key

Select columns based on your most common query filters:

Query PatternRecommended Cluster KeyExample
Filter by single columnThat columnCLUSTER BY (region)
Filter by multiple columnsMultiple columnsCLUSTER BY (region, category)
Date range queriesDate/timestamp columnsCLUSTER BY (order_date)
High cardinality columnsUse expressions to reduce valuesCLUSTER BY (DATE(created_at))

Good vs Bad Cluster Keys

✅ Good Choices❌ Poor Choices
Frequently filtered columnsRarely used columns
Medium cardinality (100-10K values)Boolean columns (too few values)
Date/time columnsUnique ID columns (too many values)
Region, category, statusRandom or hash columns

Monitoring Performance

-- Check clustering effectiveness
SELECT * FROM clustering_information('database_name', 'table_name');

-- Key metrics to watch:
-- average_depth: Lower is better (< 2 ideal)
-- average_overlaps: Lower is better
-- block_depth_histogram: More blocks at depth 1-2

When to Re-cluster

Tables become disorganized over time with data changes:

-- Check if re-clustering is needed
SELECT IF(average_depth > 2 * LEAST(GREATEST(total_block_count * 0.001, 1), 16),
'Re-cluster needed',
'Clustering is good')
FROM clustering_information('your_database', 'your_table');

-- Re-cluster the table
ALTER TABLE your_table RECLUSTER;

Performance Tuning

Custom Block Size

Adjust block size for better performance:

-- Smaller blocks = fewer rows per query
ALTER TABLE sales SET OPTIONS(
ROW_PER_BLOCK = 100000,
BLOCK_SIZE_THRESHOLD = 52428800
);

Automatic Re-clustering

  • COPY INTO and REPLACE INTO automatically trigger re-clustering
  • Monitor clustering metrics regularly
  • Re-cluster when average_depth becomes too high

Best Practices

PracticeBenefit
Start SimpleUse single-column cluster keys first
Monitor MetricsCheck clustering_information regularly
Test PerformanceMeasure query speed before/after clustering
Re-cluster PeriodicallyMaintain clustering after data changes
Consider CostsClustering consumes compute resources

Important Notes

tip

When to Use Cluster Keys:

  • Large tables (millions+ rows)
  • Slow query performance
  • Frequent filter-based queries
  • Analytical workloads

When NOT to Use:

  • Small tables
  • Random access patterns
  • Frequently changing data

Cluster keys are most effective on large, frequently queried tables with predictable filter patterns. Start with your most common WHERE clause columns.