Posted January 9th 2026Updated January 9th 2026English5 minutes read (About 816 words)

When ClickHouse CPU Explodes: A Practical Lesson in OLAP vs OLTP Design

Abstract

ClickHouse is an extremely fast analytical database—but it is not a key-value store.
This article walks through a real-world performance incident where a seemingly harmless query (LIMIT 1) caused massive CPU usage, explains why it happened, and presents a systemic fix using proper data-responsibility separation between OLAP and OLTP systems.

If you are using ClickHouse for event tracking, affiliate analytics, or high-volume logging, this will save you from a painful production outage.

The Symptom

The system showed sustained high CPU usage on the ClickHouse server:

CPU usage: ~60%
Disk IO: low
Memory usage: normal
No long-running queries visible

At first glance, nothing looked obviously wrong.

However, inspecting system.query_log revealed a pattern:

Queries finished very fast (~50–80 ms)
Each query scanned millions of rows
The same query was executed very frequently

This combination is lethal:

Fast queries × huge scans × high QPS = CPU exhaustion

The Query That Looked Innocent

SELECT
  tenant_id,
  campaign_id,
  traffic_source_id,
  source_id,
  country
FROM clicks
WHERE system_click_id = '...'
LIMIT 1;

At face value:

LIMIT 1
Single equality filter
Should be fast, right?

Wrong.

Why This Breaks ClickHouse

ClickHouse is an OLAP database.
Its performance model depends heavily on the table’s ORDER BY (sorting key).

Example table definition:

1
2
3

ENGINE = MergeTree
PARTITION BY toDate(clicked_at)
ORDER BY (tenant_id, campaign_id, clicked_at);

Key observation:

system_click_id is not part of the ORDER BY key

What ClickHouse Must Do

It cannot locate rows by system_click_id
It must scan the entire partition
LIMIT 1 is applied after scanning and filtering

In practice:

~8 million rows scanned
~200–300 MB read
Per query

Repeat this dozens or hundreds of times per second → CPU meltdown.

The Root Cause (Not a Bug)

This was not a ClickHouse bug.
It was a data-responsibility violation.

The system was using ClickHouse for:

✅ Analytics (good)
❌ Single-row lookup by unique ID (bad)

OLAP vs OLTP: The Missing Layer

The system lacked a proper lookup layer.

Two fundamentally different access patterns were mixed:

Access Pattern	Characteristics	Correct Storage
Single-row lookup	High QPS, low latency, exact key	OLTP / KV
Aggregation & reporting	Large scans, batch queries	OLAP

ClickHouse excels at the second—and performs terribly at the first.

The Correct Architecture

High-Level Separation

            ┌──────────────────┐
Click Event →│ OLTP Lookup Store │ ← Conversion lookup
            └─────────┬────────┘
                      │ async / batch
                      ▼
            ┌──────────────────┐
            │ ClickHouse (OLAP) │ ← Analytics & reporting
            └──────────────────┘

The Minimal Fix (No Big Rewrite)

1. Keep Raw Events in ClickHouse

ClickHouse remains the source of truth for analytics:

Clicks
Conversions
Rollups
Daily statistics

Nothing is removed.

2. Add a Lightweight Lookup Table (OLTP)

A new table is introduced in an OLTP database (e.g., PostgreSQL):

CREATE TABLE clicks_lookup (
  system_click_id TEXT PRIMARY KEY,
  tenant_id UUID,
  campaign_id UUID,
  offer_id UUID,
  traffic_source_id UUID,
  source_id TEXT,
  country TEXT,
  clicked_at TIMESTAMPTZ
);

Characteristics:

Very small rows
Written once per click
Indexed by primary key
Designed for O(1) lookups

3. Adjust the Data Flow

On Click

1
2
3

User click →
  1. Insert into OLTP clicks_lookup
  2. Insert into ClickHouse clicks (analytics)

On Conversion

1
2
3

Conversion callback →
  1. Lookup click metadata via OLTP
  2. Insert conversion into ClickHouse

Key rule:

Conversion processing must never query ClickHouse.

Why This Works

OLTP handles high-QPS point lookups efficiently
ClickHouse focuses purely on analytics
CPU usage drops immediately
System becomes horizontally scalable

This is not an optimization—it is correct system design.

Common Anti-Patterns to Avoid

❌ Using ClickHouse for:

WHERE unique_id = ?
LIMIT 1 lookups
High-frequency API queries
OLTP-style joins

✅ Use ClickHouse for:

Time-range queries
Aggregations
Rollups
Reporting and analytics

Takeaway

ClickHouse is not slow.
It is precise about what it is good at.

Most ClickHouse “performance problems” are actually architecture problems.

If your system starts showing unexplained CPU spikes:

Check what you are querying
Not just how fast it runs

Designing for the right access pattern will always outperform any tuning knob.

Final Thought

This incident marked a transition point:

from “it works”
to “it scales”

And that transition almost always requires separating OLTP and OLAP responsibilities.

P.S.

This article is very subjective. If you do not feel comfortable viewing it, please close it as soon as possible.
If you think my article can help you, you can subscribe to this site by using RSS.

When ClickHouse CPU Explodes: A Practical Lesson in OLAP vs OLTP Design

https://iiiyu.com/2026/01/09/When-ClickHouse-CPU-Explodes-A-Practical-Lesson-in-OLAP-vs-OLTP-Design/

Author

Ewan Xiao

Posted on

January 9th 2026

Updated on

January 9th 2026

Licensed under

#Database

When ClickHouse CPU Explodes: A Practical Lesson in OLAP vs OLTP Design

Abstract

The Symptom

The Query That Looked Innocent

Why This Breaks ClickHouse

What ClickHouse Must Do

The Root Cause (Not a Bug)

OLAP vs OLTP: The Missing Layer

The Correct Architecture

High-Level Separation

The Minimal Fix (No Big Rewrite)

1. Keep Raw Events in ClickHouse

2. Add a Lightweight Lookup Table (OLTP)

3. Adjust the Data Flow

On Click

On Conversion

Why This Works

Common Anti-Patterns to Avoid

Takeaway

Final Thought

P.S.

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Advertisement