When ClickHouse CPU Explodes: A Practical Lesson in OLAP vs OLTP Design

When ClickHouse CPU Explodes: A Practical Lesson in OLAP vs OLTP Design

Abstract

ClickHouse is an extremely fast analytical database—but it is not a key-value store.
This article walks through a real-world performance incident where a seemingly harmless query (LIMIT 1) caused massive CPU usage, explains why it happened, and presents a systemic fix using proper data-responsibility separation between OLAP and OLTP systems.

If you are using ClickHouse for event tracking, affiliate analytics, or high-volume logging, this will save you from a painful production outage.


The Symptom

The system showed sustained high CPU usage on the ClickHouse server:

  • CPU usage: ~60%
  • Disk IO: low
  • Memory usage: normal
  • No long-running queries visible

At first glance, nothing looked obviously wrong.

However, inspecting system.query_log revealed a pattern:

  • Queries finished very fast (~50–80 ms)
  • Each query scanned millions of rows
  • The same query was executed very frequently

This combination is lethal:

Fast queries × huge scans × high QPS = CPU exhaustion


The Query That Looked Innocent

1
2
3
4
5
6
7
8
9
SELECT
tenant_id,
campaign_id,
traffic_source_id,
source_id,
country
FROM clicks
WHERE system_click_id = '...'
LIMIT 1;

At face value:

  • LIMIT 1
  • Single equality filter
  • Should be fast, right?

Wrong.


Why This Breaks ClickHouse

ClickHouse is an OLAP database.
Its performance model depends heavily on the table’s ORDER BY (sorting key).

Example table definition:

1
2
3
ENGINE = MergeTree
PARTITION BY toDate(clicked_at)
ORDER BY (tenant_id, campaign_id, clicked_at);

Key observation:

system_click_id is not part of the ORDER BY key

What ClickHouse Must Do

  • It cannot locate rows by system_click_id
  • It must scan the entire partition
  • LIMIT 1 is applied after scanning and filtering

In practice:

  • ~8 million rows scanned
  • ~200–300 MB read
  • Per query

Repeat this dozens or hundreds of times per second → CPU meltdown.


The Root Cause (Not a Bug)

This was not a ClickHouse bug.
It was a data-responsibility violation.

The system was using ClickHouse for:

  • ✅ Analytics (good)
  • ❌ Single-row lookup by unique ID (bad)

OLAP vs OLTP: The Missing Layer

The system lacked a proper lookup layer.

Two fundamentally different access patterns were mixed:

Access Pattern Characteristics Correct Storage
Single-row lookup High QPS, low latency, exact key OLTP / KV
Aggregation & reporting Large scans, batch queries OLAP

ClickHouse excels at the second—and performs terribly at the first.


The Correct Architecture

High-Level Separation

1
2
3
4
5
6
7
8
            ┌──────────────────┐
Click Event →│ OLTP Lookup Store │ ← Conversion lookup
└─────────┬────────┘
│ async / batch

┌──────────────────┐
│ ClickHouse (OLAP) │ ← Analytics & reporting
└──────────────────┘

The Minimal Fix (No Big Rewrite)

1. Keep Raw Events in ClickHouse

ClickHouse remains the source of truth for analytics:

  • Clicks
  • Conversions
  • Rollups
  • Daily statistics

Nothing is removed.


2. Add a Lightweight Lookup Table (OLTP)

A new table is introduced in an OLTP database (e.g., PostgreSQL):

1
2
3
4
5
6
7
8
9
10
CREATE TABLE clicks_lookup (
system_click_id TEXT PRIMARY KEY,
tenant_id UUID,
campaign_id UUID,
offer_id UUID,
traffic_source_id UUID,
source_id TEXT,
country TEXT,
clicked_at TIMESTAMPTZ
);

Characteristics:

  • Very small rows
  • Written once per click
  • Indexed by primary key
  • Designed for O(1) lookups

3. Adjust the Data Flow

On Click

1
2
3
User click →
1. Insert into OLTP clicks_lookup
2. Insert into ClickHouse clicks (analytics)

On Conversion

1
2
3
Conversion callback →
1. Lookup click metadata via OLTP
2. Insert conversion into ClickHouse

Key rule:

Conversion processing must never query ClickHouse.


Why This Works

  • OLTP handles high-QPS point lookups efficiently
  • ClickHouse focuses purely on analytics
  • CPU usage drops immediately
  • System becomes horizontally scalable

This is not an optimization—it is correct system design.


Common Anti-Patterns to Avoid

❌ Using ClickHouse for:

  • WHERE unique_id = ?
  • LIMIT 1 lookups
  • High-frequency API queries
  • OLTP-style joins

✅ Use ClickHouse for:

  • Time-range queries
  • Aggregations
  • Rollups
  • Reporting and analytics

Takeaway

ClickHouse is not slow.
It is precise about what it is good at.

Most ClickHouse “performance problems” are actually architecture problems.

If your system starts showing unexplained CPU spikes:

  • Check what you are querying
  • Not just how fast it runs

Designing for the right access pattern will always outperform any tuning knob.


Final Thought

This incident marked a transition point:

from “it works”
to “it scales”

And that transition almost always requires separating OLTP and OLAP responsibilities.

P.S.

This article is very subjective. If you do not feel comfortable viewing it, please close it as soon as possible.
If you think my article can help you, you can subscribe to this site by using RSS.

When ClickHouse CPU Explodes: A Practical Lesson in OLAP vs OLTP Design

https://iiiyu.com/2026/01/09/When-ClickHouse-CPU-Explodes-A-Practical-Lesson-in-OLAP-vs-OLTP-Design/

Author

Ewan Xiao

Posted on

January 9th 2026

Updated on

January 9th 2026

Licensed under

Comments