ClickHouse, StarRocks, and Apache Doris are the three dominant open-source real-time columnar OLAP engines as of 2026. They share the broad goal of sub-second analytical queries over massive datasets, but their lineages and governance structures shape their design philosophies in meaningful ways.

ClickHouse originated at Yandex in 2016 as a purpose-built engine for Yandex.Metrica (web analytics at the scale of billions of events). It is governed by ClickHouse Inc., the commercial entity that reports 2,000+ paying customers as of early 2025. The engine’s DNA is single-table scan performance and raw throughput — it was designed for wide denormalized tables, not complex joins.

Apache Doris began life as Palo at Baidu, donated to the Apache Software Foundation in 2018. It adopted a classic FE/BE architecture from the start and focused on ease of use for small-to-medium teams running interactive dashboards. In 2020, a subset of Doris contributors forked the project to create StarRocks (originally called DorisDB). StarRocks claims approximately 90% of the original codebase has been rewritten since the fork. It moved to the Linux Foundation in 2023 under Apache 2.0 licensing, with CelerData as its primary commercial backer. The two projects have not exchanged code since the fork, and their architectures have diverged substantially — particularly in the optimizer, execution engine, and storage-compute separation story.

Important

Despite the shared ancestor, treating StarRocks and Doris as “the same system” is a mistake. The optimizer alone (Cascades CBO vs. Nereids) was rebuilt independently in each project, and the storage layer differences (delete-vector primary keys in StarRocks vs. merge-on-write in Doris) have real consequences for update-heavy workloads.

All three are Apache 2.0 licensed. ClickHouse has the broadest Western adoption; Doris and StarRocks have exceptionally strong traction in China/APAC, with increasing Western penetration — Pinterest’s migration to StarRocks in 2025 being a notable example.


Architecture and Cluster Design

Topology

ClickHouse uses a homogeneous node model: every node stores data and executes queries. There is no dedicated coordinator — any node can receive a query and fan it out via Distributed tables, which are virtual layers over local MergeTree tables sharded across the cluster. Sharding is manual: you define a sharding key (commonly rand() or cityHash64(user_id)) in the distributed table DDL, and inserts are routed accordingly. This gives operators fine-grained control but requires explicit cluster topology management.

StarRocks and Doris both use a FE/BE split. FE (Frontend) nodes handle SQL parsing, query planning, metadata management, and scheduling. BE (Backend) nodes handle storage and execution. FE nodes are stateful (they persist metadata) and elect a leader among themselves. BE nodes are the MPP workers.

StarRocks extends this with a shared-data mode (introduced around v3.0), where compute nodes (CNs) are stateless and data lives in object storage (S3, HDFS, GCS). CNs maintain a local hot cache for frequently accessed segments. This makes elastic scaling straightforward — spin up CNs during peak, tear them down after — at the cost of cold-read latency on cache misses. The shared-nothing mode remains available for latency-sensitive workloads. Doris introduced a similar storage-compute separation architecture in recent versions, also targeting S3-backed deployments.

ClickHouse has no native open-source equivalent. ClickHouse Cloud achieves separation through a proprietary SharedMergeTree engine not available in the self-hosted OSS build.

Coordination and Metadata

The three systems make fundamentally different choices about where metadata lives and how it is coordinated, which has significant operational implications.

ClickHouse delegates coordination to an external service. Historically this was Apache ZooKeeper; the modern replacement is ClickHouse Keeper, a C++ reimplementation using the Raft consensus algorithm (replacing ZAB). Keeper is wire-compatible with ZooKeeper’s client protocol but offers better compression, lower memory usage, and faster recovery. It can run embedded inside the ClickHouse process or as a standalone 3-node cluster:

<keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <raft_configuration>
        <server><id>1</id><hostname>node1</hostname><port>9234</port></server>
        <server><id>2</id><hostname>node2</hostname><port>9234</port></server>
        <server><id>3</id><hostname>node3</hostname><port>9234</port></server>
    </raft_configuration>
</keeper_server>

Keeper stores the replication log (which parts exist on which replicas), distributed DDL task queues, and merge coordination metadata. ClickHouse nodes themselves are stateless with respect to coordination — all shared state lives in Keeper.

StarRocks and Doris take the opposite approach: they embed the metadata store inside the FE process using BDB-JE (Berkeley DB Java Edition). BDB-JE is a pure-Java embedded key-value store — no separate server process, no network hop for metadata access. It runs inside the FE’s JVM and stores table schemas, partition maps, tablet-to-BE assignments, load job states, and user privileges.

Internally, BDB-JE uses a log-structured storage engine with an in-memory B+Tree index. All writes are appended to sequential .jdb log files on disk; B+Tree leaf nodes point into these logs. A background cleaner thread reclaims space from old log files once the live entries have been superseded — conceptually similar to LSM compaction but operating on a B+Tree rather than sorted runs. This is different from a traditional pages-on-disk B+Tree (like InnoDB or SQLite) where nodes are modified in place; BDB-JE’s log-structured approach makes writes always sequential at the cost of cleaner overhead.

For high availability, BDB-JE runs a replication group across the FE nodes using a Raft-like protocol (BDB-JE’s replication predates the Raft paper but was later aligned with its semantics). The leader FE accepts all metadata writes, appends them to its log, and streams entries to follower FEs. A write is acknowledged once a majority quorum has persisted the log entry. If the leader fails, followers run leader election — the node with the most complete log wins. This is why you deploy 3 or 5 FE nodes (odd count for clean majority). Observer FEs can be added for read scaling without participating in the quorum.

Tip

If you know etcd (the Raft-based KV store in Kubernetes), BDB-JE plays an analogous role: a small, replicated, strongly-consistent metadata store. The difference is etcd is a standalone service, while BDB-JE is an in-process library. The trade-off is clear — embedding the metadata store eliminates an operational dependency (no Keeper/ZK cluster to manage), but makes the FE process heavier and stateful.

Replication and Consistency

All distributed systems have replication lag. The meaningful question is: what consistency guarantee does the client get at write acknowledgment time, and can the read path observe partially-replicated state?

ClickHouse

Replication operates at the table engine level via ReplicatedMergeTree, using asynchronous multi-master semantics. The write path works as follows:

  1. Client sends INSERT to any replica
  2. That replica sorts the data, writes it to disk as an immutable part
  3. The replica writes a log entry to Keeper: “part X created on replica A”
  4. INSERT returns success to the client

At this point, only one replica holds the data. Other replicas asynchronously poll the Keeper replication log, discover the new entry, and fetch the part from the origin replica (or from each other). Until a given replica has fetched the part, reads against that replica will not see the new data.

Conflicts cannot arise because parts are immutable and uniquely named — two concurrent inserts to different replicas simply produce two different parts, and all replicas eventually accumulate both. Background merges are also coordinated through Keeper to avoid duplicate merge work across replicas.

The consequence is that read consistency depends on which replica you query. ClickHouse provides a select_sequential_consistency setting that forces a query to wait until the local replica has caught up to the latest Keeper log position, but this adds latency proportional to the replication lag.

StarRocks

Replication operates at the tablet level — each tablet (a horizontal slice of a table partition) has N replicas across BEs. Writes are always coordinated by the FE leader, which acts as a transaction coordinator. When a load job completes, the FE waits for replica acknowledgment before publishing a version number that makes the data visible to queries.

The write_quorum table property controls how many replicas must confirm before the write is committed:

  • ONE: committed after a single replica confirms (fastest writes, weakest durability)
  • MAJORITY: committed after ⌈N/2⌉ + 1 replicas confirm (the default)
  • ALL: committed after every replica confirms (strongest durability, highest write latency)

Because reads go through the FE’s version system, a query only sees data at a published version. If the FE published version V, it means at least write_quorum replicas have that version. The remaining replicas may still be catching up, but the read path is not affected — the FE routes reads to replicas that have the required version.

Apache Doris

Doris uses a stricter protocol: a two-phase commit across all replicas. In the prepare phase, all BE replicas flush their data. In the commit phase, the FE assigns a publish version only after all replicas have confirmed. This means at the moment data becomes visible, every replica has it — the strongest write-time consistency of the three, at the cost of write latency being bounded by the slowest replica.

Important

The real architectural distinction is not “who has replication lag” (everyone does) but where the consistency boundary sits. ClickHouse commits after 1 replica + Keeper log entry, and reads can see stale state if you hit a lagging replica. StarRocks commits after a configurable quorum and serves reads only at published versions. Doris commits after all replicas and guarantees every replica is current at publish time. These are different points on the consistency-latency spectrum, not different levels of “correctness.”


Storage Engine Internals

ClickHouse MergeTree

The MergeTree family is the core of ClickHouse’s storage. Data is not buffered in a MemTable — inserts are sorted by the sorting key (ORDER BY clause) and written directly to disk as an immutable part. Each part is a directory containing one file per column plus index files. Parts are self-contained: you can literally cp a part directory to another node’s data path and attach it.

Parts are organized into granules, the smallest read unit. A granule contains up to 8,192 rows by default (index_granularity). The sparse primary index stores the sorting key value at the start of each granule. Because it indexes granule boundaries rather than individual rows, the entire index for a part with hundreds of millions of rows might be only a few megabytes — trivially fitting in memory.

The write path is therefore: sort batch → write part to disk → fsync → done. There is no WAL. Background merge threads continuously select parts within the same partition and merge them into larger parts. Merges are non-blocking: reads continue to see the old parts until the merge completes and the new part atomically replaces them.

This design has a critical implication: ClickHouse does not have an LSM tree. There is no MemTable, no L0/L1/L2 level hierarchy, and no write-ahead log. The merge process superficially resembles LSM compaction, but the absence of in-memory buffering and level-based organization means the write amplification profile is different. Small frequent inserts (row-at-a-time) create many tiny parts that stress the merge scheduler — this is why ClickHouse strongly recommends batched inserts of at least 1,000–10,000 rows.

StarRocks and Doris: LSM-Inspired Columnar Storage

Both StarRocks and Doris use a storage architecture that is genuinely LSM-inspired, with a clear MemTable → flush → compaction lifecycle.

Incoming data lands in an in-memory MemTable (default flush threshold: 128 MB or time-based). The MemTable provides write buffering and sorting, which means StarRocks/Doris tolerate small, frequent writes far better than ClickHouse. For unique/primary key tables the MemTable also performs primary key deduplication on flush, which is central to their real-time update story.

When the MemTable flushes, it produces a Segment — a self-contained columnar file containing:

  • Data Pages: 64 KB blocks of encoded, compressed column data
  • Ordinal Index: maps row ordinals to page offsets for positional access
  • Zone-Map Index: per-page and per-segment min/max/null-count metadata
  • Short-Key (Prefix) Index: sparse binary-search table of the first 36 bytes of the sort key, stored every ~1,024 rows

Multiple segments form a Rowset, which represents a versioned snapshot of ingested data. Rowsets are immutable once committed. This versioning enables MVCC-like read consistency — a query sees all rowsets whose version is ≤ its read version.

Compaction

All three systems rely on background compaction/merge to consolidate small files, reclaim space from deletes, and improve read performance.

ClickHouse’s merge process selects parts within the same partition using a size-tiered heuristic. Specialized MergeTree variants execute logic during merges — AggregatingMergeTree applies aggregate functions, ReplacingMergeTree deduplicates by sorting key, CollapsingMergeTree cancels paired insert/delete rows. Semantic merge logic pushed into the storage layer is an unusual and powerful design: the aggregation or deduplication is not a query-time operation but a background storage maintenance task.

StarRocks and Doris both implement a two-tier compaction model:

  • Cumulative Compaction: merges recent small rowsets into medium-sized ones, triggered frequently
  • Base Compaction: merges all rowsets into a single large rowset, triggered infrequently, reclaiming deleted/obsolete data

Doris adds two specialized strategies. Time-Series Compaction merges temporally adjacent segments with the invariant that each file participates in compaction only once, reducing write amplification for append-mostly workloads. Segment Compaction runs during data loading itself, compacting segments within the same load job to prevent OLAP_ERR_TOO_MANY_SEGMENTS errors during large bulk loads.

Warning

Compaction falling behind ingest rate is a common operational problem across all three systems. StarRocks exposes a compaction score per partition (visible in FE metrics) — a rising score means queries must merge more rowsets at read time, degrading scan performance. Monitoring this metric is essential.

Indexing

Primary and Sparse Indexes

All three systems sort data on disk by a sort key and build a sparse index over it.

ClickHouse’s sparse primary index stores the sort key value at each granule boundary (every 8,192 rows). A query predicate on sort key columns performs binary search on this index to identify the granule range to read. The primary key does not enforce uniqueness — it only controls which columns appear in the sparse index. The sorting key (which may include additional trailing columns beyond the primary key) determines physical row ordering.

StarRocks/Doris use a Short-Key Index storing a prefix (first 36 bytes) of the sort key every ~1,024 rows — denser than ClickHouse’s granule-level index, providing finer-grained pruning for prefix-match lookups. Both also build Zone-Map Indexes (per-page min/max) automatically for every column, enabling predicate pushdown on arbitrary columns without requiring the column to be part of the sort key.

Secondary Indexes

ClickHouse supports data skipping indexes that attach metadata to groups of granules:

CREATE TABLE events (
    ts DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    payload String,
    INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,
    INDEX idx_event event_type TYPE set(100) GRANULARITY 4,
    INDEX idx_payload payload TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 1
) ENGINE = MergeTree()
ORDER BY (ts, user_id);

GRANULARITY 4 means the index aggregates metadata over 4 consecutive granules (4 × 8,192 = 32,768 rows). Available types include minmax, set (distinct values up to a limit), bloom_filter, tokenbf_v1 (tokenized bloom filter for text search), and ngrambf_v1. These are strictly skip indexes: they can only exclude blocks, never locate specific rows.

Doris provides Bloom Filter Indexes, Inverted Indexes (full-text search capable, based on CLucene), and the automatic zone-map indexes already described. StarRocks offers bloom filters and bitmap indexes for low-cardinality columns. Doris’s inverted index support is notably more mature than either competitor for mixed OLAP + text search workloads.

Compression and Encoding

ClickHouse has the most expressive per-column codec pipeline. You chain encoding transforms before a final compression codec:

CREATE TABLE metrics (
    ts DateTime64(3) CODEC(DoubleDelta, ZSTD(3)),
    device_id UInt32 CODEC(Delta(4), LZ4),
    temperature Float32 CODEC(Gorilla, LZ4),
    status LowCardinality(String) CODEC(ZSTD(1))
) ENGINE = MergeTree()
ORDER BY (device_id, ts);

The encoding layer exploits data structure before generic compression sees it:

  • Delta: stores differences between consecutive values — effective for monotonically increasing columns (timestamps, auto-increment IDs)
  • DoubleDelta: stores differences of differences — optimal for timestamps with near-constant intervals (e.g., 1-second sensor readings where deltas are almost always 1000ms, making double-deltas near zero and highly compressible)
  • Gorilla: XOR-based encoding from Facebook’s Gorilla paper — consecutive similar floats produce XOR values with many leading zeros, which compress extremely well
  • T64: truncates unused high bits from integer columns where the actual value range is narrow relative to the type width

The compression layer applies LZ4 (default, ~3–4 GB/s decode throughput, optimized for decompression speed) or ZSTD (configurable level 1–22, higher compression ratio, slower decode). Chaining domain-specific encoding before generic compression is what unlocks extreme ratios — a DoubleDelta + ZSTD timestamp column can achieve 50–100× compression on regular time-series.

StarRocks and Doris support LZ4 and ZSTD at the table level ("compression" = "zstd" table property). They apply dictionary encoding and run-length encoding internally within segments for low-cardinality columns, but do not expose a per-column codec pipeline. Doris’s ZSTD support (added in v1.1.0) showed 53% better compression ratio over LZ4 and 30% faster decompression from disk in their text-log benchmarks.

Tip

If your workload has diverse column types (timestamps, floats, strings, low-cardinality categoricals), ClickHouse’s per-column codec chaining can deliver meaningfully better compression — which directly translates to less I/O and faster scans. For homogeneous schemas, the difference matters less.


Query Execution

Vectorized Engine and SIMD

All three engines use vectorized, columnar execution: operators process batches of values from a single column rather than row-at-a-time, exploiting CPU cache locality and enabling SIMD.

ClickHouse processes data in blocks of up to 65,536 rows. The execution pipeline is a DAG of processors with typed input/output ports. ClickHouse uses runtime SIMD dispatch — at startup it detects CPU capabilities (SSE4.2, AVX2, AVX-512) and selects the optimal kernel. It also employs JIT compilation via LLVM for expression evaluation: compound WHERE clauses and aggregate function combinations are compiled into a single fused native function at query time, eliminating interpreter overhead. Neither StarRocks nor Doris has JIT — for queries with complex computed expressions (e.g., WHERE hash(a) % 1000 < 50 AND sqrt(x*x + y*y) < radius), ClickHouse generates one machine-code function while StarRocks/Doris interpret each node in the expression tree separately (albeit in vectorized batches).

StarRocks rebuilt its execution engine in C++ with explicit SIMD intrinsics targeting AVX2 and AVX-512. StarRocks 4.0 (2025) introduced parallelization improvements that reduced CPU consumption for complex multi-table joins and COUNT DISTINCT by ~60%.

Doris transitioned to a fully vectorized C++ engine in v2.0, reporting 3–5× over the previous version. Its 2025 roadmap includes ARM architecture tuning and adaptive concurrency.

Query Optimization

This is where the three systems diverge most sharply.

ClickHouse: Rule-Based Optimizer

ClickHouse uses a Rule-Based Optimizer (RBO) as of March 2026. It applies a fixed set of transformations: predicate pushdown, constant folding, projection pruning, PREWHERE optimization (moving filter predicates before full column reads). There is no cost model and no statistics collection — the optimizer does not estimate cardinalities. It cannot automatically choose between a hash join and a sort-merge join, cannot reorder a 10-way join, and cannot optimize CTE reuse. ByteHouse (ByteDance’s internal ClickHouse fork) built a CBO on top and reported 6× speedups, underscoring what the open-source engine leaves on the table.

In practice, ClickHouse users manually optimize complex queries: denormalize aggressively, use dictionaries for dimension lookups, and pre-aggregate with materialized views — moving optimization effort from the planner to the schema designer.

StarRocks: Cascades-Based CBO

StarRocks has a Cost-Based Optimizer built on the Cascades framework (same theoretical foundation as SQL Server’s optimizer and Orca in Greenplum). It collects column-level statistics (NDV, histograms, min/max, null fraction) and estimates cardinalities at every plan node. Join reordering strategies adapt to query complexity:

  • ≤ 4 joins: exhaustive enumeration of all orderings
  • 4–10 joins: left-deep tree search with dynamic programming
  • > 10 joins: greedy algorithm with DP fallback

The CBO also handles CTE reuse, subquery decorrelation, and partition/bucket pruning. The cost model accounts for CPU, memory, network, and disk I/O.

Doris: Nereids Optimizer

Doris replaced its original optimizer with Nereids in v2.0 — also Cascades-based, supporting all 99 TPC-DS queries. It collects statistics and performs join reordering, predicate pushdown, and materialized view rewriting. The Nereids introduction was the primary driver of Doris’s reported 10× v1.x → v2.0 performance improvement — most of that gain came from better plan selection, not engine changes.

Important

For workloads with multi-table joins (ad-hoc analytics, star/snowflake schemas), the CBO in StarRocks and Doris is a decisive advantage. If queries are predominantly single-table scans or pre-denormalized wide tables, ClickHouse’s RBO is not a bottleneck.


Data Mutation Semantics

OLAP engines historically treated data as append-only. All three have evolved toward mutable data, but the mechanisms differ.

ClickHouse originally supported only mutationsALTER TABLE ... UPDATE/DELETE that rewrite entire parts in the background, essentially batch jobs. In 2025, ClickHouse introduced lightweight deletes and lightweight updates. Lightweight deletes set a hidden _row_exists mask column to 0; the row is excluded from reads immediately but physically removed during the next merge. Lightweight updates create a small patch part with only the changed columns; the patch is applied at query time and materialized during merges.

DELETE FROM events WHERE user_id = 12345;
UPDATE events SET status = 'processed' WHERE event_id = 99999;

Physical cleanup depends on merge timing, and concurrent updates have configurable consistency semantics via update_sequential_consistency. The row mask and patch parts add read-time overhead until they are merged into the base parts.

StarRocks’s Primary Key table uses a Delete+Insert strategy with a DelVector — a per-segment bitmap of deleted row positions. On upsert, the old row is marked in the DelVector and the new row is appended. This is merge-on-write: deduplication happens at write time in the MemTable, so reads never encounter duplicates. StarRocks reports 3–10× query performance improvement over a merge-on-read approach.

CREATE TABLE orders (
    order_id BIGINT,
    status STRING,
    amount DECIMAL(10,2),
    updated_at DATETIME
) PRIMARY KEY (order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 16;

Doris’s Unique Key Model with Merge-on-Write (default since v2.0) works similarly — writes merge at ingest time, maintaining one physical row per key. Doris additionally supports partial column updates, where you update a subset of columns without supplying the full row:

SET enable_unique_key_partial_update = true;
INSERT INTO orders (order_id, status) VALUES (12345, 'shipped');

Flexible partial updates (ALTER TABLE ... ENABLE FEATURE "UPDATE_FLEXIBLE_COLUMNS") allow each row in a batch to update a different column set — useful for CDC-style ingest where different source events touch different fields.


Materialized Views and Projections

ClickHouse offers two pre-computation mechanisms. Materialized views are triggered on insert: when data arrives in a source table, it is transformed and written to a hidden destination table. They act as streaming transforms — they see only new inserts, not updates or deletes, and cannot perform JOINs. A materialized view over an AggregatingMergeTree destination is ClickHouse’s primary pattern for real-time rollups.

Projections are a secondary physical layout stored within the same table, maintained atomically with the main data — always consistent, unlike materialized views which miss mutations. The optimizer automatically selects a projection when it matches a query’s sort/aggregation pattern. Projections cannot filter or join and increase storage proportionally.

ALTER TABLE events ADD PROJECTION events_by_type (
    SELECT event_type, toDate(ts) AS day, count(), sum(value)
    GROUP BY event_type, day
);
ALTER TABLE events MATERIALIZE PROJECTION events_by_type;

StarRocks has asynchronous materialized views supporting multi-table definitions (including joins) with multiple refresh strategies: scheduled (cron-based), manual, partition-level incremental (refreshing only partitions whose base data changed, via Partition Change Tracking), and IVM (Incremental View Maintenance) for append-only tables where only the delta is applied. The optimizer performs automatic query rewriting — transparently substituting a materialized view when it can satisfy a query.

Doris supports synchronous single-table materialized views (similar to ClickHouse projections — atomic, always consistent) and asynchronous multi-table materialized views with scheduled refresh. The Nereids optimizer can also rewrite queries to use them.


Concurrency Model

Apache Doris is designed for high concurrency and reports 30,000+ QPS per node for point queries. It achieves this by short-circuiting the analytical query path: a row storage format (stored alongside columnar data for primary key tables) avoids column reassembly for point lookups, a short-circuit execution plan bypasses full query planning for simple key lookups, prepared statement support amortizes SQL parsing, and a row cache accelerates hot keys.

StarRocks targets 10,000+ QPS for point queries using hybrid row-column storage (since v3.2.3) on primary key tables, combined with configurable query queues for admission control.

ClickHouse is architecturally optimized for fewer, heavier queries — wide scans processing billions of rows. The default max_concurrent_queries is 100; pushing higher causes contention on the merge scheduler and memory allocator. High-QPS patterns are addressed by adding read replicas behind a load balancer — horizontal scaling for concurrency rather than per-node efficiency.


Lakehouse Integration

All three engines can query data in open table formats (Iceberg, Hudi, Delta Lake, Paimon) without ingestion.

StarRocks offers a Unified Catalog (since v3.2) that treats Hive, Iceberg, Hudi, Delta Lake, Paimon, and Kudu as first-class external sources. You can also write back to Iceberg/Hive tables and use async materialized views to incrementally ETL lakehouse data into native tables for acceleration:

CREATE EXTERNAL CATALOG lakehouse_catalog
PROPERTIES (
    "type" = "unified",
    "unified.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://metastore:9083"
);
 
SET CATALOG lakehouse_catalog;
SELECT * FROM iceberg_db.events WHERE dt = '2026-03-01';

Doris provides similar multi-catalog support with a 2025–2026 roadmap emphasizing catalog-level metadata caching and cross-catalog query optimization.

ClickHouse supports Iceberg, Delta Lake, and Hudi via table functions and table engines, as well as direct Parquet/ORC reads from S3. The integration is functional but less tightly optimized — ClickHouse doesn’t perform predicate pushdown into Iceberg metadata as aggressively as StarRocks, and there is no unified catalog abstraction for browsing external databases.


Comparative Summary

DimensionClickHouseStarRocksApache Doris
Storage engineMergeTree (direct-to-disk parts, no MemTable)LSM-inspired (MemTable → Segment → Rowset)LSM-inspired (MemTable → Segment → Rowset)
OptimizerRule-based (RBO)Cost-based (Cascades CBO)Cost-based (Nereids / Cascades)
Multi-table joinsWeak (no CBO, manual optimization)Strong (CBO + distributed shuffle)Strong (Nereids + distributed shuffle)
Single-table scanExcellentExcellentExcellent
CompressionPer-column codec chainingTable-level LZ4/ZSTDTable-level LZ4/ZSTD
Write consistency1 replica + Keeper logConfigurable quorum (ONE/MAJORITY/ALL)All replicas (2PC)
Real-time updatesLightweight delete/update (2025), deferred physical cleanupPrimary Key table with DelVectorUnique Key with Merge-on-Write, partial column updates
Concurrency~100 concurrent queries (scale via replicas)~10,000 QPS (hybrid row-column)~30,000 QPS (row store, short-circuit plan)
Materialized viewsInsert-triggered (no joins); projections for consistencyAsync multi-table, partition-incremental, IVM, auto-rewriteSync single-table + async multi-table
LakehouseTable functions/enginesUnified Catalog with write-backMulti-catalog
JIT compilationYes (LLVM)NoNo
CoordinationExternal (ClickHouse Keeper / ZK)Embedded (BDB-JE in FE, Raft)Embedded (BDB-JE in FE, Raft)

Decision Framework

Vendor-published benchmarks are unreliable in this space — Doris publishes 30× over ClickHouse on TPC-H; StarRocks publishes 1.87× over ClickHouse on SSB; ClickHouse dominates on single-table scans. Each benchmark is tuned to the engine’s strengths. The right engine depends on workload shape.

ClickHouse fits best when the workload is dominated by wide denormalized tables, time-series with diverse column types benefiting from per-column codecs, and the team has expertise to manually optimize schemas and queries. Its ecosystem (Grafana, Kafka engine, client libraries) is the most mature in the Western market. It is the worst choice for complex ad-hoc joins or high-concurrency dashboard serving.

StarRocks fits best when you need real-time updates (primary key upserts), complex multi-table joins, and lakehouse federation from a single system. The CBO makes it forgiving of non-expert query writing. Shared-data mode suits cloud-native elastic deployments. Its materialized view story (partition-incremental, IVM, auto-rewrite) is the most complete.

Apache Doris fits best for high-concurrency dashboard serving (30K QPS point queries), teams wanting the simplest operational experience, and use cases requiring partial column updates from CDC sources. Its inverted index support makes it the strongest for mixed OLAP + text search. ASF governance and 1,600+ contributors make it the most community-governed option.


Pinterest Context

Pinterest migrated real-time advertising analytics from Apache Druid to StarRocks in 2025. The migration was driven by Druid’s limitations: high query latency on multi-dimensional dashboards, no SQL join/subquery support, complex ingestion pipelines requiring external denormalization, and rising compute costs.

Post-migration results: p90 latency halved, infrastructure cost reduced ~68% (3× cost-performance improvement), and data freshness reached 10 seconds via native Kafka ingestion without external Flink/Spark pipelines. The key enablers were StarRocks’s SQL support for joins and subqueries (eliminating pre-denormalization), built-in Kafka ingestion, and the CBO’s ability to optimize complex multi-dimensional aggregation queries powering advertiser dashboards. Pinterest presented these results at the StarRocks Summit in September 2025.