VibeCrypto — Veille crypto

Helius8 juin, 17h · il y a 22j

De ClickHouse à RocksDB : comment la couche d'archivage de Solana a été reconstruite

Déplacer 300 To de données d'archivage Solana de ClickHouse vers RocksDB : un pari contre-intuitif qui a divisé le stockage par deux et réduit la latence des requêtes de 350 ms à 30 ms.

Helius, fournisseur d'infrastructure SOL, explique pourquoi il a migré plus de 300 To de données d'archivage depuis ClickHouse vers RocksDB. L'enjeu est majeur : la couche d'archivage stocke l'intégralité de l'historique de Solana (plus de 500 milliards de transactions) et doit répondre à des requêtes en moins de 50 ms. Les solutions existantes comme Google BigTable ou Old Faithful ne suffisaient plus en termes de coût, de contrôle et de performance.

La migration vers RocksDB a réduit l'empreinte de stockage compressé de ~330 To à ~190 To et divisé la latence p95 des appels getTransactionsForAddress par plus de dix (de 350 ms à 30 ms). Ce changement ouvre la voie à de nouvelles innovations sur la couche de lecture de Solana, impossibles à atteindre avec ClickHouse aux niveaux de performance et d'échelle exigés.

Solana

Source ↗

Détails

Source: Helius
Publication: 8 juin à 17h00
Lien direct: https://www.helius.dev/blog/migrating-from-clickhouse-to-rocksdb

Contenu source (brut)

When we tell people that we’ve moved over 300 terabytes of Solana archival data off of ClickHouse and onto RocksDB, the reaction is almost always the same: Why would you ever do that?ClickHouse is the obvious choice for petabyte-scale analytical workloads, whereas RocksDB is not. ClickHouse is trusted by some of the world's largest data consumers. For example, <a href="https://blog.cloudflare.com/clickhouse-query-plan-contention/" rel="noopener noreferrer" target="_blank">Cloudflare</a> runs it across more than a thousand replicas to handle hundreds of millions of inserts per second. <a href="https://clickhouse.com/blog/how-anthropic-is-using-clickhouse-to-scale-observability-for-ai-era" rel="noopener noreferrer" target="_blank">Anthropic</a> runs a custom, air-gapped ClickHouse deployment to power Claude's observability. <a href="https://www.uber.com/us/en/blog/logging/" rel="noopener noreferrer" target="_blank">Uber</a>, <a href="https://innovation.ebayinc.com/stories/ou-online-analytical-processing/" rel="noopener noreferrer" target="_blank">eBay</a>, and <a href="https://clickhouse.com/blog/nyc-meetup-report-large-scale-financial-market-analytics-with-clickhouse" rel="noopener noreferrer" target="_blank">Bloomberg</a> have also been using it in production for years.<a href="http://rocksdb.org/" rel="noopener noreferrer" target="_blank">RocksDB</a>, on the other hand, is an embedded key-value store with sparse documentation, primarily used as the engine within other databases (e.g., CockroachDB, TiKV, MyRocks), rather than as the direct foundation for a customer-facing historical data service.However, the move cut our compressed storage footprint from ~330TB to ~190TB, fixed the long tail of our slowest queries (e.g., <a href="https://x.com/nick_pennie/status/2045213000596852861?s=20" rel="noopener noreferrer" target="_blank">p95 latency for <code>getTransactionsForAddress</code> calls dropped from 350ms to 30ms</a>), and put us in a position to innovate on Solana’s read layer—a position that wouldn’t have been possible with ClickHouse at the performance and scale demanded at Helius.This article explains why we migrated from ClickHouse, what we learned about the workload along the way, and how we’re using RocksDB in production at scale. <h2>What is Solana Archival and Why It Matters</h2>At its core, Solana is a network of nodes that communicate to agree on new information without having to trust one another. A node is a computer on Solana that runs a client (e.g., Agave, <a href="https://www.helius.dev/blog/what-is-firedancer" rel="noopener noreferrer" target="_blank">Firedancer</a>) that abides by a specific set of rules to help facilitate the agreement of new information. A <a href="https://www.helius.dev/validator" rel="noopener noreferrer" target="_blank">validator</a> is a Solana node that secures the network by producing blocks (i.e., appending groups of transactions to Solana’s ledger) and voting on the validity of other blocks. <a href="https://www.helius.dev/solana-rpc-nodes" rel="noopener noreferrer" target="_blank">RPC nodes</a> are validators that do not participate in block production or voting. Instead, they observe the network and track all the new information it produces. RPCs allow users to query the network for specific data using the <a href="https://www.helius.dev/docs/api-reference/rpc/http-methods" rel="noopener noreferrer" target="_blank">JSON-RPC specification</a>. These nodes, however, do not keep this data forever. To stay within Solana’s hardware requirements, older blocks, transactions, and account states are pruned so that nodes retain only the most recent view of Solana’s history. This is problematic when trying to look up a transaction signature from a year ago, pull every signature that ever touched a given wallet across its lifetime, or scan a program’s full execution history.Archival broadly refers to the entire layer that stores Solana’s data since genesis. It logs and indexes everything the chain produces—every block, transaction, account interaction, Cross Program Invocation (CPI), and transaction log—and keeps it queryable beyond the standard ~2-day (i.e., 1 <a href="https://www.helius.dev/blog/solana-slots-blocks-and-epochs#defining-epochs-in-solana" rel="noopener noreferrer" target="_blank">epoch</a>) short-term storage window. Archival is what makes historical queries possible.The default approach on Solana to storing archival data has been Google BigTable. Anza maintains a BigTable instance that RPC providers can reach on demand to serve historical queries. It works, but BigTable is expensive, egress costs make running your own copy painful, and there’s almost no engineering work that can be done on your end to make it faster—you’re at the mercy of Google’s storage, with all of its cost structure and none of the control. Old Faithful is a collection of tooling maintained by Triton One that can produce Content Addressable Archives (CARs) from ledger RocksDB archives and serve them via Solana’s standard RPC and gRPC interfaces. It is a meaningful step toward decentralizing Solana’s archive layer, providing a valuable source of redundant, verifiable history. However, its trade-offs with respect to developer experience and performance (e.g., getting started requires custom tooling rather than interfaces teams can build against, it’s optimized for durable archival rather than performance and scale) do not make it a meaningful alternative for latency-sensitive workloads. The core problem with archival is that you have N petabytes of raw data. With over 500 billion transactions, ~1.3 trillion row account-to-transaction indexes, and random access patterns, how are sub-10ms lookups achieved? What do sub-50ms end-to-end queries look like? Both Google BigTable and Old Faithful fail to offer a solution to this.Moreover, if you wanted to build custom filtering and sorting options, or improve the developer experience by offering anything richer than the standard JSON-RPC methods, you’d need your own archival index. So, <a href="https://www.helius.dev/blog/introducing-gettransactionsforaddress" rel="noopener noreferrer" target="_blank">we built one</a>.<h2>ClickHouse: The Pragmatic First Bet</h2><a href="https://clickhouse.com/" rel="noopener noreferrer" target="_blank">ClickHouse</a> was the most pragmatic place to start building out our new archival system. It was the fastest path to shipping a product that was already better than one built on BigTable.ClickHouse is a mature, columnar database with great tooling and documentation. It compresses time-series data well, which is especially useful for Solana, since its block data is fundamentally time-ordered. It’s straightforward to create a database, write SQL against it, iterate on schemas, and optimize for time-to-market, so customers have something in front of them relatively quickly.ClickHouse didn’t serve traffic all on its own. Rather, it was the bottom