From Liquidation Storms to Cloud Outages: A Crisis Moment for Crypto Infrastructure
On the 20th, an AWS issue at Amazon caused Coinbase and dozens of other major crypto platforms including Robinhood, Infura, Base, and Solana to go down.
Original Title: Crypto Infrastructure is Far From Perfect
Original Author: YQ, Crypto KOL
Original Translation: AididiaoJP, Foresight News
Amazon Web Services has once again experienced a major outage, severely impacting crypto infrastructure. AWS issues in the US East 1 region (Northern Virginia data center) caused outages for Coinbase and dozens of other major crypto platforms, including Robinhood, Infura, Base, and Solana.
AWS has acknowledged an "increased error rate" affecting Amazon DynamoDB and EC2, which are core database and computing services relied upon by thousands of companies. This outage provides an immediate and vivid validation of the central argument of this article: the crypto infrastructure's reliance on centralized cloud service providers creates systemic vulnerabilities that repeatedly manifest under stress.
The timing is particularly instructive. Just ten days after a $1.93 billion liquidation cascade exposed infrastructure failures at the trading platform level, today’s AWS outage demonstrates that the problem extends beyond a single platform to the foundational cloud infrastructure layer. When AWS fails, the cascading effects simultaneously impact centralized exchanges, "decentralized" platforms with centralized dependencies, and countless other services.
This is not an isolated incident, but a pattern. The following analysis documents similar AWS outages in April 2025, December 2021, and March 2017, each time resulting in major crypto service outages. The question is not whether the next infrastructure failure will occur, but when and what will trigger it.
October 10-11, 2025 Liquidation Cascade: A Case Study
The liquidation cascade of October 10-11, 2025 provides an instructive case study of infrastructure failure patterns. At 20:00 UTC, a major geopolitical announcement triggered a market-wide sell-off. Within one hour, $6 billion in liquidations occurred. By the time Asian markets opened, $19.3 billion in leveraged positions had evaporated from 1.6 million trader accounts.
Figure 1: Timeline of the October 2025 Liquidation Cascade
This interactive timeline chart shows the dramatic hourly progression of liquidations. In the first hour alone, $6 billion evaporated, followed by an even more intense acceleration in the second hour. The visualization shows:
· 20:00-21:00: Initial shock - $6 billion liquidated (red area)
· 21:00-22:00: Cascade peak - $4.2 billion, with APIs beginning to throttle
· 22:00-04:00: Ongoing deterioration - $9.1 billion liquidated in illiquid markets
· Key turning points: API rate limiting, market makers withdrawing, thinning order books
The scale is at least an order of magnitude larger than any previous crypto market event. Historical comparisons show the step-function nature of this event:
Figure 2: Comparison of Historical Liquidation Events
The bar chart dramatically illustrates the prominence of the October 2025 event:
· March 2020 (COVID): $1.2 billion
· May 2021 (Crash): $1.6 billion
· November 2022 (FTX): $1.6 billion
· October 2025: $19.3 billion, 16 times the previous record
But liquidation numbers tell only part of the story. The more interesting question concerns the mechanism: how did an external market event trigger this specific failure pattern? The answer reveals systemic weaknesses in centralized exchange infrastructure and blockchain protocol design.
Off-chain Failures: Centralized Exchange Architecture
Infrastructure Overload and Rate Limiting
Exchange APIs implement rate limits to prevent abuse and manage server load. During normal operations, these limits allow legitimate trading while blocking potential attacks. During extreme volatility, when thousands of traders simultaneously try to adjust positions, these same rate limits become bottlenecks.
CEXs limit liquidation notifications to one order per second, even when processing thousands of orders per second. During the October cascade, this created opacity. Users could not determine the real-time severity of the cascade. Third-party monitoring tools showed hundreds of liquidations per minute, while official data sources showed far fewer.
API rate limits prevented traders from modifying positions during the critical first hour, connection requests timed out, and order submissions failed. Stop-loss orders failed to execute, and position queries returned outdated data. This infrastructure bottleneck turned a market event into an operational crisis.
Traditional exchanges configure infrastructure for normal load plus a safety margin. But normal load is very different from stress load; average daily volume does not predict peak stress demand well. During the cascade, trading volume surged 100x or more, and position data queries increased 1,000x as every user checked their account simultaneously.
Figure 4.5: AWS Outage Impacting Crypto Services
Auto-scaling cloud infrastructure helps, but cannot respond instantly; spinning up additional database read replicas takes minutes. Creating new API gateway instances takes minutes. In those minutes, margin systems continue marking position values based on corrupted price data from overloaded order books.
Oracle Manipulation and Pricing Vulnerabilities
During the October cascade, a key design choice in margin systems became apparent: some exchanges calculate collateral value based on internal spot market prices rather than external oracle data feeds. Under normal market conditions, arbitrageurs maintain price consistency across venues. But under infrastructure stress, this coupling breaks down.
Figure 3: Oracle Manipulation Flowchart
This interactive flowchart visualizes a five-stage attack vector:
· Initial sell-off: $60 million in selling pressure on USDe
· Price manipulation: USDe crashes from $1.00 to $0.65 on a single exchange
· Oracle failure: Margin system uses corrupted internal price data feed
· Cascade trigger: Collateral is marked down, forced liquidations begin
· Amplification: $19.3 billion in total liquidations (322x amplification)
This attack exploited Binance’s use of spot market prices for wrapped synthetic collateral. When the attacker dumped $60 million of USDe into a relatively thin order book, the spot price crashed from $1.00 to $0.65. Margin systems configured to mark collateral at spot price marked down all USDe-collateralized positions by 35%. This triggered margin calls and forced liquidations for thousands of accounts.
These liquidations forced more sell orders into the same illiquid market, further depressing prices. Margin systems observed these lower prices and marked down more positions, creating a feedback loop that amplified $60 million in selling pressure into $19.3 billion in forced liquidations.
Figure 4: Liquidation Cascade Feedback Loop
This loop feedback chart illustrates the self-reinforcing nature of the cascade:
Price drops → triggers liquidation → forced selling → further price drop → [cycle repeats]
This mechanism would not work if a properly designed oracle system were used. If Binance used a time-weighted average price (TWAP) across multiple exchanges, instantaneous price manipulation would not affect collateral valuation. If they used aggregated price feeds from Chainlink or other multi-source oracles, the attack would fail.
The wBETH incident four days earlier demonstrated a similar vulnerability. wBETH should maintain a 1:1 redemption ratio with ETH. During the cascade, liquidity dried up and the wBETH/ETH spot market showed a 20% discount. Margin systems marked down wBETH collateral accordingly, triggering liquidations of positions that were in fact fully collateralized by underlying ETH.
Auto-Deleveraging (ADL) Mechanism
When liquidations cannot be executed at current market prices, exchanges implement auto-deleveraging (ADL), passing losses onto profitable traders. ADL forcibly closes profitable positions at current prices to cover the gap left by liquidated positions.
During the October cascade, Binance executed ADL on multiple trading pairs. Traders holding profitable long positions found their trades forcibly closed, not due to their own risk management failures, but because other traders’ positions became insolvent.
ADL reflects a fundamental architectural choice in centralized derivatives trading. Exchanges guarantee they will not lose money. This means losses must be borne by one or more of the following:
· Insurance fund (exchange reserves to cover liquidation shortfalls)
· ADL (forcibly closing profitable traders’ positions)
· Socialized losses (spreading losses across all users)
The size of the insurance fund relative to open interest determines the frequency of ADL. Binance’s insurance fund totaled about $2 billion in October 2025. Relative to $4 billion in open interest on BTC, ETH, and BNB perpetual contracts, this provided 50% coverage. But during the October cascade, total open interest across all pairs exceeded $20 billion. The insurance fund could not cover the gap.
After the October cascade, Binance announced that as long as total open interest remains below $4 billion, they guarantee no ADL on BTC, ETH, and BNB USDⓈ-M contracts. This creates an incentive structure: exchanges can maintain larger insurance funds to avoid ADL, but this ties up capital that could otherwise be profitably deployed.
On-chain Failures: Blockchain Protocol Limitations
The bar chart compares downtime across different events:
· Solana (February 2024): 5 hours - voting throughput bottleneck
· Polygon (March 2024): 11 hours - validator version mismatch
· Optimism (June 2024): 2.5 hours - sequencer overload (airdrop)
· Solana (September 2024): 4.5 hours - transaction spam attack
· Arbitrum (December 2024): 1.5 hours - RPC provider failure
Figure 5: Major Network Outages - Duration Analysis
Solana: Consensus Bottlenecks
Solana experienced multiple outages during 2024-2025. The February 2024 outage lasted about 5 hours, and the September 2024 outage lasted 4-5 hours. These outages stemmed from similar root causes: the network could not handle transaction volume during spam attacks or extreme activity.
Figure 5 details: Solana’s outages (5 hours in February, 4.5 hours in September) highlight recurring issues with network resilience under stress.
Solana’s architecture is optimized for throughput. Under ideal conditions, the network processes 3,000-5,000 transactions per second with sub-second finality. This performance is orders of magnitude higher than Ethereum. But during stress events, this optimization creates vulnerabilities.
The September 2024 outage resulted from a flood of spam transactions that overwhelmed validator voting. Solana validators must vote on blocks to reach consensus. Under normal operations, validators prioritize voting transactions to ensure consensus progress. But the protocol previously treated voting transactions the same as regular transactions in the fee market.
When the mempool filled with millions of spam transactions, validators struggled to propagate voting transactions. Without enough votes, blocks could not be finalized. Without finalized blocks, the chain halted. Users with pending transactions saw them stuck in the mempool. New transactions could not be submitted.
StatusGator recorded multiple Solana service outages in 2024-2025, while Solana never officially acknowledged them. This created information asymmetry. Users could not distinguish between local connectivity issues and network-wide problems. Third-party monitoring services provided accountability, but platforms should maintain comprehensive status pages.
Ethereum: Gas Fee Explosion
Ethereum experienced extreme gas fee spikes during the 2021 DeFi boom, with simple transfers costing over $100. Complex smart contract interactions cost $500-1,000. These fees made the network unusable for smaller transactions and enabled a different attack vector: MEV extraction.
Figure 7: Transaction Costs During Network Stress
This line chart dramatically shows gas fee escalation across networks during stress events:
· Ethereum: $5 (normal) → $450 (peak congestion) - 90x increase
· Arbitrum: $0.50 → $15 - 30x increase
· Optimism: $0.30 → $12 - 40x increase
The visualization shows that even Layer 2 solutions experienced significant gas fee escalation, though from a much lower starting point.
Maximum Extractable Value (MEV) describes the profit validators can extract by reordering, including, or excluding transactions. In high gas fee environments, MEV becomes especially lucrative. Arbitrageurs race to front-run large DEX trades, and liquidation bots race to liquidate undercollateralized positions first. This competition manifests as gas fee bidding wars.
Users wanting to ensure their transactions are included during congestion must outbid MEV bots. This creates scenarios where transaction fees exceed transaction value. Want to claim your $100 airdrop? Pay $150 in gas fees. Need to add collateral to avoid liquidation? Compete with bots paying $500 in priority fees.
Ethereum’s gas limit caps total computation per block. During congestion, users bid for scarce block space. The fee market works as designed: higher bidders get priority. But this design makes the network increasingly expensive during periods of high usage—precisely when users most need access.
Layer 2 solutions attempt to address this by moving computation off-chain while inheriting Ethereum’s security through periodic settlement. Optimism, Arbitrum, and other rollups process thousands of transactions off-chain, then submit compressed proofs to Ethereum. This architecture successfully lowers per-transaction costs during normal operations.
Layer 2: Sequencer Bottlenecks
But Layer 2 solutions introduce new bottlenecks. Optimism experienced an outage in June 2024 when 250,000 addresses simultaneously claimed an airdrop. The sequencer, which orders transactions before submitting them to Ethereum, was overwhelmed, and users were unable to submit transactions for several hours.
This outage shows that moving computation off-chain does not eliminate infrastructure demands. Sequencers must process incoming transactions, order them, execute them, and generate fraud or ZK proofs for Ethereum settlement. Under extreme traffic, sequencers face the same scaling challenges as standalone blockchains.
Multiple RPC providers must remain available. If the main provider fails, users should seamlessly fail over to alternatives. During the Optimism outage, some RPC providers remained functional while others failed. Users whose wallets defaulted to failed providers could not interact with the chain, even though the chain itself remained online.
AWS outages have repeatedly demonstrated the centralized infrastructure risks in the crypto ecosystem:
· October 20, 2025 (today): US East 1 region outage impacts Coinbase, as well as Venmo, Robinhood, and Chime. AWS acknowledges increased error rates for DynamoDB and EC2 services.
· April 2025: Regional outage simultaneously impacts Binance, KuCoin, and MEXC. Multiple major exchanges became unavailable when their AWS-hosted components failed.
· December 2021: US East 1 region outage caused Coinbase, Binance.US, and "decentralized" exchange dYdX to go down for 8-9 hours, also impacting Amazon’s own warehouses and major streaming services.
· March 2017: S3 outage prevented users from logging into Coinbase and GDAX for five hours, alongside widespread internet outages.
The pattern is clear: these exchanges host critical components on AWS infrastructure. When AWS experiences a regional outage, multiple major exchanges and services become unavailable simultaneously. Users cannot access funds, execute trades, or modify positions during outages—precisely when market volatility may require immediate action.
Polygon: Consensus Version Mismatch
Polygon (formerly Matic) experienced an 11-hour outage in March 2024. The root cause involved validator version mismatch: some validators ran old software versions, while others ran upgraded versions. These versions computed state transitions differently.
Figure 5 details: Polygon’s outage (11 hours) was the longest among the major events analyzed, highlighting the severity of consensus failures.
When validators reach different conclusions about the correct state, consensus fails and the chain cannot produce new blocks because validators cannot agree on block validity. This creates a deadlock: validators running old software reject blocks produced by upgraded validators, and vice versa.
Resolution requires coordinating validator upgrades, but coordinating validator upgrades during an outage takes time. Each validator operator must be contacted, the correct software version must be deployed, and their validator must be restarted. In a decentralized network with hundreds of independent validators, this coordination can take hours or days.
Hard forks typically use block height triggers. All validators upgrade before a specific block height to ensure simultaneous activation, but this requires advance coordination. Incremental upgrades—where validators gradually adopt new versions—risk exactly the version mismatches that caused the Polygon outage.
Architectural Trade-offs
Figure 6: Blockchain Trilemma - Decentralization vs. Performance
This scatter plot visualizes different systems mapped to two key dimensions:
· Bitcoin: high decentralization, low performance
· Ethereum: high decentralization, medium performance
· Solana: medium decentralization, high performance
· Binance (CEX): minimal decentralization, maximum performance
· Arbitrum/Optimism: medium-high decentralization, medium performance
Key insight: no system achieves both maximum decentralization and maximum performance; each design makes thoughtful trade-offs for different use cases.
Centralized exchanges achieve low latency through architectural simplicity: matching engines process orders in microseconds, and state resides in centralized databases. No consensus protocol overhead is introduced, but this simplicity creates single points of failure, and cascading failures propagate through tightly coupled systems under stress.
Decentralized protocols distribute state across validators, eliminating single points of failure. High-throughput chains maintain this property during outages (no loss of funds, only temporary loss of liveness). But reaching consensus among distributed validators introduces computational overhead; validators must agree before state transitions are finalized. When validators run incompatible versions or face overwhelming traffic, the consensus process may temporarily halt.
Adding replicas increases fault tolerance but also coordination costs. In Byzantine fault-tolerant systems, each additional validator increases communication overhead. High-throughput architectures minimize this overhead through optimized validator communication, achieving superior performance but exposing themselves to certain attack patterns. Security-focused architectures prioritize validator diversity and consensus robustness, limiting base layer throughput while maximizing resilience.
Layer 2 solutions attempt to provide both properties through layered design. They inherit Ethereum’s security properties via L1 settlement while providing high throughput through off-chain computation. However, they introduce new bottlenecks at the sequencer and RPC layers, showing that architectural complexity solves some problems while creating new failure modes.
Scalability Remains the Fundamental Problem
These events reveal a consistent pattern: systems are provisioned for normal load, then catastrophically fail under stress. Solana handles routine traffic effectively but collapses when transaction volume increases by 10,000%. Ethereum gas fees remain reasonable until DeFi adoption triggers congestion. Optimism’s infrastructure works well until 250,000 addresses claim an airdrop simultaneously. Binance’s APIs function during normal trading but are throttled during liquidation cascades.
The October 2025 event demonstrated this dynamic at the exchange level. During normal operations, Binance’s API rate limits and database connections are sufficient, but during a liquidation cascade, when every trader tries to adjust positions simultaneously, these limits become bottlenecks. Margin systems designed to protect exchanges through forced liquidations amplify crises by creating forced sellers at the worst possible moment.
Auto-scaling provides insufficient protection against step-function increases in load. Spinning up additional servers takes minutes; in those minutes, margin systems mark position values based on corrupted price data from thin order books, and by the time new capacity comes online, the cascade has already propagated.
Over-provisioning for rare stress events is costly during normal operations. Exchange operators optimize for typical load, accepting occasional failures as an economically rational choice. The cost of downtime is externalized to users, who experience liquidations, stuck trades, or inaccessible funds during critical market moves.
Infrastructure Improvements
Figure 8: Distribution of Infrastructure Failure Modes (2024-2025)
The root cause pie chart breakdown shows:
· Infrastructure overload: 35% (most common)
· Network congestion: 20%
· Consensus failure: 18%
· Oracle manipulation: 12%
· Validator issues: 10%
· Smart contract vulnerabilities: 5%
Several architectural changes can reduce the frequency and severity of failures, though each involves trade-offs:
Separation of Pricing and Liquidation Systems
The October issue partly stemmed from coupling margin calculations with spot market prices. Using redemption rates instead of spot prices for wrapped assets could have avoided wBETH’s mispricing. More generally, critical risk management systems should not rely on potentially manipulable market data. Independent oracle systems with multi-source aggregation and TWAP calculations provide more robust price feeds.
Over-provisioned and Redundant Infrastructure
The April 2025 AWS outage affecting Binance, KuCoin, and MEXC demonstrated the risks of centralized infrastructure dependency. Running critical components across multiple cloud providers increases operational complexity and cost but eliminates correlated failures. Layer 2 networks can maintain multiple RPC providers with automatic failover. The extra expense may seem wasteful during normal operations but prevents hours of downtime during peak demand.
Enhanced Stress Testing and Capacity Planning
The pattern of systems running well until they fail under stress indicates insufficient testing under load. Simulating 100x normal load should be standard practice; identifying bottlenecks during development is less costly than discovering them during real outages. However, realistic load testing remains challenging. Production traffic exhibits patterns that synthetic tests cannot fully capture, and user behavior during real crashes differs from test scenarios.
The Road Ahead
Over-provisioning provides the most reliable solution but conflicts with economic incentives. Maintaining 10x excess capacity for rare events costs money every day to prevent a problem that occurs once a year. Until catastrophic failures impose enough cost to justify over-provisioning, systems will continue to fail under stress.
Regulatory pressure may force change. If regulations mandate 99.9% uptime or limit acceptable downtime, exchanges will need to over-provision. But regulation usually follows disasters, not prevents them. The collapse of Mt. Gox in 2014 led Japan to implement formal crypto exchange regulations. The October 2025 cascade is likely to trigger a similar regulatory response. Whether these responses specify outcomes (maximum acceptable downtime, maximum slippage during liquidations) or implementation (specific oracle providers, circuit breaker thresholds) remains uncertain.
The fundamental challenge is that these systems operate in global markets around the clock but rely on infrastructure designed for traditional business hours. When stress occurs at 02:00, teams scramble to deploy fixes while users face mounting losses. Traditional markets halt trading during stress; crypto markets simply crash. Whether this is a feature or a bug depends on perspective and stance.
Blockchain systems have achieved remarkable technical complexity in a short time. Maintaining distributed consensus across thousands of nodes is a true engineering achievement. But achieving reliability under stress requires moving beyond prototype architectures to production-grade infrastructure. This shift requires funding and prioritizing robustness over feature development speed.
The challenge is how to prioritize robustness over growth during bull markets, when everyone is making money and downtime seems like someone else’s problem. By the time the next cycle stress-tests the system, new weaknesses will emerge. Whether the industry learns from October 2025 or repeats the same patterns remains an open question. History suggests we will discover the next critical vulnerability through another multi-billion-dollar failure under stress.
Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.
You may also like
Hopes rise for altseason, but the signals aren’t there yet
A $2 Billion "Game of Probability": Is the Prediction Market Approaching Its "Singularity" Moment?
An in-depth analysis of the underlying logic and core value of prediction markets, along with a preliminary assessment of their key challenges and future development directions.

Solana Saga phone discontinued after only two years—can the second-generation Seeker avoid repeating the same fate?
Is the Web3 phone truly an innovative product with real value, or is it merely a "pseudo-demand" that can only survive through external incentives?

uniBTC Is Now Live on Rootstock: Unlock New BTC Yield and DeFi Opportunities

Trending news
MoreCrypto prices
More








