Skip to main content

Engineering at Extreme Scale: How Shopify Built a Petabyte Platform on Ruby

Most companies at scale rewrite their early technology choices. Shopify took the opposite path. During Black Friday 2025, they processed 90 petabytes of data, handled 14.8 trillion database queries, and peaked at 489 million requests per minute—all while running one of the largest Ruby on Rails monoliths in production.

The platform looks deceptively simple from the outside: Ruby on Rails, React, MySQL, and Kafka. But that simplicity hides thousands of deliberate architectural trade-offs, years of refactoring, and a systematic approach to reliability that most organizations never achieve. Shopify’s engineering demonstrates something counterintuitive: extreme scale doesn’t require abandoning proven tools. It requires investing in them intelligently.

Pragmatic Architecture Over Theoretical Perfection

The most striking pattern across Shopify’s infrastructure is their commitment to pragmatism. As ByteByteGo’s analysis of their tech stack reveals, while most startups eventually rewrite their early frameworks, Shopify “doubled down to help ensure Ruby and Rails are 100-year tools that will continue to merit being in their toolchain of choice.”

This commitment manifests in concrete investments. Rather than migrating away from Ruby, Shopify built YJIT, a Just-in-Time compiler written in Rust that dramatically improves runtime performance without changing developer ergonomics. They contributed heavily to Sorbet, a static type checker for Ruby, making it a first-class part of their stack. They even contribute to TruffleRuby, Oracle’s high-performance Ruby implementation.

The result is a 2.8 million line Rails application with over 500,000 commits that processes hundreds of billions of requests during peak events. Tools like Packwerk enforce dependency boundaries between Rails Engines—mini-applications within the monolith that enable isolation and ownership without the operational complexity of microservices.

This philosophy extends to their data architecture. Instead of implementing theoretically optimal consistency models, Shopify chose practical ones. As detailed in their petabyte-scale MySQL management approach, they implemented monotonic read consistency rather than expensive tight consistency or complex causal consistency with global transaction identifiers.

Monotonic reads ensure that “successive reads follow a consistent timeline even if the data read is not real-time” by routing related requests to the same replica server using consistent hashing. The tradeoff? Intermittent inconsistencies during server outages. But this pragmatic solution has low overhead and handles Shopify’s replication lag challenges without sacrificing performance.

Infrastructure Built for Isolation and Scale

At the heart of Shopify’s infrastructure lies a pod-based architecture that evolved from simple database sharding. Each pod is a fully isolated slice containing its own MySQL instance, Redis node, and Memcached cluster. Shops are distributed across pods using shop_id as a natural partition key—a design that leverages e-commerce’s natural tenant isolation.

This cell-based architecture solves two fundamental problems. First, it removes single points of failure. As ByteByteGo notes, “an issue in one pod won’t cascade across the fleet.” Second, it enables horizontal scaling by adding more pods instead of vertically scaling databases—a pattern that becomes critical when managing over 100 isolated shards processing 45 million database queries per second at peak read load.

But static pod assignments create problems as merchants grow. When multiple large shops live on the same pod, database utilization becomes unbalanced. Some shards risk failure from over-utilization while others waste resources running underutilized. The solution required zero-downtime shop migration—moving a shop’s data between pods while customers continue placing orders.

Shopify built Ghostferry, an open-source tool written in Go, to handle this challenge through three phases. First, batch copying with binlog tailing: Ghostferry iterates over tables, selects rows by shop_id using SELECT…FOR UPDATE to implement locking reads, and writes them to the target shard while simultaneously tailing MySQL’s binlog to track ongoing changes.

Second, cutover: when the queue of pending binlog events becomes nearly real-time, Ghostferry stops writes on the source, records the final binlog coordinate, and processes remaining events until reaching that stopping point. Third, traffic switch: the routing table updates to point traffic at the new pod, followed by verification and cleanup.

The constraints were clear: no visible downtime, zero data loss, and reasonable throughput. Every technical requirement tied directly to business impact—a pattern that appears throughout Shopify’s architecture.

Messaging at Unprecedented Scale

While pods isolate data, Kafka connects the platform. Shopify’s Kafka infrastructure handles 66 million messages per second at peak, serving as the backbone for event distribution, ML inference pipelines, search indexing, and business workflows.

This messaging layer decouples producers from consumers, buffers high-volume traffic, and supports real-time pipelines. When downstream services crash, the event stream holds data until systems recover—a practical approach to resilience that avoids tight synchronous coupling.

For synchronous communication, Shopify uses a mix of REST, GraphQL, and gRPC. But GraphQL has become the preferred interface for public-facing clients, enabling precise data queries while reducing over-fetching. The Admin interface, built on Remix, functions as a stateless GraphQL client with strict separation: no business logic in the client, no shared state across views.

This discipline enforces consistency across platforms. Mobile apps (all running on React Native) and web admin screens speak the same language, reducing duplication and preventing platform drift. As one unified interface, GraphQL ensures that changes propagate consistently rather than creating divergent client implementations.

Systematic Preparation Over Last-Minute Heroics

Perhaps Shopify’s most distinctive characteristic isn’t their architecture but their approach to reliability. Their Black Friday Cyber Monday preparation reveals a systematic nine-month program that makes most pre-launch efforts look amateur.

Starting in March, three parallel tracks run simultaneously. Capacity Planning models traffic using historical data and merchant projections, submitted early to cloud providers so they can provision physical infrastructure. The Infrastructure Roadmap sequences architectural changes months before BFCM—critically, “Shopify never uses BFCM as a release deadline.”

Risk Assessments document failure scenarios through “What Could Go Wrong” exercises that feed into Game Days—chaos engineering exercises that intentionally simulate production failures at BFCM scale. These aren’t theoretical drills. Teams inject network faults, bust caches, and run cross-system disaster simulations on critical business paths: checkout, payment processing, order creation, and fulfillment.

All findings feed into the Resiliency Matrix, centralized documentation tracking vulnerabilities, incident response procedures, recovery time objectives, and on-call coverage across the entire platform. This matrix becomes the roadmap for system hardening.

But Game Days test components in isolation. Scale tests validate the entire platform working together. From April through October, Shopify runs five major scale tests progressively ramping from 2024’s baseline to 150% of expected load. Test four hit 146 million requests per minute and 80,000 checkouts per minute. The final test reached 200 million requests per minute at their p99 scenario.

These tests are so large that Shopify coordinates with YouTube because they impact shared Google Cloud infrastructure. And they don’t just validate capacity—they execute regional failovers, evacuating traffic from US and EU regions to validate that disaster recovery actually works at BFCM volumes.

The philosophy is captured perfectly: “preparation gets you ready, but operational excellence keeps you steady.” When BFCM 2024 processed 57.3 petabytes and 10.5 trillion database queries, then BFCM 2025 went even bigger at 90 petabytes and 14.8 trillion queries, it wasn’t luck. It was systematic validation that the entire platform could handle the load.

Operational Excellence Through Continuous Deployment

Shopify’s operational model rejects conventional wisdom. They don’t use staging environments. Instead, they rely on canary deployments, feature flags, and fast rollback mechanisms. If a feature misbehaves, it can be turned off without redeploying code.

The monolith contains over 400,000 unit tests. Running them serially would take days, so Buildkite orchestrates test runs across hundreds of parallel workers, keeping builds within 15-20 minutes. Once builds pass, deployments don’t go straight to production—they use throttling to make issues easier to trace and minimize blast radius when something breaks.

Incident response isn’t siloed into an ops team. Shopify uses lateral escalation: all engineers share responsibility for uptime, and escalation happens based on domain expertise, not job title. This encourages shared ownership and reduces handoff delays during critical outages.

For fault tolerance, two tools prove essential. Semian, a circuit breaker library for Ruby, protects core services like Redis and MySQL from cascading failures during degradation. Toxiproxy simulates bad network conditions—latency spikes, dropped packets, service flaps—in test environments to validate resilience assumptions before issues appear in production.

Many of these tools started as internal needs but became open-source contributions: Ghostferry, Toxiproxy, Semian, Bootsnap, Packwerk, Tapioca. This pattern reveals another insight: Shopify’s infrastructure investments strengthen the broader ecosystem while solving their own problems.

The Cloud-Native Database Strategy

Traditional database backup at petabyte scale is time-consuming. Shopify initially used Percona’s Xtrabackup utility to create files archived on Google Cloud Storage. The restore time for each shard exceeded six hours—an unacceptable Recovery Time Objective when merchants depend on the platform for revenue.

The solution leveraged cloud-native capabilities. Since MySQL servers run on Google Cloud’s VMs using Persistent Disk, Shopify switched to PD snapshots taken every 15 minutes via CronJobs running on Kubernetes. This approach reduced RTO from 6+ hours to under 30 minutes.

Retention policies keep costs down while maintaining necessary recovery points. Creating new Persistent Disks from snapshots enables database restoration in minutes—a dramatic improvement achieved by intelligently using cloud provider features rather than building custom solutions.

Machine Learning at Production Scale

Even Shopify’s ML infrastructure reflects their pragmatic philosophy. Their real-time semantic search doesn’t just rely on keyword matching—it uses text and image embeddings that process 2,500 embeddings per second, translating to 216 million per day.

Intelligent deduplication groups visually identical images to avoid unnecessary inference. This optimization alone reduced image embedding memory usage from 104GB to under 40GB, freeing GPU resources and cutting costs across the pipeline. Embeddings are stored in BigQuery for offline analytics without affecting live systems.

The data pipeline infrastructure balances latency, throughput, and cost through careful optimization. Embeddings generate fast enough for near-real-time updates, GPU memory is used efficiently, and redundant computation is avoided through smart caching and pre-filtering.

Key Lessons from Shopify’s Approach

Several patterns emerge from Shopify’s engineering that challenge conventional scaling wisdom.

Investment over replacement. Instead of chasing new frameworks, Shopify invests in making their chosen stack world-class. This reduces rewrites, preserves institutional knowledge, and creates ecosystem improvements benefiting the entire community.

Pragmatism over perfection. Monotonic reads instead of strict consistency. Monolith with engines instead of microservices. Canary deployments instead of staging environments. Each choice prioritizes practical operation over theoretical optimization.

Business-driven architecture. Every technical decision is justified through merchant impact. Unbalanced shards risk merchant downtime. Efficient infrastructure reduces costs. Zero-downtime migrations protect revenue. The engineering serves clear business outcomes.

Systematic reliability. Most companies load test once or twice and hope for the best. Shopify’s nine-month continuous testing cycle—finding breaking points, fixing issues, validating fixes—represents a fundamentally different approach to readiness.

Permanent improvements. Tools built for BFCM (Resiliency Matrix, Critical Journey Game Days, real-time forecasting) aren’t temporary scaffolding. They become permanent infrastructure improvements making Shopify more resilient every day, not just during peak season.

Continuous feedback loops. The three preparation tracks feed into each other. Risk findings reveal capacity gaps. Infrastructure changes introduce new risks requiring assessment. This prevents siloed preparation and ensures comprehensive readiness.

Conclusion

Shopify’s infrastructure demonstrates that extreme scale doesn’t require abandoning proven technologies or adopting the latest frameworks. It requires systematic investment in fundamentals, pragmatic architectural choices, and relentless preparation.

Their success with Ruby on Rails at petabyte scale challenges the assumption that monoliths can’t scale or that mature languages can’t handle modern workloads. By investing in YJIT, Sorbet, and Rails Engines, Shopify made their foundation competitive with any modern stack.

Their approach to reliability—nine months of preparation, five major scale tests, chaos engineering, and continuous feedback loops—reveals what separates organizations that merely survive peak events from those that thrive during them.

Perhaps most importantly, Shopify’s engineering philosophy shows that many “scaling problems” are actually preparation problems. The difference between 284 million requests per minute and 489 million requests per minute isn’t a rewrite. It’s systematic validation that your platform can handle the load before merchants depend on it.

For organizations facing their own scaling challenges, Shopify’s path offers a valuable alternative to perpetual rewrites and framework chasing: invest in your foundation, choose pragmatism over perfection, and prepare systematically for what’s coming.

Sources

This post synthesizes insights from:

All articles published by ByteByteGo in collaboration with Shopify’s engineering team.