Building at Billion: User Scale - Reddit’s Architectural Evolution
When a platform serves 1.2 billion monthly users processing 469 million posts annually, every architectural decision carries enormous weight. Reddit’s journey from a Lisp application rewritten in Python to a modern microservices architecture reveals a masterclass in managing technical evolution at scale. What makes Reddit’s approach particularly instructive isn’t just the technologies they chose, but how they orchestrated transitions without breaking the experience for hundreds of millions of users.
Three recent deep dives from the Reddit Engineering team illuminate different facets of this evolution: their overall architectural journey over two decades, their ML-powered notification system, and their migration of the critical Comments model from Python to Go. Together, these stories reveal a consistent philosophy about building and evolving systems at massive scale.
The Philosophy of Incremental Evolution
Reddit’s architectural story begins with a pragmatic rewrite. As Steve Huffman noted about the early Lisp implementation: “Since we’re building a site largely by standing on the shoulders of others, this made things a little tougher. There just aren’t as many shoulders on which to stand.” The move to Python in 2005 gave them access to a richer ecosystem, but it also created the r2 monolith—a massive Python application that would serve as Reddit’s backbone for years.
Rather than attempting a risky big-bang rewrite, Reddit embraced incremental extraction. Their GraphQL Federation strategy exemplifies this approach: new Go subgraphs gradually take over functionality from the Python monolith while both systems run in parallel. A load balancer sits between the gateway and subgraphs, controlling what percentage of traffic flows to the new services versus the legacy system.
This “blue/green subgraph deployment” pattern solves a critical problem: the GraphQL Federation specification doesn’t natively support gradual traffic migration. By running the Python monolith and Go subgraph side-by-side with shared schema ownership, Reddit can ramp up traffic incrementally, monitor error rates and latencies, and switch back instantly if issues arise.
The same incremental philosophy appears in Reddit’s Comments migration to Go. Rather than switching all endpoints at once, the team migrated three distinct operations—create, update, and increment—one at a time. Each migration validated not just application logic, but cross-language compatibility, database access patterns, and downstream consumer behavior.
Testing Strategies for Zero-Downtime Migration
The most innovative aspect of Reddit’s evolution isn’t the destination technologies, but the testing strategies that made safe migration possible.
Tap Compare for Reads
For read operations, Reddit uses “tap compare”—a technique where a small percentage of traffic routes to the new Go service, which generates its response internally, then calls the old Python endpoint for comparison. The critical detail: users always receive the old endpoint’s response. This means bugs in the new service never affect user experience, while the team validates production behavior with real traffic.
Sister Datastores for Writes
Write operations present a harder challenge. When you post a comment on Reddit, the system writes to three datastores simultaneously: Postgres (permanent storage), Memcached (fast reads), and Redis (CDC events for downstream services). Reddit guarantees 100% delivery of these CDC events because features across the platform depend on them.
The breakthrough came through what the team calls “sister datastores”. They created completely isolated mirrors of their production infrastructure—separate Postgres, Memcached, and Redis instances that only the new Go service would touch.
Here’s how the dual-write flow works: When a small percentage of write traffic routes to Go, the service first calls the legacy Python service to perform the real production write (users see their comments posted normally). Then Go performs its own write to the sister datastores. After both writes complete, the system compares production data against sister data across all three datastores, logging any differences.
This approach is elegant because bugs in Go’s implementation only affect isolated sister stores, never touching real user data. The team discovered serialization problems early—Go and Python serialize data differently, and the team added verification layers to ensure Python services could deserialize Go’s writes.
Multi-Stage Migration Process
The media metadata migration demonstrates how these validation strategies combine into systematic migration processes:
- Enable dual writes into new metadata APIs from clients
- Backfill data from older databases to the new store
- Enable dual reads on media metadata from service clients
- Monitor data comparison for every read request and fix issues
- Ramp up read traffic to the new metadata store
This five-step choreography moved terabytes of data while maintaining service availability, achieving impressive results: 2.6ms read latency at p50, 4.7ms at p90, and 17ms at p99—50% faster than the previous system.
Intelligence at Scale: ML-Powered Notifications
While migration strategies handle legacy transformation, Reddit’s notification system shows how they build new intelligent systems from scratch.
Push notifications present a fundamental trade-off: done well, they reconnect users with content they care about; done poorly, they become noise that leads to muting or app uninstallation. Reddit’s solution is a four-stage pipeline that processes millions of posts daily to deliver personalized notifications to tens of millions of users.
Causal Budgeting Prevents Fatigue
The system starts with budgeting—determining how many notifications each user should receive daily. This isn’t a simple heuristic. Reddit uses causal modeling to weigh outcomes: positive signals like clicking notifications and browsing content versus negative signals like long usage gaps or disabling push entirely.
Basic correlation (“this user got 5 notifications and stayed active”) doesn’t reveal causation. Instead, the system intentionally varies notification volumes across user groups to estimate treatment effects—what would have happened under different conditions. A multi-model ensemble simulates different budget scenarios, then selects the one that optimizes a final engagement score reflecting both expected gains and expected risks.
Two-Stage Retrieval Balances Speed and Personalization
Before ranking, the system needs candidate posts worth considering. Ranking every new post with heavy ML models would be computationally impossible at Reddit’s millions-of-posts-per-day scale.
The retrieval stage combines rule-based and model-based approaches. Rule-based retrieval uses subscriptions: list the user’s subscribed subreddits, pick top X by engagement, select top Y posts from each using upvotes and age, and use round-robin selection to prevent one active community from dominating.
Model-based retrieval uses two-tower architecture—one tower learns user representations, the other learns post representations, both outputting fixed-length embeddings. Post embeddings are precomputed and indexed. At runtime, the system computes user embeddings on-the-fly and performs nearest-neighbor search to find posts the user is most likely to click.
Multi-Task Learning Captures Nuanced Engagement
The ranking model uses a deep neural network that predicts multiple behaviors simultaneously through multi-task learning: P(click), P(upvote), P(comment), and more. These predictions combine via a weighted formula:
Final Score = W_click × P(click) + W_upvote × P(upvote) + ... + W_downvote × P(downvote)
The model architecture features shared layers at the beginning that process all input features into a common representation, then task-specific heads that specialize in predicting each target. This allows generalization across behaviors while optimizing for the nuances of each interaction type.
A subtle but critical detail: the system uses prediction logs that record all features at the moment it served a notification along with actual outcomes. This eliminates train-serve skew—the mismatch between training data and production conditions that can degrade model performance.
Reranking Applies Editorial Judgment
Even after ML ranking, a final reranking stage adjusts results based on product strategy. ML models optimize for historical patterns, but product goals evolve faster than models can retrain. Reranking applies editorial judgment without overriding the model entirely: boosting subscribed content over generic posts, enforcing diversity to prevent one subreddit from dominating, and personalizing content-type emphasis based on user behavior patterns.
The Cross-Language Migration Challenge
One question might arise: if cross-language differences caused so many problems in the Comments migration, why not just create Python microservices instead?
Reddit’s infrastructure organization has made a strategic commitment to Go for several reasons: concurrency advantages (higher throughput with fewer pods), existing ecosystem support, and better tooling. The engineering team accepted short-term cross-language challenges for long-term infrastructure consistency.
This decision paid off. Beyond achieving the primary goal of improved reliability, all three migrated write endpoints saw p99 latency cut in half. The legacy Python service occasionally had latency spikes reaching 15 seconds; the new Go service shows consistently lower latency staying well under 100 milliseconds.
However, the migration surfaced unexpected complications beyond serialization:
Database access patterns: Python’s ORM had hidden optimizations the team didn’t fully understand. When they ramped up the Go service with direct database queries, it put unexpected pressure on Postgres. The same operations that ran smoothly in Python caused performance issues in Go until they optimized queries and improved resource monitoring.
Race conditions in validation: The team started seeing mismatches that were timing problems, not bugs. A user might update a comment to “hello,” Go writes to sister datastore, but in those milliseconds another user edits the same comment to “hello again.” When Go compares its write against production, they don’t match—but this is a race condition, not a bug. The team developed custom code to detect and ignore these, planning database versioning for future migrations.
Testing strategy evolution: Much migration time was spent manually reviewing tap compare logs in production. The sheer complexity of comment data—text vs rich text vs media, photos and GIFs with different dimensions, subreddit-specific workflows, awards, moderation states—creates thousands of possible combinations. For future migrations, the team plans to use real production data to generate comprehensive test cases before deploying to production.
Emerging Patterns: Lessons from Reddit’s Evolution
Across these three areas—architectural evolution, ML systems, and critical migrations—several patterns emerge:
Incremental beats revolutionary. Whether migrating from monolith to microservices, introducing ML-powered features, or moving between languages, Reddit consistently chooses gradual transitions with fallback mechanisms over risky rewrites.
Testing in production is inevitable at scale. Despite comprehensive local testing, production traffic patterns reveal issues impossible to anticipate. Tap compare and sister datastores enable production validation with zero user risk—learning from reality without betting the platform.
Strategic technology bets compound over time. Reddit’s commitment to Go, GraphQL Federation, and Kubernetes creates a consistent foundation that makes each subsequent migration smoother. The team explicitly chose short-term migration pain for long-term ecosystem benefits.
Validation must include downstream consumers. Byte-level data comparison isn’t enough. Reddit’s addition of end-to-end verification—running CDC events through actual Python consumers to ensure deserialization works—caught compatibility issues that direct comparison would miss.
Causal modeling beats correlation. In the notification system, understanding what would have happened under different conditions (treatment effects) proves more valuable than observing what did happen (correlation). This principle extends beyond ML to architectural decisions—reasoning about counterfactuals improves decision quality.
Scale forces specialization. What works at 10,000 users fails at 1 billion. Reddit’s move from generic Post objects to server-driven UI descriptions, from third-party image optimization to in-house services (reducing costs to 0.9% of original), and from synchronous processing to queue-based asynchronous infrastructure—all reflect how scale demands specialized solutions.
Conclusion
Reddit’s architectural journey reveals that evolution at billion-user scale isn’t primarily about choosing the right technologies, though that matters. It’s about developing methodologies for safe transition: testing strategies that validate in production without risk, migration patterns that allow incremental rollback, and ML systems that balance automation with product intuition.
As ByteByteGo’s analysis notes, “What began as a monolithic Lisp application was rewritten in Python, but this monolithic approach couldn’t keep pace as Reddit’s popularity exploded.” The solution wasn’t a single heroic rewrite, but an ongoing series of thoughtful extractions, each validated through production traffic and backed by fallback mechanisms.
For teams building systems at scale, Reddit’s experience offers a valuable lesson: the path from monolith to microservices, from rules to ML, from legacy to modern infrastructure—these journeys require more than technical skill. They demand patience, systematic testing, and the discipline to move incrementally even when revolutionary change seems faster. The sister datastores strategy alone—creating isolated infrastructure to validate writes without production risk—demonstrates the kind of creative problem-solving that separates successful large-scale migrations from failed rewrites.
Reddit’s evolution continues. Comments became the first core model to fully escape the legacy monolith, with Accounts completed and Posts and Subreddits in progress. Each migration incorporates lessons from previous ones: database versioning for race conditions, comprehensive local testing from production data, better resource monitoring. The notification system explores dynamic weight adjustment and real-time learning to improve relevance.
What remains constant is the philosophy: evolve incrementally, validate continuously, and never bet the platform on untested changes.
Sources
This post synthesizes insights from:
- Reddit’s Architecture: The Evolutionary Journey - ByteByteGo’s comprehensive overview of Reddit’s architectural evolution from Lisp through Python to modern Go microservices, covering GraphQL Federation, CDC with Debezium, media metadata management, and more
- How Reddit Delivers Notifications to Tens of Millions of Users - Deep dive into Reddit’s ML-powered notification pipeline using causal budgeting, two-tower retrieval, multi-task learning for ranking, and strategic reranking
- How Reddit Migrated Comments Functionality from Python to Go - Detailed playbook for migrating Reddit’s most critical model using tap compare for reads and sister datastores for writes, with lessons on cross-language compatibility