Skip to main content

You click a thumbs-up icon. The number increments. Seems simple, right?

But behind that innocent Like button on Facebook lies one of the most fascinating distributed systems challenges in modern software engineering. What appears as a simple counter is actually a carefully orchestrated dance of dozens of services, multiple databases, caching layers, and consistency protocols.

Let me take you on a journey from "it's just a counter" to "oh god, what have we built" — the natural evolution of any system at scale.

Level 0: The Naive Implementation

Let's start with what every developer would build on day one.

# Simple Flask endpoint
@app.post('/like')
def like_post(post_id, user_id):
    # Check if user already liked this
    existing = db.query(
        "SELECT * FROM likes WHERE post_id = ? AND user_id = ?",
        post_id, user_id
    )
    
    if existing:
        return {"error": "Already liked"}
    
    # Insert the like
    db.execute(
        "INSERT INTO likes (post_id, user_id, created_at) VALUES (?, ?, NOW())",
        post_id, user_id
    )
    
    # Update counter
    db.execute(
        "UPDATE posts SET like_count = like_count + 1 WHERE id = ?",
        post_id
    )
    
    return {"success": True}

This works perfectly fine... until about 1,000 users. Then everything falls apart.

The Problems That Emerge

  1. Race conditions: Two users like simultaneously, both read like_count = 100, both write 101

  2. Database load: Every like hits the database twice (check + insert + update)

  3. Lock contention: Popular posts create update bottlenecks

  4. No scalability: Single database can't handle millions of likes per second

When your system goes from "college project" to "actual users," these aren't theoretical problems. They're 3 AM pages from your monitoring system.

Level 1: Add Some Transactions

First attempt at fixing: wrap it in a transaction with proper isolation.

@app.post('/like')
def like_post(post_id, user_id):
    with db.transaction(isolation='SERIALIZABLE'):
        # Lock the post row
        post = db.query(
            "SELECT like_count FROM posts WHERE id = ? FOR UPDATE",
            post_id
        )
        
        existing = db.query(
            "SELECT * FROM likes WHERE post_id = ? AND user_id = ?",
            post_id, user_id
        )
        
        if existing:
            return {"error": "Already liked"}
        
        db.execute(
            "INSERT INTO likes (post_id, user_id, created_at) VALUES (?, ?, NOW())",
            post_id, user_id
        )
        
        db.execute(
            "UPDATE posts SET like_count = ? WHERE id = ?",
            post['like_count'] + 1, post_id
        )
    
    return {"success": True}

Better! No more race conditions. The counter is accurate.

But now you have a new problem: throughput has collapsed. That FOR UPDATE lock means only one person can like a post at a time. A viral post with 10,000 likes per second? Your database is now a smoking crater.

sequenceDiagram
    participant User1
    participant User2
    participant API
    participant Database
    
    User1->>API: Like post
    API->>Database: BEGIN TRANSACTION
    API->>Database: SELECT ... FOR UPDATE
    Database-->>API: Row locked
    
    User2->>API: Like post
    API->>Database: BEGIN TRANSACTION
    API->>Database: SELECT ... FOR UPDATE
    Note over Database: WAITING... User2 is blocked
    
    API->>Database: INSERT like
    API->>Database: UPDATE count
    API->>Database: COMMIT
    Database-->>User1: Success
    
    Note over Database: Lock released, User2 can proceed
    Database-->>API: Row locked for User2
    API->>Database: INSERT like
    API->>Database: UPDATE count
    API->>Database: COMMIT
    Database-->>User2: Success

Level 2: Read-Heavy Optimization with Caching

Here's a realization: likes are read way more often than they're written. Every time someone scrolls their feed, they see hundreds of like counts. But they only click like once.

Enter caching.

class LikeService:
    def __init__(self):
        self.cache = Redis()
        self.db = Database()
    
    def get_like_count(self, post_id):
        # Try cache first
        cached = self.cache.get(f"post:{post_id}:likes")
        if cached:
            return int(cached)
        
        # Cache miss - hit database
        count = self.db.query(
            "SELECT like_count FROM posts WHERE id = ?",
            post_id
        )[0]
        
        # Cache for 5 minutes
        self.cache.setex(f"post:{post_id}:likes", 300, count)
        return count
    
    def add_like(self, post_id, user_id):
        # Check duplicate (also cached)
        if self.cache.sismember(f"post:{post_id}:likers", user_id):
            return False
        
        # Write to database
        with self.db.transaction():
            self.db.execute(
                "INSERT INTO likes (post_id, user_id) VALUES (?, ?)",
                post_id, user_id
            )
            new_count = self.db.execute(
                "UPDATE posts SET like_count = like_count + 1 WHERE id = ? RETURNING like_count",
                post_id
            )[0]
        
        # Update cache
        self.cache.sadd(f"post:{post_id}:likers", user_id)
        self.cache.setex(f"post:{post_id}:likes", 300, new_count)
        
        return True

This helps a lot with reads. But we've introduced a new nightmare: cache invalidation.

What happens when:

  • The cache expires but the database hasn't been updated yet?

  • Multiple app servers update the cache simultaneously?

  • A user unlikes and then immediately likes again?

You end up with users seeing different like counts depending on which server they hit. One API server shows 1,000 likes, another shows 1,002, and the actual database has 1,001.

Level 3: Eventually Consistent Counters

This is where most companies make a critical decision: exact counts don't actually matter.

Does anyone care if a post has exactly 1,847,293 likes versus 1,847,301? Nope. They care about the magnitude. So we embrace eventual consistency.

class EventualLikeCounter:
    def __init__(self):
        self.redis = Redis()
        self.kafka = KafkaProducer()
        self.db = Database()
    
    def add_like(self, post_id, user_id):
        # Increment in-memory counter (fast!)
        pipe = self.redis.pipeline()
        pipe.sadd(f"post:{post_id}:likers", user_id)
        pipe.incr(f"post:{post_id}:likes")
        results = pipe.execute()
        
        # If this is a new like (not duplicate)
        if results[0] == 1:
            # Publish event for async processing
            self.kafka.send('likes', {
                'post_id': post_id,
                'user_id': user_id,
                'timestamp': time.time()
            })
            
            return True
        return False
    
    def get_like_count(self, post_id):
        # Always return from cache (fast reads)
        count = self.redis.get(f"post:{post_id}:likes")
        return int(count) if count else 0

Now we have separate workers that batch-process likes:

class LikeAggregator:
    def process_batch(self, events):
        # Batch 1000 like events together
        batch = defaultdict(set)
        
        for event in events:
            batch[event['post_id']].add(event['user_id'])
        
        # Single database write for 1000 likes
        for post_id, user_ids in batch.items():
            with self.db.transaction():
                self.db.executemany(
                    "INSERT INTO likes (post_id, user_id) VALUES (?, ?) ON CONFLICT DO NOTHING",
                    [(post_id, uid) for uid in user_ids]
                )
                self.db.execute(
                    "UPDATE posts SET like_count = like_count + ? WHERE id = ?",
                    len(user_ids), post_id
                )
graph TD
    A[User Clicks Like] --> B[API Server]
    B --> C[Redis: Increment Counter]
    B --> D[Kafka: Publish Event]
    C --> E[Return Success to User]
    D --> F[Aggregator Worker]
    F --> G[Batch 1000 Events]
    G --> H[PostgreSQL: Bulk Insert]
    H --> I[Update DB Counter]
    
    style C fill:#90EE90
    style E fill:#90EE90
    style F fill:#FFD700
    style H fill:#87CEEB

This architecture handles millions of likes per second. The user gets instant feedback (Redis is fast), and the database gets updates in digestible batches.

But we've created new problems:

  • What if Kafka goes down? Lost likes?

  • What if Redis crashes? Counter goes to zero?

  • What if the aggregator falls behind? Database gets stale?

Level 4: The Real Facebook Architecture

Facebook doesn't just store a counter. They need to:

  • Show WHO liked a post

  • Support reactions (like, love, haha, wow, sad, angry)

  • Handle un-likes

  • Provide real-time updates

  • Scale to billions of posts

  • Maintain consistency across datacenters worldwide

Here's a simplified version of what they actually do:

Sharded Write-Heavy Store

class LikeWriteStore:
    """
    Sharded across thousands of MySQL instances
    Each shard handles ~10M posts
    """
    def __init__(self):
        self.shards = [MySQLConnection(f"shard-{i}") for i in range(1000)]
    
    def get_shard(self, post_id):
        # Consistent hashing
        return self.shards[hash(post_id) % len(self.shards)]
    
    def add_like(self, post_id, user_id, reaction_type):
        shard = self.get_shard(post_id)
        
        # Insert into likes table
        shard.execute(
            """INSERT INTO likes (post_id, user_id, reaction_type, created_at) 
               VALUES (?, ?, ?, NOW())
               ON DUPLICATE KEY UPDATE reaction_type = ?, updated_at = NOW()""",
            post_id, user_id, reaction_type, reaction_type
        )

Separate Aggregation Store

class LikeAggregationStore:
    """
    Separate system for fast count reads
    Updated asynchronously from write store
    """
    def __init__(self):
        self.memcache = MemcacheClient()
        self.tao = TAOService()  # Facebook's custom graph store
    
    def get_counts(self, post_id):
        # Try L1 cache (in-memory)
        key = f"like_counts:{post_id}"
        cached = self.memcache.get(key)
        if cached:
            return cached
        
        # Try L2 cache (TAO)
        counts = self.tao.get_edge_count(post_id, "likes")
        
        # Cache in memcache
        self.memcache.set(key, counts, ttl=60)
        return counts

Real-time Updates via Streaming

class LikeStreamProcessor:
    """
    Processes like events in real-time
    Updates caches and sends notifications
    """
    def process_like_event(self, event):
        post_id = event['post_id']
        user_id = event['user_id']
        
        # Update aggregation cache
        self.update_cache(post_id)
        
        # Notify post author
        self.send_notification(event['post_author_id'], {
            'type': 'like',
            'from': user_id,
            'post': post_id
        })
        
        # Update friend feeds
        self.fanout_to_feeds(user_id, event)

Here's the full data flow:

graph TB
    subgraph "Write Path"
        A[User Clicks Like] --> B[API Server]
        B --> C[Write to Sharded MySQL]
        B --> D[Publish to Streaming System]
        D --> E[Update Memcache]
        D --> F[Update TAO Graph]
        D --> G[Send Notifications]
    end
    
    subgraph "Read Path"
        H[User Views Post] --> I[Check Memcache]
        I -->|Hit| J[Return Count]
        I -->|Miss| K[Query TAO]
        K --> L[Aggregate from Shards]
        L --> I
    end
    
    subgraph "Background Jobs"
        M[Hourly Reconciliation]
        M --> N[Count Actual Likes in MySQL]
        M --> O[Fix Drift in Caches]
    end
    
    style C fill:#ff6b6b
    style E fill:#4ecdc4
    style F fill:#4ecdc4
    style I fill:#95e1d3

Level 5: The Hidden Challenges

Even with all this architecture, Facebook engineers deal with:

1. Geographic Distribution

A user in Mumbai likes a post from Tokyo. The write goes to the closest datacenter, but reads might happen from any of 15 global datacenters. How do you keep them in sync?

graph LR
    subgraph "Asia DC"
        A1[Write: +1 Like]
        A2[Local Cache: 100]
    end
    
    subgraph "Europe DC"
        E1[Read Request]
        E2[Local Cache: 99]
    end
    
    subgraph "US DC"
        U1[Master DB]
        U2[Count: 100]
    end
    
    A1 --> U1
    U1 -.Async Replication.-> E2
    E1 --> E2
    
    style E2 fill:#ffaaaa
    style A2 fill:#aaffaa

Solution: Eventual consistency with regional caches. Accept that someone in Europe might see the count a second late.

2. Celebrity Posts

When Taylor Swift posts something, 50 million people like it in the first hour. That's ~14,000 likes per second on a single post.

Traditional sharding doesn't help here because all traffic hits one shard. Facebook's solution:

class HotPostHandler:
    """
    Detects viral posts and routes them to special infrastructure
    """
    def is_hot_post(self, post_id):
        recent_rate = self.redis.get(f"like_rate:{post_id}:last_minute")
        return recent_rate > 1000  # More than 1000 likes/min
    
    def handle_like(self, post_id, user_id):
        if self.is_hot_post(post_id):
            # Route to special high-throughput queue
            self.hot_queue.publish(post_id, user_id)
        else:
            # Normal path
            self.normal_queue.publish(post_id, user_id)

Hot posts get:

  • Dedicated queue workers

  • Higher cache TTLs

  • Approximate counts (less precision, more speed)

  • Rate limiting on reads

3. The Unlike Problem

Removing a like is harder than adding one. You need to:

  1. Remove from the set of likers (so they can like again)

  2. Decrement the counter

  3. Update all caches

  4. Handle the case where they unlike and immediately re-like

def unlike(self, post_id, user_id):
    # Remove from Redis
    pipe = self.redis.pipeline()
    removed = pipe.srem(f"post:{post_id}:likers", user_id)
    pipe.decr(f"post:{post_id}:likes")
    pipe.execute()
    
    if removed:
        # Only publish event if they actually had liked it
        self.kafka.send('unlikes', {
            'post_id': post_id,
            'user_id': user_id,
            'timestamp': time.time()
        })
        
        # Invalidate aggregate caches
        self.memcache.delete(f"like_counts:{post_id}")

The tricky part: what if the unlike event is processed before the like event (because of queue reordering)? You need idempotency and sequence numbers:

class IdempotentLikeProcessor:
    def process_event(self, event):
        # Each event has a sequence number
        last_seq = self.redis.get(f"last_seq:{event['user_id']}:{event['post_id']}")
        
        if last_seq and event['sequence'] <= last_seq:
            # Old event, ignore it
            return
        
        # Process event
        if event['type'] == 'like':
            self.add_like(event['post_id'], event['user_id'])
        else:
            self.remove_like(event['post_id'], event['user_id'])
        
        # Update sequence
        self.redis.set(
            f"last_seq:{event['user_id']}:{event['post_id']}",
            event['sequence']
        )

Level 6: Observability and Debugging

With this distributed architecture, debugging becomes its own challenge. When a user reports "my like didn't count," you need to trace through:

  1. API server logs

  2. Redis operations

  3. Kafka message delivery

  4. Aggregator processing

  5. Database writes

  6. Cache invalidations

class LikeTracer:
    """
    Distributed tracing for like operations
    """
    def add_like_with_trace(self, post_id, user_id):
        trace_id = str(uuid.uuid4())
        
        with self.tracer.span("add_like", trace_id=trace_id) as span:
            span.set_tag("post_id", post_id)
            span.set_tag("user_id", user_id)
            
            # Redis increment
            with span.child_span("redis_incr"):
                self.redis.incr(f"post:{post_id}:likes")
            
            # Kafka publish
            with span.child_span("kafka_publish"):
                self.kafka.send('likes', {
                    'post_id': post_id,
                    'user_id': user_id,
                    'trace_id': trace_id
                })
            
            # Log for debugging
            logger.info("Like added", extra={
                'trace_id': trace_id,
                'post_id': post_id,
                'user_id': user_id,
                'timestamp': time.time()
            })

You also need monitoring dashboards tracking:

  • Like latency (p50, p95, p99)

  • Cache hit rates

  • Queue lag

  • Database load

  • Error rates by failure type

The Real Lesson

The Like button teaches us something fundamental about distributed systems: there's no such thing as a simple feature at scale.

What started as:

UPDATE posts SET likes = likes + 1

Became:

  • 15+ microservices

  • 5 different data stores

  • 3 levels of caching

  • 2 message queues

  • Custom consistency protocols

  • Dedicated monitoring infrastructure

And that's just for incrementing a number.

This pattern repeats everywhere in large-scale systems. Twitter's retweet count, YouTube's view counter, Instagram's follower count—they all face the same challenges. The solutions involve embracing:

  1. Eventual consistency: Not every user needs to see the exact same number at the exact same millisecond

  2. Approximate accuracy: 1.2M vs 1.2M + 7 likes doesn't change user behavior

  3. Tiered architecture: Different paths for reads vs writes, hot vs cold data

  4. Async processing: Decouple user-facing latency from data consistency

  5. Graceful degradation: When systems fail, show stale data rather than errors

The next time you click a Like button, remember: you just participated in a carefully choreographed dance involving hundreds of servers across multiple continents, all working together to increment a number that doesn't even need to be perfectly accurate.

And somehow, it still manages to feel instant.


Want to dive deeper? Here are the key concepts to research further:

  • Last-write-wins (LWW) registers

  • CRDT (Conflict-free Replicated Data Types)

  • Facebook's TAO (The Associations and Objects) architecture

  • Lamport timestamps and vector clocks

  • Event sourcing patterns

The real magic isn't in any single technique—it's in knowing when to apply which solution as your scale changes.

Loading comments...

Share this article