Skip to main content

You're the new engineer at a startup. The founder comes to you with a simple request: "Users need to reset their passwords when they forget them. Should be easy, right?"

Sure. Let's build it.

By the end of this journey, your "simple feature" will involve 8 microservices, 3 databases, 2 message queues, and a fraud detection system that costs $10k/month.

Welcome to the real world of distributed systems.

Day 1: The MVP

You're building the simplest possible solution. One table, one endpoint, done by lunch.

Your First Design

graph LR
    A[User submits email] --> B[Check if user exists]
    B --> C[Generate reset token]
    C --> D[Save token to database]
    D --> E[Send email with link]
    E --> F[User clicks link]
    F --> G[Validate token]
    G --> H[Update password]
    
    style A fill:#a8e6cf
    style H fill:#a8e6cf

Database Schema:

users
  - id
  - email
  - password_hash
  - created_at

password_reset_tokens
  - id
  - user_id
  - token (random string)
  - created_at
  - expires_at

The Flow:

  1. User submits email

  2. You generate a random token (like a3f8b2c...)

  3. Store it with user_id and expiration (1 hour)

  4. Email them: Click here: example.com/reset?token=a3f8b2c...

  5. When they click, check if token is valid and not expired

  6. Let them set a new password

You deploy on Friday. It works perfectly. You go home feeling accomplished.

Monday Morning: The First Attack

You wake up to 47 Slack messages.

Someone is hammering your endpoint. They're requesting password resets for every possible email: admin@yourcompany.com, support@yourcompany.com, ceo@yourcompany.com...

The Email Enumeration Attack

Your current system has a fatal flaw:

graph TD
    A[Attacker: Reset 'admin@company.com'] --> B{User exists?}
    B -->|Yes| C[Response: 'Email sent']
    B -->|No| D[Response: 'User not found']
    
    C --> E[Attacker knows: This email IS registered]
    D --> F[Attacker knows: This email is NOT registered]
    
    style E fill:#ffcccc
    style F fill:#ffcccc

The attacker just enumerated your entire user database without accessing your database. They now know:

  • Which emails are registered

  • Which employees work at companies using your service

  • Potential targets for spear-phishing

Design Fix #1: Information Disclosure

New rule: Always return the same message, regardless of whether the user exists.

graph LR
    A[User submits email] --> B{User exists?}
    B -->|Yes| C[Send reset email]
    B -->|No| D[Do nothing]
    C --> E[Always respond: 'If that email exists, we sent you a link']
    D --> E
    
    style E fill:#a8e6cf

But now you've created a UX problem. Legitimate users who typo their email won't know they made a mistake. They'll wait for an email that never comes.

Trade-off #1: Security vs User Experience. You chose security.

Tuesday: The DDoS Attack

The same attacker is back. Now they're sending 10,000 reset requests per minute for victim@company.com.

The victim's inbox is flooded. Your email service provider is throttling you. Your email reputation score is tanking. Other legitimate emails aren't getting delivered.

System Design Problem: No Rate Limiting

You need to track requests and limit them. But track WHAT exactly?

graph TD
    A[Rate Limit By What?] --> B[By IP Address?]
    A --> C[By Email?]
    A --> D[By Session?]
    A --> E[Combination?]
    
    B --> F[Problem: Shared IPs, VPNs]
    C --> G[Problem: Attackers target victims' emails]
    D --> H[Problem: No session before login]
    E --> I[Problem: Complex logic]
    
    style F fill:#ffe6e6
    style G fill:#ffe6e6
    style H fill:#ffe6e6
    style I fill:#fff4e6

Design Fix #2: Multi-Layer Rate Limiting

You implement multiple limits:

Rate Limiting Strategy:

graph TB
    subgraph "Per IP Address"
        A[Max 5 requests per hour]
    end
    
    subgraph "Per Email"
        B[Max 3 requests per hour]
    end
    
    subgraph "Global"
        C[Max 1000 total requests per minute]
    end
    
    D[Incoming Request] --> A
    A --> B
    B --> C
    C --> E{All limits OK?}
    E -->|Yes| F[Process Request]
    E -->|No| G[Return 429 Too Many Requests]

New Infrastructure Needed:

  • Redis or Memcached for fast rate limit counters

  • Sliding window algorithm to prevent bursts

  • Different time windows for different limits

But now you have a new problem: Where do you store these counters? In-memory is fast but doesn't survive server restarts. Database is persistent but too slow.

Trade-off #2: Speed vs Accuracy. You choose Redis (fast but might lose data on crash).

Wednesday: The Token Prediction Attack

A security researcher emails you. They've figured out your token generation is predictable.

You were generating tokens like this: md5(email + timestamp)

They can generate all possible tokens for a given email within a time window. They don't need to intercept emails—they can just guess valid tokens.

System Design Problem: Cryptographic Randomness

graph LR
    A[Bad: md5email+time] --> B[Predictable Pattern]
    B --> C[Attacker can brute force]
    
    D[Good: Cryptographically secure random] --> E[Impossible to guess]
    E --> F[Must access email to get token]
    
    style A fill:#ffcccc
    style D fill:#ccffcc

Design Fix #3: Proper Token Generation

  • Use cryptographically secure random number generator

  • Make tokens long enough (128+ bits)

  • Make them unpredictable even if attacker knows algorithm

But now another issue: Should you hash the token before storing it in the database?

graph TD
    A{Store token in DB} --> B[Store plain text]
    A --> C[Store hashed]
    
    B --> D[Pro: Easy to query]
    B --> E[Con: DB breach exposes all tokens]
    
    C --> F[Pro: DB breach doesn't expose tokens]
    C --> G[Con: Can't revoke specific tokens easily]
    
    style E fill:#ffcccc
    style F fill:#ccffcc

Trade-off #3: Convenience vs Defense-in-Depth. You hash the tokens.

Thursday: The Support Ticket Avalanche

Your support team is drowning. People are getting locked out because:

  • They typo their email 3 times → rate limited for an hour

  • They're traveling → different IP → system seems suspicious

  • They changed phone numbers → can't receive SMS codes

  • They're not tech-savvy → confused by the process

System Design Problem: No Escape Hatch

Your security is too good. Legitimate users can't get in.

graph TD
    A[User Locked Out] --> B{Has Support Team?}
    B -->|No| C[User lost forever]
    B -->|Yes| D[Manual verification]
    
    D --> E[Support agent verifies identity]
    E --> F[Support resets password manually]
    
    F --> G[Problem: Support has too much power]
    G --> H[Problem: Support becomes attack vector]
    
    style C fill:#ffcccc
    style H fill:#ffcccc

Design Fix #4: Support Tools with Audit Logging

You build a support portal with:

  • Elevated permissions for support staff

  • Mandatory audit logs of every action

  • Multi-person approval for sensitive resets

  • Time-delayed resets (user gets notified, has 24h to object)

New Infrastructure:

  • Audit log database (write-only, immutable)

  • Admin portal with role-based access control

  • Alert system for suspicious support actions

Trade-off #4: Accessibility vs Attack Surface. You've just made your support team a high-value target.

Friday: The Scale Problem

Your startup is doing well. You now have:

  • 1 million users

  • 10,000 password reset requests per day

  • Global users across 100 countries

System Design Problem: Single Point of Failure

graph TD
    A[All Users Worldwide] --> B[Single Web Server]
    B --> C[Single Database]
    C --> D[Single Email Service]
    
    B --> E[Problem: Latency for distant users]
    C --> F[Problem: Database becomes bottleneck]
    D --> G[Problem: Email delays]
    
    style E fill:#ffcccc
    style F fill:#ffcccc
    style G fill:#ffcccc

Current Architecture Pain Points:

  1. Database Writes: Every reset request writes to DB (token creation)

  2. Database Reads: Every token validation reads from DB

  3. Email Delays: Synchronous email sending blocks response

  4. No Geographic Distribution: Asian users hit US servers

Design Fix #5: Distributed Architecture

You redesign the system:

graph TB
    subgraph "Frontend Layer"
        A[Load Balancer]
        A --> B[API Server 1]
        A --> C[API Server 2]
        A --> D[API Server N]
    end
    
    subgraph "Cache Layer"
        E[Redis Cluster]
        F[Rate Limit Counters]
        G[Token Cache]
    end
    
    subgraph "Processing Layer"
        H[Message Queue Kafka]
        I[Worker 1: Email Sender]
        J[Worker 2: Token Validator]
        K[Worker 3: Analytics]
    end
    
    subgraph "Storage Layer"
        L[(Primary DB)]
        M[(Read Replica 1)]
        N[(Read Replica 2)]
    end
    
    B --> E
    C --> F
    D --> G
    
    B --> H
    C --> H
    D --> H
    
    H --> I
    H --> J
    H --> K
    
    I --> L
    J --> M
    K --> N

Architectural Changes:

Async Email Sending:

sequenceDiagram
    participant User
    participant API
    participant Queue
    participant Worker
    participant Email
    
    User->>API: Request password reset
    API->>Queue: Publish reset event
    API-->>User: 200 OK (instant response)
    
    Note over Queue: Event waits in queue
    
    Worker->>Queue: Poll for events
    Queue-->>Worker: Reset event
    Worker->>Email: Send email
    Email-->>Worker: Sent

Benefits:

  • User gets instant response (doesn't wait for email to send)

  • Email failures don't affect API response

  • Can retry failed emails

  • Can batch emails for efficiency

Trade-offs:

  • More complex to debug (distributed tracing needed)

  • Can't tell user immediately if email failed

  • Need to handle queue failures

Month 2: The SIM Swap Attack

You added SMS two-factor authentication. Users love it—until someone's phone number gets stolen.

The Attack:

sequenceDiagram
    participant Attacker
    participant TelecomCompany
    participant Victim
    participant YourSystem
    
    Attacker->>TelecomCompany: Social engineering: "I lost my SIM"
    TelecomCompany->>TelecomCompany: Ports number to new SIM
    
    Attacker->>YourSystem: Reset password for victim
    YourSystem->>Attacker: SMS code sent to victim's number
    Note over Attacker: Attacker now receives the SMS
    
    Attacker->>YourSystem: Submit SMS code
    YourSystem->>Attacker: Access granted
    
    Attacker->>YourSystem: Change password
    Victim->>YourSystem: Try to login - LOCKED OUT

The Problem: SMS isn't as secure as you thought. Telecom companies have weak identity verification. Attackers exploit this.

Design Fix #6: Risk-Based Authentication

Different users get different security requirements based on risk signals:

graph TD
    A[Password Reset Request] --> B[Risk Assessment Engine]
    
    B --> C{Calculate Risk Score}
    
    C --> D[Check: Location]
    C --> E[Check: Device Fingerprint]
    C --> F[Check: Time of Day]
    C --> G[Check: Recent Activity]
    C --> H[Check: IP Reputation]
    
    D --> I{Risk Level}
    E --> I
    F --> I
    G --> I
    H --> I
    
    I -->|Low 0-30| J[Email Only]
    I -->|Medium 31-60| K[Email + SMS]
    I -->|High 61-80| L[Email + SMS + Security Questions]
    I -->|Critical 81-100| M[Block + Manual Review]
    
    style J fill:#ccffcc
    style K fill:#ffffcc
    style L fill:#ffddcc
    style M fill:#ffcccc

Risk Signals:

Signal

Why It Matters

Weight

New location

User never logged in from this city before

+30

VPN/Proxy

Hiding real location

+20

Suspicious IP

Known bad actor IP

+40

Unusual time

User normally active 9am-5pm, request at 3am

+15

Recent resets

Multiple resets in 24 hours

+25

Device match

Same device/browser as usual

-20

New Infrastructure Needed:

  • Geolocation service

  • Device fingerprinting

  • Behavioral analytics

  • Machine learning model for risk scoring

Trade-off #5: Privacy vs Security. You're now tracking user behavior patterns.

Month 3: The Distributed Systems Nightmare

You're now running in 5 data centers across 3 continents. A user in Tokyo requests a reset, the email gets sent from Singapore, they click the link and hit a server in California.

System Design Problem: Distributed State

graph TD
    A[User in Tokyo] --> B[Closest Server: Singapore]
    B --> C[Creates Token]
    C --> D[Writes to Singapore DB]
    
    E[User clicks link] --> F[Closest Server: California]
    F --> G[Reads from California DB]
    
    D -.Replication Delay 2 seconds.-> G
    
    H{Token found?} --> I[No - Replication hasn't finished]
    I --> J[User sees: Invalid token]
    
    style I fill:#ffcccc
    style J fill:#ffcccc

The CAP Theorem Strikes:

You can only pick 2 of 3:

  • Consistency: Every read sees the latest write

  • Availability: Every request gets a response

  • Partition Tolerance: System works despite network failures

graph TD
    A{CAP Theorem Choice} --> B[Consistency + Availability]
    A --> C[Consistency + Partition Tolerance]
    A --> D[Availability + Partition Tolerance]
    
    B --> E[Single datacenter only]
    C --> F[Strong consistency, but downtime during network issues]
    D --> G[Eventually consistent, always available]
    
    E --> H[Not an option - you're global]
    F --> I[Not acceptable - users expect high uptime]
    G --> J[This is your reality]
    
    style H fill:#ffcccc
    style I fill:#ffcccc
    style J fill:#ffffcc

Design Fix #7: Eventually Consistent Architecture

You embrace eventual consistency:

sequenceDiagram
    participant User
    participant Singapore
    participant Queue
    participant California
    participant Tokyo
    
    User->>Singapore: Request reset
    Singapore->>Singapore: Generate token
    Singapore->>Queue: Publish "token created" event
    Queue-->>California: Replicate event
    Queue-->>Tokyo: Replicate event
    
    Note over Singapore,Tokyo: All DCs eventually have the token
    
    User->>California: Click reset link (2 seconds later)
    California->>California: Check local cache first
    California->>California: Found in cache!
    California->>User: Valid token, proceed

Key Strategies:

  1. Write to local region, replicate globally

  2. Cache aggressively (Redis with short TTL)

  3. Accept eventual consistency (token might take 1-2 seconds to propagate)

  4. Increase token validity (from 1 hour to 24 hours to account for delays)

Trade-off #6: Consistency vs Latency. Users in distant regions might wait slightly longer.

Month 6: The Account Takeover Wave

Despite all your defenses, accounts are still getting compromised. Attackers evolved:

Modern Attack Vectors

graph TD
    A[Attacker Goals] --> B[Phishing]
    A --> C[Credential Stuffing]
    A --> D[Social Engineering]
    A --> E[Malware]
    
    B --> F[Fake reset page steals token]
    C --> G[Reused passwords from other breaches]
    D --> H[Trick support into resetting]
    E --> I[Keylogger captures new password]
    
    F --> J[Your reset system didn't fail...]
    G --> J
    H --> J
    I --> J
    J --> K[...the ecosystem did]
    
    style K fill:#ffcccc

The Realization: Your password reset system is secure, but passwords themselves are the problem.

Design Fix #8: Moving Beyond Passwords

You start implementing passwordless alternatives:

Option 1: Magic Links

sequenceDiagram
    participant User
    participant System
    participant Email
    
    User->>System: I want to login
    System->>System: Generate one-time login token
    System->>Email: Send magic link
    Email->>User: Click to login (no password needed)
    User->>System: Clicks link
    System->>User: Logged in!
    
    Note over System: Token expires after use or 15 minutes

Option 2: Passkeys (WebAuthn)

graph LR
    A[User wants to login] --> B[System: Use your fingerprint]
    B --> C[User's device verifies biometric]
    C --> D[Device signs challenge with private key]
    D --> E[System verifies signature]
    E --> F[Logged in!]
    
    style F fill:#ccffcc
    
    G[No password involved] --> H[Nothing to reset]
    H --> I[Nothing to phish]
    I --> J[Nothing to breach]
    
    style J fill:#ccffcc

Benefits of Passwordless:

  • No passwords to forget

  • No passwords to reset

  • Phishing becomes much harder

  • Credential stuffing becomes impossible

  • Better security AND better UX

Challenges:

  • Requires modern devices/browsers

  • User education needed

  • Fallback mechanisms still required

  • Device loss scenarios need handling

Month 12: The Reality Check

You've been building password reset functionality for a year. Here's what you've created:

The Final Architecture

graph TB
    subgraph "Edge Layer"
        CDN[CDN / Edge Cache]
        WAF[Web Application Firewall]
    end
    
    subgraph "API Layer"
        LB[Load Balancer]
        API1[API Server]
        API2[API Server]
        API3[API Server]
    end
    
    subgraph "Intelligence Layer"
        RISK[Risk Assessment Engine]
        FRAUD[Fraud Detection ML]
        GEO[Geolocation Service]
    end
    
    subgraph "State Management"
        REDIS[Redis Cluster]
        QUEUE[Kafka Message Queue]
    end
    
    subgraph "Processing Workers"
        W1[Email Worker]
        W2[SMS Worker]
        W3[Notification Worker]
        W4[Analytics Worker]
    end
    
    subgraph "Storage Layer"
        DB_PRIMARY[(Primary PostgreSQL)]
        DB_REPLICA1[(Read Replica 1)]
        DB_REPLICA2[(Read Replica 2)]
        DB_ANALYTICS[(Analytics DB)]
    end
    
    subgraph "External Services"
        EMAIL[SendGrid]
        SMS[Twilio]
        MONITORING[DataDog]
    end
    
    CDN --> WAF
    WAF --> LB
    LB --> API1
    LB --> API2
    LB --> API3
    
    API1 --> RISK
    API2 --> FRAUD
    API3 --> GEO
    
    API1 --> REDIS
    API2 --> REDIS
    API3 --> REDIS
    
    API1 --> QUEUE
    API2 --> QUEUE
    API3 --> QUEUE
    
    QUEUE --> W1
    QUEUE --> W2
    QUEUE --> W3
    QUEUE --> W4
    
    W1 --> EMAIL
    W2 --> SMS
    W1 --> DB_PRIMARY
    W2 --> DB_PRIMARY
    W3 --> DB_PRIMARY
    W4 --> DB_ANALYTICS
    
    DB_PRIMARY -.Replication.-> DB_REPLICA1
    DB_PRIMARY -.Replication.-> DB_REPLICA2
    
    W1 --> MONITORING
    W2 --> MONITORING
    API1 --> MONITORING

What You've Learned

The Cost of "Simple" Features:

Metric

Start

Now

Lines of Code

50

12,000+

Services

1

8

Databases

1

4

Team Members

You

3 engineers full-time

Infrastructure Cost

$5/month

$2,500/month

Support Tickets

0/day

15/day

The Trade-offs You Made:

graph LR
    A[Security] -.vs.-> B[Usability]
    C[Speed] -.vs.-> D[Consistency]
    E[Privacy] -.vs.-> F[Fraud Detection]
    G[Simplicity] -.vs.-> H[Features]
    I[Cost] -.vs.-> J[Reliability]
    
    style A fill:#ffd6d6
    style B fill:#d6f5d6
    style C fill:#d6f5d6
    style D fill:#ffd6d6
    style E fill:#ffd6d6
    style F fill:#d6f5d6
    style G fill:#ffd6d6
    style H fill:#d6f5d6
    style I fill:#ffd6d6
    style J fill:#d6f5d6

The Lessons

1. There Are No Simple Features at Scale

What seems simple has hidden complexity:

  • User expectations (fast, reliable, always works)

  • Attacker sophistication (always evolving)

  • Edge cases (the user in Antarctica on a satellite connection)

  • Regulations (GDPR, data residency, accessibility)

2. Every Security Layer Adds Friction

graph TD
    A[More Security] --> B[More Steps]
    B --> C[More Confusion]
    C --> D[More Support Tickets]
    D --> E[Higher Costs]
    E --> F[User Frustration]
    F --> G[Users Choose Competitors]
    
    style G fill:#ffcccc

The trick is finding the balance. Not maximum security—optimal security.

3. Distributed Systems Are Hard

You thought you were building a password reset feature. You actually built:

  • A distributed state machine

  • An event-driven architecture

  • A real-time risk assessment engine

  • A multi-region data replication system

4. The Best Reset Flow is No Reset Flow

The industry is moving toward:

  • Passkeys (biometric authentication)

  • Magic links (one-time email tokens)

  • Hardware tokens (YubiKeys)

  • Single Sign-On (let someone else handle it)

Because the ultimate truth: passwords are the problem, not password resets.

The Modern Solution: Risk-Adaptive Security

Today's best systems don't apply the same security to everyone. They adapt:

graph TD
    A[Password Reset Request] --> B[Risk Analysis]
    
    B --> C{User Signals}
    C --> D[Known Device + Normal Location]
    C --> E[Unknown Device OR Unusual Location]
    C --> F[Multiple Red Flags]
    
    D --> G[Low Friction: Email only]
    E --> H[Medium Friction: Email + SMS]
    F --> I[High Friction: Multiple verifications]
    
    G --> J[Reset in 2 minutes]
    H --> K[Reset in 5 minutes]
    I --> L[Reset in 15 minutes + possible manual review]
    
    style G fill:#ccffcc
    style H fill:#ffffcc
    style I fill:#ffcccc

The Philosophy:

  • Trust your legitimate users

  • Make attackers' lives hard

  • Use data to tell them apart

Looking Forward

The future of account recovery isn't better password resets. It's:

  1. Eliminating passwords entirely (passkeys, biometrics)

  2. Continuous authentication (behavior-based trust scores)

  3. Decentralized identity (blockchain-based identity)

  4. AI-powered fraud detection (real-time risk assessment)

But for now, we're stuck in the middle. Passwords are dying, but they're not dead yet.

So we keep building more complex reset flows, adding more steps, more verifications, more friction.

All to protect users from a threat that exists because we're still using a technology from the 1960s: the password.


Key Takeaways for System Design

If you're building a password reset system today:

Start Simple, Add Complexity Only When Needed:

  • Begin with basic token-based reset

  • Add rate limiting when you see abuse

  • Add 2FA when risk increases

  • Add ML when scale demands it

Think in Layers:

graph TD
    A[Layer 1: Basic Token] --> B[Layer 2: Rate Limiting]
    B --> C[Layer 3: Risk Assessment]
    C --> D[Layer 4: Multi-Factor Auth]
    D --> E[Layer 5: Fraud Detection]
    
    F[Each layer is optional] --> G[Add based on your threat model]

Measure Everything:

  • Success rate (how many resets complete?)

  • Time to complete (how long does it take?)

  • Drop-off points (where do users abandon?)

  • False positive rate (how many legitimate users blocked?)

  • Support ticket volume (how much manual intervention?)

Plan Your Evolution:

  • MVP: Email-based tokens

  • Phase 2: Rate limiting + token security

  • Phase 3: SMS 2FA for high-value accounts

  • Phase 4: Risk-based authentication

  • Phase 5: Passwordless alternatives

Remember: The goal isn't to build the most secure system possible. It's to build a system that balances security, usability, cost, and reliability for YOUR specific threat model and user base.

A banking app needs different security than a recipe website.

Know your risks. Design accordingly.

Loading comments...

Share this article