You're the new engineer at a startup. The founder comes to you with a simple request: "Users need to reset their passwords when they forget them. Should be easy, right?"
Sure. Let's build it.
By the end of this journey, your "simple feature" will involve 8 microservices, 3 databases, 2 message queues, and a fraud detection system that costs $10k/month.
Welcome to the real world of distributed systems.
Day 1: The MVP
You're building the simplest possible solution. One table, one endpoint, done by lunch.
Your First Design
graph LR
A[User submits email] --> B[Check if user exists]
B --> C[Generate reset token]
C --> D[Save token to database]
D --> E[Send email with link]
E --> F[User clicks link]
F --> G[Validate token]
G --> H[Update password]
style A fill:#a8e6cf
style H fill:#a8e6cfDatabase Schema:
users
- id
- email
- password_hash
- created_at
password_reset_tokens
- id
- user_id
- token (random string)
- created_at
- expires_at
The Flow:
User submits email
You generate a random token (like
a3f8b2c...)Store it with user_id and expiration (1 hour)
Email them:
Click here: example.com/reset?token=a3f8b2c...When they click, check if token is valid and not expired
Let them set a new password
You deploy on Friday. It works perfectly. You go home feeling accomplished.
Monday Morning: The First Attack
You wake up to 47 Slack messages.
Someone is hammering your endpoint. They're requesting password resets for every possible email: admin@yourcompany.com, support@yourcompany.com, ceo@yourcompany.com...
The Email Enumeration Attack
Your current system has a fatal flaw:
graph TD
A[Attacker: Reset 'admin@company.com'] --> B{User exists?}
B -->|Yes| C[Response: 'Email sent']
B -->|No| D[Response: 'User not found']
C --> E[Attacker knows: This email IS registered]
D --> F[Attacker knows: This email is NOT registered]
style E fill:#ffcccc
style F fill:#ffccccThe attacker just enumerated your entire user database without accessing your database. They now know:
Which emails are registered
Which employees work at companies using your service
Potential targets for spear-phishing
Design Fix #1: Information Disclosure
New rule: Always return the same message, regardless of whether the user exists.
graph LR
A[User submits email] --> B{User exists?}
B -->|Yes| C[Send reset email]
B -->|No| D[Do nothing]
C --> E[Always respond: 'If that email exists, we sent you a link']
D --> E
style E fill:#a8e6cfBut now you've created a UX problem. Legitimate users who typo their email won't know they made a mistake. They'll wait for an email that never comes.
Trade-off #1: Security vs User Experience. You chose security.
Tuesday: The DDoS Attack
The same attacker is back. Now they're sending 10,000 reset requests per minute for victim@company.com.
The victim's inbox is flooded. Your email service provider is throttling you. Your email reputation score is tanking. Other legitimate emails aren't getting delivered.
System Design Problem: No Rate Limiting
You need to track requests and limit them. But track WHAT exactly?
graph TD
A[Rate Limit By What?] --> B[By IP Address?]
A --> C[By Email?]
A --> D[By Session?]
A --> E[Combination?]
B --> F[Problem: Shared IPs, VPNs]
C --> G[Problem: Attackers target victims' emails]
D --> H[Problem: No session before login]
E --> I[Problem: Complex logic]
style F fill:#ffe6e6
style G fill:#ffe6e6
style H fill:#ffe6e6
style I fill:#fff4e6Design Fix #2: Multi-Layer Rate Limiting
You implement multiple limits:
Rate Limiting Strategy:
graph TB
subgraph "Per IP Address"
A[Max 5 requests per hour]
end
subgraph "Per Email"
B[Max 3 requests per hour]
end
subgraph "Global"
C[Max 1000 total requests per minute]
end
D[Incoming Request] --> A
A --> B
B --> C
C --> E{All limits OK?}
E -->|Yes| F[Process Request]
E -->|No| G[Return 429 Too Many Requests]New Infrastructure Needed:
Redis or Memcached for fast rate limit counters
Sliding window algorithm to prevent bursts
Different time windows for different limits
But now you have a new problem: Where do you store these counters? In-memory is fast but doesn't survive server restarts. Database is persistent but too slow.
Trade-off #2: Speed vs Accuracy. You choose Redis (fast but might lose data on crash).
Wednesday: The Token Prediction Attack
A security researcher emails you. They've figured out your token generation is predictable.
You were generating tokens like this: md5(email + timestamp)
They can generate all possible tokens for a given email within a time window. They don't need to intercept emails—they can just guess valid tokens.
System Design Problem: Cryptographic Randomness
graph LR
A[Bad: md5email+time] --> B[Predictable Pattern]
B --> C[Attacker can brute force]
D[Good: Cryptographically secure random] --> E[Impossible to guess]
E --> F[Must access email to get token]
style A fill:#ffcccc
style D fill:#ccffccDesign Fix #3: Proper Token Generation
Use cryptographically secure random number generator
Make tokens long enough (128+ bits)
Make them unpredictable even if attacker knows algorithm
But now another issue: Should you hash the token before storing it in the database?
graph TD
A{Store token in DB} --> B[Store plain text]
A --> C[Store hashed]
B --> D[Pro: Easy to query]
B --> E[Con: DB breach exposes all tokens]
C --> F[Pro: DB breach doesn't expose tokens]
C --> G[Con: Can't revoke specific tokens easily]
style E fill:#ffcccc
style F fill:#ccffccTrade-off #3: Convenience vs Defense-in-Depth. You hash the tokens.
Thursday: The Support Ticket Avalanche
Your support team is drowning. People are getting locked out because:
They typo their email 3 times → rate limited for an hour
They're traveling → different IP → system seems suspicious
They changed phone numbers → can't receive SMS codes
They're not tech-savvy → confused by the process
System Design Problem: No Escape Hatch
Your security is too good. Legitimate users can't get in.
graph TD
A[User Locked Out] --> B{Has Support Team?}
B -->|No| C[User lost forever]
B -->|Yes| D[Manual verification]
D --> E[Support agent verifies identity]
E --> F[Support resets password manually]
F --> G[Problem: Support has too much power]
G --> H[Problem: Support becomes attack vector]
style C fill:#ffcccc
style H fill:#ffccccDesign Fix #4: Support Tools with Audit Logging
You build a support portal with:
Elevated permissions for support staff
Mandatory audit logs of every action
Multi-person approval for sensitive resets
Time-delayed resets (user gets notified, has 24h to object)
New Infrastructure:
Audit log database (write-only, immutable)
Admin portal with role-based access control
Alert system for suspicious support actions
Trade-off #4: Accessibility vs Attack Surface. You've just made your support team a high-value target.
Friday: The Scale Problem
Your startup is doing well. You now have:
1 million users
10,000 password reset requests per day
Global users across 100 countries
System Design Problem: Single Point of Failure
graph TD
A[All Users Worldwide] --> B[Single Web Server]
B --> C[Single Database]
C --> D[Single Email Service]
B --> E[Problem: Latency for distant users]
C --> F[Problem: Database becomes bottleneck]
D --> G[Problem: Email delays]
style E fill:#ffcccc
style F fill:#ffcccc
style G fill:#ffccccCurrent Architecture Pain Points:
Database Writes: Every reset request writes to DB (token creation)
Database Reads: Every token validation reads from DB
Email Delays: Synchronous email sending blocks response
No Geographic Distribution: Asian users hit US servers
Design Fix #5: Distributed Architecture
You redesign the system:
graph TB
subgraph "Frontend Layer"
A[Load Balancer]
A --> B[API Server 1]
A --> C[API Server 2]
A --> D[API Server N]
end
subgraph "Cache Layer"
E[Redis Cluster]
F[Rate Limit Counters]
G[Token Cache]
end
subgraph "Processing Layer"
H[Message Queue Kafka]
I[Worker 1: Email Sender]
J[Worker 2: Token Validator]
K[Worker 3: Analytics]
end
subgraph "Storage Layer"
L[(Primary DB)]
M[(Read Replica 1)]
N[(Read Replica 2)]
end
B --> E
C --> F
D --> G
B --> H
C --> H
D --> H
H --> I
H --> J
H --> K
I --> L
J --> M
K --> NArchitectural Changes:
Async Email Sending:
sequenceDiagram
participant User
participant API
participant Queue
participant Worker
participant Email
User->>API: Request password reset
API->>Queue: Publish reset event
API-->>User: 200 OK (instant response)
Note over Queue: Event waits in queue
Worker->>Queue: Poll for events
Queue-->>Worker: Reset event
Worker->>Email: Send email
Email-->>Worker: SentBenefits:
User gets instant response (doesn't wait for email to send)
Email failures don't affect API response
Can retry failed emails
Can batch emails for efficiency
Trade-offs:
More complex to debug (distributed tracing needed)
Can't tell user immediately if email failed
Need to handle queue failures
Month 2: The SIM Swap Attack
You added SMS two-factor authentication. Users love it—until someone's phone number gets stolen.
The Attack:
sequenceDiagram
participant Attacker
participant TelecomCompany
participant Victim
participant YourSystem
Attacker->>TelecomCompany: Social engineering: "I lost my SIM"
TelecomCompany->>TelecomCompany: Ports number to new SIM
Attacker->>YourSystem: Reset password for victim
YourSystem->>Attacker: SMS code sent to victim's number
Note over Attacker: Attacker now receives the SMS
Attacker->>YourSystem: Submit SMS code
YourSystem->>Attacker: Access granted
Attacker->>YourSystem: Change password
Victim->>YourSystem: Try to login - LOCKED OUTThe Problem: SMS isn't as secure as you thought. Telecom companies have weak identity verification. Attackers exploit this.
Design Fix #6: Risk-Based Authentication
Different users get different security requirements based on risk signals:
graph TD
A[Password Reset Request] --> B[Risk Assessment Engine]
B --> C{Calculate Risk Score}
C --> D[Check: Location]
C --> E[Check: Device Fingerprint]
C --> F[Check: Time of Day]
C --> G[Check: Recent Activity]
C --> H[Check: IP Reputation]
D --> I{Risk Level}
E --> I
F --> I
G --> I
H --> I
I -->|Low 0-30| J[Email Only]
I -->|Medium 31-60| K[Email + SMS]
I -->|High 61-80| L[Email + SMS + Security Questions]
I -->|Critical 81-100| M[Block + Manual Review]
style J fill:#ccffcc
style K fill:#ffffcc
style L fill:#ffddcc
style M fill:#ffccccRisk Signals:
Signal | Why It Matters | Weight |
|---|---|---|
New location | User never logged in from this city before | +30 |
VPN/Proxy | Hiding real location | +20 |
Suspicious IP | Known bad actor IP | +40 |
Unusual time | User normally active 9am-5pm, request at 3am | +15 |
Recent resets | Multiple resets in 24 hours | +25 |
Device match | Same device/browser as usual | -20 |
New Infrastructure Needed:
Geolocation service
Device fingerprinting
Behavioral analytics
Machine learning model for risk scoring
Trade-off #5: Privacy vs Security. You're now tracking user behavior patterns.
Month 3: The Distributed Systems Nightmare
You're now running in 5 data centers across 3 continents. A user in Tokyo requests a reset, the email gets sent from Singapore, they click the link and hit a server in California.
System Design Problem: Distributed State
graph TD
A[User in Tokyo] --> B[Closest Server: Singapore]
B --> C[Creates Token]
C --> D[Writes to Singapore DB]
E[User clicks link] --> F[Closest Server: California]
F --> G[Reads from California DB]
D -.Replication Delay 2 seconds.-> G
H{Token found?} --> I[No - Replication hasn't finished]
I --> J[User sees: Invalid token]
style I fill:#ffcccc
style J fill:#ffccccThe CAP Theorem Strikes:
You can only pick 2 of 3:
Consistency: Every read sees the latest write
Availability: Every request gets a response
Partition Tolerance: System works despite network failures
graph TD
A{CAP Theorem Choice} --> B[Consistency + Availability]
A --> C[Consistency + Partition Tolerance]
A --> D[Availability + Partition Tolerance]
B --> E[Single datacenter only]
C --> F[Strong consistency, but downtime during network issues]
D --> G[Eventually consistent, always available]
E --> H[Not an option - you're global]
F --> I[Not acceptable - users expect high uptime]
G --> J[This is your reality]
style H fill:#ffcccc
style I fill:#ffcccc
style J fill:#ffffccDesign Fix #7: Eventually Consistent Architecture
You embrace eventual consistency:
sequenceDiagram
participant User
participant Singapore
participant Queue
participant California
participant Tokyo
User->>Singapore: Request reset
Singapore->>Singapore: Generate token
Singapore->>Queue: Publish "token created" event
Queue-->>California: Replicate event
Queue-->>Tokyo: Replicate event
Note over Singapore,Tokyo: All DCs eventually have the token
User->>California: Click reset link (2 seconds later)
California->>California: Check local cache first
California->>California: Found in cache!
California->>User: Valid token, proceedKey Strategies:
Write to local region, replicate globally
Cache aggressively (Redis with short TTL)
Accept eventual consistency (token might take 1-2 seconds to propagate)
Increase token validity (from 1 hour to 24 hours to account for delays)
Trade-off #6: Consistency vs Latency. Users in distant regions might wait slightly longer.
Month 6: The Account Takeover Wave
Despite all your defenses, accounts are still getting compromised. Attackers evolved:
Modern Attack Vectors
graph TD
A[Attacker Goals] --> B[Phishing]
A --> C[Credential Stuffing]
A --> D[Social Engineering]
A --> E[Malware]
B --> F[Fake reset page steals token]
C --> G[Reused passwords from other breaches]
D --> H[Trick support into resetting]
E --> I[Keylogger captures new password]
F --> J[Your reset system didn't fail...]
G --> J
H --> J
I --> J
J --> K[...the ecosystem did]
style K fill:#ffccccThe Realization: Your password reset system is secure, but passwords themselves are the problem.
Design Fix #8: Moving Beyond Passwords
You start implementing passwordless alternatives:
Option 1: Magic Links
sequenceDiagram
participant User
participant System
participant Email
User->>System: I want to login
System->>System: Generate one-time login token
System->>Email: Send magic link
Email->>User: Click to login (no password needed)
User->>System: Clicks link
System->>User: Logged in!
Note over System: Token expires after use or 15 minutesOption 2: Passkeys (WebAuthn)
graph LR
A[User wants to login] --> B[System: Use your fingerprint]
B --> C[User's device verifies biometric]
C --> D[Device signs challenge with private key]
D --> E[System verifies signature]
E --> F[Logged in!]
style F fill:#ccffcc
G[No password involved] --> H[Nothing to reset]
H --> I[Nothing to phish]
I --> J[Nothing to breach]
style J fill:#ccffccBenefits of Passwordless:
No passwords to forget
No passwords to reset
Phishing becomes much harder
Credential stuffing becomes impossible
Better security AND better UX
Challenges:
Requires modern devices/browsers
User education needed
Fallback mechanisms still required
Device loss scenarios need handling
Month 12: The Reality Check
You've been building password reset functionality for a year. Here's what you've created:
The Final Architecture
graph TB
subgraph "Edge Layer"
CDN[CDN / Edge Cache]
WAF[Web Application Firewall]
end
subgraph "API Layer"
LB[Load Balancer]
API1[API Server]
API2[API Server]
API3[API Server]
end
subgraph "Intelligence Layer"
RISK[Risk Assessment Engine]
FRAUD[Fraud Detection ML]
GEO[Geolocation Service]
end
subgraph "State Management"
REDIS[Redis Cluster]
QUEUE[Kafka Message Queue]
end
subgraph "Processing Workers"
W1[Email Worker]
W2[SMS Worker]
W3[Notification Worker]
W4[Analytics Worker]
end
subgraph "Storage Layer"
DB_PRIMARY[(Primary PostgreSQL)]
DB_REPLICA1[(Read Replica 1)]
DB_REPLICA2[(Read Replica 2)]
DB_ANALYTICS[(Analytics DB)]
end
subgraph "External Services"
EMAIL[SendGrid]
SMS[Twilio]
MONITORING[DataDog]
end
CDN --> WAF
WAF --> LB
LB --> API1
LB --> API2
LB --> API3
API1 --> RISK
API2 --> FRAUD
API3 --> GEO
API1 --> REDIS
API2 --> REDIS
API3 --> REDIS
API1 --> QUEUE
API2 --> QUEUE
API3 --> QUEUE
QUEUE --> W1
QUEUE --> W2
QUEUE --> W3
QUEUE --> W4
W1 --> EMAIL
W2 --> SMS
W1 --> DB_PRIMARY
W2 --> DB_PRIMARY
W3 --> DB_PRIMARY
W4 --> DB_ANALYTICS
DB_PRIMARY -.Replication.-> DB_REPLICA1
DB_PRIMARY -.Replication.-> DB_REPLICA2
W1 --> MONITORING
W2 --> MONITORING
API1 --> MONITORINGWhat You've Learned
The Cost of "Simple" Features:
Metric | Start | Now |
|---|---|---|
Lines of Code | 50 | 12,000+ |
Services | 1 | 8 |
Databases | 1 | 4 |
Team Members | You | 3 engineers full-time |
Infrastructure Cost | $5/month | $2,500/month |
Support Tickets | 0/day | 15/day |
The Trade-offs You Made:
graph LR
A[Security] -.vs.-> B[Usability]
C[Speed] -.vs.-> D[Consistency]
E[Privacy] -.vs.-> F[Fraud Detection]
G[Simplicity] -.vs.-> H[Features]
I[Cost] -.vs.-> J[Reliability]
style A fill:#ffd6d6
style B fill:#d6f5d6
style C fill:#d6f5d6
style D fill:#ffd6d6
style E fill:#ffd6d6
style F fill:#d6f5d6
style G fill:#ffd6d6
style H fill:#d6f5d6
style I fill:#ffd6d6
style J fill:#d6f5d6The Lessons
1. There Are No Simple Features at Scale
What seems simple has hidden complexity:
User expectations (fast, reliable, always works)
Attacker sophistication (always evolving)
Edge cases (the user in Antarctica on a satellite connection)
Regulations (GDPR, data residency, accessibility)
2. Every Security Layer Adds Friction
graph TD
A[More Security] --> B[More Steps]
B --> C[More Confusion]
C --> D[More Support Tickets]
D --> E[Higher Costs]
E --> F[User Frustration]
F --> G[Users Choose Competitors]
style G fill:#ffccccThe trick is finding the balance. Not maximum security—optimal security.
3. Distributed Systems Are Hard
You thought you were building a password reset feature. You actually built:
A distributed state machine
An event-driven architecture
A real-time risk assessment engine
A multi-region data replication system
4. The Best Reset Flow is No Reset Flow
The industry is moving toward:
Passkeys (biometric authentication)
Magic links (one-time email tokens)
Hardware tokens (YubiKeys)
Single Sign-On (let someone else handle it)
Because the ultimate truth: passwords are the problem, not password resets.
The Modern Solution: Risk-Adaptive Security
Today's best systems don't apply the same security to everyone. They adapt:
graph TD
A[Password Reset Request] --> B[Risk Analysis]
B --> C{User Signals}
C --> D[Known Device + Normal Location]
C --> E[Unknown Device OR Unusual Location]
C --> F[Multiple Red Flags]
D --> G[Low Friction: Email only]
E --> H[Medium Friction: Email + SMS]
F --> I[High Friction: Multiple verifications]
G --> J[Reset in 2 minutes]
H --> K[Reset in 5 minutes]
I --> L[Reset in 15 minutes + possible manual review]
style G fill:#ccffcc
style H fill:#ffffcc
style I fill:#ffccccThe Philosophy:
Trust your legitimate users
Make attackers' lives hard
Use data to tell them apart
Looking Forward
The future of account recovery isn't better password resets. It's:
Eliminating passwords entirely (passkeys, biometrics)
Continuous authentication (behavior-based trust scores)
Decentralized identity (blockchain-based identity)
AI-powered fraud detection (real-time risk assessment)
But for now, we're stuck in the middle. Passwords are dying, but they're not dead yet.
So we keep building more complex reset flows, adding more steps, more verifications, more friction.
All to protect users from a threat that exists because we're still using a technology from the 1960s: the password.
Key Takeaways for System Design
If you're building a password reset system today:
Start Simple, Add Complexity Only When Needed:
Begin with basic token-based reset
Add rate limiting when you see abuse
Add 2FA when risk increases
Add ML when scale demands it
Think in Layers:
graph TD
A[Layer 1: Basic Token] --> B[Layer 2: Rate Limiting]
B --> C[Layer 3: Risk Assessment]
C --> D[Layer 4: Multi-Factor Auth]
D --> E[Layer 5: Fraud Detection]
F[Each layer is optional] --> G[Add based on your threat model]Measure Everything:
Success rate (how many resets complete?)
Time to complete (how long does it take?)
Drop-off points (where do users abandon?)
False positive rate (how many legitimate users blocked?)
Support ticket volume (how much manual intervention?)
Plan Your Evolution:
MVP: Email-based tokens
Phase 2: Rate limiting + token security
Phase 3: SMS 2FA for high-value accounts
Phase 4: Risk-based authentication
Phase 5: Passwordless alternatives
Remember: The goal isn't to build the most secure system possible. It's to build a system that balances security, usability, cost, and reliability for YOUR specific threat model and user base.
A banking app needs different security than a recipe website.
Know your risks. Design accordingly.
Loading comments...