Systematic Debugging

Debugging: reproduce, isolate, diagnose, fix, verify. Replaces ad-hoc “try random things” with systematic root cause analysis.

When to Use

Investigating bugs or unexpected behavior
Test failures with unclear causes
Runtime errors (crashes, exceptions, wrong output)
Performance issues (slow, memory leaks)
User reports “it doesn’t work”

Don’t use for:

Writing new code (use technology-specific skills)
Code review (use critical-partner)
Planning (use brainstorming)

Critical Patterns

✅ REQUIRED: Reproduce First

Never guess at causes. Reproduce the bug reliably before investigating.

## Bug Reproduction

**Reported behavior:** [What user sees]
**Expected behavior:** [What should happen]
**Steps to reproduce:**
1. [Step 1]
2. [Step 2]
3. [Step 3 — bug occurs]

**Reproducible?** Yes/No/Intermittent
**Environment:** [OS, browser, Node version, etc.]

If bug is intermittent: look for race conditions, timing issues, or state-dependent behavior.

✅ REQUIRED: Read the Error Message Completely

// ❌ WRONG: "I got an error" (ignoring the message)
// TypeError: Cannot read properties of undefined (reading 'map')
// → This tells you EXACTLY what's wrong: something is undefined

// ✅ CORRECT: Parse the error message
// 1. Error type: TypeError (type-related, not logic)
// 2. What: Cannot read properties of undefined
// 3. Specifically: reading 'map'
// 4. Therefore: the variable before .map() is undefined
// 5. Question: WHY is it undefined? (check data flow)

✅ REQUIRED: Isolate the Problem

Narrow down where the bug lives using binary search.

## Isolation Strategy

1. **Which layer?** UI / API / Database / External service
2. **Which file?** Add console.log at entry/exit points
3. **Which function?** Step through with debugger
4. **Which line?** Narrow to exact statement
5. **Which value?** Log inputs and outputs

**Binary search**: If 10 possible causes, test the middle one first.
If it passes → bug is in the second half. If it fails → first half.
Repeat until isolated.

✅ REQUIRED: Check Assumptions (“Excuse vs Reality”)

Common debugging traps — what developers assume vs what’s actually true.

Developer Says	Reality Check
”The data is correct”	Log the actual data. Is it what you expect?
”This function is being called”	Add a log. Is it really being called? With correct args?
”The API returns the right data”	Check the actual response. Status code? Body? Headers?
”Nothing changed”	Check git diff. Something always changed
”It works on my machine”	Check environment differences (versions, config, data)
“The type is correct”	Log `typeof` and actual value. TypeScript types are compile-time only
”The state is updated”	State updates are async in React. Log in useEffect, not after setState

✅ REQUIRED: Fix and Verify

After finding root cause, fix it and verify the fix works.

## Fix Verification

**Root cause:** [One sentence explaining why the bug occurred]
**Fix:** [What was changed and why]
**Verification:**
- [ ] Original reproduction steps no longer trigger the bug
- [ ] Related functionality still works (regression check)
- [ ] Test added to prevent recurrence
- [ ] Edge cases considered

❌ NEVER: Apply Random Fixes

// ❌ WRONG: "Let me try adding a null check everywhere"
if (data) { if (data.items) { if (data.items.length) { /* ... */ } } }

// ✅ CORRECT: Understand WHY data is undefined
// Question: Where should data come from?
// Answer: API call in useEffect
// Investigation: Is the API call failing? Is state not updating?
// Root cause: API call has wrong endpoint
// Fix: Correct the endpoint (don't add defensive null checks)

Decision Tree

Error message present?
  → Read it completely. Parse type, what, and where.
  → See references/common-mistakes.md for common error patterns

No error, wrong behavior?
  → Compare expected vs actual output
  → Log intermediate values to find where divergence starts

Intermittent bug?
  → Race condition? (async operations, event ordering)
  → State-dependent? (different starting conditions)
  → Environment-dependent? (OS, browser, data)

Performance issue?
  → Profile first (don't guess)
  → Measure before and after any change

After finding root cause?
  → Fix the root cause, not the symptom
  → Add test to prevent recurrence
  → Document for team knowledge

Example

Debugging “user can’t log in” using the systematic approach.

## Bug Reproduction
**Reported behavior:** User enters correct credentials → "Invalid password" error
**Expected behavior:** User is logged in and redirected to dashboard
**Steps to reproduce:**
1. Navigate to /login
2. Enter registered email + correct password → error shown
**Reproducible?** Yes, consistently
**Environment:** Production only; works in dev

## Isolation Strategy
Layer-by-layer:
1. UI layer OK — form submits the right values (verified via Network tab)
2. API layer: POST /auth/login returns 401 → narrows to backend

Add log in AuthService:
  console.log("stored hash:", user.passwordHash);
  console.log("compare result:", await bcrypt.compare(password, user.passwordHash));
  → compare result: false  ← bug confirmed here

## Check Assumptions
"The password hash is correct" → LOG IT
  Actual stored hash: "$2b$12$OLD_ROUNDS..." (bcrypt rounds mismatch)
  Dev DB: rounds=10; Prod DB: rounds=12 — hashes generated with different round counts

## Root Cause
Password hashes in production were generated during a bcrypt config change.
Existing users have hashes with old round count; new bcrypt.compare uses new count.

## Fix Verification
- [ ] Re-hash passwords on next successful login (progressive migration)
- [ ] Verify login works for affected accounts in staging
- [ ] Add test: login with hash from each supported round count
- [ ] Edge case: users who haven't logged in since migration get password reset email

Edge Cases

Third-party library bug: Verify with minimal reproduction. Check GitHub issues. Consider workaround vs upgrade vs fork.

Heisenbug (disappears when observed): Logging or debugger changes timing. Use non-intrusive logging or production telemetry.

Works in dev, fails in prod: Check env vars, API endpoints, CORS, minification, tree-shaking.

Flaky tests: Usually timing-dependent. Add proper waits, mock time, or use deterministic test data.

Checklist

Bug reproduced reliably with clear steps
Error message read completely and parsed
Problem isolated to specific layer/file/function
Assumptions verified (not guessed)
Root cause identified (not just symptom)
Fix addresses root cause
Regression test added
Fix verified with original reproduction steps

Resources

debugging-strategies.md — Detailed debugging techniques per technology
common-mistakes.md — Common error patterns and their causes
critical-partner — For reviewing fix quality