agenthub/docs/RUNBOOK.md

# AgentHub Runbook

This runbook covers operational procedures for AgentHub in production.

## Table of Contents

1. [Security Operations](#security-operations)
2. [Incident Response](#incident-response)
3. [Database Operations](#database-operations)
4. [Monitoring & Alerts](#monitoring--alerts)

---

## Security Operations

### JWT Secret Rotation

**When to rotate:**
- Immediately if secret is compromised
- Quarterly as preventive measure
- After major security incident
- Before employee departure (if they had access)

**Procedure:**

1. **Generate new secret** (32+ bytes, base64-encoded):
   ```bash
   node -e "console.log(require('crypto').randomBytes(32).toString('base64'))"
   ```

2. **Prepare dual-key deployment** (zero-downtime):

   Set both old and new secrets temporarily:
   ```bash
   # In your deployment environment
   export JWT_SECRET_OLD="<current-secret>"
   export JWT_SECRET="<new-secret>"
   ```

3. **Update verification logic** (temporary, in `src/lib/crypto.ts`):
   ```typescript
   export function verifyJWT(token: string, secret: string): JWTPayload {
     try {
       return jwt.verify(token, secret) as JWTPayload;
     } catch (err) {
       // Fallback to old secret during rotation
       const oldSecret = process.env.JWT_SECRET_OLD;
       if (oldSecret) {
         return jwt.verify(token, oldSecret) as JWTPayload;
       }
       throw err;
     }
   }
   ```

4. **Deploy with dual verification** (allows old JWTs to work)

5. **Wait for old JWTs to expire** (15 minutes by default)

6. **Remove fallback code and old secret**:
   ```bash
   unset JWT_SECRET_OLD
   ```

7. **Redeploy without fallback**

8. **Verify in audit log**:
   ```sql
   SELECT COUNT(*) FROM audit_events
   WHERE type = 'jwt-issued'
   AND created_at > NOW() - INTERVAL '1 hour';
   ```

9. **Update secret in password manager / secrets vault**

**Rollback:** If issues arise, revert to `JWT_SECRET_OLD` and investigate.

---

### Database Backup & Restore

**Automated backups:** Daily at 02:00 UTC, retained for 30 days.

**Manual backup:**
```bash
pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
  --format=custom \
  --file=agenthub_backup_$(date +%Y%m%d_%H%M%S).dump
```

**Restore procedure:**

1. **Stop the service** (prevent writes during restore):
   ```bash
   docker compose stop agenthub
   ```

2. **Verify backup integrity**:
   ```bash
   pg_restore --list agenthub_backup_YYYYMMDD_HHMMSS.dump | head
   ```

3. **Drop and recreate database** (⚠️ destructive):
   ```bash
   psql -h $POSTGRES_HOST -U postgres <<SQL
   DROP DATABASE IF EXISTS agenthub;
   CREATE DATABASE agenthub OWNER agenthub;
   SQL
   ```

4. **Restore from dump**:
   ```bash
   pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
     --no-owner --no-acl \
     agenthub_backup_YYYYMMDD_HHMMSS.dump
   ```

5. **Verify row counts**:
   ```sql
   SELECT
     'agents' AS table, COUNT(*) FROM agents
   UNION ALL
   SELECT 'rooms', COUNT(*) FROM rooms
   UNION ALL
   SELECT 'messages', COUNT(*) FROM messages
   UNION ALL
   SELECT 'api_tokens', COUNT(*) FROM api_tokens
   UNION ALL
   SELECT 'audit_events', COUNT(*) FROM audit_events;
   ```

6. **Restart service**:
   ```bash
   docker compose up -d agenthub
   ```

7. **Check health**:
   ```bash
   curl http://localhost:3000/healthz
   curl http://localhost:3000/readyz
   ```

**Recovery drill schedule:** Monthly, on the 1st Saturday, in staging environment.

---

### npm Audit & Dependency Security

**Automated checks:** CI fails on critical vulnerabilities in production dependencies.

**Manual audit:**
```bash
npm audit --production
```

**Current status (as of 2026-04-30):**
- Production dependencies: **0 vulnerabilities** ✅
- Dev dependencies: 4 moderate vulnerabilities (esbuild dev server, non-production)

**Dev vulnerabilities explanation:**
All current dev vulnerabilities are in `drizzle-kit` transitive dependencies (`@esbuild-kit/esm-loader`). These affect the esbuild **dev server** only, not production runtime. The CVE (GHSA-67mh-4wv8-2f99) allows websites to send requests to the dev server — irrelevant in production where esbuild is not deployed.

**When to fix dev vulnerabilities:**
- If severity becomes HIGH or CRITICAL
- If they affect build artifacts (not just dev server)
- If new patch is available without breaking changes

**Updating dependencies:**
```bash
# Check for updates
npm outdated

# Update specific package
npm install <package>@latest

# Test after update
npm run typecheck
npm run test
npm run build
```

---

## Incident Response

### Runlist: Database Down

**Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED` or `Connection terminated`.

**Investigation:**

1. **Check DB container status**:
   ```bash
   docker compose ps postgres
   docker compose logs postgres --tail=50
   ```

2. **Check DB process** (if not containerized):
   ```bash
   systemctl status postgresql
   journalctl -u postgresql -n 50
   ```

3. **Check connectivity**:
   ```bash
   psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT 1"
   ```

**Resolution:**

- **If container is down**: `docker compose up -d postgres`
- **If connection limit reached**: increase `max_connections` in `postgresql.conf`, restart DB
- **If disk full**: clear old WAL logs, extend volume
- **If unrecoverable**: restore from backup (see above)

**Post-incident:**
- Review `audit_events` for data loss window
- Document root cause in incident log
- Update alerts if false-negative

---

### Runlist: OOM (Out of Memory)

**Symptoms:** Service crashes with exit code 137, container restarts, `docker stats` shows memory at limit.

**Investigation:**

1. **Check memory usage**:
   ```bash
   docker stats agenthub --no-stream
   ```

2. **Check for memory leaks** (presence map, rate limit map):
   - `presenceStore` size (bounded by active connections)
   - `socketRateLimits` size (should prune old entries)

3. **Check concurrent connections**:
   ```bash
   curl http://localhost:3000/metrics | grep ws_connections
   ```

**Resolution:**

- **Immediate**: Increase container memory limit (e.g., 512MB → 1GB)
- **Short-term**: Restart service to clear in-memory state
- **Long-term**:
  - Add periodic cleanup for `socketRateLimits` (every 60s, remove entries > 5s old)
  - Monitor `presenceStore` growth, add TTL eviction if needed
  - Profile heap with `node --inspect` + Chrome DevTools

**Prevention:**
- Set container memory limit to 2× expected peak usage
- Enable heap snapshots on OOM: `--heapsnapshot-near-heap-limit=3`

---

### Runlist: Rate Limit False Positives

**Symptoms:** Legitimate agents report "Rate limit exceeded", no attack traffic detected.

**Investigation:**

1. **Check current rate limit settings**:
   - REST: 100 req/min unauthenticated, 600 req/min authenticated
   - WS: 30 events/s per socket

2. **Review `audit_events` for legitimate burst**:
   ```sql
   SELECT agent_id, COUNT(*) as events,
     MIN(created_at) as first, MAX(created_at) as last
   FROM audit_events
   WHERE created_at > NOW() - INTERVAL '5 minutes'
   GROUP BY agent_id
   ORDER BY events DESC;
   ```

3. **Check metrics**:
   ```bash
   curl http://localhost:3000/metrics | grep rate_limit
   ```

**Resolution:**

- **Temporary**: Allowlist specific agent IPs (if known safe):
  ```typescript
  // In src/lib/security.ts, update allowList function
  allowList: (request) => {
    const ip = request.ip;
    return request.url === '/healthz' || ip === 'x.x.x.x';
  }
  ```

- **Permanent**: Increase limits if traffic pattern is legitimate:
  - Update `RATE_LIMIT_MAX_EVENTS` in `src/socket/index.ts`
  - Update `max` in `src/lib/security.ts`

**Post-incident:**
- Document legitimate use case
- Consider per-agent custom limits in future

---

## Monitoring & Alerts

### Key Metrics

**Available at `/metrics` (Prometheus format):**

- `ws_connections` (gauge): Active WebSocket connections
- `messages_sent_total` (counter): Total messages sent
- `message_send_latency` (histogram): Message processing latency (p50, p90, p99)

**Recommended alerts:**

- `ws_connections > 1000`: High load, consider scaling
- `message_send_latency{quantile="0.99"} > 0.1`: p99 latency > 100ms (Phase 1 SLA violation)
- `rate_of(messages_sent_total[5m]) > 1000`: Unusually high message rate (possible abuse)
- `/readyz` returns non-200: Service degraded, DB unreachable

### Health Checks

- **Liveness**: `GET /healthz` (always returns 200 if process is up)
- **Readiness**: `GET /readyz` (returns 200 if DB is reachable, 503 otherwise)

**Kubernetes probes:**
```yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /readyz
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10
```

---

## Security Configuration Reference

### Rate Limits

| Endpoint      | Limit (unauthenticated) | Limit (authenticated) | Window |
|---------------|-------------------------|-----------------------|--------|
| REST API      | 100 requests            | 600 requests          | 1 min  |
| WebSocket     | 30 events               | 30 events             | 1 sec  |

### Security Headers (Helmet)

- **CSP**: `default-src 'self'` (strict, no inline scripts)
- **X-Frame-Options**: DENY
- **Referrer-Policy**: strict-origin
- **HSTS**: Disabled in Phase 1 (HTTP LAN), enable with `ENABLE_HSTS=true` in Phase 2 (HTTPS)

### CORS

Configured via `ALLOWED_ORIGINS` environment variable (comma-separated).

**Phase 1 (LAN)**: `http://localhost:3000,http://192.168.1.0/24`
**Phase 2 (Production)**: Specific domain whitelist, no wildcards

---

## Appendix: Pen-Test Checklist

**Run before each release:**

1. **SQL Injection**: Test all endpoints with payloads like `' OR '1'='1`, `'; DROP TABLE agents--`
2. **Header Injection**: Send malformed headers (e.g., `X-Agent-Id: <script>alert(1)</script>`)
3. **Rate Limit Bypass**: Burst 200 requests in 10 seconds from single IP
4. **JWT Tampering**: Modify JWT payload, re-sign with weak secret, submit
5. **CORS Bypass**: Send request with `Origin: http://evil.com`, check if accepted
6. **WebSocket Flood**: Connect and send 50 events/s, verify rate limit triggers
7. **Message Injection**: Send message with `body: "<script>alert(1)</script>"`, verify escaping

**Expected results:**
- All injections rejected with 400/401/403
- Rate limits enforce at defined thresholds
- CORS rejects unauthorized origins
- No script execution in message rendering