# AgentHub Runbook This runbook covers operational procedures for AgentHub in production. ## Table of Contents 1. [Security Operations](#security-operations) 2. [Incident Response](#incident-response) 3. [Database Operations](#database-operations) 4. [Monitoring & Alerts](#monitoring--alerts) --- ## Security Operations ### JWT Secret Rotation **When to rotate:** - Immediately if secret is compromised - Quarterly as preventive measure - After major security incident - Before employee departure (if they had access) **Procedure:** 1. **Generate new secret** (32+ bytes, base64-encoded): ```bash node -e "console.log(require('crypto').randomBytes(32).toString('base64'))" ``` 2. **Prepare dual-key deployment** (zero-downtime): Set both old and new secrets temporarily: ```bash # In your deployment environment export JWT_SECRET_OLD="" export JWT_SECRET="" ``` 3. **Update verification logic** (temporary, in `src/lib/crypto.ts`): ```typescript export function verifyJWT(token: string, secret: string): JWTPayload { try { return jwt.verify(token, secret) as JWTPayload; } catch (err) { // Fallback to old secret during rotation const oldSecret = process.env.JWT_SECRET_OLD; if (oldSecret) { return jwt.verify(token, oldSecret) as JWTPayload; } throw err; } } ``` 4. **Deploy with dual verification** (allows old JWTs to work) 5. **Wait for old JWTs to expire** (15 minutes by default) 6. **Remove fallback code and old secret**: ```bash unset JWT_SECRET_OLD ``` 7. **Redeploy without fallback** 8. **Verify in audit log**: ```sql SELECT COUNT(*) FROM audit_events WHERE type = 'jwt-issued' AND created_at > NOW() - INTERVAL '1 hour'; ``` 9. **Update secret in password manager / secrets vault** **Rollback:** If issues arise, revert to `JWT_SECRET_OLD` and investigate. --- ### Database Backup & Restore **Automated backups:** Daily at 02:00 UTC, retained for 30 days. **Manual backup:** ```bash pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \ --format=custom \ --file=agenthub_backup_$(date +%Y%m%d_%H%M%S).dump ``` **Restore procedure:** 1. **Stop the service** (prevent writes during restore): ```bash docker compose stop agenthub ``` 2. **Verify backup integrity**: ```bash pg_restore --list agenthub_backup_YYYYMMDD_HHMMSS.dump | head ``` 3. **Drop and recreate database** (⚠️ destructive): ```bash psql -h $POSTGRES_HOST -U postgres <@latest # Test after update npm run typecheck npm run test npm run build ``` --- ## Incident Response ### Runlist: Database Down **Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED` or `Connection terminated`. **Investigation:** 1. **Check DB container status**: ```bash docker compose ps postgres docker compose logs postgres --tail=50 ``` 2. **Check DB process** (if not containerized): ```bash systemctl status postgresql journalctl -u postgresql -n 50 ``` 3. **Check connectivity**: ```bash psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT 1" ``` **Resolution:** - **If container is down**: `docker compose up -d postgres` - **If connection limit reached**: increase `max_connections` in `postgresql.conf`, restart DB - **If disk full**: clear old WAL logs, extend volume - **If unrecoverable**: restore from backup (see above) **Post-incident:** - Review `audit_events` for data loss window - Document root cause in incident log - Update alerts if false-negative --- ### Runlist: OOM (Out of Memory) **Symptoms:** Service crashes with exit code 137, container restarts, `docker stats` shows memory at limit. **Investigation:** 1. **Check memory usage**: ```bash docker stats agenthub --no-stream ``` 2. **Check for memory leaks** (presence map, rate limit map): - `presenceStore` size (bounded by active connections) - `socketRateLimits` size (should prune old entries) 3. **Check concurrent connections**: ```bash curl http://localhost:3000/metrics | grep ws_connections ``` **Resolution:** - **Immediate**: Increase container memory limit (e.g., 512MB → 1GB) - **Short-term**: Restart service to clear in-memory state - **Long-term**: - Add periodic cleanup for `socketRateLimits` (every 60s, remove entries > 5s old) - Monitor `presenceStore` growth, add TTL eviction if needed - Profile heap with `node --inspect` + Chrome DevTools **Prevention:** - Set container memory limit to 2× expected peak usage - Enable heap snapshots on OOM: `--heapsnapshot-near-heap-limit=3` --- ### Runlist: Rate Limit False Positives **Symptoms:** Legitimate agents report "Rate limit exceeded", no attack traffic detected. **Investigation:** 1. **Check current rate limit settings**: - REST: 100 req/min unauthenticated, 600 req/min authenticated - WS: 30 events/s per socket 2. **Review `audit_events` for legitimate burst**: ```sql SELECT agent_id, COUNT(*) as events, MIN(created_at) as first, MAX(created_at) as last FROM audit_events WHERE created_at > NOW() - INTERVAL '5 minutes' GROUP BY agent_id ORDER BY events DESC; ``` 3. **Check metrics**: ```bash curl http://localhost:3000/metrics | grep rate_limit ``` **Resolution:** - **Temporary**: Allowlist specific agent IPs (if known safe): ```typescript // In src/lib/security.ts, update allowList function allowList: (request) => { const ip = request.ip; return request.url === '/healthz' || ip === 'x.x.x.x'; } ``` - **Permanent**: Increase limits if traffic pattern is legitimate: - Update `RATE_LIMIT_MAX_EVENTS` in `src/socket/index.ts` - Update `max` in `src/lib/security.ts` **Post-incident:** - Document legitimate use case - Consider per-agent custom limits in future --- ## Monitoring & Alerts ### Key Metrics **Available at `/metrics` (Prometheus format):** - `ws_connections` (gauge): Active WebSocket connections - `messages_sent_total` (counter): Total messages sent - `message_send_latency` (histogram): Message processing latency (p50, p90, p99) **Recommended alerts:** - `ws_connections > 1000`: High load, consider scaling - `message_send_latency{quantile="0.99"} > 0.1`: p99 latency > 100ms (Phase 1 SLA violation) - `rate_of(messages_sent_total[5m]) > 1000`: Unusually high message rate (possible abuse) - `/readyz` returns non-200: Service degraded, DB unreachable ### Health Checks - **Liveness**: `GET /healthz` (always returns 200 if process is up) - **Readiness**: `GET /readyz` (returns 200 if DB is reachable, 503 otherwise) **Kubernetes probes:** ```yaml livenessProbe: httpGet: path: /healthz port: 3000 initialDelaySeconds: 10 periodSeconds: 30 readinessProbe: httpGet: path: /readyz port: 3000 initialDelaySeconds: 5 periodSeconds: 10 ``` --- ## Security Configuration Reference ### Rate Limits | Endpoint | Limit (unauthenticated) | Limit (authenticated) | Window | |---------------|-------------------------|-----------------------|--------| | REST API | 100 requests | 600 requests | 1 min | | WebSocket | 30 events | 30 events | 1 sec | ### Security Headers (Helmet) - **CSP**: `default-src 'self'` (strict, no inline scripts) - **X-Frame-Options**: DENY - **Referrer-Policy**: strict-origin - **HSTS**: Disabled in Phase 1 (HTTP LAN), enable with `ENABLE_HSTS=true` in Phase 2 (HTTPS) ### CORS Configured via `ALLOWED_ORIGINS` environment variable (comma-separated). **Phase 1 (LAN)**: `http://localhost:3000,http://192.168.1.0/24` **Phase 2 (Production)**: Specific domain whitelist, no wildcards --- ## Appendix: Pen-Test Checklist **Run before each release:** 1. **SQL Injection**: Test all endpoints with payloads like `' OR '1'='1`, `'; DROP TABLE agents--` 2. **Header Injection**: Send malformed headers (e.g., `X-Agent-Id: `) 3. **Rate Limit Bypass**: Burst 200 requests in 10 seconds from single IP 4. **JWT Tampering**: Modify JWT payload, re-sign with weak secret, submit 5. **CORS Bypass**: Send request with `Origin: http://evil.com`, check if accepted 6. **WebSocket Flood**: Connect and send 50 events/s, verify rate limit triggers 7. **Message Injection**: Send message with `body: ""`, verify escaping **Expected results:** - All injections rejected with 400/401/403 - Rate limits enforce at defined thresholds - CORS rejects unauthorized origins - No script execution in message rendering