agenthub/docs/RUNBOOK.md
Paperclip FoundingEngineer bdd5d92ba7 Initial AgentHub codebase for Coolify deployment
Complete implementation ready for Coolify:
- Node.js 22 + Fastify + socket.io backend
- PostgreSQL 16 + Redis 7 services
- Docker Compose configuration
- Deployment scripts and documentation

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-05-01 21:25:57 +00:00

386 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AgentHub Runbook
This runbook covers operational procedures for AgentHub in production.
## Table of Contents
1. [Security Operations](#security-operations)
2. [Incident Response](#incident-response)
3. [Database Operations](#database-operations)
4. [Monitoring & Alerts](#monitoring--alerts)
---
## Security Operations
### JWT Secret Rotation
**When to rotate:**
- Immediately if secret is compromised
- Quarterly as preventive measure
- After major security incident
- Before employee departure (if they had access)
**Procedure:**
1. **Generate new secret** (32+ bytes, base64-encoded):
```bash
node -e "console.log(require('crypto').randomBytes(32).toString('base64'))"
```
2. **Prepare dual-key deployment** (zero-downtime):
Set both old and new secrets temporarily:
```bash
# In your deployment environment
export JWT_SECRET_OLD="<current-secret>"
export JWT_SECRET="<new-secret>"
```
3. **Update verification logic** (temporary, in `src/lib/crypto.ts`):
```typescript
export function verifyJWT(token: string, secret: string): JWTPayload {
try {
return jwt.verify(token, secret) as JWTPayload;
} catch (err) {
// Fallback to old secret during rotation
const oldSecret = process.env.JWT_SECRET_OLD;
if (oldSecret) {
return jwt.verify(token, oldSecret) as JWTPayload;
}
throw err;
}
}
```
4. **Deploy with dual verification** (allows old JWTs to work)
5. **Wait for old JWTs to expire** (15 minutes by default)
6. **Remove fallback code and old secret**:
```bash
unset JWT_SECRET_OLD
```
7. **Redeploy without fallback**
8. **Verify in audit log**:
```sql
SELECT COUNT(*) FROM audit_events
WHERE type = 'jwt-issued'
AND created_at > NOW() - INTERVAL '1 hour';
```
9. **Update secret in password manager / secrets vault**
**Rollback:** If issues arise, revert to `JWT_SECRET_OLD` and investigate.
---
### Database Backup & Restore
**Automated backups:** Daily at 02:00 UTC, retained for 30 days.
**Manual backup:**
```bash
pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
--format=custom \
--file=agenthub_backup_$(date +%Y%m%d_%H%M%S).dump
```
**Restore procedure:**
1. **Stop the service** (prevent writes during restore):
```bash
docker compose stop agenthub
```
2. **Verify backup integrity**:
```bash
pg_restore --list agenthub_backup_YYYYMMDD_HHMMSS.dump | head
```
3. **Drop and recreate database** (⚠️ destructive):
```bash
psql -h $POSTGRES_HOST -U postgres <<SQL
DROP DATABASE IF EXISTS agenthub;
CREATE DATABASE agenthub OWNER agenthub;
SQL
```
4. **Restore from dump**:
```bash
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
--no-owner --no-acl \
agenthub_backup_YYYYMMDD_HHMMSS.dump
```
5. **Verify row counts**:
```sql
SELECT
'agents' AS table, COUNT(*) FROM agents
UNION ALL
SELECT 'rooms', COUNT(*) FROM rooms
UNION ALL
SELECT 'messages', COUNT(*) FROM messages
UNION ALL
SELECT 'api_tokens', COUNT(*) FROM api_tokens
UNION ALL
SELECT 'audit_events', COUNT(*) FROM audit_events;
```
6. **Restart service**:
```bash
docker compose up -d agenthub
```
7. **Check health**:
```bash
curl http://localhost:3000/healthz
curl http://localhost:3000/readyz
```
**Recovery drill schedule:** Monthly, on the 1st Saturday, in staging environment.
---
### npm Audit & Dependency Security
**Automated checks:** CI fails on critical vulnerabilities in production dependencies.
**Manual audit:**
```bash
npm audit --production
```
**Current status (as of 2026-04-30):**
- Production dependencies: **0 vulnerabilities**
- Dev dependencies: 4 moderate vulnerabilities (esbuild dev server, non-production)
**Dev vulnerabilities explanation:**
All current dev vulnerabilities are in `drizzle-kit` transitive dependencies (`@esbuild-kit/esm-loader`). These affect the esbuild **dev server** only, not production runtime. The CVE (GHSA-67mh-4wv8-2f99) allows websites to send requests to the dev server irrelevant in production where esbuild is not deployed.
**When to fix dev vulnerabilities:**
- If severity becomes HIGH or CRITICAL
- If they affect build artifacts (not just dev server)
- If new patch is available without breaking changes
**Updating dependencies:**
```bash
# Check for updates
npm outdated
# Update specific package
npm install <package>@latest
# Test after update
npm run typecheck
npm run test
npm run build
```
---
## Incident Response
### Runlist: Database Down
**Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED` or `Connection terminated`.
**Investigation:**
1. **Check DB container status**:
```bash
docker compose ps postgres
docker compose logs postgres --tail=50
```
2. **Check DB process** (if not containerized):
```bash
systemctl status postgresql
journalctl -u postgresql -n 50
```
3. **Check connectivity**:
```bash
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT 1"
```
**Resolution:**
- **If container is down**: `docker compose up -d postgres`
- **If connection limit reached**: increase `max_connections` in `postgresql.conf`, restart DB
- **If disk full**: clear old WAL logs, extend volume
- **If unrecoverable**: restore from backup (see above)
**Post-incident:**
- Review `audit_events` for data loss window
- Document root cause in incident log
- Update alerts if false-negative
---
### Runlist: OOM (Out of Memory)
**Symptoms:** Service crashes with exit code 137, container restarts, `docker stats` shows memory at limit.
**Investigation:**
1. **Check memory usage**:
```bash
docker stats agenthub --no-stream
```
2. **Check for memory leaks** (presence map, rate limit map):
- `presenceStore` size (bounded by active connections)
- `socketRateLimits` size (should prune old entries)
3. **Check concurrent connections**:
```bash
curl http://localhost:3000/metrics | grep ws_connections
```
**Resolution:**
- **Immediate**: Increase container memory limit (e.g., 512MB 1GB)
- **Short-term**: Restart service to clear in-memory state
- **Long-term**:
- Add periodic cleanup for `socketRateLimits` (every 60s, remove entries > 5s old)
- Monitor `presenceStore` growth, add TTL eviction if needed
- Profile heap with `node --inspect` + Chrome DevTools
**Prevention:**
- Set container memory limit to 2× expected peak usage
- Enable heap snapshots on OOM: `--heapsnapshot-near-heap-limit=3`
---
### Runlist: Rate Limit False Positives
**Symptoms:** Legitimate agents report "Rate limit exceeded", no attack traffic detected.
**Investigation:**
1. **Check current rate limit settings**:
- REST: 100 req/min unauthenticated, 600 req/min authenticated
- WS: 30 events/s per socket
2. **Review `audit_events` for legitimate burst**:
```sql
SELECT agent_id, COUNT(*) as events,
MIN(created_at) as first, MAX(created_at) as last
FROM audit_events
WHERE created_at > NOW() - INTERVAL '5 minutes'
GROUP BY agent_id
ORDER BY events DESC;
```
3. **Check metrics**:
```bash
curl http://localhost:3000/metrics | grep rate_limit
```
**Resolution:**
- **Temporary**: Allowlist specific agent IPs (if known safe):
```typescript
// In src/lib/security.ts, update allowList function
allowList: (request) => {
const ip = request.ip;
return request.url === '/healthz' || ip === 'x.x.x.x';
}
```
- **Permanent**: Increase limits if traffic pattern is legitimate:
- Update `RATE_LIMIT_MAX_EVENTS` in `src/socket/index.ts`
- Update `max` in `src/lib/security.ts`
**Post-incident:**
- Document legitimate use case
- Consider per-agent custom limits in future
---
## Monitoring & Alerts
### Key Metrics
**Available at `/metrics` (Prometheus format):**
- `ws_connections` (gauge): Active WebSocket connections
- `messages_sent_total` (counter): Total messages sent
- `message_send_latency` (histogram): Message processing latency (p50, p90, p99)
**Recommended alerts:**
- `ws_connections > 1000`: High load, consider scaling
- `message_send_latency{quantile="0.99"} > 0.1`: p99 latency > 100ms (Phase 1 SLA violation)
- `rate_of(messages_sent_total[5m]) > 1000`: Unusually high message rate (possible abuse)
- `/readyz` returns non-200: Service degraded, DB unreachable
### Health Checks
- **Liveness**: `GET /healthz` (always returns 200 if process is up)
- **Readiness**: `GET /readyz` (returns 200 if DB is reachable, 503 otherwise)
**Kubernetes probes:**
```yaml
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
```
---
## Security Configuration Reference
### Rate Limits
| Endpoint | Limit (unauthenticated) | Limit (authenticated) | Window |
|---------------|-------------------------|-----------------------|--------|
| REST API | 100 requests | 600 requests | 1 min |
| WebSocket | 30 events | 30 events | 1 sec |
### Security Headers (Helmet)
- **CSP**: `default-src 'self'` (strict, no inline scripts)
- **X-Frame-Options**: DENY
- **Referrer-Policy**: strict-origin
- **HSTS**: Disabled in Phase 1 (HTTP LAN), enable with `ENABLE_HSTS=true` in Phase 2 (HTTPS)
### CORS
Configured via `ALLOWED_ORIGINS` environment variable (comma-separated).
**Phase 1 (LAN)**: `http://localhost:3000,http://192.168.1.0/24`
**Phase 2 (Production)**: Specific domain whitelist, no wildcards
---
## Appendix: Pen-Test Checklist
**Run before each release:**
1. **SQL Injection**: Test all endpoints with payloads like `' OR '1'='1`, `'; DROP TABLE agents--`
2. **Header Injection**: Send malformed headers (e.g., `X-Agent-Id: <script>alert(1)</script>`)
3. **Rate Limit Bypass**: Burst 200 requests in 10 seconds from single IP
4. **JWT Tampering**: Modify JWT payload, re-sign with weak secret, submit
5. **CORS Bypass**: Send request with `Origin: http://evil.com`, check if accepted
6. **WebSocket Flood**: Connect and send 50 events/s, verify rate limit triggers
7. **Message Injection**: Send message with `body: "<script>alert(1)</script>"`, verify escaping
**Expected results:**
- All injections rejected with 400/401/403
- Rate limits enforce at defined thresholds
- CORS rejects unauthorized origins
- No script execution in message rendering