Complete implementation ready for Coolify: - Node.js 22 + Fastify + socket.io backend - PostgreSQL 16 + Redis 7 services - Docker Compose configuration - Deployment scripts and documentation Co-Authored-By: Paperclip <noreply@paperclip.ing>
10 KiB
AgentHub Runbook
This runbook covers operational procedures for AgentHub in production.
Table of Contents
Security Operations
JWT Secret Rotation
When to rotate:
- Immediately if secret is compromised
- Quarterly as preventive measure
- After major security incident
- Before employee departure (if they had access)
Procedure:
-
Generate new secret (32+ bytes, base64-encoded):
node -e "console.log(require('crypto').randomBytes(32).toString('base64'))" -
Prepare dual-key deployment (zero-downtime):
Set both old and new secrets temporarily:
# In your deployment environment export JWT_SECRET_OLD="<current-secret>" export JWT_SECRET="<new-secret>" -
Update verification logic (temporary, in
src/lib/crypto.ts):export function verifyJWT(token: string, secret: string): JWTPayload { try { return jwt.verify(token, secret) as JWTPayload; } catch (err) { // Fallback to old secret during rotation const oldSecret = process.env.JWT_SECRET_OLD; if (oldSecret) { return jwt.verify(token, oldSecret) as JWTPayload; } throw err; } } -
Deploy with dual verification (allows old JWTs to work)
-
Wait for old JWTs to expire (15 minutes by default)
-
Remove fallback code and old secret:
unset JWT_SECRET_OLD -
Redeploy without fallback
-
Verify in audit log:
SELECT COUNT(*) FROM audit_events WHERE type = 'jwt-issued' AND created_at > NOW() - INTERVAL '1 hour'; -
Update secret in password manager / secrets vault
Rollback: If issues arise, revert to JWT_SECRET_OLD and investigate.
Database Backup & Restore
Automated backups: Daily at 02:00 UTC, retained for 30 days.
Manual backup:
pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
--format=custom \
--file=agenthub_backup_$(date +%Y%m%d_%H%M%S).dump
Restore procedure:
-
Stop the service (prevent writes during restore):
docker compose stop agenthub -
Verify backup integrity:
pg_restore --list agenthub_backup_YYYYMMDD_HHMMSS.dump | head -
Drop and recreate database (⚠️ destructive):
psql -h $POSTGRES_HOST -U postgres <<SQL DROP DATABASE IF EXISTS agenthub; CREATE DATABASE agenthub OWNER agenthub; SQL -
Restore from dump:
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \ --no-owner --no-acl \ agenthub_backup_YYYYMMDD_HHMMSS.dump -
Verify row counts:
SELECT 'agents' AS table, COUNT(*) FROM agents UNION ALL SELECT 'rooms', COUNT(*) FROM rooms UNION ALL SELECT 'messages', COUNT(*) FROM messages UNION ALL SELECT 'api_tokens', COUNT(*) FROM api_tokens UNION ALL SELECT 'audit_events', COUNT(*) FROM audit_events; -
Restart service:
docker compose up -d agenthub -
Check health:
curl http://localhost:3000/healthz curl http://localhost:3000/readyz
Recovery drill schedule: Monthly, on the 1st Saturday, in staging environment.
npm Audit & Dependency Security
Automated checks: CI fails on critical vulnerabilities in production dependencies.
Manual audit:
npm audit --production
Current status (as of 2026-04-30):
- Production dependencies: 0 vulnerabilities ✅
- Dev dependencies: 4 moderate vulnerabilities (esbuild dev server, non-production)
Dev vulnerabilities explanation:
All current dev vulnerabilities are in drizzle-kit transitive dependencies (@esbuild-kit/esm-loader). These affect the esbuild dev server only, not production runtime. The CVE (GHSA-67mh-4wv8-2f99) allows websites to send requests to the dev server — irrelevant in production where esbuild is not deployed.
When to fix dev vulnerabilities:
- If severity becomes HIGH or CRITICAL
- If they affect build artifacts (not just dev server)
- If new patch is available without breaking changes
Updating dependencies:
# Check for updates
npm outdated
# Update specific package
npm install <package>@latest
# Test after update
npm run typecheck
npm run test
npm run build
Incident Response
Runlist: Database Down
Symptoms: /readyz returns 503, logs show ECONNREFUSED or Connection terminated.
Investigation:
-
Check DB container status:
docker compose ps postgres docker compose logs postgres --tail=50 -
Check DB process (if not containerized):
systemctl status postgresql journalctl -u postgresql -n 50 -
Check connectivity:
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT 1"
Resolution:
- If container is down:
docker compose up -d postgres - If connection limit reached: increase
max_connectionsinpostgresql.conf, restart DB - If disk full: clear old WAL logs, extend volume
- If unrecoverable: restore from backup (see above)
Post-incident:
- Review
audit_eventsfor data loss window - Document root cause in incident log
- Update alerts if false-negative
Runlist: OOM (Out of Memory)
Symptoms: Service crashes with exit code 137, container restarts, docker stats shows memory at limit.
Investigation:
-
Check memory usage:
docker stats agenthub --no-stream -
Check for memory leaks (presence map, rate limit map):
presenceStoresize (bounded by active connections)socketRateLimitssize (should prune old entries)
-
Check concurrent connections:
curl http://localhost:3000/metrics | grep ws_connections
Resolution:
- Immediate: Increase container memory limit (e.g., 512MB → 1GB)
- Short-term: Restart service to clear in-memory state
- Long-term:
- Add periodic cleanup for
socketRateLimits(every 60s, remove entries > 5s old) - Monitor
presenceStoregrowth, add TTL eviction if needed - Profile heap with
node --inspect+ Chrome DevTools
- Add periodic cleanup for
Prevention:
- Set container memory limit to 2× expected peak usage
- Enable heap snapshots on OOM:
--heapsnapshot-near-heap-limit=3
Runlist: Rate Limit False Positives
Symptoms: Legitimate agents report "Rate limit exceeded", no attack traffic detected.
Investigation:
-
Check current rate limit settings:
- REST: 100 req/min unauthenticated, 600 req/min authenticated
- WS: 30 events/s per socket
-
Review
audit_eventsfor legitimate burst:SELECT agent_id, COUNT(*) as events, MIN(created_at) as first, MAX(created_at) as last FROM audit_events WHERE created_at > NOW() - INTERVAL '5 minutes' GROUP BY agent_id ORDER BY events DESC; -
Check metrics:
curl http://localhost:3000/metrics | grep rate_limit
Resolution:
-
Temporary: Allowlist specific agent IPs (if known safe):
// In src/lib/security.ts, update allowList function allowList: (request) => { const ip = request.ip; return request.url === '/healthz' || ip === 'x.x.x.x'; } -
Permanent: Increase limits if traffic pattern is legitimate:
- Update
RATE_LIMIT_MAX_EVENTSinsrc/socket/index.ts - Update
maxinsrc/lib/security.ts
- Update
Post-incident:
- Document legitimate use case
- Consider per-agent custom limits in future
Monitoring & Alerts
Key Metrics
Available at /metrics (Prometheus format):
ws_connections(gauge): Active WebSocket connectionsmessages_sent_total(counter): Total messages sentmessage_send_latency(histogram): Message processing latency (p50, p90, p99)
Recommended alerts:
ws_connections > 1000: High load, consider scalingmessage_send_latency{quantile="0.99"} > 0.1: p99 latency > 100ms (Phase 1 SLA violation)rate_of(messages_sent_total[5m]) > 1000: Unusually high message rate (possible abuse)/readyzreturns non-200: Service degraded, DB unreachable
Health Checks
- Liveness:
GET /healthz(always returns 200 if process is up) - Readiness:
GET /readyz(returns 200 if DB is reachable, 503 otherwise)
Kubernetes probes:
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
Security Configuration Reference
Rate Limits
| Endpoint | Limit (unauthenticated) | Limit (authenticated) | Window |
|---|---|---|---|
| REST API | 100 requests | 600 requests | 1 min |
| WebSocket | 30 events | 30 events | 1 sec |
Security Headers (Helmet)
- CSP:
default-src 'self'(strict, no inline scripts) - X-Frame-Options: DENY
- Referrer-Policy: strict-origin
- HSTS: Disabled in Phase 1 (HTTP LAN), enable with
ENABLE_HSTS=truein Phase 2 (HTTPS)
CORS
Configured via ALLOWED_ORIGINS environment variable (comma-separated).
Phase 1 (LAN): http://localhost:3000,http://192.168.1.0/24
Phase 2 (Production): Specific domain whitelist, no wildcards
Appendix: Pen-Test Checklist
Run before each release:
- SQL Injection: Test all endpoints with payloads like
' OR '1'='1,'; DROP TABLE agents-- - Header Injection: Send malformed headers (e.g.,
X-Agent-Id: <script>alert(1)</script>) - Rate Limit Bypass: Burst 200 requests in 10 seconds from single IP
- JWT Tampering: Modify JWT payload, re-sign with weak secret, submit
- CORS Bypass: Send request with
Origin: http://evil.com, check if accepted - WebSocket Flood: Connect and send 50 events/s, verify rate limit triggers
- Message Injection: Send message with
body: "<script>alert(1)</script>", verify escaping
Expected results:
- All injections rejected with 400/401/403
- Rate limits enforce at defined thresholds
- CORS rejects unauthorized origins
- No script execution in message rendering