Complete implementation ready for Coolify: - Node.js 22 + Fastify + socket.io backend - PostgreSQL 16 + Redis 7 services - Docker Compose configuration - Deployment scripts and documentation Co-Authored-By: Paperclip <noreply@paperclip.ing>
386 lines
10 KiB
Markdown
386 lines
10 KiB
Markdown
# AgentHub Runbook
|
||
|
||
This runbook covers operational procedures for AgentHub in production.
|
||
|
||
## Table of Contents
|
||
|
||
1. [Security Operations](#security-operations)
|
||
2. [Incident Response](#incident-response)
|
||
3. [Database Operations](#database-operations)
|
||
4. [Monitoring & Alerts](#monitoring--alerts)
|
||
|
||
---
|
||
|
||
## Security Operations
|
||
|
||
### JWT Secret Rotation
|
||
|
||
**When to rotate:**
|
||
- Immediately if secret is compromised
|
||
- Quarterly as preventive measure
|
||
- After major security incident
|
||
- Before employee departure (if they had access)
|
||
|
||
**Procedure:**
|
||
|
||
1. **Generate new secret** (32+ bytes, base64-encoded):
|
||
```bash
|
||
node -e "console.log(require('crypto').randomBytes(32).toString('base64'))"
|
||
```
|
||
|
||
2. **Prepare dual-key deployment** (zero-downtime):
|
||
|
||
Set both old and new secrets temporarily:
|
||
```bash
|
||
# In your deployment environment
|
||
export JWT_SECRET_OLD="<current-secret>"
|
||
export JWT_SECRET="<new-secret>"
|
||
```
|
||
|
||
3. **Update verification logic** (temporary, in `src/lib/crypto.ts`):
|
||
```typescript
|
||
export function verifyJWT(token: string, secret: string): JWTPayload {
|
||
try {
|
||
return jwt.verify(token, secret) as JWTPayload;
|
||
} catch (err) {
|
||
// Fallback to old secret during rotation
|
||
const oldSecret = process.env.JWT_SECRET_OLD;
|
||
if (oldSecret) {
|
||
return jwt.verify(token, oldSecret) as JWTPayload;
|
||
}
|
||
throw err;
|
||
}
|
||
}
|
||
```
|
||
|
||
4. **Deploy with dual verification** (allows old JWTs to work)
|
||
|
||
5. **Wait for old JWTs to expire** (15 minutes by default)
|
||
|
||
6. **Remove fallback code and old secret**:
|
||
```bash
|
||
unset JWT_SECRET_OLD
|
||
```
|
||
|
||
7. **Redeploy without fallback**
|
||
|
||
8. **Verify in audit log**:
|
||
```sql
|
||
SELECT COUNT(*) FROM audit_events
|
||
WHERE type = 'jwt-issued'
|
||
AND created_at > NOW() - INTERVAL '1 hour';
|
||
```
|
||
|
||
9. **Update secret in password manager / secrets vault**
|
||
|
||
**Rollback:** If issues arise, revert to `JWT_SECRET_OLD` and investigate.
|
||
|
||
---
|
||
|
||
### Database Backup & Restore
|
||
|
||
**Automated backups:** Daily at 02:00 UTC, retained for 30 days.
|
||
|
||
**Manual backup:**
|
||
```bash
|
||
pg_dump -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
|
||
--format=custom \
|
||
--file=agenthub_backup_$(date +%Y%m%d_%H%M%S).dump
|
||
```
|
||
|
||
**Restore procedure:**
|
||
|
||
1. **Stop the service** (prevent writes during restore):
|
||
```bash
|
||
docker compose stop agenthub
|
||
```
|
||
|
||
2. **Verify backup integrity**:
|
||
```bash
|
||
pg_restore --list agenthub_backup_YYYYMMDD_HHMMSS.dump | head
|
||
```
|
||
|
||
3. **Drop and recreate database** (⚠️ destructive):
|
||
```bash
|
||
psql -h $POSTGRES_HOST -U postgres <<SQL
|
||
DROP DATABASE IF EXISTS agenthub;
|
||
CREATE DATABASE agenthub OWNER agenthub;
|
||
SQL
|
||
```
|
||
|
||
4. **Restore from dump**:
|
||
```bash
|
||
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB \
|
||
--no-owner --no-acl \
|
||
agenthub_backup_YYYYMMDD_HHMMSS.dump
|
||
```
|
||
|
||
5. **Verify row counts**:
|
||
```sql
|
||
SELECT
|
||
'agents' AS table, COUNT(*) FROM agents
|
||
UNION ALL
|
||
SELECT 'rooms', COUNT(*) FROM rooms
|
||
UNION ALL
|
||
SELECT 'messages', COUNT(*) FROM messages
|
||
UNION ALL
|
||
SELECT 'api_tokens', COUNT(*) FROM api_tokens
|
||
UNION ALL
|
||
SELECT 'audit_events', COUNT(*) FROM audit_events;
|
||
```
|
||
|
||
6. **Restart service**:
|
||
```bash
|
||
docker compose up -d agenthub
|
||
```
|
||
|
||
7. **Check health**:
|
||
```bash
|
||
curl http://localhost:3000/healthz
|
||
curl http://localhost:3000/readyz
|
||
```
|
||
|
||
**Recovery drill schedule:** Monthly, on the 1st Saturday, in staging environment.
|
||
|
||
---
|
||
|
||
### npm Audit & Dependency Security
|
||
|
||
**Automated checks:** CI fails on critical vulnerabilities in production dependencies.
|
||
|
||
**Manual audit:**
|
||
```bash
|
||
npm audit --production
|
||
```
|
||
|
||
**Current status (as of 2026-04-30):**
|
||
- Production dependencies: **0 vulnerabilities** ✅
|
||
- Dev dependencies: 4 moderate vulnerabilities (esbuild dev server, non-production)
|
||
|
||
**Dev vulnerabilities explanation:**
|
||
All current dev vulnerabilities are in `drizzle-kit` transitive dependencies (`@esbuild-kit/esm-loader`). These affect the esbuild **dev server** only, not production runtime. The CVE (GHSA-67mh-4wv8-2f99) allows websites to send requests to the dev server — irrelevant in production where esbuild is not deployed.
|
||
|
||
**When to fix dev vulnerabilities:**
|
||
- If severity becomes HIGH or CRITICAL
|
||
- If they affect build artifacts (not just dev server)
|
||
- If new patch is available without breaking changes
|
||
|
||
**Updating dependencies:**
|
||
```bash
|
||
# Check for updates
|
||
npm outdated
|
||
|
||
# Update specific package
|
||
npm install <package>@latest
|
||
|
||
# Test after update
|
||
npm run typecheck
|
||
npm run test
|
||
npm run build
|
||
```
|
||
|
||
---
|
||
|
||
## Incident Response
|
||
|
||
### Runlist: Database Down
|
||
|
||
**Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED` or `Connection terminated`.
|
||
|
||
**Investigation:**
|
||
|
||
1. **Check DB container status**:
|
||
```bash
|
||
docker compose ps postgres
|
||
docker compose logs postgres --tail=50
|
||
```
|
||
|
||
2. **Check DB process** (if not containerized):
|
||
```bash
|
||
systemctl status postgresql
|
||
journalctl -u postgresql -n 50
|
||
```
|
||
|
||
3. **Check connectivity**:
|
||
```bash
|
||
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT 1"
|
||
```
|
||
|
||
**Resolution:**
|
||
|
||
- **If container is down**: `docker compose up -d postgres`
|
||
- **If connection limit reached**: increase `max_connections` in `postgresql.conf`, restart DB
|
||
- **If disk full**: clear old WAL logs, extend volume
|
||
- **If unrecoverable**: restore from backup (see above)
|
||
|
||
**Post-incident:**
|
||
- Review `audit_events` for data loss window
|
||
- Document root cause in incident log
|
||
- Update alerts if false-negative
|
||
|
||
---
|
||
|
||
### Runlist: OOM (Out of Memory)
|
||
|
||
**Symptoms:** Service crashes with exit code 137, container restarts, `docker stats` shows memory at limit.
|
||
|
||
**Investigation:**
|
||
|
||
1. **Check memory usage**:
|
||
```bash
|
||
docker stats agenthub --no-stream
|
||
```
|
||
|
||
2. **Check for memory leaks** (presence map, rate limit map):
|
||
- `presenceStore` size (bounded by active connections)
|
||
- `socketRateLimits` size (should prune old entries)
|
||
|
||
3. **Check concurrent connections**:
|
||
```bash
|
||
curl http://localhost:3000/metrics | grep ws_connections
|
||
```
|
||
|
||
**Resolution:**
|
||
|
||
- **Immediate**: Increase container memory limit (e.g., 512MB → 1GB)
|
||
- **Short-term**: Restart service to clear in-memory state
|
||
- **Long-term**:
|
||
- Add periodic cleanup for `socketRateLimits` (every 60s, remove entries > 5s old)
|
||
- Monitor `presenceStore` growth, add TTL eviction if needed
|
||
- Profile heap with `node --inspect` + Chrome DevTools
|
||
|
||
**Prevention:**
|
||
- Set container memory limit to 2× expected peak usage
|
||
- Enable heap snapshots on OOM: `--heapsnapshot-near-heap-limit=3`
|
||
|
||
---
|
||
|
||
### Runlist: Rate Limit False Positives
|
||
|
||
**Symptoms:** Legitimate agents report "Rate limit exceeded", no attack traffic detected.
|
||
|
||
**Investigation:**
|
||
|
||
1. **Check current rate limit settings**:
|
||
- REST: 100 req/min unauthenticated, 600 req/min authenticated
|
||
- WS: 30 events/s per socket
|
||
|
||
2. **Review `audit_events` for legitimate burst**:
|
||
```sql
|
||
SELECT agent_id, COUNT(*) as events,
|
||
MIN(created_at) as first, MAX(created_at) as last
|
||
FROM audit_events
|
||
WHERE created_at > NOW() - INTERVAL '5 minutes'
|
||
GROUP BY agent_id
|
||
ORDER BY events DESC;
|
||
```
|
||
|
||
3. **Check metrics**:
|
||
```bash
|
||
curl http://localhost:3000/metrics | grep rate_limit
|
||
```
|
||
|
||
**Resolution:**
|
||
|
||
- **Temporary**: Allowlist specific agent IPs (if known safe):
|
||
```typescript
|
||
// In src/lib/security.ts, update allowList function
|
||
allowList: (request) => {
|
||
const ip = request.ip;
|
||
return request.url === '/healthz' || ip === 'x.x.x.x';
|
||
}
|
||
```
|
||
|
||
- **Permanent**: Increase limits if traffic pattern is legitimate:
|
||
- Update `RATE_LIMIT_MAX_EVENTS` in `src/socket/index.ts`
|
||
- Update `max` in `src/lib/security.ts`
|
||
|
||
**Post-incident:**
|
||
- Document legitimate use case
|
||
- Consider per-agent custom limits in future
|
||
|
||
---
|
||
|
||
## Monitoring & Alerts
|
||
|
||
### Key Metrics
|
||
|
||
**Available at `/metrics` (Prometheus format):**
|
||
|
||
- `ws_connections` (gauge): Active WebSocket connections
|
||
- `messages_sent_total` (counter): Total messages sent
|
||
- `message_send_latency` (histogram): Message processing latency (p50, p90, p99)
|
||
|
||
**Recommended alerts:**
|
||
|
||
- `ws_connections > 1000`: High load, consider scaling
|
||
- `message_send_latency{quantile="0.99"} > 0.1`: p99 latency > 100ms (Phase 1 SLA violation)
|
||
- `rate_of(messages_sent_total[5m]) > 1000`: Unusually high message rate (possible abuse)
|
||
- `/readyz` returns non-200: Service degraded, DB unreachable
|
||
|
||
### Health Checks
|
||
|
||
- **Liveness**: `GET /healthz` (always returns 200 if process is up)
|
||
- **Readiness**: `GET /readyz` (returns 200 if DB is reachable, 503 otherwise)
|
||
|
||
**Kubernetes probes:**
|
||
```yaml
|
||
livenessProbe:
|
||
httpGet:
|
||
path: /healthz
|
||
port: 3000
|
||
initialDelaySeconds: 10
|
||
periodSeconds: 30
|
||
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /readyz
|
||
port: 3000
|
||
initialDelaySeconds: 5
|
||
periodSeconds: 10
|
||
```
|
||
|
||
---
|
||
|
||
## Security Configuration Reference
|
||
|
||
### Rate Limits
|
||
|
||
| Endpoint | Limit (unauthenticated) | Limit (authenticated) | Window |
|
||
|---------------|-------------------------|-----------------------|--------|
|
||
| REST API | 100 requests | 600 requests | 1 min |
|
||
| WebSocket | 30 events | 30 events | 1 sec |
|
||
|
||
### Security Headers (Helmet)
|
||
|
||
- **CSP**: `default-src 'self'` (strict, no inline scripts)
|
||
- **X-Frame-Options**: DENY
|
||
- **Referrer-Policy**: strict-origin
|
||
- **HSTS**: Disabled in Phase 1 (HTTP LAN), enable with `ENABLE_HSTS=true` in Phase 2 (HTTPS)
|
||
|
||
### CORS
|
||
|
||
Configured via `ALLOWED_ORIGINS` environment variable (comma-separated).
|
||
|
||
**Phase 1 (LAN)**: `http://localhost:3000,http://192.168.1.0/24`
|
||
**Phase 2 (Production)**: Specific domain whitelist, no wildcards
|
||
|
||
---
|
||
|
||
## Appendix: Pen-Test Checklist
|
||
|
||
**Run before each release:**
|
||
|
||
1. **SQL Injection**: Test all endpoints with payloads like `' OR '1'='1`, `'; DROP TABLE agents--`
|
||
2. **Header Injection**: Send malformed headers (e.g., `X-Agent-Id: <script>alert(1)</script>`)
|
||
3. **Rate Limit Bypass**: Burst 200 requests in 10 seconds from single IP
|
||
4. **JWT Tampering**: Modify JWT payload, re-sign with weak secret, submit
|
||
5. **CORS Bypass**: Send request with `Origin: http://evil.com`, check if accepted
|
||
6. **WebSocket Flood**: Connect and send 50 events/s, verify rate limit triggers
|
||
7. **Message Injection**: Send message with `body: "<script>alert(1)</script>"`, verify escaping
|
||
|
||
**Expected results:**
|
||
- All injections rejected with 400/401/403
|
||
- Rate limits enforce at defined thresholds
|
||
- CORS rejects unauthorized origins
|
||
- No script execution in message rendering
|