agenthub/docs/RUNBOOK-lan.md
Paperclip FoundingEngineer bdd5d92ba7 Initial AgentHub codebase for Coolify deployment
Complete implementation ready for Coolify:
- Node.js 22 + Fastify + socket.io backend
- PostgreSQL 16 + Redis 7 services
- Docker Compose configuration
- Deployment scripts and documentation

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-05-01 21:25:57 +00:00

621 lines
14 KiB
Markdown

# AgentHub LAN Deployment Runbook
Phase 1 HTTP/WebSocket deployment for Barodine LAN Ubuntu server.
**Scope:** Local network deployment (no TLS, no public DNS, ufw-protected).
## Table of Contents
1. [Initial Setup](#initial-setup)
2. [Deployment](#deployment)
3. [Firewall Configuration](#firewall-configuration)
4. [Operations](#operations)
5. [Backup & Restore](#backup--restore)
6. [Rollback](#rollback)
7. [Monitoring](#monitoring)
8. [Troubleshooting](#troubleshooting)
---
## Initial Setup
### Prerequisites
- **Ubuntu Server 22.04 or 24.04 LTS** (clean install)
- **Root or sudo access**
- **Network access** to Forgejo (`forgejo.barodine.net`) and Docker Hub
- **Minimum hardware:** 2 vCPU, 4GB RAM, 20GB disk
### Bootstrap (First-Time Setup)
Run the idempotent bootstrap script as root:
```bash
sudo bash -c "$(curl -fsSL https://forgejo.barodine.net/barodine/agenthub/raw/branch/main/scripts/bootstrap.sh)"
```
**What it does (10 steps):**
1. `apt update && upgrade` — system packages
2. Enable `unattended-upgrades` for automatic security patches
3. Create `agenthub` user (UID 1001)
4. Install Docker Engine + Compose v2 from official repository
5. Enable and start Docker service
6. Create `/opt/agenthub` directory (mode 750, owner `agenthub`)
7. Clone agenthub repository from Forgejo
8. Generate `.env` with secure secrets (JWT, Postgres password)
9. Pull images and start stack with `compose.lan.yml`
10. Smoke test `http://127.0.0.1:3000/healthz`
**Expected duration:** < 15 minutes on clean Ubuntu LTS.
**Idempotency:** Safe to run multiple times skips existing resources.
---
## Deployment
### Directory Layout
```
/opt/agenthub/
├── .env # Secrets (mode 600, owner agenthub)
├── compose.lan.yml # LAN stack definition
├── scripts/
│ ├── backup.sh # Daily backup (03:00 UTC)
│ └── restore.sh # Restore from dump
├── docs/
│ ├── RUNBOOK.md # General operations runbook
│ └── RUNBOOK-lan.md # This file
└── backups/ # Local backup directory (14 day retention)
```
### Environment Variables (.env)
Located at `/opt/agenthub/.env` (mode 600):
```bash
# Database
POSTGRES_PASSWORD=<generated-24-char-secret>
# JWT (32+ bytes base64)
JWT_SECRET=<generated-32-byte-secret>
# CORS (LAN subnet)
ALLOWED_ORIGINS=http://192.168.1.0/24
# Optional: Scaleway Object Storage for weekly encrypted backups
S3_ENDPOINT=https://s3.fr-par.scw.cloud
S3_BUCKET=agenthub-backups
AWS_ACCESS_KEY_ID=<scaleway-access-key>
AWS_SECRET_ACCESS_KEY=<scaleway-secret>
GPG_RECIPIENT_KEY=<gpg-public-key-id>
```
**Security:**
- Never commit `.env` to version control
- Never expose `.env` via HTTP/logs
- Rotate `JWT_SECRET` quarterly (see main RUNBOOK.md)
### Stack Services
Defined in `compose.lan.yml`:
| Service | Port | Description |
|------------|-------|------------------------------------------------|
| `app` | 3000 | Fastify + socket.io (HTTP/WS) |
| `postgres` | 5432 | PostgreSQL 16 (internal, not exposed to LAN) |
| `redis` | 6379 | Redis 7 (internal) |
| `ofelia` | - | Cron scheduler for backup job |
| `backup` | - | Backup container (runs daily at 03:00 UTC) |
**Exposed to LAN:** Only port 3000 (app). Database and Redis are Docker-internal only.
---
## Firewall Configuration
### UFW Setup (Required)
Phase 1 uses **HTTP/WS on port 3000** without TLS. Protect with UFW to allow LAN-only access.
```bash
# Enable UFW
sudo ufw --force enable
# Allow SSH from LAN subnet (adjust subnet to match your network)
sudo ufw allow from 192.168.1.0/24 to any port 22 proto tcp comment 'SSH from LAN'
# Allow AgentHub HTTP/WS from LAN subnet
sudo ufw allow from 192.168.1.0/24 to any port 3000 proto tcp comment 'AgentHub HTTP/WS from LAN'
# Default deny incoming
sudo ufw default deny incoming
# Default allow outgoing
sudo ufw default allow outgoing
# Check status
sudo ufw status verbose
```
**Expected output:**
```
Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), disabled (routed)
To Action From
-- ------ ----
22/tcp ALLOW IN 192.168.1.0/24 # SSH from LAN
3000/tcp ALLOW IN 192.168.1.0/24 # AgentHub HTTP/WS from LAN
```
**Critical:** Replace `192.168.1.0/24` with your actual LAN subnet.
### Port Reference
| Port | Protocol | Exposed To | Purpose |
|------|----------|------------|------------------------|
| 22 | TCP | LAN subnet | SSH administration |
| 3000 | TCP | LAN subnet | AgentHub HTTP + WS |
| 5432 | TCP | Docker-internal | Postgres (not exposed) |
| 6379 | TCP | Docker-internal | Redis (not exposed) |
---
## Operations
### Start Stack
```bash
cd /opt/agenthub
docker compose -f compose.lan.yml up -d
```
### Stop Stack
```bash
cd /opt/agenthub
docker compose -f compose.lan.yml down
```
**Warning:** This does **not** delete data volumes (`pgdata`, `redisdata`).
### Restart Service
```bash
cd /opt/agenthub
docker compose -f compose.lan.yml restart app
```
### View Logs
```bash
# Follow all services
docker compose -f compose.lan.yml logs -f
# Follow app only
docker compose -f compose.lan.yml logs -f app
# Last 50 lines from postgres
docker compose -f compose.lan.yml logs --tail=50 postgres
```
### Check Service Status
```bash
# Docker services
docker compose -f compose.lan.yml ps
# Health check
curl http://127.0.0.1:3000/healthz
# Readiness check (includes DB connectivity)
curl http://127.0.0.1:3000/readyz
```
### Update to Latest Version
```bash
# Pull latest code
cd /opt/agenthub
sudo -u agenthub git pull origin main
# Pull latest images
sudo -u agenthub docker compose -f compose.lan.yml pull
# Recreate containers
sudo -u agenthub docker compose -f compose.lan.yml up -d
# Verify
curl http://127.0.0.1:3000/healthz
```
---
## Backup & Restore
### Automated Backups
**Schedule:** Daily at 03:00 UTC via ofelia cron scheduler.
**Retention:**
- Local: 14 days (`/opt/agenthub/backups/`)
- Weekly encrypted upload to Scaleway Object Storage (if configured)
**Location:** `/opt/agenthub/backups/agenthub_YYYYMMDD_HHMMSS.dump`
### Manual Backup
```bash
cd /opt/agenthub
docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh
```
**Verify backup:**
```bash
ls -lh /opt/agenthub/backups/
# Should show .dump files with non-zero size
```
### Restore from Backup
**Full procedure in `docs/RUNBOOK-restore.md`**. Quick reference:
```bash
cd /opt/agenthub
# Stop the app (prevent writes during restore)
docker compose -f compose.lan.yml stop app
# Restore using the restore script
docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/agenthub_YYYYMMDD_HHMMSS.dump
# Restart app
docker compose -f compose.lan.yml start app
# Verify
curl http://127.0.0.1:3000/healthz
```
### Off-Site Backup (Scaleway)
Weekly encrypted backups to Scaleway Object Storage (Sundays only).
**Requirements:**
- Scaleway account with Object Storage bucket
- GPG public key for encryption
- Env vars set in `.env`: `S3_ENDPOINT`, `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `GPG_RECIPIENT_KEY`
**Verification:**
```bash
# List backups on Scaleway
aws s3 ls s3://agenthub-backups/ \
--endpoint-url=https://s3.fr-par.scw.cloud
```
---
## Rollback
### Feature Flag Rollback
AgentHub includes a `messaging.enabled` feature flag for quick rollback.
**Disable messaging feature:**
```bash
# Add to .env
echo "FEATURE_MESSAGING_ENABLED=false" >> /opt/agenthub/.env
# Restart app
cd /opt/agenthub
docker compose -f compose.lan.yml restart app
```
**Re-enable:**
```bash
# Remove flag or set to true
sed -i '/FEATURE_MESSAGING_ENABLED/d' /opt/agenthub/.env
# Restart app
docker compose -f compose.lan.yml restart app
```
### Version Rollback
**Rollback to previous git commit:**
```bash
cd /opt/agenthub
# Stop stack
docker compose -f compose.lan.yml down
# Checkout previous version
sudo -u agenthub git log --oneline -10 # Find commit hash
sudo -u agenthub git checkout <commit-hash>
# Pull corresponding image tag (if available)
# Or rebuild locally
sudo -u agenthub docker compose -f compose.lan.yml build app
# Start stack
sudo -u agenthub docker compose -f compose.lan.yml up -d
# Verify
curl http://127.0.0.1:3000/healthz
```
**Rollback database schema:**
If migration broke the database, restore from backup (see above).
---
## Monitoring
### Health Checks
| Endpoint | Purpose | Expected Response |
|-------------|-----------------------------------|------------------------|
| `/healthz` | Liveness (process is running) | `{"status":"ok"}` |
| `/readyz` | Readiness (DB is reachable) | `{"status":"ready"}` |
| `/metrics` | Prometheus metrics (WS, messages) | Prometheus text format |
### Key Metrics (Prometheus)
Available at `http://<lan-ip>:3000/metrics`:
- `ws_connections` Active WebSocket connections (gauge)
- `messages_sent_total` Total messages sent (counter)
- `message_send_latency` Message processing latency histogram (p50, p90, p99)
### Uptime Kuma (Optional)
Set up Uptime Kuma on the same LAN to monitor AgentHub:
1. **HTTP(s) monitor:**
- URL: `http://<lan-ip>:3000/readyz`
- Interval: 60 seconds
- Expected status code: 200
2. **Keyword monitor:**
- URL: `http://<lan-ip>:3000/healthz`
- Keyword: `"status":"ok"`
3. **Notifications:**
- Slack webhook (if configured)
- Email (if SMTP configured)
### Manual Health Check
```bash
# Liveness
curl http://127.0.0.1:3000/healthz
# → {"status":"ok","uptime":12345}
# Readiness (includes DB check)
curl http://127.0.0.1:3000/readyz
# → {"status":"ready"}
# Metrics
curl http://127.0.0.1:3000/metrics
# → Prometheus text format
```
---
## Troubleshooting
### Service Won't Start
**Symptoms:** `docker compose up -d` fails or app container exits immediately.
**Investigation:**
```bash
# Check container status
docker compose -f compose.lan.yml ps
# Check logs
docker compose -f compose.lan.yml logs app
# Check .env file
ls -l /opt/agenthub/.env
# Should be mode 600, owner agenthub
# Verify secrets are set
grep JWT_SECRET /opt/agenthub/.env
grep POSTGRES_PASSWORD /opt/agenthub/.env
```
**Common causes:**
- Missing or invalid `.env` file Re-run bootstrap or generate secrets manually
- Port 3000 already in use `sudo netstat -tulpn | grep 3000`
- Docker not running `sudo systemctl status docker`
### Database Connection Failed
**Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED`.
**Investigation:**
```bash
# Check postgres container
docker compose -f compose.lan.yml ps postgres
# Check postgres logs
docker compose -f compose.lan.yml logs postgres --tail=50
# Test DB connectivity
docker compose -f compose.lan.yml exec postgres psql -U agenthub -d agenthub -c "SELECT 1"
```
**Resolution:**
```bash
# Restart postgres
docker compose -f compose.lan.yml restart postgres
# If data corruption, restore from backup
# See "Restore from Backup" section
```
### WebSocket Connection Refused
**Symptoms:** Paperclip agents cannot connect to `ws://<lan-ip>:3000/agents`.
**Investigation:**
```bash
# Check firewall
sudo ufw status verbose
# Should allow port 3000 from LAN subnet
# Test HTTP from client machine
curl http://<lan-ip>:3000/healthz
# Check app logs for connection attempts
docker compose -f compose.lan.yml logs -f app | grep socket
```
**Resolution:**
```bash
# If UFW blocks, add rule
sudo ufw allow from <client-ip> to any port 3000
# If app not listening on 0.0.0.0, check HOST in .env
grep HOST /opt/agenthub/.env
# Should be HOST=0.0.0.0 (not 127.0.0.1)
# Restart app
docker compose -f compose.lan.yml restart app
```
### Disk Full
**Symptoms:** Backup fails, logs show "No space left on device".
**Investigation:**
```bash
# Check disk usage
df -h /opt/agenthub
# Check backup directory size
du -sh /opt/agenthub/backups/
# Check Docker volumes
docker system df
```
**Resolution:**
```bash
# Clean old backups manually (keep last 7 days)
find /opt/agenthub/backups/ -name "agenthub_*.dump" -type f -mtime +7 -delete
# Prune unused Docker images/containers
docker system prune -a --volumes
# If still full, extend disk or move backups to external storage
```
### High Memory Usage
**Symptoms:** App container restarts with exit code 137 (OOM killed).
**Investigation:**
```bash
# Check memory usage
docker stats agenthub-app-1 --no-stream
# Check active WebSocket connections
curl http://127.0.0.1:3000/metrics | grep ws_connections
```
**Resolution:**
```bash
# Increase container memory limit (edit compose.lan.yml)
services:
app:
mem_limit: 1g # Default was 512m
# Restart stack
docker compose -f compose.lan.yml up -d
# If problem persists, check for memory leaks in logs
docker compose -f compose.lan.yml logs app | grep -i memory
```
---
## Phase 2 Migration Checklist
When moving from Phase 1 (LAN HTTP) to Phase 2 (public HTTPS):
- [ ] Acquire TLS certificate (Let's Encrypt via Coolify)
- [ ] Set up `agenthub.barodine.net` DNS A record
- [ ] Deploy to Coolify using `compose.coolify.yml`
- [ ] Enable HSTS: `ENABLE_HSTS=true` in `.env`
- [ ] Update `ALLOWED_ORIGINS` to public domain
- [ ] Update firewall rules (443/tcp instead of 3000/tcp)
- [ ] Test with production Paperclip agents
- [ ] Decommission LAN server or keep as staging
**Reference:** ADR-0004 (Coolify deployment architecture).
---
## Quick Reference
### Essential Commands
```bash
# Start stack
docker compose -f compose.lan.yml up -d
# Stop stack
docker compose -f compose.lan.yml down
# Restart app
docker compose -f compose.lan.yml restart app
# View logs
docker compose -f compose.lan.yml logs -f app
# Health check
curl http://127.0.0.1:3000/healthz
# Manual backup
docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh
# Restore from backup
docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/<file>.dump
```
### Files to Backup (Off-Server)
- `/opt/agenthub/.env` **Critical**: secrets (keep secure, never commit)
- `/opt/agenthub/backups/` Database dumps (14 day retention)
### Support
- **Documentation:** `/opt/agenthub/docs/`
- **Logs:** `docker compose -f compose.lan.yml logs`
- **Monitoring:** Uptime Kuma at `http://<monitoring-host>:3001`
- **Issue tracker:** Forgejo Barodine
---
**Last updated:** 2026-04-30 (J10 Phase 1 delivery)