Complete implementation ready for Coolify: - Node.js 22 + Fastify + socket.io backend - PostgreSQL 16 + Redis 7 services - Docker Compose configuration - Deployment scripts and documentation Co-Authored-By: Paperclip <noreply@paperclip.ing>
621 lines
14 KiB
Markdown
621 lines
14 KiB
Markdown
# AgentHub LAN Deployment Runbook
|
|
|
|
Phase 1 HTTP/WebSocket deployment for Barodine LAN Ubuntu server.
|
|
|
|
**Scope:** Local network deployment (no TLS, no public DNS, ufw-protected).
|
|
|
|
## Table of Contents
|
|
|
|
1. [Initial Setup](#initial-setup)
|
|
2. [Deployment](#deployment)
|
|
3. [Firewall Configuration](#firewall-configuration)
|
|
4. [Operations](#operations)
|
|
5. [Backup & Restore](#backup--restore)
|
|
6. [Rollback](#rollback)
|
|
7. [Monitoring](#monitoring)
|
|
8. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## Initial Setup
|
|
|
|
### Prerequisites
|
|
|
|
- **Ubuntu Server 22.04 or 24.04 LTS** (clean install)
|
|
- **Root or sudo access**
|
|
- **Network access** to Forgejo (`forgejo.barodine.net`) and Docker Hub
|
|
- **Minimum hardware:** 2 vCPU, 4GB RAM, 20GB disk
|
|
|
|
### Bootstrap (First-Time Setup)
|
|
|
|
Run the idempotent bootstrap script as root:
|
|
|
|
```bash
|
|
sudo bash -c "$(curl -fsSL https://forgejo.barodine.net/barodine/agenthub/raw/branch/main/scripts/bootstrap.sh)"
|
|
```
|
|
|
|
**What it does (10 steps):**
|
|
|
|
1. `apt update && upgrade` — system packages
|
|
2. Enable `unattended-upgrades` for automatic security patches
|
|
3. Create `agenthub` user (UID 1001)
|
|
4. Install Docker Engine + Compose v2 from official repository
|
|
5. Enable and start Docker service
|
|
6. Create `/opt/agenthub` directory (mode 750, owner `agenthub`)
|
|
7. Clone agenthub repository from Forgejo
|
|
8. Generate `.env` with secure secrets (JWT, Postgres password)
|
|
9. Pull images and start stack with `compose.lan.yml`
|
|
10. Smoke test `http://127.0.0.1:3000/healthz`
|
|
|
|
**Expected duration:** < 15 minutes on clean Ubuntu LTS.
|
|
|
|
**Idempotency:** Safe to run multiple times — skips existing resources.
|
|
|
|
---
|
|
|
|
## Deployment
|
|
|
|
### Directory Layout
|
|
|
|
```
|
|
/opt/agenthub/
|
|
├── .env # Secrets (mode 600, owner agenthub)
|
|
├── compose.lan.yml # LAN stack definition
|
|
├── scripts/
|
|
│ ├── backup.sh # Daily backup (03:00 UTC)
|
|
│ └── restore.sh # Restore from dump
|
|
├── docs/
|
|
│ ├── RUNBOOK.md # General operations runbook
|
|
│ └── RUNBOOK-lan.md # This file
|
|
└── backups/ # Local backup directory (14 day retention)
|
|
```
|
|
|
|
### Environment Variables (.env)
|
|
|
|
Located at `/opt/agenthub/.env` (mode 600):
|
|
|
|
```bash
|
|
# Database
|
|
POSTGRES_PASSWORD=<generated-24-char-secret>
|
|
|
|
# JWT (32+ bytes base64)
|
|
JWT_SECRET=<generated-32-byte-secret>
|
|
|
|
# CORS (LAN subnet)
|
|
ALLOWED_ORIGINS=http://192.168.1.0/24
|
|
|
|
# Optional: Scaleway Object Storage for weekly encrypted backups
|
|
S3_ENDPOINT=https://s3.fr-par.scw.cloud
|
|
S3_BUCKET=agenthub-backups
|
|
AWS_ACCESS_KEY_ID=<scaleway-access-key>
|
|
AWS_SECRET_ACCESS_KEY=<scaleway-secret>
|
|
GPG_RECIPIENT_KEY=<gpg-public-key-id>
|
|
```
|
|
|
|
**Security:**
|
|
- Never commit `.env` to version control
|
|
- Never expose `.env` via HTTP/logs
|
|
- Rotate `JWT_SECRET` quarterly (see main RUNBOOK.md)
|
|
|
|
### Stack Services
|
|
|
|
Defined in `compose.lan.yml`:
|
|
|
|
| Service | Port | Description |
|
|
|------------|-------|------------------------------------------------|
|
|
| `app` | 3000 | Fastify + socket.io (HTTP/WS) |
|
|
| `postgres` | 5432 | PostgreSQL 16 (internal, not exposed to LAN) |
|
|
| `redis` | 6379 | Redis 7 (internal) |
|
|
| `ofelia` | - | Cron scheduler for backup job |
|
|
| `backup` | - | Backup container (runs daily at 03:00 UTC) |
|
|
|
|
**Exposed to LAN:** Only port 3000 (app). Database and Redis are Docker-internal only.
|
|
|
|
---
|
|
|
|
## Firewall Configuration
|
|
|
|
### UFW Setup (Required)
|
|
|
|
Phase 1 uses **HTTP/WS on port 3000** without TLS. Protect with UFW to allow LAN-only access.
|
|
|
|
```bash
|
|
# Enable UFW
|
|
sudo ufw --force enable
|
|
|
|
# Allow SSH from LAN subnet (adjust subnet to match your network)
|
|
sudo ufw allow from 192.168.1.0/24 to any port 22 proto tcp comment 'SSH from LAN'
|
|
|
|
# Allow AgentHub HTTP/WS from LAN subnet
|
|
sudo ufw allow from 192.168.1.0/24 to any port 3000 proto tcp comment 'AgentHub HTTP/WS from LAN'
|
|
|
|
# Default deny incoming
|
|
sudo ufw default deny incoming
|
|
|
|
# Default allow outgoing
|
|
sudo ufw default allow outgoing
|
|
|
|
# Check status
|
|
sudo ufw status verbose
|
|
```
|
|
|
|
**Expected output:**
|
|
|
|
```
|
|
Status: active
|
|
Logging: on (low)
|
|
Default: deny (incoming), allow (outgoing), disabled (routed)
|
|
|
|
To Action From
|
|
-- ------ ----
|
|
22/tcp ALLOW IN 192.168.1.0/24 # SSH from LAN
|
|
3000/tcp ALLOW IN 192.168.1.0/24 # AgentHub HTTP/WS from LAN
|
|
```
|
|
|
|
**Critical:** Replace `192.168.1.0/24` with your actual LAN subnet.
|
|
|
|
### Port Reference
|
|
|
|
| Port | Protocol | Exposed To | Purpose |
|
|
|------|----------|------------|------------------------|
|
|
| 22 | TCP | LAN subnet | SSH administration |
|
|
| 3000 | TCP | LAN subnet | AgentHub HTTP + WS |
|
|
| 5432 | TCP | Docker-internal | Postgres (not exposed) |
|
|
| 6379 | TCP | Docker-internal | Redis (not exposed) |
|
|
|
|
---
|
|
|
|
## Operations
|
|
|
|
### Start Stack
|
|
|
|
```bash
|
|
cd /opt/agenthub
|
|
docker compose -f compose.lan.yml up -d
|
|
```
|
|
|
|
### Stop Stack
|
|
|
|
```bash
|
|
cd /opt/agenthub
|
|
docker compose -f compose.lan.yml down
|
|
```
|
|
|
|
**Warning:** This does **not** delete data volumes (`pgdata`, `redisdata`).
|
|
|
|
### Restart Service
|
|
|
|
```bash
|
|
cd /opt/agenthub
|
|
docker compose -f compose.lan.yml restart app
|
|
```
|
|
|
|
### View Logs
|
|
|
|
```bash
|
|
# Follow all services
|
|
docker compose -f compose.lan.yml logs -f
|
|
|
|
# Follow app only
|
|
docker compose -f compose.lan.yml logs -f app
|
|
|
|
# Last 50 lines from postgres
|
|
docker compose -f compose.lan.yml logs --tail=50 postgres
|
|
```
|
|
|
|
### Check Service Status
|
|
|
|
```bash
|
|
# Docker services
|
|
docker compose -f compose.lan.yml ps
|
|
|
|
# Health check
|
|
curl http://127.0.0.1:3000/healthz
|
|
|
|
# Readiness check (includes DB connectivity)
|
|
curl http://127.0.0.1:3000/readyz
|
|
```
|
|
|
|
### Update to Latest Version
|
|
|
|
```bash
|
|
# Pull latest code
|
|
cd /opt/agenthub
|
|
sudo -u agenthub git pull origin main
|
|
|
|
# Pull latest images
|
|
sudo -u agenthub docker compose -f compose.lan.yml pull
|
|
|
|
# Recreate containers
|
|
sudo -u agenthub docker compose -f compose.lan.yml up -d
|
|
|
|
# Verify
|
|
curl http://127.0.0.1:3000/healthz
|
|
```
|
|
|
|
---
|
|
|
|
## Backup & Restore
|
|
|
|
### Automated Backups
|
|
|
|
**Schedule:** Daily at 03:00 UTC via ofelia cron scheduler.
|
|
|
|
**Retention:**
|
|
- Local: 14 days (`/opt/agenthub/backups/`)
|
|
- Weekly encrypted upload to Scaleway Object Storage (if configured)
|
|
|
|
**Location:** `/opt/agenthub/backups/agenthub_YYYYMMDD_HHMMSS.dump`
|
|
|
|
### Manual Backup
|
|
|
|
```bash
|
|
cd /opt/agenthub
|
|
docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh
|
|
```
|
|
|
|
**Verify backup:**
|
|
|
|
```bash
|
|
ls -lh /opt/agenthub/backups/
|
|
# Should show .dump files with non-zero size
|
|
```
|
|
|
|
### Restore from Backup
|
|
|
|
**Full procedure in `docs/RUNBOOK-restore.md`**. Quick reference:
|
|
|
|
```bash
|
|
cd /opt/agenthub
|
|
|
|
# Stop the app (prevent writes during restore)
|
|
docker compose -f compose.lan.yml stop app
|
|
|
|
# Restore using the restore script
|
|
docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/agenthub_YYYYMMDD_HHMMSS.dump
|
|
|
|
# Restart app
|
|
docker compose -f compose.lan.yml start app
|
|
|
|
# Verify
|
|
curl http://127.0.0.1:3000/healthz
|
|
```
|
|
|
|
### Off-Site Backup (Scaleway)
|
|
|
|
Weekly encrypted backups to Scaleway Object Storage (Sundays only).
|
|
|
|
**Requirements:**
|
|
- Scaleway account with Object Storage bucket
|
|
- GPG public key for encryption
|
|
- Env vars set in `.env`: `S3_ENDPOINT`, `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `GPG_RECIPIENT_KEY`
|
|
|
|
**Verification:**
|
|
|
|
```bash
|
|
# List backups on Scaleway
|
|
aws s3 ls s3://agenthub-backups/ \
|
|
--endpoint-url=https://s3.fr-par.scw.cloud
|
|
```
|
|
|
|
---
|
|
|
|
## Rollback
|
|
|
|
### Feature Flag Rollback
|
|
|
|
AgentHub includes a `messaging.enabled` feature flag for quick rollback.
|
|
|
|
**Disable messaging feature:**
|
|
|
|
```bash
|
|
# Add to .env
|
|
echo "FEATURE_MESSAGING_ENABLED=false" >> /opt/agenthub/.env
|
|
|
|
# Restart app
|
|
cd /opt/agenthub
|
|
docker compose -f compose.lan.yml restart app
|
|
```
|
|
|
|
**Re-enable:**
|
|
|
|
```bash
|
|
# Remove flag or set to true
|
|
sed -i '/FEATURE_MESSAGING_ENABLED/d' /opt/agenthub/.env
|
|
|
|
# Restart app
|
|
docker compose -f compose.lan.yml restart app
|
|
```
|
|
|
|
### Version Rollback
|
|
|
|
**Rollback to previous git commit:**
|
|
|
|
```bash
|
|
cd /opt/agenthub
|
|
|
|
# Stop stack
|
|
docker compose -f compose.lan.yml down
|
|
|
|
# Checkout previous version
|
|
sudo -u agenthub git log --oneline -10 # Find commit hash
|
|
sudo -u agenthub git checkout <commit-hash>
|
|
|
|
# Pull corresponding image tag (if available)
|
|
# Or rebuild locally
|
|
sudo -u agenthub docker compose -f compose.lan.yml build app
|
|
|
|
# Start stack
|
|
sudo -u agenthub docker compose -f compose.lan.yml up -d
|
|
|
|
# Verify
|
|
curl http://127.0.0.1:3000/healthz
|
|
```
|
|
|
|
**Rollback database schema:**
|
|
|
|
If migration broke the database, restore from backup (see above).
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Health Checks
|
|
|
|
| Endpoint | Purpose | Expected Response |
|
|
|-------------|-----------------------------------|------------------------|
|
|
| `/healthz` | Liveness (process is running) | `{"status":"ok"}` |
|
|
| `/readyz` | Readiness (DB is reachable) | `{"status":"ready"}` |
|
|
| `/metrics` | Prometheus metrics (WS, messages) | Prometheus text format |
|
|
|
|
### Key Metrics (Prometheus)
|
|
|
|
Available at `http://<lan-ip>:3000/metrics`:
|
|
|
|
- `ws_connections` — Active WebSocket connections (gauge)
|
|
- `messages_sent_total` — Total messages sent (counter)
|
|
- `message_send_latency` — Message processing latency histogram (p50, p90, p99)
|
|
|
|
### Uptime Kuma (Optional)
|
|
|
|
Set up Uptime Kuma on the same LAN to monitor AgentHub:
|
|
|
|
1. **HTTP(s) monitor:**
|
|
- URL: `http://<lan-ip>:3000/readyz`
|
|
- Interval: 60 seconds
|
|
- Expected status code: 200
|
|
|
|
2. **Keyword monitor:**
|
|
- URL: `http://<lan-ip>:3000/healthz`
|
|
- Keyword: `"status":"ok"`
|
|
|
|
3. **Notifications:**
|
|
- Slack webhook (if configured)
|
|
- Email (if SMTP configured)
|
|
|
|
### Manual Health Check
|
|
|
|
```bash
|
|
# Liveness
|
|
curl http://127.0.0.1:3000/healthz
|
|
# → {"status":"ok","uptime":12345}
|
|
|
|
# Readiness (includes DB check)
|
|
curl http://127.0.0.1:3000/readyz
|
|
# → {"status":"ready"}
|
|
|
|
# Metrics
|
|
curl http://127.0.0.1:3000/metrics
|
|
# → Prometheus text format
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Service Won't Start
|
|
|
|
**Symptoms:** `docker compose up -d` fails or app container exits immediately.
|
|
|
|
**Investigation:**
|
|
|
|
```bash
|
|
# Check container status
|
|
docker compose -f compose.lan.yml ps
|
|
|
|
# Check logs
|
|
docker compose -f compose.lan.yml logs app
|
|
|
|
# Check .env file
|
|
ls -l /opt/agenthub/.env
|
|
# Should be mode 600, owner agenthub
|
|
|
|
# Verify secrets are set
|
|
grep JWT_SECRET /opt/agenthub/.env
|
|
grep POSTGRES_PASSWORD /opt/agenthub/.env
|
|
```
|
|
|
|
**Common causes:**
|
|
|
|
- Missing or invalid `.env` file → Re-run bootstrap or generate secrets manually
|
|
- Port 3000 already in use → `sudo netstat -tulpn | grep 3000`
|
|
- Docker not running → `sudo systemctl status docker`
|
|
|
|
### Database Connection Failed
|
|
|
|
**Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED`.
|
|
|
|
**Investigation:**
|
|
|
|
```bash
|
|
# Check postgres container
|
|
docker compose -f compose.lan.yml ps postgres
|
|
|
|
# Check postgres logs
|
|
docker compose -f compose.lan.yml logs postgres --tail=50
|
|
|
|
# Test DB connectivity
|
|
docker compose -f compose.lan.yml exec postgres psql -U agenthub -d agenthub -c "SELECT 1"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# Restart postgres
|
|
docker compose -f compose.lan.yml restart postgres
|
|
|
|
# If data corruption, restore from backup
|
|
# See "Restore from Backup" section
|
|
```
|
|
|
|
### WebSocket Connection Refused
|
|
|
|
**Symptoms:** Paperclip agents cannot connect to `ws://<lan-ip>:3000/agents`.
|
|
|
|
**Investigation:**
|
|
|
|
```bash
|
|
# Check firewall
|
|
sudo ufw status verbose
|
|
# Should allow port 3000 from LAN subnet
|
|
|
|
# Test HTTP from client machine
|
|
curl http://<lan-ip>:3000/healthz
|
|
|
|
# Check app logs for connection attempts
|
|
docker compose -f compose.lan.yml logs -f app | grep socket
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# If UFW blocks, add rule
|
|
sudo ufw allow from <client-ip> to any port 3000
|
|
|
|
# If app not listening on 0.0.0.0, check HOST in .env
|
|
grep HOST /opt/agenthub/.env
|
|
# Should be HOST=0.0.0.0 (not 127.0.0.1)
|
|
|
|
# Restart app
|
|
docker compose -f compose.lan.yml restart app
|
|
```
|
|
|
|
### Disk Full
|
|
|
|
**Symptoms:** Backup fails, logs show "No space left on device".
|
|
|
|
**Investigation:**
|
|
|
|
```bash
|
|
# Check disk usage
|
|
df -h /opt/agenthub
|
|
|
|
# Check backup directory size
|
|
du -sh /opt/agenthub/backups/
|
|
|
|
# Check Docker volumes
|
|
docker system df
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# Clean old backups manually (keep last 7 days)
|
|
find /opt/agenthub/backups/ -name "agenthub_*.dump" -type f -mtime +7 -delete
|
|
|
|
# Prune unused Docker images/containers
|
|
docker system prune -a --volumes
|
|
|
|
# If still full, extend disk or move backups to external storage
|
|
```
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptoms:** App container restarts with exit code 137 (OOM killed).
|
|
|
|
**Investigation:**
|
|
|
|
```bash
|
|
# Check memory usage
|
|
docker stats agenthub-app-1 --no-stream
|
|
|
|
# Check active WebSocket connections
|
|
curl http://127.0.0.1:3000/metrics | grep ws_connections
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# Increase container memory limit (edit compose.lan.yml)
|
|
services:
|
|
app:
|
|
mem_limit: 1g # Default was 512m
|
|
|
|
# Restart stack
|
|
docker compose -f compose.lan.yml up -d
|
|
|
|
# If problem persists, check for memory leaks in logs
|
|
docker compose -f compose.lan.yml logs app | grep -i memory
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2 Migration Checklist
|
|
|
|
When moving from Phase 1 (LAN HTTP) to Phase 2 (public HTTPS):
|
|
|
|
- [ ] Acquire TLS certificate (Let's Encrypt via Coolify)
|
|
- [ ] Set up `agenthub.barodine.net` DNS A record
|
|
- [ ] Deploy to Coolify using `compose.coolify.yml`
|
|
- [ ] Enable HSTS: `ENABLE_HSTS=true` in `.env`
|
|
- [ ] Update `ALLOWED_ORIGINS` to public domain
|
|
- [ ] Update firewall rules (443/tcp instead of 3000/tcp)
|
|
- [ ] Test with production Paperclip agents
|
|
- [ ] Decommission LAN server or keep as staging
|
|
|
|
**Reference:** ADR-0004 (Coolify deployment architecture).
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
### Essential Commands
|
|
|
|
```bash
|
|
# Start stack
|
|
docker compose -f compose.lan.yml up -d
|
|
|
|
# Stop stack
|
|
docker compose -f compose.lan.yml down
|
|
|
|
# Restart app
|
|
docker compose -f compose.lan.yml restart app
|
|
|
|
# View logs
|
|
docker compose -f compose.lan.yml logs -f app
|
|
|
|
# Health check
|
|
curl http://127.0.0.1:3000/healthz
|
|
|
|
# Manual backup
|
|
docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh
|
|
|
|
# Restore from backup
|
|
docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/<file>.dump
|
|
```
|
|
|
|
### Files to Backup (Off-Server)
|
|
|
|
- `/opt/agenthub/.env` — **Critical**: secrets (keep secure, never commit)
|
|
- `/opt/agenthub/backups/` — Database dumps (14 day retention)
|
|
|
|
### Support
|
|
|
|
- **Documentation:** `/opt/agenthub/docs/`
|
|
- **Logs:** `docker compose -f compose.lan.yml logs`
|
|
- **Monitoring:** Uptime Kuma at `http://<monitoring-host>:3001`
|
|
- **Issue tracker:** Forgejo Barodine
|
|
|
|
---
|
|
|
|
**Last updated:** 2026-04-30 (J10 Phase 1 delivery)
|