# AgentHub LAN Deployment Runbook Phase 1 HTTP/WebSocket deployment for Barodine LAN Ubuntu server. **Scope:** Local network deployment (no TLS, no public DNS, ufw-protected). ## Table of Contents 1. [Initial Setup](#initial-setup) 2. [Deployment](#deployment) 3. [Firewall Configuration](#firewall-configuration) 4. [Operations](#operations) 5. [Backup & Restore](#backup--restore) 6. [Rollback](#rollback) 7. [Monitoring](#monitoring) 8. [Troubleshooting](#troubleshooting) --- ## Initial Setup ### Prerequisites - **Ubuntu Server 22.04 or 24.04 LTS** (clean install) - **Root or sudo access** - **Network access** to Forgejo (`forgejo.barodine.net`) and Docker Hub - **Minimum hardware:** 2 vCPU, 4GB RAM, 20GB disk ### Bootstrap (First-Time Setup) Run the idempotent bootstrap script as root: ```bash sudo bash -c "$(curl -fsSL https://forgejo.barodine.net/barodine/agenthub/raw/branch/main/scripts/bootstrap.sh)" ``` **What it does (10 steps):** 1. `apt update && upgrade` — system packages 2. Enable `unattended-upgrades` for automatic security patches 3. Create `agenthub` user (UID 1001) 4. Install Docker Engine + Compose v2 from official repository 5. Enable and start Docker service 6. Create `/opt/agenthub` directory (mode 750, owner `agenthub`) 7. Clone agenthub repository from Forgejo 8. Generate `.env` with secure secrets (JWT, Postgres password) 9. Pull images and start stack with `compose.lan.yml` 10. Smoke test `http://127.0.0.1:3000/healthz` **Expected duration:** < 15 minutes on clean Ubuntu LTS. **Idempotency:** Safe to run multiple times — skips existing resources. --- ## Deployment ### Directory Layout ``` /opt/agenthub/ ├── .env # Secrets (mode 600, owner agenthub) ├── compose.lan.yml # LAN stack definition ├── scripts/ │ ├── backup.sh # Daily backup (03:00 UTC) │ └── restore.sh # Restore from dump ├── docs/ │ ├── RUNBOOK.md # General operations runbook │ └── RUNBOOK-lan.md # This file └── backups/ # Local backup directory (14 day retention) ``` ### Environment Variables (.env) Located at `/opt/agenthub/.env` (mode 600): ```bash # Database POSTGRES_PASSWORD= # JWT (32+ bytes base64) JWT_SECRET= # CORS (LAN subnet) ALLOWED_ORIGINS=http://192.168.1.0/24 # Optional: Scaleway Object Storage for weekly encrypted backups S3_ENDPOINT=https://s3.fr-par.scw.cloud S3_BUCKET=agenthub-backups AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= GPG_RECIPIENT_KEY= ``` **Security:** - Never commit `.env` to version control - Never expose `.env` via HTTP/logs - Rotate `JWT_SECRET` quarterly (see main RUNBOOK.md) ### Stack Services Defined in `compose.lan.yml`: | Service | Port | Description | |------------|-------|------------------------------------------------| | `app` | 3000 | Fastify + socket.io (HTTP/WS) | | `postgres` | 5432 | PostgreSQL 16 (internal, not exposed to LAN) | | `redis` | 6379 | Redis 7 (internal) | | `ofelia` | - | Cron scheduler for backup job | | `backup` | - | Backup container (runs daily at 03:00 UTC) | **Exposed to LAN:** Only port 3000 (app). Database and Redis are Docker-internal only. --- ## Firewall Configuration ### UFW Setup (Required) Phase 1 uses **HTTP/WS on port 3000** without TLS. Protect with UFW to allow LAN-only access. ```bash # Enable UFW sudo ufw --force enable # Allow SSH from LAN subnet (adjust subnet to match your network) sudo ufw allow from 192.168.1.0/24 to any port 22 proto tcp comment 'SSH from LAN' # Allow AgentHub HTTP/WS from LAN subnet sudo ufw allow from 192.168.1.0/24 to any port 3000 proto tcp comment 'AgentHub HTTP/WS from LAN' # Default deny incoming sudo ufw default deny incoming # Default allow outgoing sudo ufw default allow outgoing # Check status sudo ufw status verbose ``` **Expected output:** ``` Status: active Logging: on (low) Default: deny (incoming), allow (outgoing), disabled (routed) To Action From -- ------ ---- 22/tcp ALLOW IN 192.168.1.0/24 # SSH from LAN 3000/tcp ALLOW IN 192.168.1.0/24 # AgentHub HTTP/WS from LAN ``` **Critical:** Replace `192.168.1.0/24` with your actual LAN subnet. ### Port Reference | Port | Protocol | Exposed To | Purpose | |------|----------|------------|------------------------| | 22 | TCP | LAN subnet | SSH administration | | 3000 | TCP | LAN subnet | AgentHub HTTP + WS | | 5432 | TCP | Docker-internal | Postgres (not exposed) | | 6379 | TCP | Docker-internal | Redis (not exposed) | --- ## Operations ### Start Stack ```bash cd /opt/agenthub docker compose -f compose.lan.yml up -d ``` ### Stop Stack ```bash cd /opt/agenthub docker compose -f compose.lan.yml down ``` **Warning:** This does **not** delete data volumes (`pgdata`, `redisdata`). ### Restart Service ```bash cd /opt/agenthub docker compose -f compose.lan.yml restart app ``` ### View Logs ```bash # Follow all services docker compose -f compose.lan.yml logs -f # Follow app only docker compose -f compose.lan.yml logs -f app # Last 50 lines from postgres docker compose -f compose.lan.yml logs --tail=50 postgres ``` ### Check Service Status ```bash # Docker services docker compose -f compose.lan.yml ps # Health check curl http://127.0.0.1:3000/healthz # Readiness check (includes DB connectivity) curl http://127.0.0.1:3000/readyz ``` ### Update to Latest Version ```bash # Pull latest code cd /opt/agenthub sudo -u agenthub git pull origin main # Pull latest images sudo -u agenthub docker compose -f compose.lan.yml pull # Recreate containers sudo -u agenthub docker compose -f compose.lan.yml up -d # Verify curl http://127.0.0.1:3000/healthz ``` --- ## Backup & Restore ### Automated Backups **Schedule:** Daily at 03:00 UTC via ofelia cron scheduler. **Retention:** - Local: 14 days (`/opt/agenthub/backups/`) - Weekly encrypted upload to Scaleway Object Storage (if configured) **Location:** `/opt/agenthub/backups/agenthub_YYYYMMDD_HHMMSS.dump` ### Manual Backup ```bash cd /opt/agenthub docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh ``` **Verify backup:** ```bash ls -lh /opt/agenthub/backups/ # Should show .dump files with non-zero size ``` ### Restore from Backup **Full procedure in `docs/RUNBOOK-restore.md`**. Quick reference: ```bash cd /opt/agenthub # Stop the app (prevent writes during restore) docker compose -f compose.lan.yml stop app # Restore using the restore script docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/agenthub_YYYYMMDD_HHMMSS.dump # Restart app docker compose -f compose.lan.yml start app # Verify curl http://127.0.0.1:3000/healthz ``` ### Off-Site Backup (Scaleway) Weekly encrypted backups to Scaleway Object Storage (Sundays only). **Requirements:** - Scaleway account with Object Storage bucket - GPG public key for encryption - Env vars set in `.env`: `S3_ENDPOINT`, `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `GPG_RECIPIENT_KEY` **Verification:** ```bash # List backups on Scaleway aws s3 ls s3://agenthub-backups/ \ --endpoint-url=https://s3.fr-par.scw.cloud ``` --- ## Rollback ### Feature Flag Rollback AgentHub includes a `messaging.enabled` feature flag for quick rollback. **Disable messaging feature:** ```bash # Add to .env echo "FEATURE_MESSAGING_ENABLED=false" >> /opt/agenthub/.env # Restart app cd /opt/agenthub docker compose -f compose.lan.yml restart app ``` **Re-enable:** ```bash # Remove flag or set to true sed -i '/FEATURE_MESSAGING_ENABLED/d' /opt/agenthub/.env # Restart app docker compose -f compose.lan.yml restart app ``` ### Version Rollback **Rollback to previous git commit:** ```bash cd /opt/agenthub # Stop stack docker compose -f compose.lan.yml down # Checkout previous version sudo -u agenthub git log --oneline -10 # Find commit hash sudo -u agenthub git checkout # Pull corresponding image tag (if available) # Or rebuild locally sudo -u agenthub docker compose -f compose.lan.yml build app # Start stack sudo -u agenthub docker compose -f compose.lan.yml up -d # Verify curl http://127.0.0.1:3000/healthz ``` **Rollback database schema:** If migration broke the database, restore from backup (see above). --- ## Monitoring ### Health Checks | Endpoint | Purpose | Expected Response | |-------------|-----------------------------------|------------------------| | `/healthz` | Liveness (process is running) | `{"status":"ok"}` | | `/readyz` | Readiness (DB is reachable) | `{"status":"ready"}` | | `/metrics` | Prometheus metrics (WS, messages) | Prometheus text format | ### Key Metrics (Prometheus) Available at `http://:3000/metrics`: - `ws_connections` — Active WebSocket connections (gauge) - `messages_sent_total` — Total messages sent (counter) - `message_send_latency` — Message processing latency histogram (p50, p90, p99) ### Uptime Kuma (Optional) Set up Uptime Kuma on the same LAN to monitor AgentHub: 1. **HTTP(s) monitor:** - URL: `http://:3000/readyz` - Interval: 60 seconds - Expected status code: 200 2. **Keyword monitor:** - URL: `http://:3000/healthz` - Keyword: `"status":"ok"` 3. **Notifications:** - Slack webhook (if configured) - Email (if SMTP configured) ### Manual Health Check ```bash # Liveness curl http://127.0.0.1:3000/healthz # → {"status":"ok","uptime":12345} # Readiness (includes DB check) curl http://127.0.0.1:3000/readyz # → {"status":"ready"} # Metrics curl http://127.0.0.1:3000/metrics # → Prometheus text format ``` --- ## Troubleshooting ### Service Won't Start **Symptoms:** `docker compose up -d` fails or app container exits immediately. **Investigation:** ```bash # Check container status docker compose -f compose.lan.yml ps # Check logs docker compose -f compose.lan.yml logs app # Check .env file ls -l /opt/agenthub/.env # Should be mode 600, owner agenthub # Verify secrets are set grep JWT_SECRET /opt/agenthub/.env grep POSTGRES_PASSWORD /opt/agenthub/.env ``` **Common causes:** - Missing or invalid `.env` file → Re-run bootstrap or generate secrets manually - Port 3000 already in use → `sudo netstat -tulpn | grep 3000` - Docker not running → `sudo systemctl status docker` ### Database Connection Failed **Symptoms:** `/readyz` returns 503, logs show `ECONNREFUSED`. **Investigation:** ```bash # Check postgres container docker compose -f compose.lan.yml ps postgres # Check postgres logs docker compose -f compose.lan.yml logs postgres --tail=50 # Test DB connectivity docker compose -f compose.lan.yml exec postgres psql -U agenthub -d agenthub -c "SELECT 1" ``` **Resolution:** ```bash # Restart postgres docker compose -f compose.lan.yml restart postgres # If data corruption, restore from backup # See "Restore from Backup" section ``` ### WebSocket Connection Refused **Symptoms:** Paperclip agents cannot connect to `ws://:3000/agents`. **Investigation:** ```bash # Check firewall sudo ufw status verbose # Should allow port 3000 from LAN subnet # Test HTTP from client machine curl http://:3000/healthz # Check app logs for connection attempts docker compose -f compose.lan.yml logs -f app | grep socket ``` **Resolution:** ```bash # If UFW blocks, add rule sudo ufw allow from to any port 3000 # If app not listening on 0.0.0.0, check HOST in .env grep HOST /opt/agenthub/.env # Should be HOST=0.0.0.0 (not 127.0.0.1) # Restart app docker compose -f compose.lan.yml restart app ``` ### Disk Full **Symptoms:** Backup fails, logs show "No space left on device". **Investigation:** ```bash # Check disk usage df -h /opt/agenthub # Check backup directory size du -sh /opt/agenthub/backups/ # Check Docker volumes docker system df ``` **Resolution:** ```bash # Clean old backups manually (keep last 7 days) find /opt/agenthub/backups/ -name "agenthub_*.dump" -type f -mtime +7 -delete # Prune unused Docker images/containers docker system prune -a --volumes # If still full, extend disk or move backups to external storage ``` ### High Memory Usage **Symptoms:** App container restarts with exit code 137 (OOM killed). **Investigation:** ```bash # Check memory usage docker stats agenthub-app-1 --no-stream # Check active WebSocket connections curl http://127.0.0.1:3000/metrics | grep ws_connections ``` **Resolution:** ```bash # Increase container memory limit (edit compose.lan.yml) services: app: mem_limit: 1g # Default was 512m # Restart stack docker compose -f compose.lan.yml up -d # If problem persists, check for memory leaks in logs docker compose -f compose.lan.yml logs app | grep -i memory ``` --- ## Phase 2 Migration Checklist When moving from Phase 1 (LAN HTTP) to Phase 2 (public HTTPS): - [ ] Acquire TLS certificate (Let's Encrypt via Coolify) - [ ] Set up `agenthub.barodine.net` DNS A record - [ ] Deploy to Coolify using `compose.coolify.yml` - [ ] Enable HSTS: `ENABLE_HSTS=true` in `.env` - [ ] Update `ALLOWED_ORIGINS` to public domain - [ ] Update firewall rules (443/tcp instead of 3000/tcp) - [ ] Test with production Paperclip agents - [ ] Decommission LAN server or keep as staging **Reference:** ADR-0004 (Coolify deployment architecture). --- ## Quick Reference ### Essential Commands ```bash # Start stack docker compose -f compose.lan.yml up -d # Stop stack docker compose -f compose.lan.yml down # Restart app docker compose -f compose.lan.yml restart app # View logs docker compose -f compose.lan.yml logs -f app # Health check curl http://127.0.0.1:3000/healthz # Manual backup docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh # Restore from backup docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/.dump ``` ### Files to Backup (Off-Server) - `/opt/agenthub/.env` — **Critical**: secrets (keep secure, never commit) - `/opt/agenthub/backups/` — Database dumps (14 day retention) ### Support - **Documentation:** `/opt/agenthub/docs/` - **Logs:** `docker compose -f compose.lan.yml logs` - **Monitoring:** Uptime Kuma at `http://:3001` - **Issue tracker:** Forgejo Barodine --- **Last updated:** 2026-04-30 (J10 Phase 1 delivery)