agenthub/docs/RUNBOOK-lan.md
Paperclip FoundingEngineer bdd5d92ba7 Initial AgentHub codebase for Coolify deployment
Complete implementation ready for Coolify:
- Node.js 22 + Fastify + socket.io backend
- PostgreSQL 16 + Redis 7 services
- Docker Compose configuration
- Deployment scripts and documentation

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-05-01 21:25:57 +00:00

14 KiB

AgentHub LAN Deployment Runbook

Phase 1 HTTP/WebSocket deployment for Barodine LAN Ubuntu server.

Scope: Local network deployment (no TLS, no public DNS, ufw-protected).

Table of Contents

  1. Initial Setup
  2. Deployment
  3. Firewall Configuration
  4. Operations
  5. Backup & Restore
  6. Rollback
  7. Monitoring
  8. Troubleshooting

Initial Setup

Prerequisites

  • Ubuntu Server 22.04 or 24.04 LTS (clean install)
  • Root or sudo access
  • Network access to Forgejo (forgejo.barodine.net) and Docker Hub
  • Minimum hardware: 2 vCPU, 4GB RAM, 20GB disk

Bootstrap (First-Time Setup)

Run the idempotent bootstrap script as root:

sudo bash -c "$(curl -fsSL https://forgejo.barodine.net/barodine/agenthub/raw/branch/main/scripts/bootstrap.sh)"

What it does (10 steps):

  1. apt update && upgrade — system packages
  2. Enable unattended-upgrades for automatic security patches
  3. Create agenthub user (UID 1001)
  4. Install Docker Engine + Compose v2 from official repository
  5. Enable and start Docker service
  6. Create /opt/agenthub directory (mode 750, owner agenthub)
  7. Clone agenthub repository from Forgejo
  8. Generate .env with secure secrets (JWT, Postgres password)
  9. Pull images and start stack with compose.lan.yml
  10. Smoke test http://127.0.0.1:3000/healthz

Expected duration: < 15 minutes on clean Ubuntu LTS.

Idempotency: Safe to run multiple times — skips existing resources.


Deployment

Directory Layout

/opt/agenthub/
├── .env                      # Secrets (mode 600, owner agenthub)
├── compose.lan.yml           # LAN stack definition
├── scripts/
│   ├── backup.sh             # Daily backup (03:00 UTC)
│   └── restore.sh            # Restore from dump
├── docs/
│   ├── RUNBOOK.md            # General operations runbook
│   └── RUNBOOK-lan.md        # This file
└── backups/                  # Local backup directory (14 day retention)

Environment Variables (.env)

Located at /opt/agenthub/.env (mode 600):

# Database
POSTGRES_PASSWORD=<generated-24-char-secret>

# JWT (32+ bytes base64)
JWT_SECRET=<generated-32-byte-secret>

# CORS (LAN subnet)
ALLOWED_ORIGINS=http://192.168.1.0/24

# Optional: Scaleway Object Storage for weekly encrypted backups
S3_ENDPOINT=https://s3.fr-par.scw.cloud
S3_BUCKET=agenthub-backups
AWS_ACCESS_KEY_ID=<scaleway-access-key>
AWS_SECRET_ACCESS_KEY=<scaleway-secret>
GPG_RECIPIENT_KEY=<gpg-public-key-id>

Security:

  • Never commit .env to version control
  • Never expose .env via HTTP/logs
  • Rotate JWT_SECRET quarterly (see main RUNBOOK.md)

Stack Services

Defined in compose.lan.yml:

Service Port Description
app 3000 Fastify + socket.io (HTTP/WS)
postgres 5432 PostgreSQL 16 (internal, not exposed to LAN)
redis 6379 Redis 7 (internal)
ofelia - Cron scheduler for backup job
backup - Backup container (runs daily at 03:00 UTC)

Exposed to LAN: Only port 3000 (app). Database and Redis are Docker-internal only.


Firewall Configuration

UFW Setup (Required)

Phase 1 uses HTTP/WS on port 3000 without TLS. Protect with UFW to allow LAN-only access.

# Enable UFW
sudo ufw --force enable

# Allow SSH from LAN subnet (adjust subnet to match your network)
sudo ufw allow from 192.168.1.0/24 to any port 22 proto tcp comment 'SSH from LAN'

# Allow AgentHub HTTP/WS from LAN subnet
sudo ufw allow from 192.168.1.0/24 to any port 3000 proto tcp comment 'AgentHub HTTP/WS from LAN'

# Default deny incoming
sudo ufw default deny incoming

# Default allow outgoing
sudo ufw default allow outgoing

# Check status
sudo ufw status verbose

Expected output:

Status: active
Logging: on (low)
Default: deny (incoming), allow (outgoing), disabled (routed)

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW IN    192.168.1.0/24             # SSH from LAN
3000/tcp                   ALLOW IN    192.168.1.0/24             # AgentHub HTTP/WS from LAN

Critical: Replace 192.168.1.0/24 with your actual LAN subnet.

Port Reference

Port Protocol Exposed To Purpose
22 TCP LAN subnet SSH administration
3000 TCP LAN subnet AgentHub HTTP + WS
5432 TCP Docker-internal Postgres (not exposed)
6379 TCP Docker-internal Redis (not exposed)

Operations

Start Stack

cd /opt/agenthub
docker compose -f compose.lan.yml up -d

Stop Stack

cd /opt/agenthub
docker compose -f compose.lan.yml down

Warning: This does not delete data volumes (pgdata, redisdata).

Restart Service

cd /opt/agenthub
docker compose -f compose.lan.yml restart app

View Logs

# Follow all services
docker compose -f compose.lan.yml logs -f

# Follow app only
docker compose -f compose.lan.yml logs -f app

# Last 50 lines from postgres
docker compose -f compose.lan.yml logs --tail=50 postgres

Check Service Status

# Docker services
docker compose -f compose.lan.yml ps

# Health check
curl http://127.0.0.1:3000/healthz

# Readiness check (includes DB connectivity)
curl http://127.0.0.1:3000/readyz

Update to Latest Version

# Pull latest code
cd /opt/agenthub
sudo -u agenthub git pull origin main

# Pull latest images
sudo -u agenthub docker compose -f compose.lan.yml pull

# Recreate containers
sudo -u agenthub docker compose -f compose.lan.yml up -d

# Verify
curl http://127.0.0.1:3000/healthz

Backup & Restore

Automated Backups

Schedule: Daily at 03:00 UTC via ofelia cron scheduler.

Retention:

  • Local: 14 days (/opt/agenthub/backups/)
  • Weekly encrypted upload to Scaleway Object Storage (if configured)

Location: /opt/agenthub/backups/agenthub_YYYYMMDD_HHMMSS.dump

Manual Backup

cd /opt/agenthub
docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh

Verify backup:

ls -lh /opt/agenthub/backups/
# Should show .dump files with non-zero size

Restore from Backup

Full procedure in docs/RUNBOOK-restore.md. Quick reference:

cd /opt/agenthub

# Stop the app (prevent writes during restore)
docker compose -f compose.lan.yml stop app

# Restore using the restore script
docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/agenthub_YYYYMMDD_HHMMSS.dump

# Restart app
docker compose -f compose.lan.yml start app

# Verify
curl http://127.0.0.1:3000/healthz

Off-Site Backup (Scaleway)

Weekly encrypted backups to Scaleway Object Storage (Sundays only).

Requirements:

  • Scaleway account with Object Storage bucket
  • GPG public key for encryption
  • Env vars set in .env: S3_ENDPOINT, S3_BUCKET, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, GPG_RECIPIENT_KEY

Verification:

# List backups on Scaleway
aws s3 ls s3://agenthub-backups/ \
  --endpoint-url=https://s3.fr-par.scw.cloud

Rollback

Feature Flag Rollback

AgentHub includes a messaging.enabled feature flag for quick rollback.

Disable messaging feature:

# Add to .env
echo "FEATURE_MESSAGING_ENABLED=false" >> /opt/agenthub/.env

# Restart app
cd /opt/agenthub
docker compose -f compose.lan.yml restart app

Re-enable:

# Remove flag or set to true
sed -i '/FEATURE_MESSAGING_ENABLED/d' /opt/agenthub/.env

# Restart app
docker compose -f compose.lan.yml restart app

Version Rollback

Rollback to previous git commit:

cd /opt/agenthub

# Stop stack
docker compose -f compose.lan.yml down

# Checkout previous version
sudo -u agenthub git log --oneline -10  # Find commit hash
sudo -u agenthub git checkout <commit-hash>

# Pull corresponding image tag (if available)
# Or rebuild locally
sudo -u agenthub docker compose -f compose.lan.yml build app

# Start stack
sudo -u agenthub docker compose -f compose.lan.yml up -d

# Verify
curl http://127.0.0.1:3000/healthz

Rollback database schema:

If migration broke the database, restore from backup (see above).


Monitoring

Health Checks

Endpoint Purpose Expected Response
/healthz Liveness (process is running) {"status":"ok"}
/readyz Readiness (DB is reachable) {"status":"ready"}
/metrics Prometheus metrics (WS, messages) Prometheus text format

Key Metrics (Prometheus)

Available at http://<lan-ip>:3000/metrics:

  • ws_connections — Active WebSocket connections (gauge)
  • messages_sent_total — Total messages sent (counter)
  • message_send_latency — Message processing latency histogram (p50, p90, p99)

Uptime Kuma (Optional)

Set up Uptime Kuma on the same LAN to monitor AgentHub:

  1. HTTP(s) monitor:

    • URL: http://<lan-ip>:3000/readyz
    • Interval: 60 seconds
    • Expected status code: 200
  2. Keyword monitor:

    • URL: http://<lan-ip>:3000/healthz
    • Keyword: "status":"ok"
  3. Notifications:

    • Slack webhook (if configured)
    • Email (if SMTP configured)

Manual Health Check

# Liveness
curl http://127.0.0.1:3000/healthz
# → {"status":"ok","uptime":12345}

# Readiness (includes DB check)
curl http://127.0.0.1:3000/readyz
# → {"status":"ready"}

# Metrics
curl http://127.0.0.1:3000/metrics
# → Prometheus text format

Troubleshooting

Service Won't Start

Symptoms: docker compose up -d fails or app container exits immediately.

Investigation:

# Check container status
docker compose -f compose.lan.yml ps

# Check logs
docker compose -f compose.lan.yml logs app

# Check .env file
ls -l /opt/agenthub/.env
# Should be mode 600, owner agenthub

# Verify secrets are set
grep JWT_SECRET /opt/agenthub/.env
grep POSTGRES_PASSWORD /opt/agenthub/.env

Common causes:

  • Missing or invalid .env file → Re-run bootstrap or generate secrets manually
  • Port 3000 already in use → sudo netstat -tulpn | grep 3000
  • Docker not running → sudo systemctl status docker

Database Connection Failed

Symptoms: /readyz returns 503, logs show ECONNREFUSED.

Investigation:

# Check postgres container
docker compose -f compose.lan.yml ps postgres

# Check postgres logs
docker compose -f compose.lan.yml logs postgres --tail=50

# Test DB connectivity
docker compose -f compose.lan.yml exec postgres psql -U agenthub -d agenthub -c "SELECT 1"

Resolution:

# Restart postgres
docker compose -f compose.lan.yml restart postgres

# If data corruption, restore from backup
# See "Restore from Backup" section

WebSocket Connection Refused

Symptoms: Paperclip agents cannot connect to ws://<lan-ip>:3000/agents.

Investigation:

# Check firewall
sudo ufw status verbose
# Should allow port 3000 from LAN subnet

# Test HTTP from client machine
curl http://<lan-ip>:3000/healthz

# Check app logs for connection attempts
docker compose -f compose.lan.yml logs -f app | grep socket

Resolution:

# If UFW blocks, add rule
sudo ufw allow from <client-ip> to any port 3000

# If app not listening on 0.0.0.0, check HOST in .env
grep HOST /opt/agenthub/.env
# Should be HOST=0.0.0.0 (not 127.0.0.1)

# Restart app
docker compose -f compose.lan.yml restart app

Disk Full

Symptoms: Backup fails, logs show "No space left on device".

Investigation:

# Check disk usage
df -h /opt/agenthub

# Check backup directory size
du -sh /opt/agenthub/backups/

# Check Docker volumes
docker system df

Resolution:

# Clean old backups manually (keep last 7 days)
find /opt/agenthub/backups/ -name "agenthub_*.dump" -type f -mtime +7 -delete

# Prune unused Docker images/containers
docker system prune -a --volumes

# If still full, extend disk or move backups to external storage

High Memory Usage

Symptoms: App container restarts with exit code 137 (OOM killed).

Investigation:

# Check memory usage
docker stats agenthub-app-1 --no-stream

# Check active WebSocket connections
curl http://127.0.0.1:3000/metrics | grep ws_connections

Resolution:

# Increase container memory limit (edit compose.lan.yml)
services:
  app:
    mem_limit: 1g  # Default was 512m

# Restart stack
docker compose -f compose.lan.yml up -d

# If problem persists, check for memory leaks in logs
docker compose -f compose.lan.yml logs app | grep -i memory

Phase 2 Migration Checklist

When moving from Phase 1 (LAN HTTP) to Phase 2 (public HTTPS):

  • Acquire TLS certificate (Let's Encrypt via Coolify)
  • Set up agenthub.barodine.net DNS A record
  • Deploy to Coolify using compose.coolify.yml
  • Enable HSTS: ENABLE_HSTS=true in .env
  • Update ALLOWED_ORIGINS to public domain
  • Update firewall rules (443/tcp instead of 3000/tcp)
  • Test with production Paperclip agents
  • Decommission LAN server or keep as staging

Reference: ADR-0004 (Coolify deployment architecture).


Quick Reference

Essential Commands

# Start stack
docker compose -f compose.lan.yml up -d

# Stop stack
docker compose -f compose.lan.yml down

# Restart app
docker compose -f compose.lan.yml restart app

# View logs
docker compose -f compose.lan.yml logs -f app

# Health check
curl http://127.0.0.1:3000/healthz

# Manual backup
docker compose -f compose.lan.yml exec backup /usr/local/bin/backup.sh

# Restore from backup
docker compose -f compose.lan.yml run --rm backup /usr/local/bin/restore.sh /backups/<file>.dump

Files to Backup (Off-Server)

  • /opt/agenthub/.envCritical: secrets (keep secure, never commit)
  • /opt/agenthub/backups/ — Database dumps (14 day retention)

Support

  • Documentation: /opt/agenthub/docs/
  • Logs: docker compose -f compose.lan.yml logs
  • Monitoring: Uptime Kuma at http://<monitoring-host>:3001
  • Issue tracker: Forgejo Barodine

Last updated: 2026-04-30 (J10 Phase 1 delivery)