Files
cloud-compose/ACTION_PLAN.md
2026-02-09 09:52:00 -05:00

9.3 KiB

Home Lab Action Plan

Phase 1: Critical Fixes (Do This Week)

1.1 Fix Failing Services

bewcloud-memos (Restarting Loop)

# SSH to controller
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130

# Check what's wrong
docker service logs bewcloud-memos-ssogxn-memos --tail 100

# Common fixes:
# If database connection issue:
docker service update --env-add "MEMOS_DB_HOST=correct-hostname" bewcloud-memos-ssogxn-memos

# If it keeps failing, try recreating:
docker service rm bewcloud-memos-ssogxn-memos
# Then redeploy via Dokploy UI

bendtstudio-webstatic (Rollback Paused)

# Check the error
docker service ps bendtstudio-webstatic-iq9evl --no-trunc

# Force update to retry
docker service update --force bendtstudio-webstatic-iq9evl

# If that fails, inspect the image
docker service inspect bendtstudio-webstatic-iq9evl --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'

syncthing (Stopped)

# Option A: Start it if you need it
docker service scale syncthing=1

# Option B: Remove it if not needed
docker service rm syncthing
# Also remove the volume if no longer needed
docker volume rm cloud-syncthing-i2rpwr_syncthing_config

1.2 Clean Up Unused Resources

# Remove unused volumes (reclaim ~595MB)
docker volume prune

# Remove unused images
docker image prune -a

# System-wide cleanup
docker system prune -a --volumes

1.3 Document Current State

Take screenshots of:


Phase 2: Configuration Backup (Do This Week)

2.1 Create Git Repository for Infrastructure

# On the controller node:
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130

# Create a backup directory
mkdir -p ~/infrastructure-backup/$(date +%Y-%m-%d)
cd ~/infrastructure-backup/$(date +%Y-%m-%d)

# Copy all compose files
cp -r /etc/dokploy/compose ./dokploy-compose
cp -r /etc/dokploy/traefik ./traefik-config
cp ~/minio-stack.yml ./

# Export service configs
mkdir -p ./service-configs
docker service ls -q | while read service; do
    docker service inspect "$service" > "./service-configs/${service}.json"
done

# Export stack configs
docker stack ls -q | while read stack; do
    docker stack ps "$stack" > "./service-configs/${stack}-tasks.txt"
done

# Create a summary
cat > README.txt << EOF
Infrastructure Backup - $(date)
Cluster: Docker Swarm with Dokploy
Nodes: 3 (tpi-n1, tpi-n2, node-nas)
Services: $(docker service ls -q | wc -l) services
Stacks: $(docker stack ls -q | wc -l) stacks

See HOMELAB_AUDIT.md for full documentation.
EOF

# Create tar archive
cd ..
tar -czf infrastructure-$(date +%Y-%m-%d).tar.gz $(date +%Y-%m-%d)

2.2 Commit to Gitea

# Clone your infrastructure repo (create if needed)
# Replace with your actual Gitea URL
git clone http://gitea.bendtstudio.com:3000/sirtimbly/homelab-configs.git
cd homelab-configs

# Copy backed up configs
cp -r ~/infrastructure-backup/$(date +%Y-%m-%d)/* .

# Organize by service
mkdir -p {stacks,compose,dokploy,traefik,docs}
mv dokploy-compose/* compose/ 2>/dev/null || true
mv traefik-config/* traefik/ 2>/dev/null || true
mv minio-stack.yml stacks/
mv service-configs/* docs/ 2>/dev/null || true

# Commit
git add .
git commit -m "Initial infrastructure backup - $(date +%Y-%m-%d)

- All Dokploy compose files
- Traefik configuration
- MinIO stack definition
- Service inspection exports
- Task history exports

Services backed up:
$(docker service ls --format '- {{.Name}}' | sort)

git push origin main

Phase 3: Security Hardening (Do Next Week)

3.1 Remove Exposed Credentials

Problem: Services have passwords in environment variables visible in Docker configs

Solution: Use Docker secrets or Dokploy environment variables

# Example: Securing MinIO
# Instead of having password in compose file, use Docker secret:

echo "your-minio-password" | docker secret create minio_root_password -

# Then in compose:
# environment:
#   MINIO_ROOT_PASSWORD_FILE: /run/secrets/minio_root_password
# secrets:
#   - minio_root_password

Action items:

  1. List all services with exposed passwords:

    docker service ls -q | xargs -I {} docker service inspect {} --format '{{.Spec.Name}}: {{range .Spec.TaskTemplate.ContainerSpec.Env}}{{.}} {{end}}' | grep -i password
    
  2. For each service, create a plan to move credentials to:

    • Docker secrets (best for swarm)
    • Environment files (easier to manage)
    • Dokploy UI environment variables
  3. Update compose files and redeploy

3.2 Update Default Passwords

Check for default/weak passwords:

  • Dokploy (if still default)
  • MinIO
  • Gitea admin
  • Technitium DNS
  • Any databases

3.3 Review Exposed Ports

# Check all published ports
docker service ls --format '{{.Name}}: {{.Ports}}'

# Check if any services are exposed without Traefik
# (Should only be: 53, 2222, 3000, 8384, 9000-9001)

Phase 4: Monitoring Setup (Do Next Week)

4.1 Set Up Prometheus + Grafana

You mentioned these in PLAN.md but they're not running. Let's add them:

Create monitoring-stack.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - prometheus-data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - dokploy-network
    deploy:
      placement:
        constraints:
          - node.role == manager

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
    secrets:
      - grafana_admin_password
    networks:
      - dokploy-network
    deploy:
      labels:
        - traefik.http.routers.grafana.rule=Host(`grafana.bendtstudio.com`)
        - traefik.http.routers.grafana.entrypoints=websecure
        - traefik.http.routers.grafana.tls.certresolver=letsencrypt
        - traefik.enable=true

volumes:
  prometheus-data:
  grafana-data:

networks:
  dokploy-network:
    external: true

secrets:
  grafana_admin_password:
    external: true

4.2 Add Node Exporter

Deploy node-exporter on all nodes to collect system metrics.

4.3 Configure Alerts

Set up alerts for:

  • Service down
  • High CPU/memory usage
  • Disk space low
  • Certificate expiration

Phase 5: Backup Strategy (Do Within 2 Weeks)

5.1 Define What to Back Up

Critical Data:

  1. Gitea repositories (/data/git)
  2. Dokploy database
  3. MinIO buckets
  4. Immich photos (/mnt/synology-data/immich)
  5. PostgreSQL databases
  6. Configuration files

5.2 Create Backup Scripts

Example backup script for Gitea:

#!/bin/bash
# /opt/backup/backup-gitea.sh

BACKUP_DIR="/backup/gitea/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

# Backup Gitea data
docker exec gitea-giteasqlite-bhymqw-gitea-1 tar czf /tmp/gitea-backup.tar.gz /data
docker cp gitea-giteasqlite-bhymqw-gitea-1:/tmp/gitea-backup.tar.gz "$BACKUP_DIR/"

# Backup to MinIO (offsite)
mc cp "$BACKUP_DIR/gitea-backup.tar.gz" minio/backups/gitea/

# Clean up old backups (keep 30 days)
find /backup/gitea -type d -mtime +30 -exec rm -rf {} +

5.3 Automate Backups

Add to crontab:

# Daily backups at 2 AM
0 2 * * * /opt/backup/backup-gitea.sh
0 3 * * * /opt/backup/backup-dokploy.sh
0 4 * * * /opt/backup/backup-databases.sh

Phase 6: Documentation (Ongoing)

6.1 Create Service Catalog

For each service, document:

  • Purpose: What does it do?
  • Access URL: How do I reach it?
  • Dependencies: What does it need?
  • Data location: Where is data stored?
  • Backup procedure: How to back it up?
  • Restore procedure: How to restore it?

6.2 Create Runbooks

Common operations:

  • Adding a new service
  • Scaling a service
  • Updating a service
  • Removing a service
  • Recovering from node failure
  • Restoring from backup

6.3 Network Diagram

Create a visual diagram showing:

  • Nodes and their roles
  • Services and their locations
  • Network connections
  • Data flows

Quick Reference Commands

# Cluster status
docker node ls
docker service ls
docker stack ls

# Service management
docker service logs <service> --tail 100 -f
docker service ps <service>
docker service scale <service>=<count>
docker service update --force <service>

# Resource usage
docker system df
docker stats

# SSH access
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130  # Manager
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.19   # Worker

# Web UIs
curl http://192.168.2.130:3000    # Dokploy
curl http://192.168.2.130:888     # Swarmpit
curl http://192.168.2.130:8080    # Traefik
curl http://192.168.2.18:5380     # Technitium DNS
curl http://192.168.2.18:9001     # MinIO Console

Questions for You

Before we proceed, I need to clarify a few things:

  1. NAS Node Access: What are the SSH credentials for node-nas (192.168.2.18)?

  2. bendtstudio-app: Is this service needed? It has 0 replicas.

  3. syncthing: Do you want to keep this? It's currently stopped.

  4. Monitoring: Do you want me to set up Prometheus/Grafana now, or later?

  5. Gitea: Can you provide access credentials so I can check what's already version controlled?

  6. Priority: Which phase should we tackle first? I recommend Phase 1 (critical fixes).


Action Plan Version 1.0 - February 9, 2026