Files
cloud-compose/ACTION_PLAN.md
2026-02-09 09:52:00 -05:00

404 lines
9.3 KiB
Markdown

# Home Lab Action Plan
## Phase 1: Critical Fixes (Do This Week)
### 1.1 Fix Failing Services
**bewcloud-memos (Restarting Loop)**
```bash
# SSH to controller
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130
# Check what's wrong
docker service logs bewcloud-memos-ssogxn-memos --tail 100
# Common fixes:
# If database connection issue:
docker service update --env-add "MEMOS_DB_HOST=correct-hostname" bewcloud-memos-ssogxn-memos
# If it keeps failing, try recreating:
docker service rm bewcloud-memos-ssogxn-memos
# Then redeploy via Dokploy UI
```
**bendtstudio-webstatic (Rollback Paused)**
```bash
# Check the error
docker service ps bendtstudio-webstatic-iq9evl --no-trunc
# Force update to retry
docker service update --force bendtstudio-webstatic-iq9evl
# If that fails, inspect the image
docker service inspect bendtstudio-webstatic-iq9evl --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'
```
**syncthing (Stopped)**
```bash
# Option A: Start it if you need it
docker service scale syncthing=1
# Option B: Remove it if not needed
docker service rm syncthing
# Also remove the volume if no longer needed
docker volume rm cloud-syncthing-i2rpwr_syncthing_config
```
### 1.2 Clean Up Unused Resources
```bash
# Remove unused volumes (reclaim ~595MB)
docker volume prune
# Remove unused images
docker image prune -a
# System-wide cleanup
docker system prune -a --volumes
```
### 1.3 Document Current State
Take screenshots of:
- Dokploy UI (all projects)
- Swarmpit dashboard
- Traefik dashboard (http://192.168.2.130:8080)
- MinIO console (http://192.168.2.18:9001)
- Gitea repositories
---
## Phase 2: Configuration Backup (Do This Week)
### 2.1 Create Git Repository for Infrastructure
```bash
# On the controller node:
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130
# Create a backup directory
mkdir -p ~/infrastructure-backup/$(date +%Y-%m-%d)
cd ~/infrastructure-backup/$(date +%Y-%m-%d)
# Copy all compose files
cp -r /etc/dokploy/compose ./dokploy-compose
cp -r /etc/dokploy/traefik ./traefik-config
cp ~/minio-stack.yml ./
# Export service configs
mkdir -p ./service-configs
docker service ls -q | while read service; do
docker service inspect "$service" > "./service-configs/${service}.json"
done
# Export stack configs
docker stack ls -q | while read stack; do
docker stack ps "$stack" > "./service-configs/${stack}-tasks.txt"
done
# Create a summary
cat > README.txt << EOF
Infrastructure Backup - $(date)
Cluster: Docker Swarm with Dokploy
Nodes: 3 (tpi-n1, tpi-n2, node-nas)
Services: $(docker service ls -q | wc -l) services
Stacks: $(docker stack ls -q | wc -l) stacks
See HOMELAB_AUDIT.md for full documentation.
EOF
# Create tar archive
cd ..
tar -czf infrastructure-$(date +%Y-%m-%d).tar.gz $(date +%Y-%m-%d)
```
### 2.2 Commit to Gitea
```bash
# Clone your infrastructure repo (create if needed)
# Replace with your actual Gitea URL
git clone http://gitea.bendtstudio.com:3000/sirtimbly/homelab-configs.git
cd homelab-configs
# Copy backed up configs
cp -r ~/infrastructure-backup/$(date +%Y-%m-%d)/* .
# Organize by service
mkdir -p {stacks,compose,dokploy,traefik,docs}
mv dokploy-compose/* compose/ 2>/dev/null || true
mv traefik-config/* traefik/ 2>/dev/null || true
mv minio-stack.yml stacks/
mv service-configs/* docs/ 2>/dev/null || true
# Commit
git add .
git commit -m "Initial infrastructure backup - $(date +%Y-%m-%d)
- All Dokploy compose files
- Traefik configuration
- MinIO stack definition
- Service inspection exports
- Task history exports
Services backed up:
$(docker service ls --format '- {{.Name}}' | sort)
git push origin main
```
---
## Phase 3: Security Hardening (Do Next Week)
### 3.1 Remove Exposed Credentials
**Problem:** Services have passwords in environment variables visible in Docker configs
**Solution:** Use Docker secrets or Dokploy environment variables
```bash
# Example: Securing MinIO
# Instead of having password in compose file, use Docker secret:
echo "your-minio-password" | docker secret create minio_root_password -
# Then in compose:
# environment:
# MINIO_ROOT_PASSWORD_FILE: /run/secrets/minio_root_password
# secrets:
# - minio_root_password
```
**Action items:**
1. List all services with exposed passwords:
```bash
docker service ls -q | xargs -I {} docker service inspect {} --format '{{.Spec.Name}}: {{range .Spec.TaskTemplate.ContainerSpec.Env}}{{.}} {{end}}' | grep -i password
```
2. For each service, create a plan to move credentials to:
- Docker secrets (best for swarm)
- Environment files (easier to manage)
- Dokploy UI environment variables
3. Update compose files and redeploy
### 3.2 Update Default Passwords
Check for default/weak passwords:
- Dokploy (if still default)
- MinIO
- Gitea admin
- Technitium DNS
- Any databases
### 3.3 Review Exposed Ports
```bash
# Check all published ports
docker service ls --format '{{.Name}}: {{.Ports}}'
# Check if any services are exposed without Traefik
# (Should only be: 53, 2222, 3000, 8384, 9000-9001)
```
---
## Phase 4: Monitoring Setup (Do Next Week)
### 4.1 Set Up Prometheus + Grafana
You mentioned these in PLAN.md but they're not running. Let's add them:
Create `monitoring-stack.yml`:
```yaml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- prometheus-data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
networks:
- dokploy-network
deploy:
placement:
constraints:
- node.role == manager
grafana:
image: grafana/grafana:latest
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
secrets:
- grafana_admin_password
networks:
- dokploy-network
deploy:
labels:
- traefik.http.routers.grafana.rule=Host(`grafana.bendtstudio.com`)
- traefik.http.routers.grafana.entrypoints=websecure
- traefik.http.routers.grafana.tls.certresolver=letsencrypt
- traefik.enable=true
volumes:
prometheus-data:
grafana-data:
networks:
dokploy-network:
external: true
secrets:
grafana_admin_password:
external: true
```
### 4.2 Add Node Exporter
Deploy node-exporter on all nodes to collect system metrics.
### 4.3 Configure Alerts
Set up alerts for:
- Service down
- High CPU/memory usage
- Disk space low
- Certificate expiration
---
## Phase 5: Backup Strategy (Do Within 2 Weeks)
### 5.1 Define What to Back Up
**Critical Data:**
1. Gitea repositories (/data/git)
2. Dokploy database
3. MinIO buckets
4. Immich photos (/mnt/synology-data/immich)
5. PostgreSQL databases
6. Configuration files
### 5.2 Create Backup Scripts
Example backup script for Gitea:
```bash
#!/bin/bash
# /opt/backup/backup-gitea.sh
BACKUP_DIR="/backup/gitea/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
# Backup Gitea data
docker exec gitea-giteasqlite-bhymqw-gitea-1 tar czf /tmp/gitea-backup.tar.gz /data
docker cp gitea-giteasqlite-bhymqw-gitea-1:/tmp/gitea-backup.tar.gz "$BACKUP_DIR/"
# Backup to MinIO (offsite)
mc cp "$BACKUP_DIR/gitea-backup.tar.gz" minio/backups/gitea/
# Clean up old backups (keep 30 days)
find /backup/gitea -type d -mtime +30 -exec rm -rf {} +
```
### 5.3 Automate Backups
Add to crontab:
```bash
# Daily backups at 2 AM
0 2 * * * /opt/backup/backup-gitea.sh
0 3 * * * /opt/backup/backup-dokploy.sh
0 4 * * * /opt/backup/backup-databases.sh
```
---
## Phase 6: Documentation (Ongoing)
### 6.1 Create Service Catalog
For each service, document:
- **Purpose:** What does it do?
- **Access URL:** How do I reach it?
- **Dependencies:** What does it need?
- **Data location:** Where is data stored?
- **Backup procedure:** How to back it up?
- **Restore procedure:** How to restore it?
### 6.2 Create Runbooks
Common operations:
- Adding a new service
- Scaling a service
- Updating a service
- Removing a service
- Recovering from node failure
- Restoring from backup
### 6.3 Network Diagram
Create a visual diagram showing:
- Nodes and their roles
- Services and their locations
- Network connections
- Data flows
---
## Quick Reference Commands
```bash
# Cluster status
docker node ls
docker service ls
docker stack ls
# Service management
docker service logs <service> --tail 100 -f
docker service ps <service>
docker service scale <service>=<count>
docker service update --force <service>
# Resource usage
docker system df
docker stats
# SSH access
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130 # Manager
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.19 # Worker
# Web UIs
curl http://192.168.2.130:3000 # Dokploy
curl http://192.168.2.130:888 # Swarmpit
curl http://192.168.2.130:8080 # Traefik
curl http://192.168.2.18:5380 # Technitium DNS
curl http://192.168.2.18:9001 # MinIO Console
```
---
## Questions for You
Before we proceed, I need to clarify a few things:
1. **NAS Node Access:** What are the SSH credentials for node-nas (192.168.2.18)?
2. **bendtstudio-app:** Is this service needed? It has 0 replicas.
3. **syncthing:** Do you want to keep this? It's currently stopped.
4. **Monitoring:** Do you want me to set up Prometheus/Grafana now, or later?
5. **Gitea:** Can you provide access credentials so I can check what's already version controlled?
6. **Priority:** Which phase should we tackle first? I recommend Phase 1 (critical fixes).
---
*Action Plan Version 1.0 - February 9, 2026*