404 lines
9.3 KiB
Markdown
404 lines
9.3 KiB
Markdown
# Home Lab Action Plan
|
|
|
|
## Phase 1: Critical Fixes (Do This Week)
|
|
|
|
### 1.1 Fix Failing Services
|
|
|
|
**bewcloud-memos (Restarting Loop)**
|
|
```bash
|
|
# SSH to controller
|
|
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130
|
|
|
|
# Check what's wrong
|
|
docker service logs bewcloud-memos-ssogxn-memos --tail 100
|
|
|
|
# Common fixes:
|
|
# If database connection issue:
|
|
docker service update --env-add "MEMOS_DB_HOST=correct-hostname" bewcloud-memos-ssogxn-memos
|
|
|
|
# If it keeps failing, try recreating:
|
|
docker service rm bewcloud-memos-ssogxn-memos
|
|
# Then redeploy via Dokploy UI
|
|
```
|
|
|
|
**bendtstudio-webstatic (Rollback Paused)**
|
|
```bash
|
|
# Check the error
|
|
docker service ps bendtstudio-webstatic-iq9evl --no-trunc
|
|
|
|
# Force update to retry
|
|
docker service update --force bendtstudio-webstatic-iq9evl
|
|
|
|
# If that fails, inspect the image
|
|
docker service inspect bendtstudio-webstatic-iq9evl --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'
|
|
```
|
|
|
|
**syncthing (Stopped)**
|
|
```bash
|
|
# Option A: Start it if you need it
|
|
docker service scale syncthing=1
|
|
|
|
# Option B: Remove it if not needed
|
|
docker service rm syncthing
|
|
# Also remove the volume if no longer needed
|
|
docker volume rm cloud-syncthing-i2rpwr_syncthing_config
|
|
```
|
|
|
|
### 1.2 Clean Up Unused Resources
|
|
|
|
```bash
|
|
# Remove unused volumes (reclaim ~595MB)
|
|
docker volume prune
|
|
|
|
# Remove unused images
|
|
docker image prune -a
|
|
|
|
# System-wide cleanup
|
|
docker system prune -a --volumes
|
|
```
|
|
|
|
### 1.3 Document Current State
|
|
|
|
Take screenshots of:
|
|
- Dokploy UI (all projects)
|
|
- Swarmpit dashboard
|
|
- Traefik dashboard (http://192.168.2.130:8080)
|
|
- MinIO console (http://192.168.2.18:9001)
|
|
- Gitea repositories
|
|
|
|
---
|
|
|
|
## Phase 2: Configuration Backup (Do This Week)
|
|
|
|
### 2.1 Create Git Repository for Infrastructure
|
|
|
|
```bash
|
|
# On the controller node:
|
|
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130
|
|
|
|
# Create a backup directory
|
|
mkdir -p ~/infrastructure-backup/$(date +%Y-%m-%d)
|
|
cd ~/infrastructure-backup/$(date +%Y-%m-%d)
|
|
|
|
# Copy all compose files
|
|
cp -r /etc/dokploy/compose ./dokploy-compose
|
|
cp -r /etc/dokploy/traefik ./traefik-config
|
|
cp ~/minio-stack.yml ./
|
|
|
|
# Export service configs
|
|
mkdir -p ./service-configs
|
|
docker service ls -q | while read service; do
|
|
docker service inspect "$service" > "./service-configs/${service}.json"
|
|
done
|
|
|
|
# Export stack configs
|
|
docker stack ls -q | while read stack; do
|
|
docker stack ps "$stack" > "./service-configs/${stack}-tasks.txt"
|
|
done
|
|
|
|
# Create a summary
|
|
cat > README.txt << EOF
|
|
Infrastructure Backup - $(date)
|
|
Cluster: Docker Swarm with Dokploy
|
|
Nodes: 3 (tpi-n1, tpi-n2, node-nas)
|
|
Services: $(docker service ls -q | wc -l) services
|
|
Stacks: $(docker stack ls -q | wc -l) stacks
|
|
|
|
See HOMELAB_AUDIT.md for full documentation.
|
|
EOF
|
|
|
|
# Create tar archive
|
|
cd ..
|
|
tar -czf infrastructure-$(date +%Y-%m-%d).tar.gz $(date +%Y-%m-%d)
|
|
```
|
|
|
|
### 2.2 Commit to Gitea
|
|
|
|
```bash
|
|
# Clone your infrastructure repo (create if needed)
|
|
# Replace with your actual Gitea URL
|
|
git clone http://gitea.bendtstudio.com:3000/sirtimbly/homelab-configs.git
|
|
cd homelab-configs
|
|
|
|
# Copy backed up configs
|
|
cp -r ~/infrastructure-backup/$(date +%Y-%m-%d)/* .
|
|
|
|
# Organize by service
|
|
mkdir -p {stacks,compose,dokploy,traefik,docs}
|
|
mv dokploy-compose/* compose/ 2>/dev/null || true
|
|
mv traefik-config/* traefik/ 2>/dev/null || true
|
|
mv minio-stack.yml stacks/
|
|
mv service-configs/* docs/ 2>/dev/null || true
|
|
|
|
# Commit
|
|
git add .
|
|
git commit -m "Initial infrastructure backup - $(date +%Y-%m-%d)
|
|
|
|
- All Dokploy compose files
|
|
- Traefik configuration
|
|
- MinIO stack definition
|
|
- Service inspection exports
|
|
- Task history exports
|
|
|
|
Services backed up:
|
|
$(docker service ls --format '- {{.Name}}' | sort)
|
|
|
|
git push origin main
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 3: Security Hardening (Do Next Week)
|
|
|
|
### 3.1 Remove Exposed Credentials
|
|
|
|
**Problem:** Services have passwords in environment variables visible in Docker configs
|
|
|
|
**Solution:** Use Docker secrets or Dokploy environment variables
|
|
|
|
```bash
|
|
# Example: Securing MinIO
|
|
# Instead of having password in compose file, use Docker secret:
|
|
|
|
echo "your-minio-password" | docker secret create minio_root_password -
|
|
|
|
# Then in compose:
|
|
# environment:
|
|
# MINIO_ROOT_PASSWORD_FILE: /run/secrets/minio_root_password
|
|
# secrets:
|
|
# - minio_root_password
|
|
```
|
|
|
|
**Action items:**
|
|
1. List all services with exposed passwords:
|
|
```bash
|
|
docker service ls -q | xargs -I {} docker service inspect {} --format '{{.Spec.Name}}: {{range .Spec.TaskTemplate.ContainerSpec.Env}}{{.}} {{end}}' | grep -i password
|
|
```
|
|
|
|
2. For each service, create a plan to move credentials to:
|
|
- Docker secrets (best for swarm)
|
|
- Environment files (easier to manage)
|
|
- Dokploy UI environment variables
|
|
|
|
3. Update compose files and redeploy
|
|
|
|
### 3.2 Update Default Passwords
|
|
|
|
Check for default/weak passwords:
|
|
- Dokploy (if still default)
|
|
- MinIO
|
|
- Gitea admin
|
|
- Technitium DNS
|
|
- Any databases
|
|
|
|
### 3.3 Review Exposed Ports
|
|
|
|
```bash
|
|
# Check all published ports
|
|
docker service ls --format '{{.Name}}: {{.Ports}}'
|
|
|
|
# Check if any services are exposed without Traefik
|
|
# (Should only be: 53, 2222, 3000, 8384, 9000-9001)
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 4: Monitoring Setup (Do Next Week)
|
|
|
|
### 4.1 Set Up Prometheus + Grafana
|
|
|
|
You mentioned these in PLAN.md but they're not running. Let's add them:
|
|
|
|
Create `monitoring-stack.yml`:
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
volumes:
|
|
- prometheus-data:/prometheus
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.path=/prometheus'
|
|
networks:
|
|
- dokploy-network
|
|
deploy:
|
|
placement:
|
|
constraints:
|
|
- node.role == manager
|
|
|
|
grafana:
|
|
image: grafana/grafana:latest
|
|
volumes:
|
|
- grafana-data:/var/lib/grafana
|
|
environment:
|
|
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
|
|
secrets:
|
|
- grafana_admin_password
|
|
networks:
|
|
- dokploy-network
|
|
deploy:
|
|
labels:
|
|
- traefik.http.routers.grafana.rule=Host(`grafana.bendtstudio.com`)
|
|
- traefik.http.routers.grafana.entrypoints=websecure
|
|
- traefik.http.routers.grafana.tls.certresolver=letsencrypt
|
|
- traefik.enable=true
|
|
|
|
volumes:
|
|
prometheus-data:
|
|
grafana-data:
|
|
|
|
networks:
|
|
dokploy-network:
|
|
external: true
|
|
|
|
secrets:
|
|
grafana_admin_password:
|
|
external: true
|
|
```
|
|
|
|
### 4.2 Add Node Exporter
|
|
|
|
Deploy node-exporter on all nodes to collect system metrics.
|
|
|
|
### 4.3 Configure Alerts
|
|
|
|
Set up alerts for:
|
|
- Service down
|
|
- High CPU/memory usage
|
|
- Disk space low
|
|
- Certificate expiration
|
|
|
|
---
|
|
|
|
## Phase 5: Backup Strategy (Do Within 2 Weeks)
|
|
|
|
### 5.1 Define What to Back Up
|
|
|
|
**Critical Data:**
|
|
1. Gitea repositories (/data/git)
|
|
2. Dokploy database
|
|
3. MinIO buckets
|
|
4. Immich photos (/mnt/synology-data/immich)
|
|
5. PostgreSQL databases
|
|
6. Configuration files
|
|
|
|
### 5.2 Create Backup Scripts
|
|
|
|
Example backup script for Gitea:
|
|
```bash
|
|
#!/bin/bash
|
|
# /opt/backup/backup-gitea.sh
|
|
|
|
BACKUP_DIR="/backup/gitea/$(date +%Y%m%d)"
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
# Backup Gitea data
|
|
docker exec gitea-giteasqlite-bhymqw-gitea-1 tar czf /tmp/gitea-backup.tar.gz /data
|
|
docker cp gitea-giteasqlite-bhymqw-gitea-1:/tmp/gitea-backup.tar.gz "$BACKUP_DIR/"
|
|
|
|
# Backup to MinIO (offsite)
|
|
mc cp "$BACKUP_DIR/gitea-backup.tar.gz" minio/backups/gitea/
|
|
|
|
# Clean up old backups (keep 30 days)
|
|
find /backup/gitea -type d -mtime +30 -exec rm -rf {} +
|
|
```
|
|
|
|
### 5.3 Automate Backups
|
|
|
|
Add to crontab:
|
|
```bash
|
|
# Daily backups at 2 AM
|
|
0 2 * * * /opt/backup/backup-gitea.sh
|
|
0 3 * * * /opt/backup/backup-dokploy.sh
|
|
0 4 * * * /opt/backup/backup-databases.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 6: Documentation (Ongoing)
|
|
|
|
### 6.1 Create Service Catalog
|
|
|
|
For each service, document:
|
|
- **Purpose:** What does it do?
|
|
- **Access URL:** How do I reach it?
|
|
- **Dependencies:** What does it need?
|
|
- **Data location:** Where is data stored?
|
|
- **Backup procedure:** How to back it up?
|
|
- **Restore procedure:** How to restore it?
|
|
|
|
### 6.2 Create Runbooks
|
|
|
|
Common operations:
|
|
- Adding a new service
|
|
- Scaling a service
|
|
- Updating a service
|
|
- Removing a service
|
|
- Recovering from node failure
|
|
- Restoring from backup
|
|
|
|
### 6.3 Network Diagram
|
|
|
|
Create a visual diagram showing:
|
|
- Nodes and their roles
|
|
- Services and their locations
|
|
- Network connections
|
|
- Data flows
|
|
|
|
---
|
|
|
|
## Quick Reference Commands
|
|
|
|
```bash
|
|
# Cluster status
|
|
docker node ls
|
|
docker service ls
|
|
docker stack ls
|
|
|
|
# Service management
|
|
docker service logs <service> --tail 100 -f
|
|
docker service ps <service>
|
|
docker service scale <service>=<count>
|
|
docker service update --force <service>
|
|
|
|
# Resource usage
|
|
docker system df
|
|
docker stats
|
|
|
|
# SSH access
|
|
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130 # Manager
|
|
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.19 # Worker
|
|
|
|
# Web UIs
|
|
curl http://192.168.2.130:3000 # Dokploy
|
|
curl http://192.168.2.130:888 # Swarmpit
|
|
curl http://192.168.2.130:8080 # Traefik
|
|
curl http://192.168.2.18:5380 # Technitium DNS
|
|
curl http://192.168.2.18:9001 # MinIO Console
|
|
```
|
|
|
|
---
|
|
|
|
## Questions for You
|
|
|
|
Before we proceed, I need to clarify a few things:
|
|
|
|
1. **NAS Node Access:** What are the SSH credentials for node-nas (192.168.2.18)?
|
|
|
|
2. **bendtstudio-app:** Is this service needed? It has 0 replicas.
|
|
|
|
3. **syncthing:** Do you want to keep this? It's currently stopped.
|
|
|
|
4. **Monitoring:** Do you want me to set up Prometheus/Grafana now, or later?
|
|
|
|
5. **Gitea:** Can you provide access credentials so I can check what's already version controlled?
|
|
|
|
6. **Priority:** Which phase should we tackle first? I recommend Phase 1 (critical fixes).
|
|
|
|
---
|
|
|
|
*Action Plan Version 1.0 - February 9, 2026*
|