new plan and docs
This commit is contained in:
403
ACTION_PLAN.md
Normal file
403
ACTION_PLAN.md
Normal file
@@ -0,0 +1,403 @@
|
||||
# Home Lab Action Plan
|
||||
|
||||
## Phase 1: Critical Fixes (Do This Week)
|
||||
|
||||
### 1.1 Fix Failing Services
|
||||
|
||||
**bewcloud-memos (Restarting Loop)**
|
||||
```bash
|
||||
# SSH to controller
|
||||
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130
|
||||
|
||||
# Check what's wrong
|
||||
docker service logs bewcloud-memos-ssogxn-memos --tail 100
|
||||
|
||||
# Common fixes:
|
||||
# If database connection issue:
|
||||
docker service update --env-add "MEMOS_DB_HOST=correct-hostname" bewcloud-memos-ssogxn-memos
|
||||
|
||||
# If it keeps failing, try recreating:
|
||||
docker service rm bewcloud-memos-ssogxn-memos
|
||||
# Then redeploy via Dokploy UI
|
||||
```
|
||||
|
||||
**bendtstudio-webstatic (Rollback Paused)**
|
||||
```bash
|
||||
# Check the error
|
||||
docker service ps bendtstudio-webstatic-iq9evl --no-trunc
|
||||
|
||||
# Force update to retry
|
||||
docker service update --force bendtstudio-webstatic-iq9evl
|
||||
|
||||
# If that fails, inspect the image
|
||||
docker service inspect bendtstudio-webstatic-iq9evl --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'
|
||||
```
|
||||
|
||||
**syncthing (Stopped)**
|
||||
```bash
|
||||
# Option A: Start it if you need it
|
||||
docker service scale syncthing=1
|
||||
|
||||
# Option B: Remove it if not needed
|
||||
docker service rm syncthing
|
||||
# Also remove the volume if no longer needed
|
||||
docker volume rm cloud-syncthing-i2rpwr_syncthing_config
|
||||
```
|
||||
|
||||
### 1.2 Clean Up Unused Resources
|
||||
|
||||
```bash
|
||||
# Remove unused volumes (reclaim ~595MB)
|
||||
docker volume prune
|
||||
|
||||
# Remove unused images
|
||||
docker image prune -a
|
||||
|
||||
# System-wide cleanup
|
||||
docker system prune -a --volumes
|
||||
```
|
||||
|
||||
### 1.3 Document Current State
|
||||
|
||||
Take screenshots of:
|
||||
- Dokploy UI (all projects)
|
||||
- Swarmpit dashboard
|
||||
- Traefik dashboard (http://192.168.2.130:8080)
|
||||
- MinIO console (http://192.168.2.18:9001)
|
||||
- Gitea repositories
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Configuration Backup (Do This Week)
|
||||
|
||||
### 2.1 Create Git Repository for Infrastructure
|
||||
|
||||
```bash
|
||||
# On the controller node:
|
||||
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130
|
||||
|
||||
# Create a backup directory
|
||||
mkdir -p ~/infrastructure-backup/$(date +%Y-%m-%d)
|
||||
cd ~/infrastructure-backup/$(date +%Y-%m-%d)
|
||||
|
||||
# Copy all compose files
|
||||
cp -r /etc/dokploy/compose ./dokploy-compose
|
||||
cp -r /etc/dokploy/traefik ./traefik-config
|
||||
cp ~/minio-stack.yml ./
|
||||
|
||||
# Export service configs
|
||||
mkdir -p ./service-configs
|
||||
docker service ls -q | while read service; do
|
||||
docker service inspect "$service" > "./service-configs/${service}.json"
|
||||
done
|
||||
|
||||
# Export stack configs
|
||||
docker stack ls -q | while read stack; do
|
||||
docker stack ps "$stack" > "./service-configs/${stack}-tasks.txt"
|
||||
done
|
||||
|
||||
# Create a summary
|
||||
cat > README.txt << EOF
|
||||
Infrastructure Backup - $(date)
|
||||
Cluster: Docker Swarm with Dokploy
|
||||
Nodes: 3 (tpi-n1, tpi-n2, node-nas)
|
||||
Services: $(docker service ls -q | wc -l) services
|
||||
Stacks: $(docker stack ls -q | wc -l) stacks
|
||||
|
||||
See HOMELAB_AUDIT.md for full documentation.
|
||||
EOF
|
||||
|
||||
# Create tar archive
|
||||
cd ..
|
||||
tar -czf infrastructure-$(date +%Y-%m-%d).tar.gz $(date +%Y-%m-%d)
|
||||
```
|
||||
|
||||
### 2.2 Commit to Gitea
|
||||
|
||||
```bash
|
||||
# Clone your infrastructure repo (create if needed)
|
||||
# Replace with your actual Gitea URL
|
||||
git clone http://gitea.bendtstudio.com:3000/sirtimbly/homelab-configs.git
|
||||
cd homelab-configs
|
||||
|
||||
# Copy backed up configs
|
||||
cp -r ~/infrastructure-backup/$(date +%Y-%m-%d)/* .
|
||||
|
||||
# Organize by service
|
||||
mkdir -p {stacks,compose,dokploy,traefik,docs}
|
||||
mv dokploy-compose/* compose/ 2>/dev/null || true
|
||||
mv traefik-config/* traefik/ 2>/dev/null || true
|
||||
mv minio-stack.yml stacks/
|
||||
mv service-configs/* docs/ 2>/dev/null || true
|
||||
|
||||
# Commit
|
||||
git add .
|
||||
git commit -m "Initial infrastructure backup - $(date +%Y-%m-%d)
|
||||
|
||||
- All Dokploy compose files
|
||||
- Traefik configuration
|
||||
- MinIO stack definition
|
||||
- Service inspection exports
|
||||
- Task history exports
|
||||
|
||||
Services backed up:
|
||||
$(docker service ls --format '- {{.Name}}' | sort)
|
||||
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Security Hardening (Do Next Week)
|
||||
|
||||
### 3.1 Remove Exposed Credentials
|
||||
|
||||
**Problem:** Services have passwords in environment variables visible in Docker configs
|
||||
|
||||
**Solution:** Use Docker secrets or Dokploy environment variables
|
||||
|
||||
```bash
|
||||
# Example: Securing MinIO
|
||||
# Instead of having password in compose file, use Docker secret:
|
||||
|
||||
echo "your-minio-password" | docker secret create minio_root_password -
|
||||
|
||||
# Then in compose:
|
||||
# environment:
|
||||
# MINIO_ROOT_PASSWORD_FILE: /run/secrets/minio_root_password
|
||||
# secrets:
|
||||
# - minio_root_password
|
||||
```
|
||||
|
||||
**Action items:**
|
||||
1. List all services with exposed passwords:
|
||||
```bash
|
||||
docker service ls -q | xargs -I {} docker service inspect {} --format '{{.Spec.Name}}: {{range .Spec.TaskTemplate.ContainerSpec.Env}}{{.}} {{end}}' | grep -i password
|
||||
```
|
||||
|
||||
2. For each service, create a plan to move credentials to:
|
||||
- Docker secrets (best for swarm)
|
||||
- Environment files (easier to manage)
|
||||
- Dokploy UI environment variables
|
||||
|
||||
3. Update compose files and redeploy
|
||||
|
||||
### 3.2 Update Default Passwords
|
||||
|
||||
Check for default/weak passwords:
|
||||
- Dokploy (if still default)
|
||||
- MinIO
|
||||
- Gitea admin
|
||||
- Technitium DNS
|
||||
- Any databases
|
||||
|
||||
### 3.3 Review Exposed Ports
|
||||
|
||||
```bash
|
||||
# Check all published ports
|
||||
docker service ls --format '{{.Name}}: {{.Ports}}'
|
||||
|
||||
# Check if any services are exposed without Traefik
|
||||
# (Should only be: 53, 2222, 3000, 8384, 9000-9001)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Monitoring Setup (Do Next Week)
|
||||
|
||||
### 4.1 Set Up Prometheus + Grafana
|
||||
|
||||
You mentioned these in PLAN.md but they're not running. Let's add them:
|
||||
|
||||
Create `monitoring-stack.yml`:
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
volumes:
|
||||
- prometheus-data:/prometheus
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
networks:
|
||||
- dokploy-network
|
||||
deploy:
|
||||
placement:
|
||||
constraints:
|
||||
- node.role == manager
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
|
||||
secrets:
|
||||
- grafana_admin_password
|
||||
networks:
|
||||
- dokploy-network
|
||||
deploy:
|
||||
labels:
|
||||
- traefik.http.routers.grafana.rule=Host(`grafana.bendtstudio.com`)
|
||||
- traefik.http.routers.grafana.entrypoints=websecure
|
||||
- traefik.http.routers.grafana.tls.certresolver=letsencrypt
|
||||
- traefik.enable=true
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
grafana-data:
|
||||
|
||||
networks:
|
||||
dokploy-network:
|
||||
external: true
|
||||
|
||||
secrets:
|
||||
grafana_admin_password:
|
||||
external: true
|
||||
```
|
||||
|
||||
### 4.2 Add Node Exporter
|
||||
|
||||
Deploy node-exporter on all nodes to collect system metrics.
|
||||
|
||||
### 4.3 Configure Alerts
|
||||
|
||||
Set up alerts for:
|
||||
- Service down
|
||||
- High CPU/memory usage
|
||||
- Disk space low
|
||||
- Certificate expiration
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Backup Strategy (Do Within 2 Weeks)
|
||||
|
||||
### 5.1 Define What to Back Up
|
||||
|
||||
**Critical Data:**
|
||||
1. Gitea repositories (/data/git)
|
||||
2. Dokploy database
|
||||
3. MinIO buckets
|
||||
4. Immich photos (/mnt/synology-data/immich)
|
||||
5. PostgreSQL databases
|
||||
6. Configuration files
|
||||
|
||||
### 5.2 Create Backup Scripts
|
||||
|
||||
Example backup script for Gitea:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /opt/backup/backup-gitea.sh
|
||||
|
||||
BACKUP_DIR="/backup/gitea/$(date +%Y%m%d)"
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
|
||||
# Backup Gitea data
|
||||
docker exec gitea-giteasqlite-bhymqw-gitea-1 tar czf /tmp/gitea-backup.tar.gz /data
|
||||
docker cp gitea-giteasqlite-bhymqw-gitea-1:/tmp/gitea-backup.tar.gz "$BACKUP_DIR/"
|
||||
|
||||
# Backup to MinIO (offsite)
|
||||
mc cp "$BACKUP_DIR/gitea-backup.tar.gz" minio/backups/gitea/
|
||||
|
||||
# Clean up old backups (keep 30 days)
|
||||
find /backup/gitea -type d -mtime +30 -exec rm -rf {} +
|
||||
```
|
||||
|
||||
### 5.3 Automate Backups
|
||||
|
||||
Add to crontab:
|
||||
```bash
|
||||
# Daily backups at 2 AM
|
||||
0 2 * * * /opt/backup/backup-gitea.sh
|
||||
0 3 * * * /opt/backup/backup-dokploy.sh
|
||||
0 4 * * * /opt/backup/backup-databases.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Documentation (Ongoing)
|
||||
|
||||
### 6.1 Create Service Catalog
|
||||
|
||||
For each service, document:
|
||||
- **Purpose:** What does it do?
|
||||
- **Access URL:** How do I reach it?
|
||||
- **Dependencies:** What does it need?
|
||||
- **Data location:** Where is data stored?
|
||||
- **Backup procedure:** How to back it up?
|
||||
- **Restore procedure:** How to restore it?
|
||||
|
||||
### 6.2 Create Runbooks
|
||||
|
||||
Common operations:
|
||||
- Adding a new service
|
||||
- Scaling a service
|
||||
- Updating a service
|
||||
- Removing a service
|
||||
- Recovering from node failure
|
||||
- Restoring from backup
|
||||
|
||||
### 6.3 Network Diagram
|
||||
|
||||
Create a visual diagram showing:
|
||||
- Nodes and their roles
|
||||
- Services and their locations
|
||||
- Network connections
|
||||
- Data flows
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Commands
|
||||
|
||||
```bash
|
||||
# Cluster status
|
||||
docker node ls
|
||||
docker service ls
|
||||
docker stack ls
|
||||
|
||||
# Service management
|
||||
docker service logs <service> --tail 100 -f
|
||||
docker service ps <service>
|
||||
docker service scale <service>=<count>
|
||||
docker service update --force <service>
|
||||
|
||||
# Resource usage
|
||||
docker system df
|
||||
docker stats
|
||||
|
||||
# SSH access
|
||||
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130 # Manager
|
||||
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.19 # Worker
|
||||
|
||||
# Web UIs
|
||||
curl http://192.168.2.130:3000 # Dokploy
|
||||
curl http://192.168.2.130:888 # Swarmpit
|
||||
curl http://192.168.2.130:8080 # Traefik
|
||||
curl http://192.168.2.18:5380 # Technitium DNS
|
||||
curl http://192.168.2.18:9001 # MinIO Console
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Questions for You
|
||||
|
||||
Before we proceed, I need to clarify a few things:
|
||||
|
||||
1. **NAS Node Access:** What are the SSH credentials for node-nas (192.168.2.18)?
|
||||
|
||||
2. **bendtstudio-app:** Is this service needed? It has 0 replicas.
|
||||
|
||||
3. **syncthing:** Do you want to keep this? It's currently stopped.
|
||||
|
||||
4. **Monitoring:** Do you want me to set up Prometheus/Grafana now, or later?
|
||||
|
||||
5. **Gitea:** Can you provide access credentials so I can check what's already version controlled?
|
||||
|
||||
6. **Priority:** Which phase should we tackle first? I recommend Phase 1 (critical fixes).
|
||||
|
||||
---
|
||||
|
||||
*Action Plan Version 1.0 - February 9, 2026*
|
||||
Reference in New Issue
Block a user