new plan and docs

This commit is contained in:
Bendt
2026-02-09 09:52:00 -05:00
parent 8d23a6f576
commit 15c9dee474
4 changed files with 965 additions and 0 deletions

425
HOMELAB_AUDIT.md Normal file
View File

@@ -0,0 +1,425 @@
# Home Lab Cluster Audit Report
**Date:** February 9, 2026
**Auditor:** opencode
**Cluster:** Docker Swarm with Dokploy
---
## 1. Cluster Overview
- **Cluster Type:** Docker Swarm (3 nodes)
- **Orchestration:** Dokploy v3.x
- **Reverse Proxy:** Traefik v3.6.1
- **DNS:** Technitium DNS Server
- **Monitoring:** Swarmpit
- **Git Server:** Gitea v1.24.4
- **Object Storage:** MinIO
---
## 2. Node Inventory
### Node 1: tpi-n1 (Controller/Manager)
- **IP:** 192.168.2.130
- **Role:** Manager (Leader)
- **Architecture:** aarch64 (ARM64)
- **OS:** Linux
- **CPU:** 8 cores
- **RAM:** ~8 GB
- **Docker:** v27.5.1
- **Labels:**
- `infra=true`
- `role=storage`
- `storage=high`
- **Status:** Ready, Active
### Node 2: tpi-n2 (Worker)
- **IP:** 192.168.2.19
- **Role:** Worker
- **Architecture:** aarch64 (ARM64)
- **OS:** Linux
- **CPU:** 8 cores
- **RAM:** ~8 GB
- **Docker:** v27.5.1
- **Labels:**
- `role=compute`
- **Status:** Ready, Active
### Node 3: node-nas (Storage Worker)
- **IP:** 192.168.2.18
- **Role:** Worker (NAS/Storage)
- **Architecture:** x86_64
- **OS:** Linux
- **CPU:** 2 cores
- **RAM:** ~8 GB
- **Docker:** v29.1.2
- **Labels:**
- `type=nas`
- **Status:** Ready, Active
---
## 3. Docker Stacks (Swarm Mode)
### Active Stacks:
#### 1. minio
- **Services:** 1 (minio_minio)
- **Status:** Running
- **Node:** node-nas (constrained to NAS)
- **Ports:** 9000 (API), 9001 (Console)
- **Storage:** /mnt/synology-data/minio (bind mount)
- **Credentials:** [REDACTED - see service config]
#### 2. swarmpit
- **Services:** 4
- swarmpit_app (UI) - Running on tpi-n1, Port 888
- swarmpit_agent (global) - Running on all 3 nodes
- swarmpit_db (CouchDB) - Running on tpi-n2
- swarmpit_influxdb - Running on node-nas
- **Status:** Active with historical failures
- **Issues:** Multiple container failures in history (mostly resolved)
---
## 4. Dokploy-Managed Services
### Running Services (via Dokploy Compose):
1. **ai-lobechat-yqvecg** - AI chat interface
2. **bewcloud-memos-ssogxn** - Note-taking app (⚠️ Restarting loop)
3. **bewcloud-silverbullet-42sjev** - SilverBullet markdown editor + Watchtower
4. **cloud-bewcloud-u2pls5** - BewCloud instance with Radicale (CalDAV/CardDAV)
5. **cloud-fizzy-ezuhfq** - Fizzy web app
6. **cloud-ironcalc-0id5k8** - IronCalc spreadsheet
7. **cloud-radicale-wqldcv** - Standalone Radicale server
8. **cloud-uptimekuma-jdeivt** - Uptime monitoring
9. **dns-technitum-6ojgo2** - Technitium DNS server
10. **gitea-giteasqlite-bhymqw** - Git server (Port 3000, SSH on 2222)
11. **gitea-registry-vdftrt** - Docker registry (Port 5000)
### Dokploy Infrastructure Services:
- **dokploy** - Main Dokploy UI (Port 3000, host mode)
- **dokploy-postgres** - Dokploy database
- **dokploy-redis** - Dokploy cache
- **dokploy-traefik** - Reverse proxy (Ports 80, 443, 8080)
---
## 5. Standalone Services (docker-compose)
### Running:
- **technitium-dns** - DNS server (Port 53, 5380)
- **immich3-compose** - Photo management (Immich v2.3.0)
- immich-server
- immich-machine-learning
- immich-database (pgvecto-rs)
- immich-redis
### Stack Services:
- **bendtstudio-pancake-bzgfpc** - MariaDB database (Port 3306)
- **bendtstudio-webstatic-iq9evl** - Static web files (⚠️ Rollback paused state)
---
## 6. Issues Identified
### 🔴 Critical Issues:
1. **bewcloud-memos in Restart Loop**
- Container keeps restarting (seen 24 seconds ago)
- Status: `Restarting (0) 24 seconds ago`
- **Action Required:** Check logs and fix configuration
2. **bendtstudio-webstatic in Rollback Paused State**
- Service is not updating properly
- State: `rollback_paused`
- **Action Required:** Investigate update failure
3. **bendtstudio-app Not Running**
- Service has 0/0 replicas
- **Action Required:** Determine if needed or remove
4. **syncthing Stopped**
- Service has 0 replicas
- Should be on node-nas
- **Action Required:** Restart or remove if not needed
### 🟡 Warning Issues:
5. **Swarmpit Agent Failures (Historical)**
- Multiple past failures on all nodes
- Currently running but concerning history
- **Action Required:** Monitor for stability
6. **No Monitoring of MinIO**
- MinIO running but no obvious backup/monitoring strategy documented
- **Action Required:** Set up monitoring and backup
7. **Credential Management**
- Passwords visible in service configs (bendtstudio-webstatic, MinIO, DNS)
- **Action Required:** Migrate to Docker secrets or env files
### 🟢 Informational:
8. **13 Unused/Orphaned Volumes**
- 33 total volumes, only 20 active
- **Action Required:** Clean up unused volumes to reclaim ~595MB
9. **Gitea Repository Status Unknown**
- Cannot verify if all compose files are version controlled
- **Action Required:** Audit Gitea repositories
---
## 7. Storage Configuration
### Local Volumes (33 total):
Key volumes include:
- `dokploy-postgres-database`
- `bewcloud-postgres-in40hh-data`
- `gitea-data`, `gitea-registry-data`
- `immich-postgres`, `immich-redis-data`, `immich-model-cache`
- `bendtstudio-pancake-data`
- `shared-data` (NFS/shared)
- Various app-specific volumes
### Bind Mounts:
- **MinIO:** `/mnt/synology-data/minio``/data`
- **Syncthing:** `/mnt/synology-data``/var/syncthing` (currently stopped)
- **Dokploy:** `/etc/dokploy``/etc/dokploy`
### NFS Mounts:
- Synology NAS mounted at `/mnt/synology-data/`
- Contains: immich/, minio/
---
## 8. Networking
### Overlay Networks:
- `dokploy-network` - Main Dokploy network
- `minio_default` - MinIO stack network
- `swarmpit_net` - Swarmpit monitoring network
- `ingress` - Docker Swarm ingress
### Bridge Networks:
- Multiple app-specific networks created by compose
- `ai-lobechat-yqvecg`
- `bewcloud-memos-ssogxn`
- `bewcloud-silverbullet-42sjev`
- `cloud-fizzy-ezuhfq_default`
- `cloud-uptimekuma-jdeivt`
- `gitea-giteasqlite-bhymqw`
- `gitea-registry-vdftrt`
- `immich3-compose-ubyhe9_default`
---
## 9. SSL/TLS Configuration
- **Certificate Resolver:** Let's Encrypt (ACME)
- **Email:** sirtimbly@gmail.com
- **Challenge Type:** HTTP-01
- **Storage:** `/etc/dokploy/traefik/dynamic/acme.json`
- **Entry Points:** web (80) → websecure (443) with auto-redirect
- **HTTP/3:** Enabled on websecure
---
## 10. Traefik Routing
### Configured Routes (via labels):
- gitea.bendtstudio.com → Gitea
- Multiple apps via traefik.me subdomains
- HTTP → HTTPS redirect enabled
- Middlewares configured in `/etc/dokploy/traefik/dynamic/`
---
## 11. DNS Configuration
### Technitium DNS:
- **Port:** 53 (TCP/UDP), 5380 (Web UI)
- **Domain:** dns.bendtstudio.com
- **Admin Password:** [REDACTED]
- **Placement:** Locked to tpi-n1
- **TZ:** America/New_York
### Services using DNS:
- All services accessible via bendtstudio.com subdomains
- Internal DNS resolution for Docker services
---
## 12. Configuration Files Location
### In `/etc/dokploy/`:
- `traefik/traefik.yml` - Main Traefik config
- `traefik/dynamic/*.yml` - Dynamic routes and middlewares
- `compose/*/code/docker-compose.yml` - Dokploy-managed compose files
### In `/home/ubuntu/`:
- `minio-stack.yml` - MinIO stack definition
### In local workspace:
- Various compose files (not all deployed via Dokploy)
- May be out of sync with running services
---
## 13. Missing Configuration in Version Control
Based on the analysis, the following may NOT be properly tracked in Gitea:
1.**Gitea** itself - compose file present
2.**MinIO** - stack file in ~/minio-stack.yml
3. ⚠️ **Dokploy dynamic configs** - traefik routes
4. ⚠️ **All Dokploy-managed compose files** - 11 services
5.**Technitium DNS** - compose file in /etc/dokploy/
6.**Immich** - compose configuration
7.**Swarmpit** - stack configuration
8.**Dokploy infrastructure** - internal services
---
## 14. Resource Usage
### Docker System:
- **Images:** 23 (10.91 GB)
- **Containers:** 26 (135 MB)
- **Volumes:** 33 (2.02 GB, 595MB reclaimable)
- **Build Cache:** 0
### Node Resources:
- **tpi-n1 & tpi-n2:** 8 cores ARM64, 8GB RAM each
- **node-nas:** 2 cores x86_64, 8GB RAM
---
## 15. Recommendations
### Immediate Actions (High Priority):
1. **Fix bewcloud-memos**
```bash
docker service logs bewcloud-memos-ssogxn-memos --tail 50
```
2. **Fix bendtstudio-webstatic**
```bash
docker service ps bendtstudio-webstatic-iq9evl --no-trunc
docker service update --force bendtstudio-webstatic-iq9evl
```
3. **Restart or Remove syncthing**
```bash
# Option 1: Scale up
docker service scale syncthing=1
# Option 2: Remove
docker service rm syncthing
```
4. **Clean up unused volumes**
```bash
docker volume prune
```
### Short-term Actions (Medium Priority):
5. **Audit Gitea repositories**
- Access Gitea at http://gitea.bendtstudio.com
- Verify which compose files are tracked
- Commit missing configurations
6. **Secure credentials**
- Use Docker secrets for passwords
- Move credentials to environment files
- Never commit .env files with real passwords
7. **Set up automated backups**
- Back up Dokploy database
- Back up Gitea repositories
- Back up MinIO data
8. **Document all services**
- Create README for each service
- Document dependencies and data locations
- Create runbook for common operations
### Long-term Actions (Low Priority):
9. **Implement proper monitoring**
- Prometheus/Grafana for metrics (mentioned in PLAN.md but not found)
- Alerting for service failures
- Disk usage monitoring
10. **Implement GitOps workflow**
- All changes through Git
- Automated deployments via Dokploy webhooks
- Configuration drift detection
11. **Consolidate storage strategy**
- Define clear policy for volumes vs bind mounts
- Document backup procedures for each storage type
12. **Security audit**
- Review all exposed ports
- Check for default/weak passwords
- Implement network segmentation if needed
---
## 16. Next Steps Checklist
- [ ] Fix critical service issues (memos, webstatic)
- [ ] Document all running services with purpose
- [ ] Commit all compose files to Gitea
- [ ] Create backup strategy
- [ ] Set up monitoring and alerting
- [ ] Clean up unused resources
- [ ] Create disaster recovery plan
- [ ] Document SSH access for all nodes
---
## Appendix A: Quick Commands Reference
```bash
# View cluster status
docker node ls
docker service ls
docker stack ls
# View service logs
docker service logs <service-name> --tail 100 -f
# View container logs
docker logs <container-name> --tail 100 -f
# Scale a service
docker service scale <service-name>=<replicas>
# Update a service
docker service update --force <service-name>
# SSH to nodes
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.130 # tpi-n1 (manager)
ssh -i ~/.ssh/id_ed25519 ubuntu@192.168.2.19 # tpi-n2 (worker)
# NAS node requires different credentials
# Access Dokploy UI
http://192.168.2.130:3000
# Access Swarmpit UI
http://192.168.2.130:888
# Access Traefik Dashboard
http://192.168.2.130:8080
```
---
*End of Audit Report*