initial

2026-02-24 17:36:44 -05:00
commit 1413a1f463
12 changed files with 869 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,152 @@
+# RK Whisper + VLM API
+
+OpenAI-compatible API server for:
+
+- Whisper-style speech-to-text
+- Vision understanding through the RKLLM multimodal demo (Qwen3-VL)
+
+This service exposes:
+
+- `GET /health`
+- `POST /v1/audio/transcriptions` (Whisper-style multipart API)
+- `POST /v1/vision/understand` (multipart image + prompt)
+- `POST /v1/chat/completions` (OpenAI-style JSON with image_url)
+
+The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs.
+
+## Repo Layout
+
+- `app/server.py` - FastAPI app
+- `Dockerfile` - container image
+- `docker-compose.yml` - local run
+- `stack.yml` - Docker Swarm deploy with node placement
+- `app/download_models.py` - downloads Whisper assets into a target directory/volume
+
+## 1) Initialize model volumes
+
+```bash
+cp .env.example .env
+docker compose --profile init run --rm whisper-models-init
+```
+
+This seeds the named Docker volume `whisper-models` with:
+
+- `whisper_encoder_base_20s.onnx`
+- `whisper_decoder_base_20s.onnx`
+- `mel_80_filters.txt`
+- `vocab_en.txt`
+- `vocab_zh.txt`
+
+## 2) Run with docker compose
+
+```bash
+docker compose up --build -d
+curl http://127.0.0.1:9000/health
+```
+
+By default compose runs STT only. To enable VLM locally:
+
+```bash
+VLM_ENABLED=true
+```
+
+Then copy RKLLM assets into the `rkllm-root` volume (one-time):
+
+```bash
+docker volume create rk-whisper-stt-api_rkllm-root
+
+docker run --rm \
+  -v rk-whisper-stt-api_rkllm-root:/dst \
+  -v /home/ubuntu/rkllm-demo:/src:ro \
+  alpine:3.20 \
+  sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/'
+```
+
+## 3) Test transcription
+
+```bash
+curl http://127.0.0.1:9000/v1/audio/transcriptions \
+  -F file=@/path/to/audio.wav \
+  -F model=whisper-base-onnx \
+  -F language=en \
+  -F response_format=json
+```
+
+If you set `STT_API_KEY`, send an auth header:
+
+```bash
+Authorization: Bearer <your-key>
+```
+
+## 4) Build and push image
+
+```bash
+docker build -t registry.lan/openai-whisper-stt:latest .
+docker push registry.lan/openai-whisper-stt:latest
+```
+
+## 5) Deploy to Swarm on a specific node
+
+```bash
+cp .env.example .env
+# edit STT_NODE_HOSTNAME to the target node
+docker stack deploy -c stack.yml whisper-stt
+```
+
+The service is pinned by:
+
+```yaml
+placement:
+  constraints:
+    - node.hostname == ${STT_NODE_HOSTNAME}
+```
+
+The stack uses named volumes for model persistence and backups:
+
+```yaml
+whisper-models:/models
+rkllm-root:/opt/rkllm-root
+```
+
+Seed those volumes on the target node before deploying (same copy/download steps as compose).
+
+## API fields
+
+`POST /v1/audio/transcriptions` form fields:
+
+- `file` (required)
+- `model` (default `whisper-base-onnx`)
+- `language` (`en` or `zh`, default `en`)
+- `response_format` (`json`, `text`, or `verbose_json`)
+
+`POST /v1/vision/understand` form fields:
+
+- `file` (required image)
+- `prompt` (default `Describe this image in English.`)
+- `model` (default `qwen3-vl-2b-rkllm`)
+
+`POST /v1/chat/completions` accepts OpenAI-style content with `image_url`:
+
+```json
+{
+  "model": "qwen3-vl-2b-rkllm",
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {"type": "text", "text": "Describe this image"},
+        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
+      ]
+    }
+  ]
+}
+```
+
+Example call:
+
+```bash
+curl http://127.0.0.1:9000/v1/vision/understand \
+  -F file=@demo.jpg \
+  -F prompt="Describe this image in English." \
+  -F model=qwen3-vl-2b-rkllm
+```