# RK Whisper + VLM API OpenAI-compatible API server for: - Whisper-style speech-to-text - Vision understanding through the RKLLM multimodal demo (Qwen3-VL) This service exposes: - `GET /health` - `POST /v1/audio/transcriptions` (Whisper-style multipart API) - `POST /v1/vision/understand` (multipart image + prompt) - `POST /v1/chat/completions` (OpenAI-style JSON with image_url) The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs. ## Repo Layout - `app/server.py` - FastAPI app - `Dockerfile` - container image - `docker-compose.yml` - local run - `stack.yml` - Docker Swarm deploy with node placement - `app/download_models.py` - downloads Whisper assets into a target directory/volume ## 1) Initialize model volumes ```bash cp .env.example .env docker compose --profile init run --rm whisper-models-init ``` This seeds the named Docker volume `whisper-models` with: - `whisper_encoder_base_20s.onnx` - `whisper_decoder_base_20s.onnx` - `mel_80_filters.txt` - `vocab_en.txt` - `vocab_zh.txt` ## 2) Run with docker compose ```bash docker compose up --build -d curl http://127.0.0.1:9000/health ``` By default compose runs STT only. To enable VLM locally: ```bash VLM_ENABLED=true ``` Then copy RKLLM assets into the `rkllm-root` volume (one-time): ```bash docker volume create rk-whisper-stt-api_rkllm-root docker run --rm \ -v rk-whisper-stt-api_rkllm-root:/dst \ -v /home/ubuntu/rkllm-demo:/src:ro \ alpine:3.20 \ sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/' ``` ## 3) Test transcription ```bash curl http://127.0.0.1:9000/v1/audio/transcriptions \ -F file=@/path/to/audio.wav \ -F model=whisper-base-onnx \ -F language=en \ -F response_format=json ``` If you set `STT_API_KEY`, send an auth header: ```bash Authorization: Bearer ``` ## 4) Build and push image ```bash docker build -t registry.lan/openai-whisper-stt:latest . docker push registry.lan/openai-whisper-stt:latest ``` ## 5) Deploy to Swarm on a specific node ```bash cp .env.example .env # edit STT_NODE_HOSTNAME to the target node docker stack deploy -c stack.yml whisper-stt ``` The service is pinned by: ```yaml placement: constraints: - node.hostname == ${STT_NODE_HOSTNAME} ``` The stack uses named volumes for model persistence and backups: ```yaml whisper-models:/models rkllm-root:/opt/rkllm-root ``` Seed those volumes on the target node before deploying (same copy/download steps as compose). ## API fields `POST /v1/audio/transcriptions` form fields: - `file` (required) - `model` (default `whisper-base-onnx`) - `language` (`en` or `zh`, default `en`) - `response_format` (`json`, `text`, or `verbose_json`) `POST /v1/vision/understand` form fields: - `file` (required image) - `prompt` (default `Describe this image in English.`) - `model` (default `qwen3-vl-2b-rkllm`) `POST /v1/chat/completions` accepts OpenAI-style content with `image_url`: ```json { "model": "qwen3-vl-2b-rkllm", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} ] } ] } ``` Example call: ```bash curl http://127.0.0.1:9000/v1/vision/understand \ -F file=@demo.jpg \ -F prompt="Describe this image in English." \ -F model=qwen3-vl-2b-rkllm ```