RK Whisper + VLM API

OpenAI-compatible API server for:

Whisper-style speech-to-text
Vision understanding through the RKLLM multimodal demo (Qwen3-VL)

This service exposes:

GET /health
POST /v1/audio/transcriptions (Whisper-style multipart API)
POST /v1/vision/understand (multipart image + prompt)
POST /v1/chat/completions (OpenAI-style JSON with image_url)

The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs.

Repo Layout

app/server.py - FastAPI app
Dockerfile - container image
docker-compose.yml - local run
stack.yml - Docker Swarm deploy with node placement
app/download_models.py - downloads Whisper assets into a target directory/volume

1) Initialize model volumes

cp .env.example .env
docker compose --profile init run --rm whisper-models-init

This seeds the named Docker volume whisper-models with:

whisper_encoder_base_20s.onnx
whisper_decoder_base_20s.onnx
mel_80_filters.txt
vocab_en.txt
vocab_zh.txt

2) Run with docker compose

docker compose up --build -d
curl http://127.0.0.1:9000/health

By default compose runs STT only. To enable VLM locally:

VLM_ENABLED=true

Then copy RKLLM assets into the rkllm-root volume (one-time):

docker volume create rk-whisper-stt-api_rkllm-root

docker run --rm \
  -v rk-whisper-stt-api_rkllm-root:/dst \
  -v /home/ubuntu/rkllm-demo:/src:ro \
  alpine:3.20 \
  sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/'

3) Test transcription

curl http://127.0.0.1:9000/v1/audio/transcriptions \
  -F file=@/path/to/audio.wav \
  -F model=whisper-base-onnx \
  -F language=en \
  -F response_format=json

If you set STT_API_KEY, send an auth header:

Authorization: Bearer <your-key>

4) Build and push image

docker build -t registry.lan/openai-whisper-stt:latest .
docker push registry.lan/openai-whisper-stt:latest

5) Deploy to Swarm on a specific node

cp .env.example .env
# edit STT_NODE_HOSTNAME to the target node
docker stack deploy -c stack.yml whisper-stt

The service is pinned by:

placement:
  constraints:
    - node.hostname == ${STT_NODE_HOSTNAME}

The stack uses named volumes for model persistence and backups:

whisper-models:/models
rkllm-root:/opt/rkllm-root

Seed those volumes on the target node before deploying (same copy/download steps as compose).

API fields

POST /v1/audio/transcriptions form fields:

file (required)
model (default whisper-base-onnx)
language (en or zh, default en)
response_format (json, text, or verbose_json)

POST /v1/vision/understand form fields:

file (required image)
prompt (default Describe this image in English.)
model (default qwen3-vl-2b-rkllm)

POST /v1/chat/completions accepts OpenAI-style content with image_url:

{
  "model": "qwen3-vl-2b-rkllm",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ]
}

Example call:

curl http://127.0.0.1:9000/v1/vision/understand \
  -F file=@demo.jpg \
  -F prompt="Describe this image in English." \
  -F model=qwen3-vl-2b-rkllm

3.4 KiB Raw Blame History