3.4 KiB
3.4 KiB
RK Whisper + VLM API
OpenAI-compatible API server for:
- Whisper-style speech-to-text
- Vision understanding through the RKLLM multimodal demo (Qwen3-VL)
This service exposes:
GET /healthPOST /v1/audio/transcriptions(Whisper-style multipart API)POST /v1/vision/understand(multipart image + prompt)POST /v1/chat/completions(OpenAI-style JSON with image_url)
The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs.
Repo Layout
app/server.py- FastAPI appDockerfile- container imagedocker-compose.yml- local runstack.yml- Docker Swarm deploy with node placementapp/download_models.py- downloads Whisper assets into a target directory/volume
1) Initialize model volumes
cp .env.example .env
docker compose --profile init run --rm whisper-models-init
This seeds the named Docker volume whisper-models with:
whisper_encoder_base_20s.onnxwhisper_decoder_base_20s.onnxmel_80_filters.txtvocab_en.txtvocab_zh.txt
2) Run with docker compose
docker compose up --build -d
curl http://127.0.0.1:9000/health
By default compose runs STT only. To enable VLM locally:
VLM_ENABLED=true
Then copy RKLLM assets into the rkllm-root volume (one-time):
docker volume create rk-whisper-stt-api_rkllm-root
docker run --rm \
-v rk-whisper-stt-api_rkllm-root:/dst \
-v /home/ubuntu/rkllm-demo:/src:ro \
alpine:3.20 \
sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/'
3) Test transcription
curl http://127.0.0.1:9000/v1/audio/transcriptions \
-F file=@/path/to/audio.wav \
-F model=whisper-base-onnx \
-F language=en \
-F response_format=json
If you set STT_API_KEY, send an auth header:
Authorization: Bearer <your-key>
4) Build and push image
docker build -t registry.lan/openai-whisper-stt:latest .
docker push registry.lan/openai-whisper-stt:latest
5) Deploy to Swarm on a specific node
cp .env.example .env
# edit STT_NODE_HOSTNAME to the target node
docker stack deploy -c stack.yml whisper-stt
The service is pinned by:
placement:
constraints:
- node.hostname == ${STT_NODE_HOSTNAME}
The stack uses named volumes for model persistence and backups:
whisper-models:/models
rkllm-root:/opt/rkllm-root
Seed those volumes on the target node before deploying (same copy/download steps as compose).
API fields
POST /v1/audio/transcriptions form fields:
file(required)model(defaultwhisper-base-onnx)language(enorzh, defaulten)response_format(json,text, orverbose_json)
POST /v1/vision/understand form fields:
file(required image)prompt(defaultDescribe this image in English.)model(defaultqwen3-vl-2b-rkllm)
POST /v1/chat/completions accepts OpenAI-style content with image_url:
{
"model": "qwen3-vl-2b-rkllm",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
]
}
Example call:
curl http://127.0.0.1:9000/v1/vision/understand \
-F file=@demo.jpg \
-F prompt="Describe this image in English." \
-F model=qwen3-vl-2b-rkllm