153 lines
3.4 KiB
Markdown
153 lines
3.4 KiB
Markdown
# RK Whisper + VLM API
|
|
|
|
OpenAI-compatible API server for:
|
|
|
|
- Whisper-style speech-to-text
|
|
- Vision understanding through the RKLLM multimodal demo (Qwen3-VL)
|
|
|
|
This service exposes:
|
|
|
|
- `GET /health`
|
|
- `POST /v1/audio/transcriptions` (Whisper-style multipart API)
|
|
- `POST /v1/vision/understand` (multipart image + prompt)
|
|
- `POST /v1/chat/completions` (OpenAI-style JSON with image_url)
|
|
|
|
The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs.
|
|
|
|
## Repo Layout
|
|
|
|
- `app/server.py` - FastAPI app
|
|
- `Dockerfile` - container image
|
|
- `docker-compose.yml` - local run
|
|
- `stack.yml` - Docker Swarm deploy with node placement
|
|
- `app/download_models.py` - downloads Whisper assets into a target directory/volume
|
|
|
|
## 1) Initialize model volumes
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
docker compose --profile init run --rm whisper-models-init
|
|
```
|
|
|
|
This seeds the named Docker volume `whisper-models` with:
|
|
|
|
- `whisper_encoder_base_20s.onnx`
|
|
- `whisper_decoder_base_20s.onnx`
|
|
- `mel_80_filters.txt`
|
|
- `vocab_en.txt`
|
|
- `vocab_zh.txt`
|
|
|
|
## 2) Run with docker compose
|
|
|
|
```bash
|
|
docker compose up --build -d
|
|
curl http://127.0.0.1:9000/health
|
|
```
|
|
|
|
By default compose runs STT only. To enable VLM locally:
|
|
|
|
```bash
|
|
VLM_ENABLED=true
|
|
```
|
|
|
|
Then copy RKLLM assets into the `rkllm-root` volume (one-time):
|
|
|
|
```bash
|
|
docker volume create rk-whisper-stt-api_rkllm-root
|
|
|
|
docker run --rm \
|
|
-v rk-whisper-stt-api_rkllm-root:/dst \
|
|
-v /home/ubuntu/rkllm-demo:/src:ro \
|
|
alpine:3.20 \
|
|
sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/'
|
|
```
|
|
|
|
## 3) Test transcription
|
|
|
|
```bash
|
|
curl http://127.0.0.1:9000/v1/audio/transcriptions \
|
|
-F file=@/path/to/audio.wav \
|
|
-F model=whisper-base-onnx \
|
|
-F language=en \
|
|
-F response_format=json
|
|
```
|
|
|
|
If you set `STT_API_KEY`, send an auth header:
|
|
|
|
```bash
|
|
Authorization: Bearer <your-key>
|
|
```
|
|
|
|
## 4) Build and push image
|
|
|
|
```bash
|
|
docker build -t registry.lan/openai-whisper-stt:latest .
|
|
docker push registry.lan/openai-whisper-stt:latest
|
|
```
|
|
|
|
## 5) Deploy to Swarm on a specific node
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
# edit STT_NODE_HOSTNAME to the target node
|
|
docker stack deploy -c stack.yml whisper-stt
|
|
```
|
|
|
|
The service is pinned by:
|
|
|
|
```yaml
|
|
placement:
|
|
constraints:
|
|
- node.hostname == ${STT_NODE_HOSTNAME}
|
|
```
|
|
|
|
The stack uses named volumes for model persistence and backups:
|
|
|
|
```yaml
|
|
whisper-models:/models
|
|
rkllm-root:/opt/rkllm-root
|
|
```
|
|
|
|
Seed those volumes on the target node before deploying (same copy/download steps as compose).
|
|
|
|
## API fields
|
|
|
|
`POST /v1/audio/transcriptions` form fields:
|
|
|
|
- `file` (required)
|
|
- `model` (default `whisper-base-onnx`)
|
|
- `language` (`en` or `zh`, default `en`)
|
|
- `response_format` (`json`, `text`, or `verbose_json`)
|
|
|
|
`POST /v1/vision/understand` form fields:
|
|
|
|
- `file` (required image)
|
|
- `prompt` (default `Describe this image in English.`)
|
|
- `model` (default `qwen3-vl-2b-rkllm`)
|
|
|
|
`POST /v1/chat/completions` accepts OpenAI-style content with `image_url`:
|
|
|
|
```json
|
|
{
|
|
"model": "qwen3-vl-2b-rkllm",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{"type": "text", "text": "Describe this image"},
|
|
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Example call:
|
|
|
|
```bash
|
|
curl http://127.0.0.1:9000/v1/vision/understand \
|
|
-F file=@demo.jpg \
|
|
-F prompt="Describe this image in English." \
|
|
-F model=qwen3-vl-2b-rkllm
|
|
```
|