This commit is contained in:
2026-02-24 17:36:44 -05:00
commit 1413a1f463
12 changed files with 869 additions and 0 deletions

152
README.md Normal file
View File

@@ -0,0 +1,152 @@
# RK Whisper + VLM API
OpenAI-compatible API server for:
- Whisper-style speech-to-text
- Vision understanding through the RKLLM multimodal demo (Qwen3-VL)
This service exposes:
- `GET /health`
- `POST /v1/audio/transcriptions` (Whisper-style multipart API)
- `POST /v1/vision/understand` (multipart image + prompt)
- `POST /v1/chat/completions` (OpenAI-style JSON with image_url)
The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs.
## Repo Layout
- `app/server.py` - FastAPI app
- `Dockerfile` - container image
- `docker-compose.yml` - local run
- `stack.yml` - Docker Swarm deploy with node placement
- `app/download_models.py` - downloads Whisper assets into a target directory/volume
## 1) Initialize model volumes
```bash
cp .env.example .env
docker compose --profile init run --rm whisper-models-init
```
This seeds the named Docker volume `whisper-models` with:
- `whisper_encoder_base_20s.onnx`
- `whisper_decoder_base_20s.onnx`
- `mel_80_filters.txt`
- `vocab_en.txt`
- `vocab_zh.txt`
## 2) Run with docker compose
```bash
docker compose up --build -d
curl http://127.0.0.1:9000/health
```
By default compose runs STT only. To enable VLM locally:
```bash
VLM_ENABLED=true
```
Then copy RKLLM assets into the `rkllm-root` volume (one-time):
```bash
docker volume create rk-whisper-stt-api_rkllm-root
docker run --rm \
-v rk-whisper-stt-api_rkllm-root:/dst \
-v /home/ubuntu/rkllm-demo:/src:ro \
alpine:3.20 \
sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/'
```
## 3) Test transcription
```bash
curl http://127.0.0.1:9000/v1/audio/transcriptions \
-F file=@/path/to/audio.wav \
-F model=whisper-base-onnx \
-F language=en \
-F response_format=json
```
If you set `STT_API_KEY`, send an auth header:
```bash
Authorization: Bearer <your-key>
```
## 4) Build and push image
```bash
docker build -t registry.lan/openai-whisper-stt:latest .
docker push registry.lan/openai-whisper-stt:latest
```
## 5) Deploy to Swarm on a specific node
```bash
cp .env.example .env
# edit STT_NODE_HOSTNAME to the target node
docker stack deploy -c stack.yml whisper-stt
```
The service is pinned by:
```yaml
placement:
constraints:
- node.hostname == ${STT_NODE_HOSTNAME}
```
The stack uses named volumes for model persistence and backups:
```yaml
whisper-models:/models
rkllm-root:/opt/rkllm-root
```
Seed those volumes on the target node before deploying (same copy/download steps as compose).
## API fields
`POST /v1/audio/transcriptions` form fields:
- `file` (required)
- `model` (default `whisper-base-onnx`)
- `language` (`en` or `zh`, default `en`)
- `response_format` (`json`, `text`, or `verbose_json`)
`POST /v1/vision/understand` form fields:
- `file` (required image)
- `prompt` (default `Describe this image in English.`)
- `model` (default `qwen3-vl-2b-rkllm`)
`POST /v1/chat/completions` accepts OpenAI-style content with `image_url`:
```json
{
"model": "qwen3-vl-2b-rkllm",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
]
}
```
Example call:
```bash
curl http://127.0.0.1:9000/v1/vision/understand \
-F file=@demo.jpg \
-F prompt="Describe this image in English." \
-F model=qwen3-vl-2b-rkllm
```