rknn-inference-server/README.md

# RK Whisper + VLM API

OpenAI-compatible API server for:

- Whisper-style speech-to-text
- Vision understanding through the RKLLM multimodal demo (Qwen3-VL)

This service exposes:

- `GET /health`
- `POST /v1/audio/transcriptions` (Whisper-style multipart API)
- `POST /v1/vision/understand` (multipart image + prompt)
- `POST /v1/chat/completions` (OpenAI-style JSON with image_url)

The endpoint shape is compatible with clients that call OpenAI Whisper and Chat Completions APIs.

## Repo Layout

- `app/server.py` - FastAPI app
- `Dockerfile` - container image
- `docker-compose.yml` - local run
- `stack.yml` - Docker Swarm deploy with node placement
- `app/download_models.py` - downloads Whisper assets into a target directory/volume

## 1) Initialize model volumes

```bash
cp .env.example .env
docker compose --profile init run --rm whisper-models-init
```

This seeds the named Docker volume `whisper-models` with:

- `whisper_encoder_base_20s.onnx`
- `whisper_decoder_base_20s.onnx`
- `mel_80_filters.txt`
- `vocab_en.txt`
- `vocab_zh.txt`

## 2) Run with docker compose

```bash
docker compose up --build -d
curl http://127.0.0.1:9000/health
```

By default compose runs STT only. To enable VLM locally:

```bash
VLM_ENABLED=true
```

Then copy RKLLM assets into the `rkllm-root` volume (one-time):

```bash
docker volume create rk-whisper-stt-api_rkllm-root

docker run --rm \
  -v rk-whisper-stt-api_rkllm-root:/dst \
  -v /home/ubuntu/rkllm-demo:/src:ro \
  alpine:3.20 \
  sh -c 'cp -r /src/models /dst/ && mkdir -p /dst/quickstart && cp -r /src/quickstart/demo_Linux_aarch64 /dst/quickstart/'
```

## 3) Test transcription

```bash
curl http://127.0.0.1:9000/v1/audio/transcriptions \
  -F file=@/path/to/audio.wav \
  -F model=whisper-base-onnx \
  -F language=en \
  -F response_format=json
```

If you set `STT_API_KEY`, send an auth header:

```bash
Authorization: Bearer <your-key>
```

## 4) Build and push image

```bash
docker build -t registry.lan/openai-whisper-stt:latest .
docker push registry.lan/openai-whisper-stt:latest
```

## 5) Deploy to Swarm on a specific node

```bash
cp .env.example .env
# edit STT_NODE_HOSTNAME to the target node
docker stack deploy -c stack.yml whisper-stt
```

The service is pinned by:

```yaml
placement:
  constraints:
    - node.hostname == ${STT_NODE_HOSTNAME}
```

The stack uses named volumes for model persistence and backups:

```yaml
whisper-models:/models
rkllm-root:/opt/rkllm-root
```

Seed those volumes on the target node before deploying (same copy/download steps as compose).

## API fields

`POST /v1/audio/transcriptions` form fields:

- `file` (required)
- `model` (default `whisper-base-onnx`)
- `language` (`en` or `zh`, default `en`)
- `response_format` (`json`, `text`, or `verbose_json`)

`POST /v1/vision/understand` form fields:

- `file` (required image)
- `prompt` (default `Describe this image in English.`)
- `model` (default `qwen3-vl-2b-rkllm`)

`POST /v1/chat/completions` accepts OpenAI-style content with `image_url`:

```json
{
  "model": "qwen3-vl-2b-rkllm",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ]
}
```

Example call:

```bash
curl http://127.0.0.1:9000/v1/vision/understand \
  -F file=@demo.jpg \
  -F prompt="Describe this image in English." \
  -F model=qwen3-vl-2b-rkllm
```