API 接入範例
OpenAI 相容協議,現有 client 零改動接入
AI 引擎啟動後三服務同時提供 OpenAI 相容 API。base_url 指向對應 port,api_key 任意字串即可。
服務總覽
| Service | Port | Endpoint |
|---|---|---|
| VLM | 8080 | POST /v1/chat/completions |
| Embedding | 8081 | POST /v1/embeddings |
| STT | 8082 | POST /inference |
| Health | all | GET /health |
1. 圖文分析(VLM)
💡 Gemma 4 是 thinking model,會先寫推理再產出 content;建議 max_tokens 至少 256。
Python (openai SDK)
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="not-needed")
# Text only
resp = client.chat.completions.create(
model="gemma-4-e4b",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=256,
)
# With image
import base64
img_b64 = base64.b64encode(open("photo.jpg", "rb").read()).decode()
resp = client.chat.completions.create(
model="gemma-4-e4b",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this photo"},
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
],
}],
max_tokens=256,
) curl
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-e4b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 256
}' JavaScript
const resp = await fetch("http://127.0.0.1:8080/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "gemma-4-e4b",
messages: [{ role: "user", content: "Hello" }],
max_tokens: 256,
}),
});
const data = await resp.json(); 2. 文字向量(Embedding)
Python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8081/v1", api_key="not-needed")
# Single
resp = client.embeddings.create(model="bge-base-zh", input="Sample text")
vec = resp.data[0].embedding # list[float], len == 768
# Batch
texts = ["Text 1", "Text 2", "Text 3"]
resp = client.embeddings.create(model="bge-base-zh", input=texts)
vectors = [d.embedding for d in resp.data] # 3 x 768 curl
curl http://127.0.0.1:8081/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"bge-base-zh","input":"Sample text"}' 3. 語音辨識(STT)
ℹ️ whisper.cpp server 採原生 multipart 介面,非 OpenAI Audio API。
curl
curl http://127.0.0.1:8082/inference \
-F "file=@audio.wav" \
-F "language=zh" \
-F "response_format=json"
# Response: {"text": "transcribed text..."} Python (requests)
import requests
with open("audio.wav", "rb") as f:
r = requests.post(
"http://127.0.0.1:8082/inference",
files={"file": f},
data={"language": "zh", "response_format": "json"},
)
print(r.json()) 限制
- 綁定 IP:僅 127.0.0.1,無法跨機存取
- 認證:無(同 host 即視同信任)
- 並發:預設 1(llama-server 單槽,多 client 排隊)
- Context:VLM 預設無上限;Embedding 512