LLM & Image API
Chat, reasoning, and image generation through a single OpenAI-compatible API. Claude, GPT-5, Gemini, Grok, DeepSeek, Nano Banana, and more — one key, one balance.
Overview
GPUniq provides a single API surface for 90+ language models and image generators across Anthropic, OpenAI, Google, xAI, and DeepSeek. One API key, one balance, one usage dashboard — chat completions, reasoning, and text-to-image all in the same place.
You can use GPUniq LLMs two ways:
- Native GPUniq API (
/v1/llm/*) — wrapped responses, persistent chat sessions, terminal-command generator, SDK helpers. - OpenAI-compatible API (
/v1/openai/*) — drop-in replacement forapi.openai.com/v1. Works with Claude Code, Cursor, Continue.dev, Aider, LiteLLM, and the official OpenAI Python/JS SDKs without code changes.
Available Models
Chat & Reasoning
| Provider | Models | Best for |
|---|---|---|
| Anthropic | Claude Opus 4.7 / 4.6 / 4.5, Sonnet 4.6 / 4.5, Haiku 4.5 | General reasoning, coding, agents |
| OpenAI | GPT-5.5, GPT-5.2 Pro / Codex, GPT-5, o3, o3-mini, GPT-4o, GPT-4.1 | Reasoning, structured output, vision |
| Gemini 3 Pro / Flash, Gemini 2.5 | Long context, fast batch work | |
| xAI | Grok 4, Grok 4.1 Thinking, Grok 4 Fast | Real-time knowledge, low latency |
| DeepSeek | V4 Pro / V4 Flash, V3.2 / V3.2 Thinking, V3.1 / Terminus, R1 / R1 (May 2025), Reasoner, Chat, OCR | Cost-efficient reasoning, OCR, conversational |
| MiniMax | M2.7 / M2.5 / M2.1 / M2 | Long-context Chinese & multilingual, balanced cost |
DeepSeek pricing (USD per 1M tokens, already discounted −20%)
| Slug | Input | Output | Category | Notes |
|---|---|---|---|---|
deepseek-v4-pro | $2.40 | $4.00 | flagship | V4 family flagship |
deepseek-v4-flash | $0.18 | $0.30 | fast | V4 family fast tier |
deepseek-v3.2 | $2.16 | $3.24 | flagship | Latest flagship general model |
deepseek-v3.2-thinking | $0.30 | $0.45 | reasoning | Reasoning-tuned V3.2 (very cheap) |
deepseek-v3.1 | $4.32 | $12.96 | flagship | Previous flagship |
deepseek-v3.1-terminus | $0.15 | $0.30 | balanced | Updated V3.1, very cheap |
deepseek-v3 | $2.16 | $8.64 | balanced | Original V3 (Dec 2024) |
deepseek-r1 | $4.32 | $17.28 | reasoning | First reasoning model |
deepseek-r1-0528 | $0.59 | $1.81 | reasoning | Updated R1, May 2025 |
deepseek-reasoner | $0.30 | $0.45 | reasoning | Reasoning-focused alias |
deepseek-chat | $0.29 | $1.17 | balanced | Conversational alias |
deepseek-ocr | $0.23 | $0.23 | fast | OCR model |
MiniMax pricing (USD per 1M tokens)
| Slug | Input | Output | Category | Public discount |
|---|---|---|---|---|
MiniMax-M2.7 | $0.33 | $1.32 | balanced | −10% off API |
MiniMax-M2.5 | $0.33 | $1.32 | balanced | −10% off API |
MiniMax-M2.1 | $0.297 | $1.188 | balanced | −20% off API |
MiniMax-M2 | $2.079 | $8.316 | flagship | −20% off API |
Image Generation
Image models are billed per returned image, not per token.
| Model | Slug | Price / image | Notes |
|---|---|---|---|
| Nano Banana | nano-banana | $0.0312 | Fast text-to-image & image-to-image, 1K |
| Nano Banana 2 | nano-banana-2 | $0.0500 | Quality-value generation up to 2K |
| Nano Banana Pro | nano-banana-pro | $0.1072 | Higher quality, ~1K resolution |
| Nano Banana Pro 4K | nano-banana-pro-4k | $0.192 | 4K resolution |
| Grok 4 Image | grok-4-image | $0.0352 | xAI image generator |
| GPT Image 2 | gpt-image-2 | $0.0464 | OpenAI image, default 1K |
| GPT Image 1.5 | gpt-image-1-5 | $0.020 | OpenAI image (cheaper tier) |
| GPT-4o Image | gpt-4o-image | $0.040 | OpenAI 4o image |
| FLUX.2 Pro | flux-2-pro | $0.060 | Black Forest Labs FLUX.2 Pro 1K |
| FLUX.2 Flex | flux-2-flex | $0.180 | Premium quality 1K |
| Flux Kontext Pro | flux-kontext-pro | $0.080 | Text-to-image & edit |
| Flux Kontext Max | flux-kontext-max | $0.160 | Premium edit / generation |
| Seedream 4 | seedream-4 | $0.050 | ByteDance Seedream 4 |
| Seedream 4.5 | seedream-4-5 | $0.040 | ByteDance Seedream 4.5 |
| Seedream 5.0 Lite | seedream-5-0-lite | $0.035 | ByteDance Seedream 5.0 Lite |
| Z-Image | z-image | $0.020 | Alibaba Z-Image |
The synchronous POST /v1/llm/images/generations (and its OpenAI-compat
twin POST /v1/openai/images/generations) holds the connection open
for the full 5-minute upstream budget, which is plenty for every model
in the catalog. If you sit behind a CDN with a strict idle-read limit
(Cloudflare's free tier caps responses at ~100 s) and call from a
browser, prefer the job-based API below — it returns a job_id in
under a second and you poll GET /v1/llm/images/jobs/{job_id} every
2-3 s until completion.
Job-based image generation (recommended)
POST /v1/llm/images/jobs returns a job_id in under a second, and you
poll GET /v1/llm/images/jobs/{job_id} every 2-3 seconds until the
status is terminal. You are charged only when the completion poll
returns — a timed-out or failed job costs nothing. Server-side, polls
that arrive within 2 seconds of each other are coalesced via Redis, so
hammering the endpoint will not be billed as repeated upstream calls.
import time, requests
BASE = "https://api.gpuniq.com/v1/llm"
HEADERS = {"X-API-Key": "gpuniq_your_key"}
# 1. Kickoff
start = requests.post(
f"{BASE}/images/jobs",
headers=HEADERS,
json={"model": "nano-banana-pro", "prompt": "a cozy cabin at sunrise", "n": 1},
).json()
job_id = start["data"]["job_id"]
# 2. Poll — 5-minute budget covers the slowest Pro / 4K runs
deadline = time.time() + 300
while time.time() < deadline:
time.sleep(2.5)
r = requests.get(f"{BASE}/images/jobs/{job_id}", headers=HEADERS).json()
d = r["data"]
if d["status"] == "completed":
image_b64 = d["image"]["b64_json"]
print(f"Cost: ${d['cost_usd']}, balance: ${d['balance_usd']}")
break
if d["status"] == "failed":
print("failed:", d.get("error"))
break
Only Nano Banana slugs are accepted on this surface. n must be 1 —
issue separate jobs in parallel for batches.
Generating an image inside a chat session
When you want the image to appear as a turn in an existing chat (so the
prompt and result both land in the chat history), POST to
/v1/llm/chats/{chat_id}/messages with an image model. The response
returns immediately with type: "image_pending" plus a job_id and the
dialogue_id of a placeholder row that already lives in the chat
history. Poll GET /v1/llm/chats/{chat_id}/image-jobs/{job_id} until
the status is completed (placeholder is rewritten with the image and
the balance is debited) or failed (placeholder is marked, nothing
charged). The polling endpoint 404s once the job is terminal — the
final dialogue is the source of truth from then on.
import time, requests
BASE = "https://api.gpuniq.com/v1/llm"
HEADERS = {"X-API-Key": "gpuniq_your_key"}
chat_id = 42 # existing chat created via POST /v1/llm/chats
# 1. Kickoff (POST /chats/{id}/messages with an image model)
start = requests.post(
f"{BASE}/chats/{chat_id}/messages",
headers=HEADERS,
json={"model": "nano-banana-pro", "message": "a cozy cabin at sunrise"},
).json()
job_id = start["data"]["job_id"]
dialogue_id = start["data"]["dialogue_id"]
# 2. Poll — same 5-minute budget as the standalone /images/jobs flow
deadline = time.time() + 300
while time.time() < deadline:
time.sleep(2.5)
r = requests.get(
f"{BASE}/chats/{chat_id}/image-jobs/{job_id}", headers=HEADERS,
).json()
d = r["data"]
if d["status"] == "completed":
image_b64 = d["image"]["b64_json"]
print(f"Cost: ${d['cost_usd']}, balance: ${d['balance_usd']}")
break
if d["status"] == "failed":
print("failed:", d.get("error"))
break
Use this surface when the image should be part of a multi-turn
conversation. Use the standalone /images/jobs surface when you don't
need persistence — it has the same job semantics without creating a
chat row.
Video Generation
Video models are billed per delivered video, not per token. Every
generation is asynchronous — POST to /v1/llm/videos/jobs to kick
off a job, then poll GET /v1/llm/videos/jobs/{job_id} until the
status is terminal. You are charged only when the completion poll
returns a video.url — a failed or timed-out job costs nothing.
| Family | Slug | Headline / video | Notes |
|---|---|---|---|
| OpenAI Sora 2 | sora-2-video | $0.060 | Sora 2, default 10s |
| OpenAI Sora 2 Pro | sora-2-pro-video | $1.000 | Premium quality, 10s |
| Sora 2 Official | sora-2-official | $0.480 | 8s, official API |
| Sora 2 Pro Official | sora-2-pro-official | $0.560 | 8s 1080p, official API |
| Google Veo 3.1 Lite | veo-3-1-lite | $0.100 | 720p / 1080p |
| Google Veo 3.1 Fast | veo-3-1-fast | $0.200 | Balanced 720p / 1080p |
| Google Veo 3.1 Quality | veo-3-1-quality | $1.200 | Flagship Google video |
| Kling 2.1 Pro | kling-2-1 | $0.405 | Standard / Pro / Master tiers, 5s or 10s, i2v |
| Kling 2.5 Turbo Pro | kling-2-5-turbo-pro | $0.315 | 5s or 10s, t2v / i2v |
| Kling 2.6 | kling-2-6 | $0.315 | Optional audio, 5s or 10s, t2v / i2v |
| Kling 3.0 | kling-3-0 | $0.504 | 720p / 1080p / 4K, audio, multi-shot to 15s |
| Kling O3 (Video) | kling-o3-video | $0.150 | Premium audio video, 5s |
| Kling 2.6 Motion Control | kling-2-6-motion-control | $0.504 | 720p / 1080p video-to-video |
| Kling 3.0 Motion Control | kling-3-0-motion-control | $0.756 | 720p / 1080p video-to-video |
| Kling AI Avatar Pro | kling-avatar-pro | $1.035 | 1080p lip-sync, up to 15s |
| Kling AI Avatar Standard | kling-avatar-standard | $0.506 | 720p lip-sync, up to 15s |
| Hailuo 02 | hailuo-02 | $0.200 | 768p, 6s default |
| Hailuo 2.3 | hailuo-2-3 | $0.350 | 768p 6s |
| Seedance 1.0 Pro | seedance-1-0-pro | $0.210 | ByteDance 720p 5s |
| Seedance 1.5 Pro | seedance-1-5-pro | $0.160 | ByteDance 720p 5s |
| Seedance 2 | seedance-2 | $0.200 | ByteDance 720p 5s |
| Alibaba Wan 2.2 Fast | wan-2-2-fast | $0.120 | 720p fast tier |
| Alibaba Wan 2.5 | wan-2-5 | $0.600 | 720p 5s |
| Alibaba Wan 2.6 | wan-2-6 | $0.800 | 720p 5s flagship |
| Wan Animate | wan-animate | $0.150 | 720p animation |
| Happy Horse | happy-horse | $0.160 | 720p |
| Grok Imagine Video | grok-imagine-video | $0.300 | xAI video, 6s |
| Runway Gen-4.5 | runway-gen-4-5 | $0.750 | Runway flagship 5s |
Kling SKUs are billed at −10% off the official public price. The
headline above is the cheapest default configuration (1080p / no
audio / 5s / Pro tier). Audio, longer duration, 4K, and Master tier
scale the price linearly off the underlying reference rate × 0.9 — the
exact cost is returned in the cost_usd field of the completion
response. A 10% margin floor against the upstream supplier
guarantees we never bill below source cost, so on a provider
fallback the price may rise by 1-3%.
Job-based video generation
Same kickoff-then-poll shape as the image-jobs API. The catalog covers
text-to-video (t2v), image-to-video (i2v, pass image_url), and
video-to-video / motion-control (v2v, pass video_url + image_url
for the conditioning frame). Avatar SKUs accept an audio reference URL
in the prompt body — see the model-specific docs for the schema.
import time, requests
BASE = "https://api.gpuniq.com/v1/llm"
HEADERS = {"X-API-Key": "gpuniq_your_key"}
# 1. Kickoff
start = requests.post(
f"{BASE}/videos/jobs",
headers=HEADERS,
json={
"model": "kling-2-6",
"prompt": "A small black cat slowly turns toward the camera at golden hour",
"duration": 5,
"audio": False, # opt-in, doubles price on Kling 2.6 / 3.0
"resolution": "1080p", # 720p | 1080p | 4k (where supported)
},
).json()
job_id = start["data"]["job_id"]
print(f"job: {job_id}, est cost: ${start['data']['estimated_cost_usd']}")
# 2. Poll — video models deliver in 30-90s; budget 5 minutes for the slowest variants
deadline = time.time() + 300
while time.time() < deadline:
time.sleep(3)
r = requests.get(f"{BASE}/videos/jobs/{job_id}", headers=HEADERS).json()
d = r["data"]
if d["status"] == "completed":
print(f"video: {d['video']['url']}")
print(f"cost: ${d['cost_usd']}, balance: ${d['balance_usd']}")
break
if d["status"] == "failed":
print("failed:", d.get("error"))
break
Request body
| Field | Type | Required | Notes |
|---|---|---|---|
model | string | yes | Slug from the table above. |
prompt | string | yes | Up to 4000 characters. |
duration | int | no | Seconds; valid range depends on model (5 / 10 for most Kling, 1-15 for Avatar). |
aspect_ratio | string | no | 16:9 (default), 9:16, 1:1 where supported. |
image_url | string | no | https URL or data URI — enables image-to-video. |
video_url | string | no | https URL — required for motion-control v2v variants. |
resolution | string | no | 720p (default for some SKUs), 1080p (default for Kling), 4k (Kling 3.0 only). |
audio | bool | no | Default false. Kling 2.6 / 3.0 double the price when true. |
mode | string | no | standard / pro (default) / master for Kling 2.1; turbo for 2.5 Turbo Pro. |
Response
The kickoff returns immediately with the GPUniq job id, the resolved
parameter snapshot, and the cost estimate. Internal routing is
opaque — the same job_id is valid across fallbacks, and the
user-facing price stays stable.
// POST /v1/llm/videos/jobs
{
"job_id": "vid_e93e98c7ca5e4982876b",
"status": "pending",
"model": "kling-2-6",
"estimated_cost_usd": 0.315,
"config": { "resolution": "1080p", "audio": false, "duration": 5, "task": "t2v", "mode": null }
}
// GET /v1/llm/videos/jobs/{job_id} — completed
{
"job_id": "vid_e93e98c7ca5e4982876b",
"status": "completed",
"model": "kling-2-6",
"video": { "url": "https://cdn.example.com/.../output.mp4" },
"cost_usd": 0.315,
"balance_usd": 9.17825791,
"config": { "resolution": "1080p", "audio": false, "duration": 5, "task": "t2v", "mode": null }
}
The polling endpoint transparently falls back across internal routes if the first attempt fails — your job_id and the user-facing price stay stable across fallbacks. Internal route identifiers are deliberately omitted from the public response; they live only in admin/operator logs.
Chat models are sold at 20% below vendor list price.
Fetch the live catalog at any time:
models = client.llm.models()
for model in models["models"]:
print(model)
curl https://api.gpuniq.com/v1/llm/models/catalog
curl https://api.gpuniq.com/v1/openai/models \
-H "Authorization: Bearer gpuniq_your_key"
The default model is claude-haiku-4-5 — fast, cheap, strong at code.
Long generations & streaming
The edge proxy closes inbound connections after ~100 seconds of streaming silence. A non-streaming request asking for max_tokens > 4096 is rejected up-front with HTTP 400 streaming_required — buffered responses past that length routinely lose to the cap. For long replies, set "stream": true or use the job-based long-poll API.
| Your request | What to do |
|---|---|
| ≤ 4096 output tokens, fast model | Plain POST /chat/completions works. |
| > 4096 output tokens OR slow / reasoning model | Set "stream": true. |
| Client can't speak SSE | Use POST /v1/llm/chat/jobs (long-poll). |
Reasoning models (Gemini 3 Pro, DeepSeek R1, o3, Claude Opus thinking) burn tokens on hidden chain-of-thought before the visible reply, so they need extra max_tokens headroom — see the Long generations guide for the full streaming / job-based / reasoning-token recipe.
Errors
Every failure returns a stable OpenAI error envelope with a structured code you can branch on — streaming_required, insufficient_balance, model_not_found, rate_limit_per_key, etc. See the Error reference for the complete catalog (29 codes), recovery strategies, and the native vs. OpenAI-compat envelope shapes.
{
"error": {
"message": "…human-readable description…",
"type": "invalid_request_error",
"code": "streaming_required",
"doc_url": "https://docs.gpuniq.com/llm/long-generations",
"meta": { "max_tokens": 8000, "limit": 4096 }
},
"status_code": 400,
"request_id": "…"
}
OpenAI-Compatible Endpoint
Point any OpenAI-compatible tool at GPUniq by setting two environment variables:
OPENAI_API_KEY=gpuniq_your_key
OPENAI_BASE_URL=https://api.gpuniq.com/v1/openai
Every field of the OpenAI Chat Completions protocol is forwarded unchanged: tools, tool_choice, response_format, logprobs, seed, stream, stream_options, etc.
Official OpenAI SDK
from openai import OpenAI
client = OpenAI(
api_key="gpuniq_your_key",
base_url="https://api.gpuniq.com/v1/openai",
)
resp = client.chat.completions.create(
model="claude-opus-4-7",
messages=[{"role": "user", "content": "Write a binary search in Rust."}],
)
print(resp.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "gpuniq_your_key",
baseURL: "https://api.gpuniq.com/v1/openai",
});
const resp = await client.chat.completions.create({
model: "claude-opus-4-7",
messages: [{ role: "user", content: "Write a binary search in Rust." }],
});
console.log(resp.choices[0].message.content);
curl https://api.gpuniq.com/v1/openai/chat/completions \
-H "Authorization: Bearer gpuniq_your_key" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-opus-4-7",
"messages": [{"role": "user", "content": "Write a binary search in Rust."}]
}'
Streaming
Set stream: true — GPUniq returns a text/event-stream with byte-identical OpenAI SSE framing:
stream = client.chat.completions.create(
model="gpt-5.2",
messages=[{"role": "user", "content": "Explain MoE in one paragraph."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Image Generation
Both API surfaces expose a /images/generations endpoint that matches
OpenAI's images.generate protocol. Pass any image slug from the
catalog above (e.g. nano-banana-pro, gpt-image-2, flux-2-pro,
seedream-4). Billing is flat per returned image — no token accounting.
Image requests route through a multi-tier reliability chain behind the
scenes: a per-model priority gateway, two cost-optimised intermediaries,
then a generic OpenAI-compatible fallback for safety. The chain is
selected automatically per slug, so SDK callers never pick a backend
themselves. If the primary fails or returns no image, the next tier is
tried within the same HTTP request — you still see one synchronous
POST /images/generations and pay for delivered images only.
Heavy generations (Pro / 4K, multi-image batches, high-quality preset) can run up to 5 minutes end-to-end; the connection is held open for that whole budget so SDKs never need to re-poll. For interactive UIs that cannot keep an HTTP connection open that long, prefer the job-based API.
from openai import OpenAI
client = OpenAI(
api_key="gpuniq_your_key",
base_url="https://api.gpuniq.com/v1/openai",
)
resp = client.images.generate(
model="nano-banana-pro",
prompt="A cozy mountain cabin at sunrise, cinematic lighting",
n=2,
size="1024x1024",
response_format="b64_json",
)
for i, img in enumerate(resp.data):
with open(f"out_{i}.png", "wb") as f:
import base64
f.write(base64.b64decode(img.b64_json))
import base64, requests
with open("reference.jpg", "rb") as fh:
ref = "data:image/jpeg;base64," + base64.b64encode(fh.read()).decode()
resp = requests.post(
"https://api.gpuniq.com/v1/llm/images/generations",
headers={"X-API-Key": "gpuniq_your_key"},
json={
"model": "nano-banana-pro",
"prompt": "Redraw the cabin in watercolor style",
"n": 1,
"size": "2048x2048",
"input_images": [ref],
},
).json()
print("cost:", resp["data"]["cost_usd"])
curl -X POST https://api.gpuniq.com/v1/openai/images/generations \
-H "Authorization: Bearer gpuniq_your_key" \
-H "Content-Type: application/json" \
-d '{
"model": "grok-4-image",
"prompt": "Studio portrait of an astronaut in a pink desert",
"n": 1,
"size": "1024x1024"
}'
Parameters
Any image slug from the catalog: the Nano Banana family, grok-4-image, gpt-image-2, gpt-image-1-5, gpt-4o-image, flux-2-pro, flux-2-flex, flux-kontext-pro, flux-kontext-max, seedream-4, seedream-4-5, seedream-5-0-lite, or z-image.
Text description of the image you want. Up to 4000 characters.
Number of images to generate. 1–4.
Output resolution hint forwarded to the upstream, e.g. 1024x1024, 2048x2048, 4096x4096. Nano Banana Pro 4K defaults to 4096.
Optional upstream quality hint (e.g. standard, hd). Models that don't recognise the value silently fall back to their default.
b64_json returns inline PNG base64 (browser-renderable). url returns a short-lived upstream URL.
Re-encode every delivered image into this format on the server before returning, so the client doesn't need a Pillow / Sharp pipeline. One of:
png(default if omitted) — pass-through, lossless.jpeg(aliasjpg) — ~10× smaller payload, alpha is flattened onto white because JPEG has no transparency.webp— ~5× smaller at comparable quality, alpha preserved.
Quality for the lossy formats is fixed at 92 — visually indistinguishable from the source PNG. Conversion failures degrade to "return source PNG unchanged" so you always get an image, never a 502 after the upstream has done the expensive work. The MIME type of the converted bytes is echoed back in data[i].mime_type.
Optional reference photos for image-to-image / editing. Each entry is a data: URL, https:// URL, or bare base64 string. Supported by Nano Banana family, GPT Image, FLUX Kontext, Seedream and Nano Banana Pro edit slugs.
If the upstream returns fewer images than requested (content-policy rejects, partial failures, etc.), you are billed only for what was delivered.
Claude Code
Claude Code can route through GPUniq via a LiteLLM proxy. Run LiteLLM locally as an Anthropic-compatible front-end for the GPUniq OpenAI endpoint:
# ~/litellm.yaml
model_list:
- model_name: claude-opus-4-7
litellm_params:
model: openai/claude-opus-4-7
api_base: https://api.gpuniq.com/v1/openai
api_key: os.environ/GPUNIQ_API_KEY
export GPUNIQ_API_KEY=gpuniq_your_key
litellm --config ~/litellm.yaml --port 4000
# In another shell — point Claude Code at the proxy
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_API_KEY=sk-litellm-anything
claude
All tokens are billed against your GPUniq balance — no separate Anthropic account required.
Cursor
Settings → Models → Override OpenAI Base URL:
Base URL: https://api.gpuniq.com/v1/openai
API Key: gpuniq_your_key
Model: claude-opus-4-7 # or any slug from /v1/openai/models
Continue.dev / Aider / LiteLLM
Any tool that accepts an OPENAI_BASE_URL works the same way:
export OPENAI_API_KEY=gpuniq_your_key
export OPENAI_BASE_URL=https://api.gpuniq.com/v1/openai
aider --model claude-sonnet-4-6
The OpenAI-compat endpoint returns raw OpenAI response objects (not wrapped in GPUniq's ResponseSchema). Errors use OpenAI's {"error": {"message", "type", "code"}} envelope so SDK retry logic works unchanged.
Native GPUniq SDK
For the fullest feature set — persistent chat sessions, USD balance conversion, usage history — use the native API.
Simple Chat
response = client.llm.chat("claude-haiku-4-5", "Explain how transformers work")
print(response)
curl -X POST "https://api.gpuniq.com/v1/llm/chat/completions" \
-H "X-API-Key: gpuniq_your_key" \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain how transformers work"}],
"model": "claude-haiku-4-5"
}'
Chat Completion (Full)
data = client.llm.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is gradient descent?"},
],
model="claude-sonnet-4-6",
temperature=0.7,
max_tokens=1000,
top_p=0.9,
)
print(data["content"])
print(f"Tokens used: {data['tokens_used']} cost: ${data['cost_usd']:.6f}")
Parameters
List of message objects with role ("system", "user", "assistant") and content.
Model slug (e.g., claude-opus-4-7, gpt-5.2, gemini-3-pro). Defaults to claude-haiku-4-5.
Maximum tokens in the response.
Sampling temperature (0.0-2.0). Higher = more creative.
Top-p nucleus sampling parameter.
Account Balance
Chat and image requests are billed directly against your GPUniq account balance in USD — there is no separate "token pool" anymore. Each call deducts the model's blended retail rate × the tokens it actually consumed (or per-image flat rate for image models). Prepaid token packages and ruble-to-token conversions are no longer required and the corresponding endpoints have been retired.
balance = client.llm.balance()
print(f"Available: ${balance['balance_usd']:.4f} USD")
curl https://api.gpuniq.com/v1/llm/balance \
-H "Authorization: Bearer gpuniq_your_key"
Top up the balance from the web dashboard → Billing (Stripe / YooKassa / crypto). The balance is shared with every other GPUniq surface — GPU rentals, volume storage, image generations — so a single deposit covers the whole platform.
Usage History
Per-request detail with prompt / completion / cached / reasoning
tokens and the USD cost charged at retail. Backed by the
/v1/llm/usage/history endpoint; pair it with
/v1/llm/usage/breakdown for daily / weekly aggregates.
history = client.llm.usage_history(limit=50, offset=0)
for log in history["logs"]:
print(f"{log['model']}: {log['total_tokens']} tokens — ${log['cost_usd']:.6f}")
Chat Sessions
Persistent conversations stored server-side — the model sees the full history on every call:
# Create a session
session = client.llm.create_chat_session(
model="claude-sonnet-4-6",
title="Research Assistant",
)
# Send messages within the session
reply = client.llm.send_message(
chat_id=session["id"],
message="What are the key papers on attention mechanisms?",
temperature=0.5,
)
# List all sessions
sessions = client.llm.list_chat_sessions(limit=50)
# Get a session with full message history
full = client.llm.get_chat_session(chat_id=session["id"])
# Update title
client.llm.update_chat_session(chat_id=session["id"], title="New Title")
# Delete
client.llm.delete_chat_session(chat_id=session["id"])
Generate Terminal Commands
Convert natural language to a ranked list of shell commands with danger annotations:
cmds = client.llm.generate_commands(
prompt="find all Python files larger than 1MB and sort by size",
max_commands=5,
)
for c in cmds["commands"]:
print(f"[{c['danger']}] {c['command']} # {c['description']}")
API Key Management
API keys are created from the web dashboard (LLM API Keys) and sent as Authorization: Bearer gpuniq_... on OpenAI-compat routes, or X-API-Key: gpuniq_... on native routes.
Rate limit: 120 req/min per key, sliding window.