LLM & Image API

Chat, reasoning, and image generation through a single OpenAI-compatible API. Claude, GPT-5, Gemini, Grok, DeepSeek, Nano Banana, and more — one key, one balance.

Overview

GPUniq provides a single API surface for 90+ language models and image generators across Anthropic, OpenAI, Google, xAI, and DeepSeek. One API key, one balance, one usage dashboard — chat completions, reasoning, and text-to-image all in the same place.

You can use GPUniq LLMs two ways:

Native GPUniq API (/v1/llm/*) — wrapped responses, persistent chat sessions, terminal-command generator, SDK helpers.
OpenAI-compatible API (/v1/openai/*) — drop-in replacement for api.openai.com/v1. Works with Claude Code, Cursor, Continue.dev, Aider, LiteLLM, and the official OpenAI Python/JS SDKs without code changes.

Available Models

Chat & Reasoning

Provider	Models	Best for
Anthropic	Claude Opus 4.7 / 4.6 / 4.5, Sonnet 4.6 / 4.5, Haiku 4.5	General reasoning, coding, agents
OpenAI	GPT-5.5, GPT-5.2 Pro / Codex, GPT-5, o3, o3-mini, GPT-4o, GPT-4.1	Reasoning, structured output, vision
Google	Gemini 3 Pro / Flash, Gemini 2.5	Long context, fast batch work
xAI	Grok 4, Grok 4.1 Thinking, Grok 4 Fast	Real-time knowledge, low latency
DeepSeek	V4 Pro / V4 Flash, V3.2 / V3.2 Thinking, V3.1 / Terminus, R1 / R1 (May 2025), Reasoner, Chat, OCR	Cost-efficient reasoning, OCR, conversational
MiniMax	M2.7 / M2.5 / M2.1 / M2	Long-context Chinese & multilingual, balanced cost

DeepSeek pricing (USD per 1M tokens, already discounted −20%)

Slug	Input	Output	Category	Notes
`deepseek-v4-pro`	$2.40	$4.00	flagship	V4 family flagship
`deepseek-v4-flash`	$0.18	$0.30	fast	V4 family fast tier
`deepseek-v3.2`	$2.16	$3.24	flagship	Latest flagship general model
`deepseek-v3.2-thinking`	$0.30	$0.45	reasoning	Reasoning-tuned V3.2 (very cheap)
`deepseek-v3.1`	$4.32	$12.96	flagship	Previous flagship
`deepseek-v3.1-terminus`	$0.15	$0.30	balanced	Updated V3.1, very cheap
`deepseek-v3`	$2.16	$8.64	balanced	Original V3 (Dec 2024)
`deepseek-r1`	$4.32	$17.28	reasoning	First reasoning model
`deepseek-r1-0528`	$0.59	$1.81	reasoning	Updated R1, May 2025
`deepseek-reasoner`	$0.30	$0.45	reasoning	Reasoning-focused alias
`deepseek-chat`	$0.29	$1.17	balanced	Conversational alias
`deepseek-ocr`	$0.23	$0.23	fast	OCR model

MiniMax pricing (USD per 1M tokens)

Slug	Input	Output	Category	Public discount
`MiniMax-M2.7`	$0.33	$1.32	balanced	−10% off API
`MiniMax-M2.5`	$0.33	$1.32	balanced	−10% off API
`MiniMax-M2.1`	$0.297	$1.188	balanced	−20% off API
`MiniMax-M2`	$2.079	$8.316	flagship	−20% off API

Image Generation

Image models are billed per returned image, not per token.

Model	Slug	Price / image	Notes
Nano Banana	`nano-banana`	$0.0312	Fast text-to-image & image-to-image, 1K
Nano Banana 2	`nano-banana-2`	$0.0500	Quality-value generation up to 2K
Nano Banana Pro	`nano-banana-pro`	$0.1072	Higher quality, ~1K resolution
Nano Banana Pro 4K	`nano-banana-pro-4k`	$0.192	4K resolution
Grok 4 Image	`grok-4-image`	$0.0352	xAI image generator
GPT Image 2	`gpt-image-2`	$0.0464	OpenAI image, default 1K
GPT Image 1.5	`gpt-image-1-5`	$0.020	OpenAI image (cheaper tier)
GPT-4o Image	`gpt-4o-image`	$0.040	OpenAI 4o image
FLUX.2 Pro	`flux-2-pro`	$0.060	Black Forest Labs FLUX.2 Pro 1K
FLUX.2 Flex	`flux-2-flex`	$0.180	Premium quality 1K
Flux Kontext Pro	`flux-kontext-pro`	$0.080	Text-to-image & edit
Flux Kontext Max	`flux-kontext-max`	$0.160	Premium edit / generation
Seedream 4	`seedream-4`	$0.050	ByteDance Seedream 4
Seedream 4.5	`seedream-4-5`	$0.040	ByteDance Seedream 4.5
Seedream 5.0 Lite	`seedream-5-0-lite`	$0.035	ByteDance Seedream 5.0 Lite
Z-Image	`z-image`	$0.020	Alibaba Z-Image

The synchronous POST /v1/llm/images/generations (and its OpenAI-compat twin POST /v1/openai/images/generations) holds the connection open for the full 5-minute upstream budget, which is plenty for every model in the catalog. If you sit behind a CDN with a strict idle-read limit (Cloudflare's free tier caps responses at ~100 s) and call from a browser, prefer the job-based API below — it returns a job_id in under a second and you poll GET /v1/llm/images/jobs/{job_id} every 2-3 s until completion.

Job-based image generation (recommended)

POST /v1/llm/images/jobs returns a job_id in under a second, and you poll GET /v1/llm/images/jobs/{job_id} every 2-3 seconds until the status is terminal. You are charged only when the completion poll returns — a timed-out or failed job costs nothing. Server-side, polls that arrive within 2 seconds of each other are coalesced via Redis, so hammering the endpoint will not be billed as repeated upstream calls.

import time, requests

BASE = "https://api.gpuniq.com/v1/llm"
HEADERS = {"X-API-Key": "gpuniq_your_key"}

# 1. Kickoff
start = requests.post(
    f"{BASE}/images/jobs",
    headers=HEADERS,
    json={"model": "nano-banana-pro", "prompt": "a cozy cabin at sunrise", "n": 1},
).json()
job_id = start["data"]["job_id"]

# 2. Poll — 5-minute budget covers the slowest Pro / 4K runs
deadline = time.time() + 300
while time.time() < deadline:
    time.sleep(2.5)
    r = requests.get(f"{BASE}/images/jobs/{job_id}", headers=HEADERS).json()
    d = r["data"]
    if d["status"] == "completed":
        image_b64 = d["image"]["b64_json"]
        print(f"Cost: ${d['cost_usd']}, balance: ${d['balance_usd']}")
        break
    if d["status"] == "failed":
        print("failed:", d.get("error"))
        break

Only Nano Banana slugs are accepted on this surface. n must be 1 — issue separate jobs in parallel for batches.

Generating an image inside a chat session

When you want the image to appear as a turn in an existing chat (so the prompt and result both land in the chat history), POST to /v1/llm/chats/{chat_id}/messages with an image model. The response returns immediately with type: "image_pending" plus a job_id and the dialogue_id of a placeholder row that already lives in the chat history. Poll GET /v1/llm/chats/{chat_id}/image-jobs/{job_id} until the status is completed (placeholder is rewritten with the image and the balance is debited) or failed (placeholder is marked, nothing charged). The polling endpoint 404s once the job is terminal — the final dialogue is the source of truth from then on.

import time, requests

BASE = "https://api.gpuniq.com/v1/llm"
HEADERS = {"X-API-Key": "gpuniq_your_key"}
chat_id = 42  # existing chat created via POST /v1/llm/chats

# 1. Kickoff (POST /chats/{id}/messages with an image model)
start = requests.post(
    f"{BASE}/chats/{chat_id}/messages",
    headers=HEADERS,
    json={"model": "nano-banana-pro", "message": "a cozy cabin at sunrise"},
).json()
job_id = start["data"]["job_id"]
dialogue_id = start["data"]["dialogue_id"]

# 2. Poll — same 5-minute budget as the standalone /images/jobs flow
deadline = time.time() + 300
while time.time() < deadline:
    time.sleep(2.5)
    r = requests.get(
        f"{BASE}/chats/{chat_id}/image-jobs/{job_id}", headers=HEADERS,
    ).json()
    d = r["data"]
    if d["status"] == "completed":
        image_b64 = d["image"]["b64_json"]
        print(f"Cost: ${d['cost_usd']}, balance: ${d['balance_usd']}")
        break
    if d["status"] == "failed":
        print("failed:", d.get("error"))
        break

Use this surface when the image should be part of a multi-turn conversation. Use the standalone /images/jobs surface when you don't need persistence — it has the same job semantics without creating a chat row.

Video Generation

Video models are billed per delivered video, not per token. Every generation is asynchronous — POST to /v1/llm/videos/jobs to kick off a job, then poll GET /v1/llm/videos/jobs/{job_id} until the status is terminal. You are charged only when the completion poll returns a video.url — a failed or timed-out job costs nothing.

Family	Slug	Headline / video	Notes
OpenAI Sora 2	`sora-2-video`	$0.060	Sora 2, default 10s
OpenAI Sora 2 Pro	`sora-2-pro-video`	$1.000	Premium quality, 10s
Sora 2 Official	`sora-2-official`	$0.480	8s, official API
Sora 2 Pro Official	`sora-2-pro-official`	$0.560	8s 1080p, official API
Google Veo 3.1 Lite	`veo-3-1-lite`	$0.100	720p / 1080p
Google Veo 3.1 Fast	`veo-3-1-fast`	$0.200	Balanced 720p / 1080p
Google Veo 3.1 Quality	`veo-3-1-quality`	$1.200	Flagship Google video
Kling 2.1 Pro	`kling-2-1`	$0.405	Standard / Pro / Master tiers, 5s or 10s, i2v
Kling 2.5 Turbo Pro	`kling-2-5-turbo-pro`	$0.315	5s or 10s, t2v / i2v
Kling 2.6	`kling-2-6`	$0.315	Optional audio, 5s or 10s, t2v / i2v
Kling 3.0	`kling-3-0`	$0.504	720p / 1080p / 4K, audio, multi-shot to 15s
Kling O3 (Video)	`kling-o3-video`	$0.150	Premium audio video, 5s
Kling 2.6 Motion Control	`kling-2-6-motion-control`	$0.504	720p / 1080p video-to-video
Kling 3.0 Motion Control	`kling-3-0-motion-control`	$0.756	720p / 1080p video-to-video
Kling AI Avatar Pro	`kling-avatar-pro`	$1.035	1080p lip-sync, up to 15s
Kling AI Avatar Standard	`kling-avatar-standard`	$0.506	720p lip-sync, up to 15s
Hailuo 02	`hailuo-02`	$0.200	768p, 6s default
Hailuo 2.3	`hailuo-2-3`	$0.350	768p 6s
Seedance 1.0 Pro	`seedance-1-0-pro`	$0.210	ByteDance 720p 5s
Seedance 1.5 Pro	`seedance-1-5-pro`	$0.160	ByteDance 720p 5s
Seedance 2	`seedance-2`	$0.200	ByteDance 720p 5s
Alibaba Wan 2.2 Fast	`wan-2-2-fast`	$0.120	720p fast tier
Alibaba Wan 2.5	`wan-2-5`	$0.600	720p 5s
Alibaba Wan 2.6	`wan-2-6`	$0.800	720p 5s flagship
Wan Animate	`wan-animate`	$0.150	720p animation
Happy Horse	`happy-horse`	$0.160	720p
Grok Imagine Video	`grok-imagine-video`	$0.300	xAI video, 6s
Runway Gen-4.5	`runway-gen-4-5`	$0.750	Runway flagship 5s

Kling SKUs are billed at −10% off the official public price. The headline above is the cheapest default configuration (1080p / no audio / 5s / Pro tier). Audio, longer duration, 4K, and Master tier scale the price linearly off the underlying reference rate × 0.9 — the exact cost is returned in the cost_usd field of the completion response. A 10% margin floor against the upstream supplier guarantees we never bill below source cost, so on a provider fallback the price may rise by 1-3%.

Job-based video generation

Same kickoff-then-poll shape as the image-jobs API. The catalog covers text-to-video (t2v), image-to-video (i2v, pass image_url), and video-to-video / motion-control (v2v, pass video_url + image_url for the conditioning frame). Avatar SKUs accept an audio reference URL in the prompt body — see the model-specific docs for the schema.

import time, requests

BASE = "https://api.gpuniq.com/v1/llm"
HEADERS = {"X-API-Key": "gpuniq_your_key"}

# 1. Kickoff
start = requests.post(
    f"{BASE}/videos/jobs",
    headers=HEADERS,
    json={
        "model": "kling-2-6",
        "prompt": "A small black cat slowly turns toward the camera at golden hour",
        "duration": 5,
        "audio": False,          # opt-in, doubles price on Kling 2.6 / 3.0
        "resolution": "1080p",   # 720p | 1080p | 4k (where supported)
    },
).json()
job_id = start["data"]["job_id"]
print(f"job: {job_id}, est cost: ${start['data']['estimated_cost_usd']}")

# 2. Poll — video models deliver in 30-90s; budget 5 minutes for the slowest variants
deadline = time.time() + 300
while time.time() < deadline:
    time.sleep(3)
    r = requests.get(f"{BASE}/videos/jobs/{job_id}", headers=HEADERS).json()
    d = r["data"]
    if d["status"] == "completed":
        print(f"video: {d['video']['url']}")
        print(f"cost: ${d['cost_usd']}, balance: ${d['balance_usd']}")
        break
    if d["status"] == "failed":
        print("failed:", d.get("error"))
        break

Request body

Field	Type	Required	Notes
`model`	string	yes	Slug from the table above.
`prompt`	string	yes	Up to 4000 characters.
`duration`	int	no	Seconds; valid range depends on model (5 / 10 for most Kling, 1-15 for Avatar).
`aspect_ratio`	string	no	`16:9` (default), `9:16`, `1:1` where supported.
`image_url`	string	no	https URL or data URI — enables image-to-video.
`video_url`	string	no	https URL — required for motion-control v2v variants.
`resolution`	string	no	`720p` (default for some SKUs), `1080p` (default for Kling), `4k` (Kling 3.0 only).
`audio`	bool	no	Default `false`. Kling 2.6 / 3.0 double the price when `true`.
`mode`	string	no	`standard` / `pro` (default) / `master` for Kling 2.1; `turbo` for 2.5 Turbo Pro.

Response

The kickoff returns immediately with the GPUniq job id, the resolved parameter snapshot, and the cost estimate. Internal routing is opaque — the same job_id is valid across fallbacks, and the user-facing price stays stable.

// POST /v1/llm/videos/jobs
{
  "job_id": "vid_e93e98c7ca5e4982876b",
  "status": "pending",
  "model": "kling-2-6",
  "estimated_cost_usd": 0.315,
  "config": { "resolution": "1080p", "audio": false, "duration": 5, "task": "t2v", "mode": null }
}

// GET /v1/llm/videos/jobs/{job_id} — completed
{
  "job_id": "vid_e93e98c7ca5e4982876b",
  "status": "completed",
  "model": "kling-2-6",
  "video": { "url": "https://cdn.example.com/.../output.mp4" },
  "cost_usd": 0.315,
  "balance_usd": 9.17825791,
  "config": { "resolution": "1080p", "audio": false, "duration": 5, "task": "t2v", "mode": null }
}

The polling endpoint transparently falls back across internal routes if the first attempt fails — your job_id and the user-facing price stay stable across fallbacks. Internal route identifiers are deliberately omitted from the public response; they live only in admin/operator logs.

Chat models are sold at 20% below vendor list price.

Fetch the live catalog at any time:

models = client.llm.models()
for model in models["models"]:
    print(model)

The default model is claude-haiku-4-5 — fast, cheap, strong at code.

Long generations & streaming

The edge proxy closes inbound connections after ~100 seconds of streaming silence. A non-streaming request asking for max_tokens > 4096 is rejected up-front with HTTP 400 streaming_required — buffered responses past that length routinely lose to the cap. For long replies, set "stream": true or use the job-based long-poll API.

Your request	What to do
≤ 4096 output tokens, fast model	Plain `POST /chat/completions` works.
> 4096 output tokens OR slow / reasoning model	Set `"stream": true`.
Client can't speak SSE	Use `POST /v1/llm/chat/jobs` (long-poll).

Reasoning models (Gemini 3 Pro, DeepSeek R1, o3, Claude Opus thinking) burn tokens on hidden chain-of-thought before the visible reply, so they need extra max_tokens headroom — see the Long generations guide for the full streaming / job-based / reasoning-token recipe.

Errors

Every failure returns a stable OpenAI error envelope with a structured code you can branch on — streaming_required, insufficient_balance, model_not_found, rate_limit_per_key, etc. See the Error reference for the complete catalog (29 codes), recovery strategies, and the native vs. OpenAI-compat envelope shapes.

{
  "error": {
    "message": "…human-readable description…",
    "type": "invalid_request_error",
    "code": "streaming_required",
    "doc_url": "https://docs.gpuniq.com/llm/long-generations",
    "meta": { "max_tokens": 8000, "limit": 4096 }
  },
  "status_code": 400,
  "request_id": "…"
}

OpenAI-Compatible Endpoint

Point any OpenAI-compatible tool at GPUniq by setting two environment variables:

OPENAI_API_KEY=gpuniq_your_key
OPENAI_BASE_URL=https://api.gpuniq.com/v1/openai

Every field of the OpenAI Chat Completions protocol is forwarded unchanged: tools, tool_choice, response_format, logprobs, seed, stream, stream_options, etc.

Official OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="gpuniq_your_key",
    base_url="https://api.gpuniq.com/v1/openai",
)

resp = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Write a binary search in Rust."}],
)
print(resp.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "gpuniq_your_key",
  baseURL: "https://api.gpuniq.com/v1/openai",
});

const resp = await client.chat.completions.create({
  model: "claude-opus-4-7",
  messages: [{ role: "user", content: "Write a binary search in Rust." }],
});
console.log(resp.choices[0].message.content);

curl https://api.gpuniq.com/v1/openai/chat/completions \
  -H "Authorization: Bearer gpuniq_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-7",
    "messages": [{"role": "user", "content": "Write a binary search in Rust."}]
  }'

Streaming

Set stream: true — GPUniq returns a text/event-stream with byte-identical OpenAI SSE framing:

stream = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Explain MoE in one paragraph."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Image Generation

Both API surfaces expose a /images/generations endpoint that matches OpenAI's images.generate protocol. Pass any image slug from the catalog above (e.g. nano-banana-pro, gpt-image-2, flux-2-pro, seedream-4). Billing is flat per returned image — no token accounting.

Image requests route through a multi-tier reliability chain behind the scenes: a per-model priority gateway, two cost-optimised intermediaries, then a generic OpenAI-compatible fallback for safety. The chain is selected automatically per slug, so SDK callers never pick a backend themselves. If the primary fails or returns no image, the next tier is tried within the same HTTP request — you still see one synchronous POST /images/generations and pay for delivered images only.

Heavy generations (Pro / 4K, multi-image batches, high-quality preset) can run up to 5 minutes end-to-end; the connection is held open for that whole budget so SDKs never need to re-poll. For interactive UIs that cannot keep an HTTP connection open that long, prefer the job-based API.

from openai import OpenAI

client = OpenAI(
    api_key="gpuniq_your_key",
    base_url="https://api.gpuniq.com/v1/openai",
)

resp = client.images.generate(
    model="nano-banana-pro",
    prompt="A cozy mountain cabin at sunrise, cinematic lighting",
    n=2,
    size="1024x1024",
    response_format="b64_json",
)

for i, img in enumerate(resp.data):
    with open(f"out_{i}.png", "wb") as f:
        import base64
        f.write(base64.b64decode(img.b64_json))

import base64, requests

with open("reference.jpg", "rb") as fh:
    ref = "data:image/jpeg;base64," + base64.b64encode(fh.read()).decode()

resp = requests.post(
    "https://api.gpuniq.com/v1/llm/images/generations",
    headers={"X-API-Key": "gpuniq_your_key"},
    json={
        "model": "nano-banana-pro",
        "prompt": "Redraw the cabin in watercolor style",
        "n": 1,
        "size": "2048x2048",
        "input_images": [ref],
    },
).json()

print("cost:", resp["data"]["cost_usd"])

curl -X POST https://api.gpuniq.com/v1/openai/images/generations \
  -H "Authorization: Bearer gpuniq_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-4-image",
    "prompt": "Studio portrait of an astronaut in a pink desert",
    "n": 1,
    "size": "1024x1024"
  }'

Parameters

body

model

Any image slug from the catalog: the Nano Banana family, grok-4-image, gpt-image-2, gpt-image-1-5, gpt-4o-image, flux-2-pro, flux-2-flex, flux-kontext-pro, flux-kontext-max, seedream-4, seedream-4-5, seedream-5-0-lite, or z-image.

body

prompt

Text description of the image you want. Up to 4000 characters.

body

Number of images to generate. 1–4.

body

size

Output resolution hint forwarded to the upstream, e.g. 1024x1024, 2048x2048, 4096x4096. Nano Banana Pro 4K defaults to 4096.

body

quality

Optional upstream quality hint (e.g. standard, hd). Models that don't recognise the value silently fall back to their default.

body

response_format

b64_json returns inline PNG base64 (browser-renderable). url returns a short-lived upstream URL.

body

output_format

Re-encode every delivered image into this format on the server before returning, so the client doesn't need a Pillow / Sharp pipeline. One of:

png (default if omitted) — pass-through, lossless.
jpeg (alias jpg) — ~10× smaller payload, alpha is flattened onto white because JPEG has no transparency.
webp — ~5× smaller at comparable quality, alpha preserved.

Quality for the lossy formats is fixed at 92 — visually indistinguishable from the source PNG. Conversion failures degrade to "return source PNG unchanged" so you always get an image, never a 502 after the upstream has done the expensive work. The MIME type of the converted bytes is echoed back in data[i].mime_type.

body

input_images

Optional reference photos for image-to-image / editing. Each entry is a data: URL, https:// URL, or bare base64 string. Supported by Nano Banana family, GPT Image, FLUX Kontext, Seedream and Nano Banana Pro edit slugs.

If the upstream returns fewer images than requested (content-policy rejects, partial failures, etc.), you are billed only for what was delivered.

Claude Code

Claude Code can route through GPUniq via a LiteLLM proxy. Run LiteLLM locally as an Anthropic-compatible front-end for the GPUniq OpenAI endpoint:

# ~/litellm.yaml
model_list:
  - model_name: claude-opus-4-7
    litellm_params:
      model: openai/claude-opus-4-7
      api_base: https://api.gpuniq.com/v1/openai
      api_key: os.environ/GPUNIQ_API_KEY

export GPUNIQ_API_KEY=gpuniq_your_key
litellm --config ~/litellm.yaml --port 4000

# In another shell — point Claude Code at the proxy
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_API_KEY=sk-litellm-anything
claude

All tokens are billed against your GPUniq balance — no separate Anthropic account required.

Cursor

Settings → Models → Override OpenAI Base URL:

Base URL:  https://api.gpuniq.com/v1/openai
API Key:   gpuniq_your_key
Model:     claude-opus-4-7   # or any slug from /v1/openai/models

Continue.dev / Aider / LiteLLM

Any tool that accepts an OPENAI_BASE_URL works the same way:

export OPENAI_API_KEY=gpuniq_your_key
export OPENAI_BASE_URL=https://api.gpuniq.com/v1/openai

aider --model claude-sonnet-4-6

The OpenAI-compat endpoint returns raw OpenAI response objects (not wrapped in GPUniq's ResponseSchema). Errors use OpenAI's {"error": {"message", "type", "code"}} envelope so SDK retry logic works unchanged.

Native GPUniq SDK

For the fullest feature set — persistent chat sessions, USD balance conversion, usage history — use the native API.

Simple Chat

response = client.llm.chat("claude-haiku-4-5", "Explain how transformers work")
print(response)

curl -X POST "https://api.gpuniq.com/v1/llm/chat/completions" \
  -H "X-API-Key: gpuniq_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain how transformers work"}],
    "model": "claude-haiku-4-5"
  }'

Chat Completion (Full)

data = client.llm.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "What is gradient descent?"},
    ],
    model="claude-sonnet-4-6",
    temperature=0.7,
    max_tokens=1000,
    top_p=0.9,
)

print(data["content"])
print(f"Tokens used: {data['tokens_used']}  cost: ${data['cost_usd']:.6f}")

Parameters

body

messages

List of message objects with role ("system", "user", "assistant") and content.

body

model

Model slug (e.g., claude-opus-4-7, gpt-5.2, gemini-3-pro). Defaults to claude-haiku-4-5.

body

max_tokens

Maximum tokens in the response.

body

temperature

Sampling temperature (0.0-2.0). Higher = more creative.

body

top_p

Top-p nucleus sampling parameter.

Account Balance

Chat and image requests are billed directly against your GPUniq account balance in USD — there is no separate "token pool" anymore. Each call deducts the model's blended retail rate × the tokens it actually consumed (or per-image flat rate for image models). Prepaid token packages and ruble-to-token conversions are no longer required and the corresponding endpoints have been retired.

balance = client.llm.balance()
print(f"Available: ${balance['balance_usd']:.4f} USD")

Top up the balance from the web dashboard → Billing (Stripe / YooKassa / crypto). The balance is shared with every other GPUniq surface — GPU rentals, volume storage, image generations — so a single deposit covers the whole platform.

Usage History

Per-request detail with prompt / completion / cached / reasoning tokens and the USD cost charged at retail. Backed by the /v1/llm/usage/history endpoint; pair it with /v1/llm/usage/breakdown for daily / weekly aggregates.

history = client.llm.usage_history(limit=50, offset=0)
for log in history["logs"]:
    print(f"{log['model']}: {log['total_tokens']} tokens — ${log['cost_usd']:.6f}")

Chat Sessions

Persistent conversations stored server-side — the model sees the full history on every call:

# Create a session
session = client.llm.create_chat_session(
    model="claude-sonnet-4-6",
    title="Research Assistant",
)

# Send messages within the session
reply = client.llm.send_message(
    chat_id=session["id"],
    message="What are the key papers on attention mechanisms?",
    temperature=0.5,
)

# List all sessions
sessions = client.llm.list_chat_sessions(limit=50)

# Get a session with full message history
full = client.llm.get_chat_session(chat_id=session["id"])

# Update title
client.llm.update_chat_session(chat_id=session["id"], title="New Title")

# Delete
client.llm.delete_chat_session(chat_id=session["id"])

Generate Terminal Commands

Convert natural language to a ranked list of shell commands with danger annotations:

cmds = client.llm.generate_commands(
    prompt="find all Python files larger than 1MB and sort by size",
    max_commands=5,
)
for c in cmds["commands"]:
    print(f"[{c['danger']}] {c['command']}  # {c['description']}")

API Key Management

API keys are created from the web dashboard (LLM API Keys) and sent as Authorization: Bearer gpuniq_... on OpenAI-compat routes, or X-API-Key: gpuniq_... on native routes.

Rate limit: 120 req/min per key, sliding window.

Was this page helpful?