Changelog

Track the latest updates and improvements to GPUniq.

2026-05-14Kling video — multi-provider auto-fallback, parametric −10% off official pricing, Avatar SKUs

featurevideoklingpricing

Production Kling video — cheapest-of-two routing with transparent fallback

Kling video generation now ranks multiple upstreams per request by their parametric source price for the exact (slug, resolution, audio, duration, mode, task) combination, and falls back through the chain transparently if the active upstream errors out mid-flight. The GPUniq job_id and the user-facing price stay stable across fallbacks — clients just keep polling and either get a completed with the video URL or a final failed if every upstream rejects.

The response now exposes a config snapshot of the resolved generation parameters:

{
  "job_id": "vid_e93e98c7ca5e4982876b",
  "status": "completed",
  "model": "kling-2-6",
  "video": { "url": "https://cdn.example.com/.../output.mp4" },
  "cost_usd": 0.315,
  "balance_usd": 9.17825791,
  "config": { "resolution": "1080p", "audio": false, "duration": 5, "task": "t2v", "mode": null }
}

Kling retail is now anchored at the official reference price × 0.90

Every Kling SKU with a published official reference listing is billed at exactly −10% off the official public price for the requested configuration — the same headline regardless of which upstream actually served the job. A 10% margin floor against the chosen supplier guarantees we never bill below source cost, so on a provider fallback the user may see a 1-3% price bump but never a freebie. Catalog SKUs surface the badge as discount_percent_label: 10.0 for the frontend to render the "−10% off official" tag without hard-coding model lists.

Parametric examples:

SKU	Config	Retail
`kling-2-6`	5s no-audio	$0.315
`kling-2-6`	5s with audio	$0.630
`kling-3-0`	1080p+audio 5s	$0.756
`kling-3-0`	4K 5s	$1.890
`kling-2-1`	Pro 5s i2v	$0.405
`kling-2-6-motion-control`	1080p 5s v2v	$0.504
`kling-avatar-pro`	10s 1080p	$1.035

Two new Kling-Avatar SKUs (lip-sync)

kling-avatar-pro (1080p) and kling-avatar-standard (720p) join the catalog for lip-sync avatars up to 15 seconds. Per-second billing; lip-sync is currently served by a single upstream only.

Request shape gained `resolution`, `audio`, `mode`, `video_url`

POST /v1/llm/videos/jobs accepts four new optional fields used by the Kling parametric pricing path:

resolution: 720p / 1080p (default) / 4k
audio: true to enable audio on Kling 2.6 / 3.0 (default false)
mode: standard / pro / master for Kling 2.1; turbo for 2.5 Turbo Pro
video_url: reference video for motion-control v2v variants

Non-Kling SKUs (Sora, Veo, Wan, Hailuo, Seedance) ignore the new fields — their flat-price catalog path is unchanged, and existing callers see no regression.

Full reference: LLM API → Video Generation.

2026-05-11Hardened error envelope, max_tokens > 4096 requires streaming, full provider failover refactor

featurellmerrorsbreaking

Stable error catalog with 29 codes

Every /v1/openai/* and /v1/llm/* failure now returns a structured error.code you can branch on — streaming_required, insufficient_balance, model_not_found, rate_limit_per_key, upstream_timeout, and 24 more. Full reference, recovery strategies, and the native vs. OpenAI-compat envelope shapes are documented at LLM API → Error reference.

The OpenAI envelope is now byte-identical to the spec — earlier deployments emitted a double-wrapped {"error":{"error":{…}}} body that broke the OpenAI SDK's typed exception parser. That regression is fixed: the wire format is {"error":{message,type,code,doc_url?,meta?}, status_code, request_id}, which lets OpenAI clients raise BadRequestError, RateLimitError, AuthenticationError etc. without special-casing.

`max_tokens > 4096` without `stream: true` is now rejected up-front (BREAKING)

A non-streaming request asking for more than 4096 output tokens now returns HTTP 400 with error.code = "streaming_required". Earlier deployments silently upgraded the upstream call to streaming and reassembled the SSE chunks into a non-stream response — that worked inside SDKs but ate connections on every Cloudflare-fronted client (browsers, mobile, behind corporate proxies). The new behaviour fails fast with a clear hint:

{
  "error": {
    "message": "Requested max_tokens=8000 exceeds the non-streaming limit of 4096. Long responses must use streaming.",
    "type": "invalid_request_error",
    "code": "streaming_required",
    "meta": { "max_tokens": 8000, "limit": 4096, "hint": "stream=true" },
    "doc_url": "https://docs.gpuniq.com/llm/long-generations"
  }
}

Migration: add "stream": true to the request body. If your client can't speak SSE, use the job-based long-poll API. The threshold is configurable per-deployment; the 4096 default is the clean intersection of "fits in the 100s edge-proxy window" and "below every model's stream-chunk delivery rate".

Full provider failover chain on non-stream

The non-streaming chat path now iterates the cost-sorted provider chain top-to-bottom on transient failures (network errors, upstream 5xx, model_not_found / insufficient_user_quota patterns, vendor maintenance envelopes). Previously, only the cheapest provider was tried before falling straight to the safety-net upstream — the middle entries of the chain were skipped on failover, costing margin and reliability.

The streaming path now uses the same chain iterator. Both paths emit exactly ONE operator alert when the chain is fully exhausted (was sometimes two per request before).

Claude now streams via an Anthropic-native endpoint

For Claude haiku/sonnet/opus 4-5/4-6, requests now serve through an Anthropic-native endpoint with real SSE event streaming (verified end-to-end on opus 4.6 / sonnet 4.6). A redundant upstream was added at the same price point, so if one upstream throttles another picks up without a price change — improving reliability.

Sliding-window rate limiter + per-user gate

core/rate_limiter.py is now a true Redis ZSET sliding window — the previous fixed-bucket implementation let 2× the limit through at the minute boundary. A per-user aggregate cap (default 600 req/min) fires independently of the per-key cap and surfaces as error.code = "rate_limit_per_user" so clients can branch on which gate triggered.

Admin → Streaming Providers tab

A new admin tab under Unit Economics → Streaming Providers shows the live per-model eligibility table: cost / balance / enabled / stream-capable flag / reason for every provider in the chain, plus the actual stream and non-stream fallback chains the router would build right now. Useful when a customer reports an "expected provider X, got provider Y" mismatch.

2026-05-07Image format conversion, 5-min timeouts, expanded image catalog

featurellmimages

Server-side image format conversion

POST /v1/llm/images/generations and POST /v1/openai/images/generations now accept an optional output_format field. The server re-encodes delivered images into the chosen format before they go over the wire, so clients no longer need a Pillow / Sharp pipeline of their own.

png — pass-through (default if omitted), lossless.
jpeg (alias jpg) — ~10× smaller payload; alpha is flattened onto white because JPEG has no transparency.
webp — ~5× smaller at comparable quality, alpha preserved.

The chosen format is echoed back in usage.output_format and each delivered image entry carries its mime_type. Conversion failures degrade to "return source PNG unchanged" so a corrupt encode never loses an image the upstream already produced.

5-minute upstream timeout across the chain

Long-running reasoning models (Gemini 3 Pro thinking, GPT-5.2 Pro, Claude Opus 4.7 thinking, Sora-2 video) that legitimately run for several minutes were occasionally cut off mid-generation by a tight 60-80 s upstream budget and silently re-routed at full official rate. The whole LLM stack now holds the connection open for up to 5 minutes end-to-end (10 minutes worst-case if the primary fully exhausts and the reliability fallback also takes the full window). In practice we never see anything past ~3 minutes, but you'll never lose a slow honest generation again.

A dead gateway still fast-fails on TCP connect (10 s) and re-routes inside the same request — a 5-minute upper bound is the budget for progress, not a forced wait.

Expanded image model catalog

The image-generation endpoint now serves 11 additional models on top of the Nano Banana family and Grok 4 Image:

Slug	Price / image	Notes
`gpt-image-2`	$0.0464	OpenAI image, 1K default
`gpt-image-1-5`	$0.020	OpenAI cheaper tier
`gpt-4o-image`	$0.040	OpenAI 4o image
`flux-2-pro`	$0.060	Black Forest Labs FLUX.2 Pro 1K
`flux-2-flex`	$0.180	Premium quality 1K
`flux-kontext-pro`	$0.080	Text-to-image & edit
`flux-kontext-max`	$0.160	Premium edit / generation
`seedream-4`	$0.050	ByteDance Seedream 4
`seedream-4-5`	$0.040	ByteDance Seedream 4.5
`seedream-5-0-lite`	$0.035	ByteDance Seedream 5.0 Lite
`z-image`	$0.020	Alibaba Z-Image

Image-to-image (input_images) is supported on the FLUX Kontext, Seedream, GPT Image and Nano Banana edit slugs.

Gemini 3.1 Flash Lite at the standard discount tier

gemini-3.1-flash-lite joined the Gemini −20% retail tier:

	Before	Now
Retail $/MTok	$0.25 / $1.50	$0.20 / $1.20

Existing API keys see the lower price automatically — no migration needed.

Gemini 3 Pro fallback works again

gemini-3-pro requests now have a working fallback chain (the upstream alias was previously routing to a missing model and 404-ing). Recovery is transparent — same slug, same retail price, no client-side change.

2026-04-26Chat-image jobs + 5-minute polling budget

featurellmimages

In-chat image generation moves to the job-based pattern

POST /v1/llm/chats/{chat_id}/messages with an image model now returns in under 1 second with type: "image_pending", a job_id, and the dialogue_id of a placeholder dialogue that already lives in the chat history.

New endpoint: GET /v1/llm/chats/{chat_id}/image-jobs/{job_id} — poll every 2-3s until completed (placeholder is rewritten with the image and balance is debited) or failed (placeholder is marked, nothing charged).
Pro / 4K turns no longer hit a Cloudflare 524 when they cross 100 seconds — the request itself is short, the image arrives over polling.
Server-side, polls are coalesced via Redis (2-second TTL), so a client polling every 250ms still costs one upstream call every 2s.
Both this and the existing standalone /v1/llm/images/jobs flow share the same kickoff / poll / charge code path — pricing, ownership checks, and balance debit are guaranteed to stay in lockstep.

The recommended polling deadline (client-side) is now 5 minutes for either surface. Examples in the LLM API docs updated.

New chat models

gpt-5.5 — OpenAI flagship, 4.00 / 24.00 USD per Mtok (−20% off API)
deepseek-v4-pro — flagship, 2.40 / 4.00
deepseek-v4-flash — fast tier, 0.18 / 0.30

2026-04-25v3.5.4 — gpuniq SDK

featuresdkclillmimages

LLM chat + image generation in the SDK and CLI

The gpuniq Python package (pip install -U gpuniq) now ships chat and image-generation helpers, plus matching gg subcommands so you never need to leave the terminal.

Python SDK

client.llm.generate_image(prompt, model, n, size, quality, input_images, save_to) — synchronous text-to-image / image-to-image.
client.llm.start_image_job / get_image_job / generate_image_async — job-based surface for Nano Banana (polls automatically, emits on_progress callbacks, streams past proxy read-timeouts).
input_images accepts local paths, data: URLs, https:// URLs, raw bytes, or bare base64 — the SDK inlines local files as data URLs for you.
save_to accepts a filename (single image) or a directory (many), decodes b64_json and writes PNG(s). The list of written paths is returned as saved_paths.
client.llm.default_model() and model_catalog() expose the platform default and pricing metadata.
Dropped stale purchase_tokens / convert_rubles_to_tokens / packages — LLM and image usage are billed directly in USD from user.balance.

CLI

gg llm "prompt" — one-shot chat, prints the answer plus tokens / cost / balance.
gg llm — interactive REPL with /exit, /clear.
gg image "prompt" — generates image(s), saves PNG(s) to disk, prints paths + cost + balance. Auto-uses the async-poll path for Nano Banana slugs.

2026-04-25v3.5.2 / 3.5.3 — gpuniq CLI polish

cliux

CLI polish pass

gg help now works (alias for gg --help). Typos like gg oders print a clear Error: unknown command '<x>'. + full help instead of the confusing "gg not initialized" message — the shell-fallback only kicks in when a GPU-side gg init config actually exists.
2D GPU picker — arrow-key navigation through a matrix laid out by generation (Datacenter · 50XX · 40XX · 30XX · 20XX · 1660) with Any GPU / Other… meta rows.
Templates — docker image presets (PyTorch, ComfyUI, vLLM, Ubuntu VM, Custom). gg rent picks PyTorch by default; gg replace defaults to the old instance's image so on-disk data keeps working.
gg open always goes via ssh.gpuniq.com — the CLI calls a new POST /v1/instances/{id}/ssh-proxy/ensure to allocate a proxy port on demand for older orders whose allocation failed at order time.
gg replace fully destroys the old instance (DELETE, not just stop) so the provider machine and SSH proxy port are released before the new one is placed.
Billing plans trimmed to week (default), month, and minute. Hourly/daily billing is no longer exposed in the rental UI.

2026-04-24v3.3.0 / 3.4.0 / 3.5.0 / 3.5.1 — gpuniq CLI

featurecli

`gg rent` and `gg replace`

gg rent — interactive GPU rental from the terminal. Filter wizard (GPU model picker, min count, max price, verified, sort), full-width marketplace table with n / p / f / s controls, template picker, volume picker (pick existing / create new / skip), confirm, place order. On HTTP 410 (offer taken mid-flow) the picker loops back without losing plan / volume choices.
gg replace <id> — swap the GPU on a running instance. Same picker, same filters. Preserves the original billing plan and volume; defaults the Docker image to whatever the old instance was running.
Adaptive table — columns resize with the terminal. Narrow: GPU / VRAM / RAM / DISK / LOCATION / RELIA / PRICE / VER. Wide: adds CPU, NET ↓/↑, AVAIL, CPU MODEL, HOSTING. GPU and LOCATION flex-share leftover width.
gg status recognises both gg login (client) and gg init (GPU) configs and shows a combined status view.
--image, --disk, --gpu, --count, --max-price, --sort, --pricing, --volume-id, --no-volume, --verified flags on gg rent / gg replace for non-interactive use.
OrderOfferGone typed exception and richer FastAPI error parsing (dict / list / string detail shapes) — you see order failed (400): docker_image is required instead of 400 Client Error: Bad Request for url.

2026-04-21v3.2.0

featurellmimages

Image Generation

Text-to-image and image-to-image via the OpenAI-compatible /v1/openai/images/generations and native /v1/llm/images/generations endpoints.
Four new models: nano-banana ($0.0312/img), nano-banana-pro ($0.1072/img), nano-banana-pro-4k ($0.192/img), grok-4-image ($0.0352/img).
Reference photos: attach up to 4 photos with any prompt — Nano Banana handles the rest for image editing and style transfer.
Flat per-image billing: you pay only for delivered images. If the upstream content-policy rejects a frame you're charged only for what arrived.
New in-app studio at /chat — pick any image model and the composer switches to a text+photos panel with a gallery view.

2026-04-21v3.1.0

featurellmapi

OpenAI-Compatible LLM Endpoint

Drop-in OpenAI API at /v1/openai/chat/completions and /v1/openai/models — every field of the OpenAI Chat Completions protocol is forwarded unchanged (tools, tool_choice, response_format, logprobs, seed, streaming).
Works with Claude Code (via LiteLLM proxy), Cursor, Continue.dev, Aider, LiteLLM, and the official OpenAI Python / JS SDKs — no code changes required.
Byte-identical SSE streaming — plug directly into OpenAI SDK's streaming parser.
Authenticates with your existing GPUniq API key via Authorization: Bearer gpuniq_....

Expanded Model Catalog

Anthropic Claude — Opus 4.7 / 4.6 / 4.5, Sonnet 4.6 / 4.5, Haiku 4.5 (now the platform default).
OpenAI GPT-5 family — GPT-5.2 Pro / Codex, GPT-5.1 Codex Max, GPT-5, o3 / o3-mini / o4-mini, GPT-4o, GPT-4.1.
Google Gemini — Gemini 3 Pro / Flash, Gemini 2.5, Nano Banana.
xAI Grok — Grok 4, Grok 4.1 Thinking, Grok 4 Fast.
Premium models priced 20% below vendor list price; 30+ free-tier community models remain available.

2026-02-28v3.0.0

featurecli

CLI Tool (`gg`)

Command checkpointing: Every command run via gg is saved with full output, exit code, and timing
Replay on restart: gg replay re-runs interrupted commands after instance restarts
PTY support: Full terminal colors, progress bars, and interactive prompts
Backend sync: Checkpoints are synced to the GPUniq backend automatically
6 commands: gg init, gg run, gg list, gg logs, gg replay, gg status
Shorthand syntax: gg python train.py works the same as gg run python train.py

2026-02-23v2.0.0

featuresdkapi

Python SDK v2.0

Full Python SDK: pip install GPUniq — all endpoints accessible via GPUniq client
Universal API Key Auth: API keys now work across all endpoints (marketplace, instances, volumes, LLM, payments, settings)
Rate Limiting: 120 req/min per API key with auto-retry in SDK
8 SDK Modules: marketplace, gpu_cloud, burst, instances, volumes, llm, payments, settings
Backward compatible: v1.x gpuniq.init() and client.request() still work

GPU Dex-Cloud

Deploy by GPU type: Pick a GPU model, the platform finds the best server automatically
One-call deployment: client.gpu_cloud.deploy(gpu_name="RTX_4090")
Pricing API: Check pricing before deploying

GPU Burst

Multi-GPU burst orders: Scale to 100 GPUs in a single order
Fallback GPUs: Define alternative GPU types with price caps
Cost estimation: Estimate burst cost before committing
Per-order billing: Transaction and run history per order

Persistent Volumes

Persistent storage: Create volumes that survive instance restarts
File management: Upload, download, list, and delete files via API/SDK
Attach to any instance: Mount volumes at creation time across all deployment modes
Sync logs: Monitor volume sync operations

LLM API

Unified LLM access: OpenAI, Qwen, DeepSeek, Llama, Mistral, and more
Chat sessions: Persistent conversations with message history
Token management: Balance, packages, usage history
Terminal commands: Generate shell commands from natural language

Settings

SSH key management: Add, update, toggle, delete SSH keys via API
Per-instance SSH keys: Attach/detach keys to individual instances
Telegram notifications: Link Telegram for status alerts

2026-01-09v0.5.1

featureimprovement

New Features

GPU Analytics Dashboard: Market analytics with price tracking and availability trends
Enhanced AI Recommendations: Improved GPU suggestions based on usage history
Async Order Creation: Reliable order creation with job status polling
SLA Monitoring: Real-time uptime tracking per instance