Changelog
Track the latest updates and improvements to GPUniq.
Follow us on Telegram for real-time announcements.
Production Kling video — cheapest-of-two routing with transparent fallback
Kling video generation now ranks multiple upstreams per request by
their parametric source price for the exact (slug, resolution, audio, duration, mode, task) combination, and falls back
through the chain transparently if the active upstream errors out
mid-flight. The GPUniq job_id and the user-facing price stay stable
across fallbacks — clients just keep polling and either get a
completed with the video URL or a final failed if every upstream
rejects.
The response now exposes a config snapshot of the resolved
generation parameters:
{
"job_id": "vid_e93e98c7ca5e4982876b",
"status": "completed",
"model": "kling-2-6",
"video": { "url": "https://cdn.example.com/.../output.mp4" },
"cost_usd": 0.315,
"balance_usd": 9.17825791,
"config": { "resolution": "1080p", "audio": false, "duration": 5, "task": "t2v", "mode": null }
}
Kling retail is now anchored at the official reference price × 0.90
Every Kling SKU with a published official reference listing is billed
at exactly −10% off the official public price for the requested
configuration — the same headline regardless of which upstream
actually served the job. A 10% margin floor against the chosen
supplier guarantees we never bill below source cost, so on a provider
fallback the user may see a 1-3% price bump but never a freebie. Catalog SKUs surface
the badge as discount_percent_label: 10.0 for the frontend to
render the "−10% off official" tag without hard-coding model lists.
Parametric examples:
| SKU | Config | Retail |
|---|---|---|
kling-2-6 | 5s no-audio | $0.315 |
kling-2-6 | 5s with audio | $0.630 |
kling-3-0 | 1080p+audio 5s | $0.756 |
kling-3-0 | 4K 5s | $1.890 |
kling-2-1 | Pro 5s i2v | $0.405 |
kling-2-6-motion-control | 1080p 5s v2v | $0.504 |
kling-avatar-pro | 10s 1080p | $1.035 |
Two new Kling-Avatar SKUs (lip-sync)
kling-avatar-pro (1080p) and kling-avatar-standard (720p) join
the catalog for lip-sync avatars up to 15 seconds. Per-second
billing; lip-sync is currently served by a single upstream only.
Request shape gained resolution, audio, mode, video_url
POST /v1/llm/videos/jobs accepts four new optional fields used by
the Kling parametric pricing path:
resolution:720p/1080p(default) /4kaudio:trueto enable audio on Kling 2.6 / 3.0 (defaultfalse)mode:standard/pro/masterfor Kling 2.1;turbofor 2.5 Turbo Provideo_url: reference video for motion-controlv2vvariants
Non-Kling SKUs (Sora, Veo, Wan, Hailuo, Seedance) ignore the new fields — their flat-price catalog path is unchanged, and existing callers see no regression.
Full reference: LLM API → Video Generation.
Stable error catalog with 29 codes
Every /v1/openai/* and /v1/llm/* failure now returns a structured
error.code you can branch on — streaming_required,
insufficient_balance, model_not_found, rate_limit_per_key,
upstream_timeout, and 24 more. Full reference, recovery strategies,
and the native vs. OpenAI-compat envelope shapes are documented at
LLM API → Error reference.
The OpenAI envelope is now byte-identical to the spec — earlier
deployments emitted a double-wrapped {"error":{"error":{…}}} body
that broke the OpenAI SDK's typed exception parser. That regression
is fixed: the wire format is {"error":{message,type,code,doc_url?,meta?}, status_code, request_id},
which lets OpenAI clients raise BadRequestError, RateLimitError,
AuthenticationError etc. without special-casing.
max_tokens > 4096 without stream: true is now rejected up-front (BREAKING)
A non-streaming request asking for more than 4096 output tokens now
returns HTTP 400 with error.code = "streaming_required". Earlier
deployments silently upgraded the upstream call to streaming and
reassembled the SSE chunks into a non-stream response — that worked
inside SDKs but ate connections on every Cloudflare-fronted client
(browsers, mobile, behind corporate proxies). The new behaviour fails
fast with a clear hint:
{
"error": {
"message": "Requested max_tokens=8000 exceeds the non-streaming limit of 4096. Long responses must use streaming.",
"type": "invalid_request_error",
"code": "streaming_required",
"meta": { "max_tokens": 8000, "limit": 4096, "hint": "stream=true" },
"doc_url": "https://docs.gpuniq.com/llm/long-generations"
}
}
Migration: add "stream": true to the request body. If your client
can't speak SSE, use the job-based long-poll API.
The threshold is configurable per-deployment; the 4096 default is the
clean intersection of "fits in the 100s edge-proxy window" and "below
every model's stream-chunk delivery rate".
Full provider failover chain on non-stream
The non-streaming chat path now iterates the cost-sorted provider chain
top-to-bottom on transient failures (network errors, upstream 5xx,
model_not_found / insufficient_user_quota patterns,
vendor maintenance envelopes). Previously, only the cheapest provider
was tried before falling straight to the safety-net upstream —
the middle entries of the chain were
skipped on failover, costing margin and reliability.
The streaming path now uses the same chain iterator. Both paths emit exactly ONE operator alert when the chain is fully exhausted (was sometimes two per request before).
Claude now streams via an Anthropic-native endpoint
For Claude haiku/sonnet/opus 4-5/4-6, requests now serve through an Anthropic-native endpoint with real SSE event streaming (verified end-to-end on opus 4.6 / sonnet 4.6). A redundant upstream was added at the same price point, so if one upstream throttles another picks up without a price change — improving reliability.
Sliding-window rate limiter + per-user gate
core/rate_limiter.py is now a true Redis ZSET sliding window — the
previous fixed-bucket implementation let 2× the limit through at the
minute boundary. A per-user aggregate cap (default 600 req/min)
fires independently of the per-key cap and surfaces as
error.code = "rate_limit_per_user" so clients can branch on which
gate triggered.
Admin → Streaming Providers tab
A new admin tab under Unit Economics → Streaming Providers shows the live per-model eligibility table: cost / balance / enabled / stream-capable flag / reason for every provider in the chain, plus the actual stream and non-stream fallback chains the router would build right now. Useful when a customer reports an "expected provider X, got provider Y" mismatch.
Server-side image format conversion
POST /v1/llm/images/generations and POST /v1/openai/images/generations
now accept an optional output_format field. The server re-encodes
delivered images into the chosen format before they go over the wire,
so clients no longer need a Pillow / Sharp pipeline of their own.
png— pass-through (default if omitted), lossless.jpeg(aliasjpg) — ~10× smaller payload; alpha is flattened onto white because JPEG has no transparency.webp— ~5× smaller at comparable quality, alpha preserved.
The chosen format is echoed back in usage.output_format and each
delivered image entry carries its mime_type. Conversion failures
degrade to "return source PNG unchanged" so a corrupt encode never
loses an image the upstream already produced.
5-minute upstream timeout across the chain
Long-running reasoning models (Gemini 3 Pro thinking, GPT-5.2 Pro, Claude Opus 4.7 thinking, Sora-2 video) that legitimately run for several minutes were occasionally cut off mid-generation by a tight 60-80 s upstream budget and silently re-routed at full official rate. The whole LLM stack now holds the connection open for up to 5 minutes end-to-end (10 minutes worst-case if the primary fully exhausts and the reliability fallback also takes the full window). In practice we never see anything past ~3 minutes, but you'll never lose a slow honest generation again.
A dead gateway still fast-fails on TCP connect (10 s) and re-routes inside the same request — a 5-minute upper bound is the budget for progress, not a forced wait.
Expanded image model catalog
The image-generation endpoint now serves 11 additional models on top of the Nano Banana family and Grok 4 Image:
| Slug | Price / image | Notes |
|---|---|---|
gpt-image-2 | $0.0464 | OpenAI image, 1K default |
gpt-image-1-5 | $0.020 | OpenAI cheaper tier |
gpt-4o-image | $0.040 | OpenAI 4o image |
flux-2-pro | $0.060 | Black Forest Labs FLUX.2 Pro 1K |
flux-2-flex | $0.180 | Premium quality 1K |
flux-kontext-pro | $0.080 | Text-to-image & edit |
flux-kontext-max | $0.160 | Premium edit / generation |
seedream-4 | $0.050 | ByteDance Seedream 4 |
seedream-4-5 | $0.040 | ByteDance Seedream 4.5 |
seedream-5-0-lite | $0.035 | ByteDance Seedream 5.0 Lite |
z-image | $0.020 | Alibaba Z-Image |
Image-to-image (input_images) is supported on the FLUX Kontext,
Seedream, GPT Image and Nano Banana edit slugs.
Gemini 3.1 Flash Lite at the standard discount tier
gemini-3.1-flash-lite joined the Gemini −20% retail tier:
| Before | Now | |
|---|---|---|
| Retail $/MTok | $0.25 / $1.50 | $0.20 / $1.20 |
Existing API keys see the lower price automatically — no migration needed.
Gemini 3 Pro fallback works again
gemini-3-pro requests now have a working fallback chain (the upstream
alias was previously routing to a missing model and 404-ing). Recovery
is transparent — same slug, same retail price, no client-side change.
In-chat image generation moves to the job-based pattern
POST /v1/llm/chats/{chat_id}/messages with an image model now returns
in under 1 second with type: "image_pending", a job_id, and the dialogue_id
of a placeholder dialogue that already lives in the chat history.
- New endpoint:
GET /v1/llm/chats/{chat_id}/image-jobs/{job_id}— poll every 2-3s untilcompleted(placeholder is rewritten with the image and balance is debited) orfailed(placeholder is marked, nothing charged). - Pro / 4K turns no longer hit a Cloudflare 524 when they cross 100 seconds — the request itself is short, the image arrives over polling.
- Server-side, polls are coalesced via Redis (2-second TTL), so a client polling every 250ms still costs one upstream call every 2s.
- Both this and the existing standalone
/v1/llm/images/jobsflow share the same kickoff / poll / charge code path — pricing, ownership checks, and balance debit are guaranteed to stay in lockstep.
The recommended polling deadline (client-side) is now 5 minutes for either surface. Examples in the LLM API docs updated.
New chat models
gpt-5.5— OpenAI flagship, 4.00 / 24.00 USD per Mtok (−20% off API)deepseek-v4-pro— flagship, 2.40 / 4.00deepseek-v4-flash— fast tier, 0.18 / 0.30
LLM chat + image generation in the SDK and CLI
The gpuniq Python package (pip install -U gpuniq) now ships chat and image-generation helpers, plus matching gg subcommands so you never need to leave the terminal.
Python SDK
client.llm.generate_image(prompt, model, n, size, quality, input_images, save_to)— synchronous text-to-image / image-to-image.client.llm.start_image_job/get_image_job/generate_image_async— job-based surface for Nano Banana (polls automatically, emitson_progresscallbacks, streams past proxy read-timeouts).input_imagesaccepts local paths,data:URLs,https://URLs, raw bytes, or bare base64 — the SDK inlines local files as data URLs for you.save_toaccepts a filename (single image) or a directory (many), decodesb64_jsonand writes PNG(s). The list of written paths is returned assaved_paths.client.llm.default_model()andmodel_catalog()expose the platform default and pricing metadata.- Dropped stale
purchase_tokens/convert_rubles_to_tokens/packages— LLM and image usage are billed directly in USD fromuser.balance.
CLI
gg llm "prompt"— one-shot chat, prints the answer plus tokens / cost / balance.gg llm— interactive REPL with/exit,/clear.gg image "prompt"— generates image(s), saves PNG(s) to disk, prints paths + cost + balance. Auto-uses the async-poll path for Nano Banana slugs.
CLI polish pass
gg helpnow works (alias forgg --help). Typos likegg odersprint a clearError: unknown command '<x>'.+ full help instead of the confusing "gg not initialized" message — the shell-fallback only kicks in when a GPU-sidegg initconfig actually exists.- 2D GPU picker — arrow-key navigation through a matrix laid out by generation (Datacenter · 50XX · 40XX · 30XX · 20XX · 1660) with
Any GPU/Other…meta rows. - Templates — docker image presets (PyTorch, ComfyUI, vLLM, Ubuntu VM, Custom).
gg rentpicks PyTorch by default;gg replacedefaults to the old instance's image so on-disk data keeps working. gg openalways goes viassh.gpuniq.com— the CLI calls a newPOST /v1/instances/{id}/ssh-proxy/ensureto allocate a proxy port on demand for older orders whose allocation failed at order time.gg replacefully destroys the old instance (DELETE, not just stop) so the provider machine and SSH proxy port are released before the new one is placed.- Billing plans trimmed to
week(default),month, andminute. Hourly/daily billing is no longer exposed in the rental UI.
gg rent and gg replace
gg rent— interactive GPU rental from the terminal. Filter wizard (GPU model picker, min count, max price, verified, sort), full-width marketplace table withn/p/f/scontrols, template picker, volume picker (pick existing / create new / skip), confirm, place order. On HTTP 410 (offer taken mid-flow) the picker loops back without losing plan / volume choices.gg replace <id>— swap the GPU on a running instance. Same picker, same filters. Preserves the original billing plan and volume; defaults the Docker image to whatever the old instance was running.- Adaptive table — columns resize with the terminal. Narrow: GPU / VRAM / RAM / DISK / LOCATION / RELIA / PRICE / VER. Wide: adds CPU, NET ↓/↑, AVAIL, CPU MODEL, HOSTING. GPU and LOCATION flex-share leftover width.
gg statusrecognises bothgg login(client) andgg init(GPU) configs and shows a combined status view.--image,--disk,--gpu,--count,--max-price,--sort,--pricing,--volume-id,--no-volume,--verifiedflags ongg rent/gg replacefor non-interactive use.OrderOfferGonetyped exception and richer FastAPI error parsing (dict / list / stringdetailshapes) — you seeorder failed (400): docker_image is requiredinstead of400 Client Error: Bad Request for url.
Image Generation
- Text-to-image and image-to-image via the OpenAI-compatible
/v1/openai/images/generationsand native/v1/llm/images/generationsendpoints. - Four new models:
nano-banana($0.0312/img),nano-banana-pro($0.1072/img),nano-banana-pro-4k($0.192/img),grok-4-image($0.0352/img). - Reference photos: attach up to 4 photos with any prompt — Nano Banana handles the rest for image editing and style transfer.
- Flat per-image billing: you pay only for delivered images. If the upstream content-policy rejects a frame you're charged only for what arrived.
- New in-app studio at
/chat— pick any image model and the composer switches to a text+photos panel with a gallery view.
OpenAI-Compatible LLM Endpoint
- Drop-in OpenAI API at
/v1/openai/chat/completionsand/v1/openai/models— every field of the OpenAI Chat Completions protocol is forwarded unchanged (tools, tool_choice, response_format, logprobs, seed, streaming). - Works with Claude Code (via LiteLLM proxy), Cursor, Continue.dev, Aider, LiteLLM, and the official OpenAI Python / JS SDKs — no code changes required.
- Byte-identical SSE streaming — plug directly into OpenAI SDK's streaming parser.
- Authenticates with your existing GPUniq API key via
Authorization: Bearer gpuniq_....
Expanded Model Catalog
- Anthropic Claude — Opus 4.7 / 4.6 / 4.5, Sonnet 4.6 / 4.5, Haiku 4.5 (now the platform default).
- OpenAI GPT-5 family — GPT-5.2 Pro / Codex, GPT-5.1 Codex Max, GPT-5, o3 / o3-mini / o4-mini, GPT-4o, GPT-4.1.
- Google Gemini — Gemini 3 Pro / Flash, Gemini 2.5, Nano Banana.
- xAI Grok — Grok 4, Grok 4.1 Thinking, Grok 4 Fast.
- Premium models priced 20% below vendor list price; 30+ free-tier community models remain available.
CLI Tool (gg)
- Command checkpointing: Every command run via
ggis saved with full output, exit code, and timing - Replay on restart:
gg replayre-runs interrupted commands after instance restarts - PTY support: Full terminal colors, progress bars, and interactive prompts
- Backend sync: Checkpoints are synced to the GPUniq backend automatically
- 6 commands:
gg init,gg run,gg list,gg logs,gg replay,gg status - Shorthand syntax:
gg python train.pyworks the same asgg run python train.py
Python SDK v2.0
- Full Python SDK:
pip install GPUniq— all endpoints accessible viaGPUniqclient - Universal API Key Auth: API keys now work across all endpoints (marketplace, instances, volumes, LLM, payments, settings)
- Rate Limiting: 120 req/min per API key with auto-retry in SDK
- 8 SDK Modules: marketplace, gpu_cloud, burst, instances, volumes, llm, payments, settings
- Backward compatible: v1.x
gpuniq.init()andclient.request()still work
GPU Dex-Cloud
- Deploy by GPU type: Pick a GPU model, the platform finds the best server automatically
- One-call deployment:
client.gpu_cloud.deploy(gpu_name="RTX_4090") - Pricing API: Check pricing before deploying
GPU Burst
- Multi-GPU burst orders: Scale to 100 GPUs in a single order
- Fallback GPUs: Define alternative GPU types with price caps
- Cost estimation: Estimate burst cost before committing
- Per-order billing: Transaction and run history per order
Persistent Volumes
- Persistent storage: Create volumes that survive instance restarts
- File management: Upload, download, list, and delete files via API/SDK
- Attach to any instance: Mount volumes at creation time across all deployment modes
- Sync logs: Monitor volume sync operations
LLM API
- Unified LLM access: OpenAI, Qwen, DeepSeek, Llama, Mistral, and more
- Chat sessions: Persistent conversations with message history
- Token management: Balance, packages, usage history
- Terminal commands: Generate shell commands from natural language
Settings
- SSH key management: Add, update, toggle, delete SSH keys via API
- Per-instance SSH keys: Attach/detach keys to individual instances
- Telegram notifications: Link Telegram for status alerts
New Features
- GPU Analytics Dashboard: Market analytics with price tracking and availability trends
- Enhanced AI Recommendations: Improved GPU suggestions based on usage history
- Async Order Creation: Reliable order creation with job status polling
- SLA Monitoring: Real-time uptime tracking per instance
Improvements
- Faster marketplace loading with optimized queries
- Improved SSH connection reliability
- Enhanced provider verification with reliability scoring
New Features
- USDT Payments (TRC20): Deposit and withdraw using Tether on Tron network
- Flexible Pricing Types: Hourly, daily, weekly, and monthly billing
- Provider Dashboard: GPU owners can manage listings and track earnings
New Features
- AI GPU Recommendations: Natural language GPU suggestions
- Task History API: Rental history with detailed statistics
- Session Management: View and manage active sessions
New Features
- Multi-GPU Support: Rent machines with multiple GPUs
- Provider Agent: Connect your GPUs to the marketplace
- Heartbeat Monitoring: Real-time provider availability tracking
Initial Launch
- GPU Marketplace: Browse and rent GPUs from providers worldwide
- Instant SSH Access: Get credentials immediately after rental
- YooKassa Payments: Russian payment methods
- User Dashboard: Manage rentals, balance, and usage
- REST API: Full API access for automation
Upgrade Instructions
Python SDK: pip install --upgrade GPUniq
The API is backward compatible. Existing integrations continue to work.