Serverless GPU inference
on AMD GPUs
Deploy any HuggingFace or Docker model on AMD's largest-VRAM GPUs. 192 GB per card, scale to zero, OpenAI-compatible.
Currently working with a small group of early design partners.
Any inference server in a Docker image — vLLM, HuggingFace TGI, Triton, or custom. Expose an HTTP port.
Push your image, pick your GPU config. Inferix provisions an AMD GPU VM and routes HTTPS traffic.
POST to your function URL. Replicas spin up on demand, scale to zero when idle. Pay per second of active compute.
# 1. Deploy — any OpenAI-compatible inference server
curl -X POST https://api.inferix.dev/v1/functions \
-H "X-Api-Key: $INFERIX_KEY" \
-d '{
"name": "llama3-70b",
"image": "vllm/vllm-openai:latest",
"gpu_count": 1,
"env_vars": {
"MODEL": "meta-llama/Meta-Llama-3-70B-Instruct",
"ROCM_VISIBLE_DEVICES": "0"
}
}'
# 2. Invoke — standard OpenAI chat format
curl -X POST https://api.inferix.dev/v1/functions/fn-abc123/invoke \
-H "X-Api-Key: $INFERIX_KEY" \
-d '{"payload": {"messages": [{"role": "user", "content": "Hello"}]}}'
Built for serious AI workloads
AMD GPUs offer some of the highest VRAM available — built for models that don't fit elsewhere.
192 GB VRAM per GPU
Run Llama-3 70B, Mixtral 8×22B, or Stable Diffusion XL on a single AMD GPU. No tensor parallelism needed for most frontier models.
Any framework, any image
vLLM, HuggingFace TGI, Triton Inference Server, ONNX Runtime, or your own FastAPI server. If it runs in Docker and exposes HTTP, Inferix can host it.
Scale to zero
Set min_replicas: 0 and replicas shut down when traffic stops. Cold start in seconds. You pay nothing while idle.
Pay per second of compute
Billed per second of actual GPU time consumed. No subscriptions, no seat licenses, no idle charges. Pricing tailored to workload — let's talk.
RPS-based autoscaling
Inferix watches requests-per-second and scales replicas between your min and max automatically. Handle traffic spikes without pre-provisioning.
ROCm-native
AMD GPUs run ROCm — the open-source GPU computing stack. Full HIP support, compatible with most CUDA-based images via HIPify. PyTorch, JAX, and ONNX work out of the box.
AMD GPU — built for large models
The most memory-dense GPU available. Load larger models without quantisation.
| Model | Size | VRAM needed | Fits on one GPU? |
|---|---|---|---|
| Llama-3 8B | 8B | ~16 GB | ✓ Yes |
| Llama-3 70B | 70B | ~140 GB | ✓ Yes |
| Mixtral 8×7B | 47B | ~94 GB | ✓ Yes |
| Mixtral 8×22B | 141B | ~282 GB | Use 2× GPU |
| Stable Diffusion XL | — | ~6 GB | ✓ Yes |
| Flux.1 dev | 12B | ~24 GB | ✓ Yes |
Who is this for?
Inferix is built for teams that have outgrown OpenAI economics or need infrastructure they control.
Llama, Qwen, Mistral, DeepSeek — if you're hosting your own model, 192 GB of single-GPU VRAM lets you skip the multi-GPU tensor-parallel headache.
Whisper + LLM + TTS on a single dedicated GPU. Predictable TTFT. No vendor cloud carries your audio data.
Single-tenant deployments where customer data must not flow through OpenAI or Anthropic infrastructure. Healthcare, finance, regulated industries.
If you're spending $5K+/month on per-token APIs, self-hosted inference on AMD GPUs usually pays for itself within weeks.
Bring your own LoRA, full fine-tune, or hand-rolled architecture. Any HuggingFace or HIP-compatible model works.
Stable Diffusion XL, Flux, Whisper, CLIP — AMD GPUs handle non-LLM workloads with the same per-second economics.
FAQ
Answers to the things people ask first.
For a warm replica, requests start instantly. For a cold GPU (replica spinning up from zero), expect 2–5 minutes depending on model size and container size. Keep min_replicas: 1 if you need always-warm latency.
Most PyTorch and HuggingFace workloads run unchanged on ROCm. Custom CUDA kernels need HIPify (usually automatic). vLLM, TGI, and Triton all ship official ROCm builds.
Any model with a Docker image and HTTP endpoint. Llama, Qwen, Mistral, DeepSeek, Mixtral, Stable Diffusion, Flux, Whisper, custom — if it runs on a GPU with ROCm, Inferix can host it.
Billed per second of GPU compute. Final pricing depends on your model, workload, and concurrency profile — we tune the offer to fit. Email us with rough numbers and we'll quote.
Yes — dedicated AMD GPU deployments are available for production workloads with isolation or compliance requirements. Hardware-isolated, single-tenant, predictable billing.
We're working with a small group of design partners right now. If your use case fits, we onboard one customer at a time so we can give each the attention they need. Reply timeframe: ~1 week.
Built by a small team focused on giving serious AI workloads an AMD-native alternative to NVIDIA-locked inference platforms. Solo founder with deep infrastructure background — see "About" below.
About
Inferix exists because AMD's flagship inference GPUs are some of the most under-utilised hardware in the market today — significantly more VRAM than comparable NVIDIA cards, comparable bandwidth, available at substantially lower cost. The catch was that most serverless inference platforms only support NVIDIA.
We built Inferix to make AMD GPUs as easy to deploy on as Modal or Replicate make NVIDIA. Bring your Docker image, choose your config, get an HTTPS endpoint. The infrastructure complexity (ROCm, vLLM tuning, scaling, billing) stays on our side.
We're in private beta and working with one customer at a time. If you're building something where AMD-native economics or single-tenant isolation matter, we'd like to hear about it.
Interested in early access?
We're talking to a small group of design partners. If your workload fits, we'd love to hear about it.
Reply timeframe: ~1 week · No commitment, no card on file