Private beta · AMD GPU infrastructure

Serverless GPU inference
on AMD GPUs

Deploy any HuggingFace or Docker model on AMD's largest-VRAM GPUs. 192 GB per card, scale to zero, OpenAI-compatible.

Currently working with a small group of early design partners.

Get in touch → How it works

Step 1

Package your model

Any inference server in a Docker image — vLLM, HuggingFace TGI, Triton, or custom. Expose an HTTP port.

Step 2

Deploy to Inferix

Push your image, pick your GPU config. Inferix provisions an AMD GPU VM and routes HTTPS traffic.

Step 3

Call your endpoint

POST to your function URL. Replicas spin up on demand, scale to zero when idle. Pay per second of active compute.

Deploy an OpenAI-compatible endpoint — 2 API calls

# 1. Deploy — any OpenAI-compatible inference server
curl -X POST https://api.inferix.dev/v1/functions \
  -H "X-Api-Key: $INFERIX_KEY" \
  -d '{
    "name": "llama3-70b",
    "image": "vllm/vllm-openai:latest",
    "gpu_count": 1,
    "env_vars": {
      "MODEL": "meta-llama/Meta-Llama-3-70B-Instruct",
      "ROCM_VISIBLE_DEVICES": "0"
    }
  }'

# 2. Invoke — standard OpenAI chat format
curl -X POST https://api.inferix.dev/v1/functions/fn-abc123/invoke \
  -H "X-Api-Key: $INFERIX_KEY" \
  -d '{"payload": {"messages": [{"role": "user", "content": "Hello"}]}}'

Scales to zero when idle — no charge between requests ROCm · HIP · PyTorch

AMD

AMD GPU

Among the largest-VRAM inference GPUs available

Run Llama-3 70B, Mixtral 8×7B, Stable Diffusion XL, and more on a single GPU — no quantisation, no tensor parallelism needed.

Fully ROCm-native. Works with PyTorch, vLLM, HuggingFace TGI, ONNX Runtime, and any HIP-compatible framework out of the box.

Built for serious AI workloads

AMD GPUs offer some of the highest VRAM available — built for models that don't fit elsewhere.

192 GB VRAM per GPU

Run Llama-3 70B, Mixtral 8×22B, or Stable Diffusion XL on a single AMD GPU. No tensor parallelism needed for most frontier models.

Any framework, any image

vLLM, HuggingFace TGI, Triton Inference Server, ONNX Runtime, or your own FastAPI server. If it runs in Docker and exposes HTTP, Inferix can host it.

Scale to zero

Set min_replicas: 0 and replicas shut down when traffic stops. Cold start in seconds. You pay nothing while idle.

Pay per second of compute

Billed per second of actual GPU time consumed. No subscriptions, no seat licenses, no idle charges. Pricing tailored to workload — let's talk.

RPS-based autoscaling

Inferix watches requests-per-second and scales replicas between your min and max automatically. Handle traffic spikes without pre-provisioning.

ROCm-native

AMD GPUs run ROCm — the open-source GPU computing stack. Full HIP support, compatible with most CUDA-based images via HIPify. PyTorch, JAX, and ONNX work out of the box.

AMD GPU — built for large models

The most memory-dense GPU available. Load larger models without quantisation.

192 GB

HBM3 VRAM

5.3 TB/s

Memory bandwidth

1536 GB

8-GPU node total

ROCm

Open GPU stack

What fits on a single AMD GPU (fp16, no quantisation)

Model	Size	VRAM needed	Fits on one GPU?
Llama-3 8B	8B	~16 GB	✓ Yes
Llama-3 70B	70B	~140 GB	✓ Yes
Mixtral 8×7B	47B	~94 GB	✓ Yes
Mixtral 8×22B	141B	~282 GB	Use 2× GPU
Stable Diffusion XL	—	~6 GB	✓ Yes
Flux.1 dev	12B	~24 GB	✓ Yes

Who is this for?

Inferix is built for teams that have outgrown OpenAI economics or need infrastructure they control.

Teams running open-weight LLMs in production

Llama, Qwen, Mistral, DeepSeek — if you're hosting your own model, 192 GB of single-GPU VRAM lets you skip the multi-GPU tensor-parallel headache.

Voice / real-time AI applications

Whisper + LLM + TTS on a single dedicated GPU. Predictable TTFT. No vendor cloud carries your audio data.

Enterprises with data residency requirements

Single-tenant deployments where customer data must not flow through OpenAI or Anthropic infrastructure. Healthcare, finance, regulated industries.

High-volume token workloads

If you're spending $5K+/month on per-token APIs, self-hosted inference on AMD GPUs usually pays for itself within weeks.

Fine-tuned or custom model deployments

Bring your own LoRA, full fine-tune, or hand-rolled architecture. Any HuggingFace or HIP-compatible model works.

Image, audio, and multimodal pipelines

Stable Diffusion XL, Flux, Whisper, CLIP — AMD GPUs handle non-LLM workloads with the same per-second economics.

FAQ

Answers to the things people ask first.

What's the cold start latency?

For a warm replica, requests start instantly. For a cold GPU (replica spinning up from zero), expect 2–5 minutes depending on model size and container size. Keep min_replicas: 1 if you need always-warm latency.

Does my CUDA code work on AMD?

Most PyTorch and HuggingFace workloads run unchanged on ROCm. Custom CUDA kernels need HIPify (usually automatic). vLLM, TGI, and Triton all ship official ROCm builds.

What models can I run?

Any model with a Docker image and HTTP endpoint. Llama, Qwen, Mistral, DeepSeek, Mixtral, Stable Diffusion, Flux, Whisper, custom — if it runs on a GPU with ROCm, Inferix can host it.

How does pricing work?

Billed per second of GPU compute. Final pricing depends on your model, workload, and concurrency profile — we tune the offer to fit. Email us with rough numbers and we'll quote.

Can I get a dedicated GPU?

Yes — dedicated AMD GPU deployments are available for production workloads with isolation or compliance requirements. Hardware-isolated, single-tenant, predictable billing.

When can I start?

We're working with a small group of design partners right now. If your use case fits, we onboard one customer at a time so we can give each the attention they need. Reply timeframe: ~1 week.

Who's behind Inferix?

Built by a small team focused on giving serious AI workloads an AMD-native alternative to NVIDIA-locked inference platforms. Solo founder with deep infrastructure background — see "About" below.

About

Inferix exists because AMD's flagship inference GPUs are some of the most under-utilised hardware in the market today — significantly more VRAM than comparable NVIDIA cards, comparable bandwidth, available at substantially lower cost. The catch was that most serverless inference platforms only support NVIDIA.

We built Inferix to make AMD GPUs as easy to deploy on as Modal or Replicate make NVIDIA. Bring your Docker image, choose your config, get an HTTPS endpoint. The infrastructure complexity (ROCm, vLLM tuning, scaling, billing) stays on our side.

We're in private beta and working with one customer at a time. If you're building something where AMD-native economics or single-tenant isolation matter, we'd like to hear about it.

Interested in early access?

We're talking to a small group of design partners. If your workload fits, we'd love to hear about it.

Reply timeframe: ~1 week · No commitment, no card on file

Apply for early access →

Serverless GPU inference on AMD GPUs