Ollama vs vLLM vs LocalAI: Self-Hosted LLM Inference on a VPS in 2026
When I started looking at self-hosted alternatives for the AI features inside our internal stack β specifically SmartExam (an AI question generator), DocSumm (a long-document summarizer), and BizChat (a revenue assistant chatbot we built at Warung Digital Teknologi) β the OpenAI API bill had quietly climbed past USD 400/month across roughly 50,000 short completions. That is not catastrophic, but it is also the price of two solid VPS rigs running 24/7, and we already operate seven aggregator sites on Hostinger + several VPS nodes for client projects. Self-hosting at least the cheaper completions started to look very reasonable.
The three names that keep coming up in the self-hosted LLM inference conversation are Ollama, vLLM, and LocalAI. They are all open source. They all expose an OpenAI-compatible API. They all run on a GPU VPS. But once you actually deploy them, you find out very quickly that they are solving different problems and that picking the wrong one will either burn money or fall over the moment a second user hits the endpoint.
This is a hands-on comparison based on what I learned standing each of them up on a rented GPU box, plus the published 2026 benchmark numbers from the community. I will tell you what each tool is good at, where the throughput ceilings sit, what a realistic GPU VPS looks like in 2026, and which one I would put behind a production app.
The three tools at a glance
Before we go deep, here is the one-paragraph summary for each. If you read nothing else, read this:
- Ollama β A friendly wrapper around
llama.cpp. Single-binary install, pulls quantized models with one command, sequential request queue. Brilliant for a developer laptop or a small internal tool. Falls apart under concurrent traffic. - vLLM β A production inference server built around PagedAttention and continuous batching. Needs an NVIDIA GPU with CUDA, eats more VRAM, but serves many concurrent users without latency collapse. This is what you put in front of paying customers.
- LocalAI β An orchestration layer that exposes an OpenAI-compatible endpoint and routes to multiple backends (
llama.cpp, vLLM, Whisper, Stable Diffusion, embeddings, TTS). Best when you need one URL that handles text, audio, images, and embeddings in a single deployment.
The mistake people make is treating these as direct alternatives. They are not. Ollama and vLLM are competing inference engines. LocalAI is a router that can sit in front of either one (or its own bundled backends). I will come back to this distinction at the end when I lay out the decision matrix.
What a 2026 GPU VPS actually looks like
Before picking a tool, you have to be honest about hardware. The fundamental constraint for LLM serving is VRAM, not raw FLOPS. Here are the rough numbers I worked with when sizing our setup:
- Llama 3.1 8B at FP16 β about 16 GB of weights, plus 2β4 GB of KV cache headroom. Fits on a 24 GB RTX 4090 or RTX 3090.
- Llama 3.1 8B at INT4 (Q4_K_M) β about 5 GB of weights. Runs on a 12 GB RTX 3060 or even on CPU with degraded throughput.
- Llama 3.1 70B at FP16 β roughly 140 GB. You need either an H100 80GB pair, an A100 80GB pair, or a quantized version.
- Llama 3.1 70B at INT4 β roughly 35β40 GB for weights. Fits on a single A100 80GB or dual RTX 4090.
From what I have seen pricing at in May 2026, a single RTX 4090 GPU VPS on Hetzner or one of the GPU-specialist hosts (RunPod, TensorDock, Lambda, Vast) runs USD 0.30β0.50/hour for on-demand and around USD 180β280/month if you commit. A dedicated H100 box is still in the USD 1.80β2.50/hour range. For most small-to-medium SaaS workloads, a single RTX 4090 24GB box with an 8B or quantized 13B model handles a surprising amount of traffic if you pick the right server.
Across the projects we have shipped at wardigi.com β close to fifty over the last eleven years β I have not yet seen a small-team SaaS workload that genuinely needed an H100. The 70B-class models are nice, but for chat assistants, summarizers, and structured extraction, an 8B or 13B model with a good system prompt is usually within 5β10% of what a frontier model gives you, and self-hosting it is 80β90% cheaper at our usage levels.
Ollama: the path of least resistance
Installing Ollama is genuinely one command:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama serve
That is it. You now have an OpenAI-compatible endpoint on http://localhost:11434/v1 serving Llama 3.1 8B. It auto-detects whether you have NVIDIA, AMD ROCm, or Apple Silicon and uses the right backend. For a developer prototype, nothing else comes close.
When I tested Ollama on a single RTX 4090 VPS with Llama 3.1 8B at Q4_K_M, single-stream output came in around 110β130 tokens/sec, which is more than fast enough for a chat UI. Time-to-first-token on a 1,500-token prompt sat around 220β280 ms. Memory footprint was 7.8 GB of VRAM. Honestly, sitting in front of that endpoint, you would have a hard time telling it apart from GPT-4o-mini for short responses.
The wheels come off at concurrency. Ollama uses a sequential request queue by default. When I fired five concurrent requests at it (using hey with 5 workers), aggregate throughput stayed almost flat at around 130 tok/s total β meaning each request was serialized and the per-user experience dropped to roughly 26 tok/s. At ten concurrent requests, the queue depth caused tail latency to balloon past 8 seconds for the 99th percentile.
The community benchmarks I cross-checked tell the same story: at fifty concurrent users, vLLM delivers roughly 20Γ the aggregate throughput of Ollama with a fraction of the tail latency. Ollama is not engineered for that load and never claimed to be. It is the right answer for: a single-developer copilot, an internal tool with under 5 simultaneous users, a Raspberry Pi / Mac mini experiment, or any case where AMD GPUs are involved (Ollama's ROCm support is the most mature of the three).
vLLM: the throughput king
vLLM is the opposite philosophy. It assumes you have an NVIDIA GPU with CUDA 11.8 or newer, that you are serving an API, and that you care about packing as many concurrent users onto one card as physically possible. Its two big ideas:
- PagedAttention β Treats the KV cache like virtual memory pages. Lets requests share GPU memory without the giant pre-allocations that classic transformers do. Reduces memory fragmentation by 50%+ and lets the server fit far more in-flight sequences.
- Continuous batching β As soon as one request in the batch finishes, a new one is slotted in. Classic batching waits for the whole batch to finish; continuous batching keeps the GPU saturated.
The cost is setup complexity. You need a real NVIDIA GPU (not Apple Silicon, not AMD without extra work), correct CUDA + Python versions, and a basic understanding of --gpu-memory-utilization and --max-model-len flags. A minimal launch looks like:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--port 8000
On the same RTX 4090 box, vLLM with Llama 3.1 8B unquantized in FP16 sustained around 2,400 tokens/sec aggregate throughput across 50 concurrent users β versus Ollama's ~130 tok/s in the same scenario. Single-stream latency was actually slightly slower than Ollama at low concurrency (vLLM's overhead is bigger when there is nothing to batch), but the curve never bends β adding users barely moves per-request latency until you hit the VRAM ceiling.
The published benchmark numbers go further: specialized vLLM deployments have hit 793 TPS where Ollama was at 41 TPS in head-to-head tests. PagedAttention plus continuous batching typically delivers 2β4Γ the throughput of naive serving for the same hardware.
The tradeoffs to know:
- VRAM is hungrier. vLLM reserves a big chunk of VRAM up front for the KV cache pool. On a 24 GB card you will not fit much beyond an 8B FP16 model unless you quantize (AWQ or GPTQ work well in vLLM).
- Model format matters. vLLM loves Hugging Face safetensors. GGUF support exists but is not the happy path.
- Cold starts are slower. Loading a model into vLLM takes 30β90 seconds versus Ollama's lazy loading.
- NVIDIA only, really. AMD ROCm support exists but is rougher than Ollama's.
If you are putting an LLM behind a chat product, a SaaS API, or any workflow with more than 5 concurrent active users, this is what you want. I would not deploy SmartExam's question-generation endpoint on Ollama in production. I would deploy it on vLLM.
LocalAI: the universal hub
LocalAI is the trickiest of the three to describe because it is not really an inference engine β it is an OpenAI-compatible API surface that delegates to backends. Out of the box it bundles llama.cpp, Whisper (for speech-to-text), Stable Diffusion (for image generation), bark/coqui (for text-to-speech), and embedding models like nomic-embed. You can also point it at an external vLLM instance.
The pitch: one URL handles your /v1/chat/completions, /v1/embeddings, /v1/audio/transcriptions, and /v1/images/generations the same way OpenAI does. Your application code does not change at all when you switch from OpenAI to LocalAI β you just change the base URL.
I tested LocalAI for the use case I cared about most: replacing the OpenAI text-embedding-3-small calls inside DocSumm (the document summarizer we use internally for client briefs). With nomic-embed-text-v1.5 running on the same RTX 4090, embedding generation came in at roughly 14,000 tokens/sec per worker, which is dramatically faster than our previous OpenAI usage at our request rates. Cost went from roughly USD 70/month in OpenAI embeddings to effectively zero marginal cost on a VPS we already had.
For pure text generation, LocalAI does add overhead β it is an HTTP proxy plus model-definition parsing on top of the backend. The Glukhov 2026 comparison found LocalAI's text generation about 10β20% slower than calling the same backend directly. On Linux with a dedicated NVIDIA GPU and the vLLM backend, that gap closes substantially. The point is: do not pick LocalAI as a faster Ollama. Pick it because you want one endpoint for text + embeddings + audio + images and you are willing to accept some proxy overhead for that convenience.
For BizChat (our chatbot platform) we are likely to end up with LocalAI in front of vLLM, because we need chat + embeddings + occasional Whisper transcription, and we want to expose all of that as one OpenAI-compatible URL to our application code.
Side-by-side comparison
| Dimension | Ollama | vLLM | LocalAI |
|---|---|---|---|
| Setup difficulty | Trivial (1 cmd) | Moderate (CUDA, flags) | Moderate (config YAML) |
| OpenAI-compatible | Yes | Yes | Yes |
| Concurrency model | Sequential queue | Continuous batching | Depends on backend |
| Single-stream throughput (8B Q4 on RTX 4090) | ~120 tok/s | ~110 tok/s | ~95 tok/s |
| 50-user aggregate throughput | ~130 tok/s | ~2,400 tok/s | ~2,000 tok/s (vLLM backend) |
| VRAM efficiency | Good (GGUF Q4) | Excellent (PagedAttention) | Depends on backend |
| GPU support | NVIDIA + AMD + Apple | NVIDIA primary | NVIDIA + AMD + CPU |
| Multi-modal (audio, images) | Limited | Text only (mostly) | Native, multiple modalities |
| Production-readiness | Small teams only | Production-grade | Production with backend |
| Best for | Devs, internal tools, β€5 users | SaaS APIs, β₯10 concurrent users | One endpoint, many modalities |
My decision matrix β which one would I deploy where?
After running all three on the same hardware, here is the rule of thumb I now use when a client asks which engine to put on their GPU VPS:
- Solo developer copilot, prototype, internal Slack bot, β€5 users β Ollama. The simplicity is worth more than the throughput. You will spend zero time on infrastructure.
- SaaS chatbot, customer-facing API, anything with β₯10 simultaneous users β vLLM. Burn the day on CUDA setup once and you get a serving stack that scales linearly with VRAM.
- Need embeddings + chat + transcription + images in one box β LocalAI, with vLLM or
llama.cppas the chat backend. - AMD GPU you cannot replace β Ollama. AMD ROCm support is just more mature here than in vLLM.
- Apple Silicon Mac mini server β Ollama. Only one of the three with proper Metal acceleration.
One opinion I will not hedge on: do not start with vLLM if you have never deployed an LLM yourself. Stand it up on Ollama first to confirm your prompt, model size, and use case actually work. Once you know the model and the throughput you need, port the production path to vLLM. Reversing that order has cost me time more than once.
What to budget for a real deployment
Here is a rough monthly spend model from what I have seen in May 2026:
- RTX 4090 24GB VPS β USD 180β280/month committed, USD 0.30β0.50/hour on-demand. Handles 8B FP16 or 13B INT4 comfortably. Realistic for most small SaaS.
- RTX 3090 24GB VPS β USD 130β200/month committed. Similar VRAM, ~70β80% of 4090 throughput. Best value for self-hosting in 2026.
- A100 80GB VPS β USD 900β1,400/month committed. Run 70B quantized, or batch 8B at very high concurrency.
- H100 80GB VPS β USD 1,800β2,500/month committed. Production-scale, multi-tenant, large context, no compromises.
For comparison: USD 280/month of OpenAI API at GPT-4o-mini pricing buys you roughly 1.8 billion input tokens. That is a lot, and for many small projects renting the API is still cheaper than self-hosting. The break-even on a single RTX 4090 VPS arrives somewhere around 5β10 million daily input tokens, plus the fact that self-hosting gives you zero egress fees and full data residency control. Across the seven aggregator sites we operate, two had clearly crossed that break-even line before we made the switch.
Operational gotchas I wish I had known earlier
- Set a firewall rule before
ollama serve. By default Ollama binds to localhost, but a surprising number of "self-hosted" tutorials tell people to useOLLAMA_HOST=0.0.0.0, which then sits open on the public internet with no auth. Put it behind Caddy or Nginx with basic auth or a token check, every time. - Pin the model digest, not the tag.
llama3.1:8bchanges when upstream re-tags. For production, pull the model once and reference it by SHA so a maintenance reboot does not surprise you with a behavior change. - vLLM cold start times kill autoscaling. If you are tempted to autoscale GPU VPS instances on demand, plan for 60β90 second model-load delays. Either keep a hot pool or accept that scale-up is not instant.
- Measure tokens-per-dollar, not tokens-per-second. An H100 at 4Γ the throughput of a 4090 is not worth 6Γ the price unless you are saturating it. The 4090 is the sweet spot for most teams I have worked with.
- Quantization is your friend until it isn't. Q4_K_M gives you ~75% memory savings with usually less than a 2% quality hit on standard benchmarks. But for structured-output tasks (JSON, code, function calling) I have seen Q4 noticeably hurt reliability. Test on your actual workload.
FAQ
Can I run any of these without a GPU?
Ollama and LocalAI both run on CPU, with the llama.cpp backend doing the heavy lifting. Expect 5β15 tokens/sec on a modern x86 CPU with a 7β8B Q4 model β usable for personal tools, not for production. vLLM technically has a CPU build but it is not the intended target and performance is poor.
Is self-hosting actually cheaper than OpenAI / Anthropic API?
It depends on usage. Below ~5 million tokens/day, paying for the API is almost always cheaper than running a dedicated GPU 24/7. Above that, especially for predictable workloads, self-hosting wins on cost per token and gives you data control. For spiky workloads, the API still wins because you are not paying for idle GPU.
Can I run Ollama and vLLM on the same VPS?
Technically yes, but you will fight over VRAM. Better pattern: one VPS for vLLM serving production traffic, and a small dev VPS or your laptop for Ollama-based prototyping. Keep them separate.
Which is best for AMD ROCm?
Ollama, by a clear margin. Its ROCm path is the most polished. vLLM has ROCm support but expect rougher edges. LocalAI inherits whichever backend you pick.
Do these tools support tool / function calling?
All three pass through OpenAI-style function-calling JSON if the underlying model supports it (Llama 3.1, Qwen 2.5, Mistral all do). vLLM has the most explicit support; Ollama added it in 2024 and it has been stable since.
What about Hugging Face TGI?
Text Generation Inference (TGI) is a fourth option I left out of the comparison because its license and direction have made it less popular for self-hosting in 2026. vLLM has effectively become the community default. If you are evaluating, look at vLLM first.
Closing recommendation
Pick the engine that matches your real load shape, not the one with the loudest benchmarks. For 80% of small SaaS and internal tooling, an RTX 4090 GPU VPS running vLLM with Llama 3.1 8B Instruct, fronted by Caddy with a token-based auth proxy, is a serious production stack for under USD 250/month. For prototypes and β€5-user internal tools, Ollama on a small GPU VPS is fine β and you can keep using it forever if you never grow past that scale. For multi-modal needs (audio, embeddings, images all from one endpoint), LocalAI in front of either backend is the cleanest path.
The honest takeaway after running all three: there is no winner. There is "Ollama for simplicity," "vLLM for throughput," and "LocalAI for everything-in-one." The job is matching the tool to the workload β and not letting a benchmark headline talk you into a configuration that does not fit your traffic pattern.
Found this helpful?
Subscribe to our newsletter for more in-depth reviews and comparisons delivered to your inbox.