Reference Guide

Local LLM Home Server:
What You Actually Need

Running a large language model at home means owning the hardware, the data, and the results. Here is what that setup looks like, what it costs, and where to start.

Running a large language model at home – on hardware you own, on your local network, with nothing sent to any external server – has gone from a niche technical project to something a growing number of people are doing seriously. The software has matured, the hardware requirements have come down, and the privacy and cost arguments for doing it have only gotten stronger as cloud AI pricing and data practices have become better understood.

This page covers what a local LLM home server actually requires: the hardware, the software, the realistic performance expectations at different budget levels, and how it fits into a broader offline knowledge setup. It is written for people who might build one themselves as much as for people who want a pre-built option – both paths are real and worth understanding.

What a Local LLM Home Server Actually Is

A local LLM home server is a computer on your home network running software that serves large language model inference locally. You send it a prompt – from a browser, a chat interface, or an API call – and it responds using a model running on that machine’s processor. No internet connection involved in the inference itself. No request leaving your network.

The two pieces of software that make this practical for most people are Ollama and its associated interfaces. Ollama handles model downloading, management, and serving. It runs on Linux, macOS, and Windows, exposes a local API, and integrates with a growing number of front-end chat interfaces. Open WebUI is the most widely used browser-based front end – it gives you a chat interface that looks and behaves like a commercial AI product, running entirely against your local Ollama instance.

Setup on a capable machine is genuinely straightforward. Install Ollama, pull a model, point a browser at the interface. The technical barrier is lower than it was two years ago. The hardware barrier is the remaining constraint that determines how good the experience actually is.

The Hardware That Determines Everything

Local LLM performance comes down to one variable more than any other: how fast the model can generate tokens, measured in tokens per second. A model generating 5 tokens per second produces responses one halting word at a time. At 30 tokens per second, responses stream naturally. At 100 tokens per second or above, the experience is faster than most people read.

Token generation speed is almost entirely determined by your GPU – specifically, how much GPU memory you have and how fast it is. This is why the hardware question is not peripheral to running a local LLM home server. It is the central question.

CPU-Only Inference

Any modern x86 computer can run a small language model using the CPU alone. A 1 to 3 billion parameter model on a capable Intel or AMD desktop processor produces 5 to 20 tokens per second. That is enough to verify the technology works, run simple queries, and get a feel for local AI. The experience is noticeably slower than what most people are used to from cloud AI.

CPU-only inference is also where RAM matters most. Models load into system memory when running on CPU. A 7 billion parameter model quantized to 4-bit precision requires around 4 to 5 GB of memory to load. Running it comfortably alongside an operating system and other applications requires 16 GB of RAM minimum, 32 GB preferred.

AMD Integrated GPU (Radeon 780M / 890M)

This is where local LLM home servers become genuinely practical for everyday use. AMD’s Ryzen 7 and Ryzen 9 processors from the Zen 4 generation include Radeon 780M or 890M integrated graphics – real GPU compute that Ollama can use for inference acceleration through ROCm.

The performance difference over CPU-only is significant. A 7 to 8 billion parameter model runs at 30 to 55 tokens per second on Radeon 780M or 890M hardware. Responses stream at a natural conversational pace. The experience is close to what early-generation cloud AI products delivered, running entirely on a mini PC that draws 15 to 35 watts and costs $400 to $700.

The constraint is VRAM – integrated graphics share system memory, so a 64 GB RAM configuration gives the GPU a larger pool to work with than a 32 GB configuration. For 7 to 8 billion parameter models, 32 GB is sufficient. For 13B models, 64 GB is where performance stays comfortable.

Community benchmarks from the Project NOMAD leaderboard – which tracks over 1,270 real hardware builds running Ollama – show Radeon 780M averaging 73.6 across 57 submissions and Radeon 890M averaging 76.3 across 23 submissions. The Minisforum AI X1 Pro with Ryzen AI 9 HX 370 and Radeon 890M is verified at 51.7 tokens per second on 7B models.

NVIDIA Discrete GPU

A dedicated NVIDIA GPU with its own VRAM changes the performance picture substantially. NVIDIA’s CUDA ecosystem has had the longest support for LLM inference, and the tooling around it is mature.

An RTX 3060 with 12 GB of VRAM runs a 7B model at around 75 tokens per second based on community benchmarks. An RTX 4070 with 12 GB pushes that higher. An RTX 5070 or above runs larger models at speeds that make local AI faster than most people type prompts.

The tradeoff is cost, power draw, and form factor. A discrete GPU requires a desktop chassis or an external GPU enclosure. Power consumption climbs from the 15 to 35 watts of a mini PC to 150 to 300 watts under GPU load. For a home server running continuously, that difference shows up in electricity costs over time.

For users who want maximum capability – running 13B or 70B models, coding assistance with large context windows, or inference speeds above 100 tokens per second – the discrete GPU path is where that lives.

Model Selection: What to Actually Run

The model landscape changes faster than almost any other aspect of local AI. What was a frontier capability six months ago is now a mid-tier option. A few practical guidelines that remain stable:

7 to 8 billion parameter models are the practical sweet spot for most home server use cases on iGPU hardware. Meta’s Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, and Google’s Gemma 2 9B all run well in this range. Response quality is sufficient for summarization, question answering, drafting, coding assistance, and general research support.

13 billion parameter models are where reasoning quality improves noticeably, at the cost of slower inference on iGPU hardware and a requirement for 64 GB of system RAM to run comfortably. On a discrete GPU with 24 GB of VRAM, a 13B model runs quickly.

70 billion parameter models require either a discrete GPU with large VRAM (24 GB+), multiple GPUs, or acceptance of very slow CPU inference. The quality step up is real. The hardware requirement is significant.

For a home server used by a household for everyday tasks, a 7 to 8B model on AMD iGPU hardware covers most practical needs. The gap between this and frontier cloud models is real but smaller than it was, and for many use cases – drafting, summarization, explanation, basic coding – it is not the limiting factor.

Privacy: The Argument That Does Not Age Out

The performance case for local LLM gets better as hardware improves. The privacy case is structural and does not depend on hardware getting faster.

When you use a cloud AI service, every prompt you send is processed on infrastructure you do not control. Terms of service vary by provider and change over time. What is retained, how it is used for training, and what third parties have access to it are questions with answers that are not always clear and not always in your favor.

A local LLM home server means your prompts stay on your hardware. A question about a medical situation, a legal matter, a financial decision, business strategy, or anything else you would rather keep private is answered on your machine and goes nowhere. There is no retention policy because there is nothing retained externally. There is no data breach risk from the AI provider because the AI provider is not involved.

For professionals handling client information, households with strong privacy preferences, or anyone who has read the terms of service on a cloud AI product and found them unsatisfying, this is a durable argument independent of where inference speeds land.

Local LLM as Part of a Broader Knowledge System

A standalone Ollama instance is a capable tool. It becomes more useful when it runs alongside a reference library – offline Wikipedia, medical references, technical documentation – that the AI can help you navigate and understand.

Project NOMAD packages Ollama-powered local AI alongside offline Wikipedia via Kiwix, global OpenStreetMap maps, and the Khan Academy education library into a single managed system. The AI and the reference library run on the same hardware, accessible from the same browser interface on your local network. You can ask the AI a question, look up the relevant Wikipedia article, and come back to the AI with follow-up questions – all without any internet connection.

For a home setup where local AI is one part of a broader offline knowledge infrastructure rather than a standalone tool, this is the architecture that makes the most sense. One box, one local network, everything accessible from any device in the house.

Building vs. Buying

The DIY path for a local LLM home server is accessible. Install Linux or use your existing OS, install Ollama, pull a model, optionally install Open WebUI for a browser interface. On capable hardware – an AMD Ryzen 7 mini PC with 32 or 64 GB RAM – this is an afternoon. The software is all free and open source.

Where the DIY path gets more involved is if you want the full knowledge stack alongside the AI – Wikipedia, maps, and education library loaded and configured. That is where the NOMAD setup process adds time: a Linux install if you are not already running it, the NOMAD configuration script, and content downloads that run into hundreds of gigabytes.

The pre-built option – the Codex Standard – ships with NOMAD configured on AMD Ryzen 7 Zen 4 hardware, the full content stack downloaded and loaded, and the system benchmarked before it arrives. You plug it into your router via Ethernet and access everything from any browser on your network. No Linux install, no content download queue, no configuration process.

Both paths lead to the same system. For a technically confident person who enjoys the setup process, DIY on appropriate hardware is the highest-value option. For anyone who wants the capability without the overhead, the pre-built path is the practical one.

The Codex Standard runs Ollama on AMD Radeon iGPU hardware alongside offline Wikipedia, global maps, and Khan Academy – configured and benchmarked before it ships. See full specs at Codex Standard.

Local LLM Home Server:What You Actually Need