Running LLMs locally with llama.cpp and Open WebUI on macOS or Linux
Table of Contents
Running DeepSeek and other LLMs locally has become very trendy recently. However, I found that sometimes it wasn’t obvious on how to set up and run these locally, so I’m sharing my current setup in this blog post.
A popular approach seems to be just using Ollama or LM Studio but I’d like to avoid both. I’ve been getting a bad impresison of Ollama for multiple reasons, such as their use of misleading names and tags for the models, and seemingly leveraging llama.cpp without giving proper credit. As for LM Studio, I’m mainly avoiding it because it isn’t free software.
This post while focus on llama.cpp and Open WebUI. The instructions are tailored to Homebrew on macOS but I believe it should work mostly the same in Linux, just with different ways of installing the tools.
Setting up Docker #
Because, as of March 2025, I’m unable to use the open-webui
package from Nix, we’ll be using Docker for that.
Docker Desktop on macOS seems to be having some issues lately, so I took the chance to try out colima for a leaner setup. We can install it like so:
brew install colima docker
After that, we can just do:
colima start
And the docker
command will work as expected.
Installing llama.cpp #
To run the model, we’ll be using llama.cpp, a leading open-source project for running LLMs locally.
Again, we can install it with Homebrew:
brew install llama.cpp
Running a model #
For a more minimalist setup, it is possible to run the model with llama-cli
from llama.cpp and interact with it directly in the terminal.
However, I’ll show you how to run the model with llama-server
so that it hosts an API to connect with Open WebUI, where we’ll have niceties like conversation history.
For example, here’s how we can run the freshly released Qwen QwQ model:
llama-server \
-hf bartowski/Qwen_QwQ-32B-GGUF:q4_k_m \
--host 127.0.0.1 \
--port 10000
I haven’t found this -hf
flag mentioned very often but I find it useful to have llama.cpp download the model directly from Hugging Face.
If at a later point you wish to move or delete these models, on macOS they’re stored at ~/Library/Caches/llama.cpp
.
From my (superficial) research, the Q4_K_M
level of quantization seems to provide a nice balance of small size while keeping decent results.
I tend to search for models in the form of GGUFs by bartowski
on Hugging Face.
On my MacBook Pro 14 M1 Pro with 32GB of RAM, I find models of size 32B with Q4_K_M quantization to be the sweet spot.
There’s also lots of room for exploration and tuning with llama.cpp’s other flags and parameters. However, because I currently don’t really know what those do, I prefer to keep them to their default values.
Running Open WebUI #
Finally, to interact with the model, we’ll use Open WebUI.
We can launch it like this:
docker run --pull=always --rm -p 3000:8080 -e WEBUI_AUTH=False -v ~/Documents/open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
This will launch an Open WebUI instance, without logins, accessible on localhost:3000
in a Web Browser.
Stateful data is kept in the ~/Documents/open-webui
directory.
Opening up localhost:3000
on your Web Browser of choice will now leave you with the Open WebUI interface.
All that’s left to do is to connect it to the llama.cpp we left running:
- Admin Settings
- Connections > OpenAI Connections
- Add Connection:
- URL: http://host.docker.internal:10000/v1
- API Key: none
And now you should be able to start a new chat with the model. I find it sometimes takes a while to “warm up” but then it starts giving faster responses.
Closing thoughts #
Very exciting times ahead in the local LLM space. It seems that while state of the art commercial models are plateauing, open-source models are increasingly achieving better performance at smaller sizes. A world where everyone is able to run excellent models on their own computers is rapidly approaching.
I’m definitely instered in digging further.\ I’d like to try out mlx models instead of GGUFs on llama.cpp, to see if I can get faster speed out of my MacBook. It would also be interesting to see if it is possible to setup a local SearxNG instance for setting up Web Search on Open WebUI. Finally, I’d like to check out tools like aider and tabby for local code assistance and completion.