Running LLMs locally with llama.cpp and Open WebUI on macOS or Linux

Table of Contents

Running DeepSeek and other LLMs locally has become very trendy recently. However, I found that sometimes it wasn’t obvious on how to set up and run these locally, so I’m sharing my current setup in this blog post.

A popular approach seems to be just using Ollama or LM Studio but I’d like to avoid both. I’ve been getting a bad impresison of Ollama for multiple reasons, such as their use of misleading names and tags for the models, and seemingly leveraging llama.cpp without giving proper credit. As for LM Studio, I’m mainly avoiding it because it isn’t free software.

This post while focus on llama.cpp and Open WebUI. The instructions are tailored to Homebrew on macOS but I believe it should work mostly the same in Linux, just with different ways of installing the tools.

Setting up Docker #

Because, as of March 2025, I’m unable to use the open-webui package from Nix, we’ll be using Docker for that.

Docker Desktop on macOS seems to be having some issues lately, so I took the chance to try out colima for a leaner setup. We can install it like so:

brew install colima docker

After that, we can just do:

colima start

And the docker command will work as expected.

Installing llama.cpp #

To run the model, we’ll be using llama.cpp, a leading open-source project for running LLMs locally.

Again, we can install it with Homebrew:

brew install llama.cpp

Running a model #

For a more minimalist setup, it is possible to run the model with llama-cli from llama.cpp and interact with it directly in the terminal.

However, I’ll show you how to run the model with llama-server so that it hosts an API to connect with Open WebUI, where we’ll have niceties like conversation history.

For example, here’s how we can run the freshly released Qwen QwQ model:

llama-server \
    -hf bartowski/Qwen_QwQ-32B-GGUF:q4_k_m \
    --host 127.0.0.1 \
    --port 10000

I haven’t found this -hf flag mentioned very often but I find it useful to have llama.cpp download the model directly from Hugging Face. If at a later point you wish to move or delete these models, on macOS they’re stored at ~/Library/Caches/llama.cpp.

From my (superficial) research, the Q4_K_M level of quantization seems to provide a nice balance of small size while keeping decent results. I tend to search for models in the form of GGUFs by bartowski on Hugging Face. On my MacBook Pro 14 M1 Pro with 32GB of RAM, I find models of size 32B with Q4_K_M quantization to be the sweet spot.

There’s also lots of room for exploration and tuning with llama.cpp’s other flags and parameters. However, because I currently don’t really know what those do, I prefer to keep them to their default values.

Running Open WebUI #

Finally, to interact with the model, we’ll use Open WebUI.

We can launch it like this:

docker run --pull=always --rm -p 3000:8080 -e WEBUI_AUTH=False -v ~/Documents/open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

This will launch an Open WebUI instance, without logins, accessible on localhost:3000 in a Web Browser. Stateful data is kept in the ~/Documents/open-webui directory.

Opening up localhost:3000 on your Web Browser of choice will now leave you with the Open WebUI interface. All that’s left to do is to connect it to the llama.cpp we left running:

Admin Settings
Connections > OpenAI Connections
Add Connection:
- URL: http://host.docker.internal:10000/v1
- API Key: none

And now you should be able to start a new chat with the model. I find it sometimes takes a while to “warm up” but then it starts giving faster responses.

Closing thoughts #

Very exciting times ahead in the local LLM space. It seems that while state of the art commercial models are plateauing, open-source models are increasingly achieving better performance at smaller sizes. A world where everyone is able to run excellent models on their own computers is rapidly approaching.

I’m definitely instered in digging further.\ I’d like to try out mlx models instead of GGUFs on llama.cpp, to see if I can get faster speed out of my MacBook. It would also be interesting to see if it is possible to setup a local SearxNG instance for setting up Web Search on Open WebUI. Finally, I’d like to check out tools like aider and tabby for local code assistance and completion.

Setting up Docker #

Installing llama.cpp #

Running a model #

Running Open WebUI #

Closing thoughts #

References #