Skip to main content

What it’s good for

Full privacy — no data leaves your machine. Run any GGUF model with direct control over quantization, context size, and GPU layers. Best for users who want maximum control over their inference setup.

Requirements

  • llama.cpp built with llama-server
  • A GGUF model file downloaded to your machine
  • Sufficient RAM/VRAM for your chosen model and context size
Install llama.cpp:
brew install llama.cpp    # macOS
# or build from source: https://github.com/ggerganov/llama.cpp#build

Configure in Spaceduck

Chat

1

Start llama-server

llama-server \
  -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 8192 \
  -ngl 99
-ngl 99 offloads all layers to GPU. Reduce this number if you run out of VRAM.
2

Configure Spaceduck

In Settings > Chat:
  • Provider: llama.cpp
  • Base URL: http://127.0.0.1:8080/v1
  • Model: leave empty (llama-server uses the loaded model)
Or via CLI:
spaceduck config set /ai/provider llamacpp
spaceduck config set /ai/baseUrl http://127.0.0.1:8080/v1
3

Verify

curl http://127.0.0.1:8080/v1/models
You should see your loaded model in the response.

Embeddings

llama.cpp runs embeddings on a separate server instance with the --embeddings flag. This is the two-server pattern.
1

Start embedding server

llama-server \
  -hf nomic-ai/nomic-embed-text-v1.5-GGUF:Q5_K_M \
  --host 127.0.0.1 \
  --port 8081 \
  --embeddings
The -hf flag downloads the model from Hugging Face automatically. You can also use -m /path/to/model.gguf for a local file.
2

Configure Spaceduck

In Settings > Memory:
  • Toggle Semantic recall on
  • Provider: llama.cpp
  • Server URL: http://127.0.0.1:8081/v1
  • Model: leave empty
  • Dimensions: 768 (for nomic-embed-text-v1.5)
Or via CLI:
spaceduck config set /embedding/enabled true
spaceduck config set /embedding/provider llamacpp
spaceduck config set /embedding/baseUrl http://127.0.0.1:8081/v1
spaceduck config set /embedding/dimensions 768
3

Verify

curl http://127.0.0.1:8081/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "test", "model": "nomic"}'
You should get back a JSON response with an embedding array.In Settings > Memory, click the Test button — you should see a green status indicator.
ModelDimensionsSizeNotes
nomic-embed-text-v1.5768~260 MB (Q5_K_M)Good quality, fast, multilingual
bge-small-en-v1.5384~130 MBEnglish-focused, very fast
mxbai-embed-large-v11024~670 MBHigher quality, more RAM

Test and troubleshoot

ProblemCauseFix
ECONNREFUSED on port 8080llama-server not runningStart it with the command above
Garbled or wrong responsesMissing or wrong chat templateAdd --chat-template flag to llama-server
Embeddings return errorsMissing --embeddings flagRestart the embedding server with --embeddings
Slow generationNot enough GPU offloadIncrease -ngl or use a smaller quantization
If responses look wrong (e.g., raw tokens, repeated text), the most common cause is a missing chat template. Add --chat-template chatml or the appropriate template for your model to the llama-server command.