What it’s good for
Full privacy — no data leaves your machine. Run any GGUF model with direct control over quantization, context size, and GPU layers. Best for users who want maximum control over their inference setup.Requirements
- llama.cpp built with
llama-server - A GGUF model file downloaded to your machine
- Sufficient RAM/VRAM for your chosen model and context size
Configure in Spaceduck
Chat
Configure Spaceduck
In Settings > Chat:
- Provider: llama.cpp
- Base URL:
http://127.0.0.1:8080/v1 - Model: leave empty (llama-server uses the loaded model)
Embeddings
llama.cpp runs embeddings on a separate server instance with the--embeddings flag. This is the two-server pattern.
Start embedding server
The
-hf flag downloads the model from Hugging Face automatically. You can also use -m /path/to/model.gguf for a local file.Configure Spaceduck
In Settings > Memory:
- Toggle Semantic recall on
- Provider: llama.cpp
- Server URL:
http://127.0.0.1:8081/v1 - Model: leave empty
- Dimensions:
768(for nomic-embed-text-v1.5)
Recommended embedding models
| Model | Dimensions | Size | Notes |
|---|---|---|---|
| nomic-embed-text-v1.5 | 768 | ~260 MB (Q5_K_M) | Good quality, fast, multilingual |
| bge-small-en-v1.5 | 384 | ~130 MB | English-focused, very fast |
| mxbai-embed-large-v1 | 1024 | ~670 MB | Higher quality, more RAM |
Test and troubleshoot
| Problem | Cause | Fix |
|---|---|---|
ECONNREFUSED on port 8080 | llama-server not running | Start it with the command above |
| Garbled or wrong responses | Missing or wrong chat template | Add --chat-template flag to llama-server |
| Embeddings return errors | Missing --embeddings flag | Restart the embedding server with --embeddings |
| Slow generation | Not enough GPU offload | Increase -ngl or use a smaller quantization |
