Models & hardware
Riverforge runs every model locally through Ollama. The model is interchangeable on purpose — point it at any Ollama model and Riverforge adapts how it prompts and runs to suit it. A small 4B model runs happily on a laptop GPU; go bigger if you have the VRAM.
Bring your own model
Riverforge isn’t tied to one model. Pull any model into Ollama, point a slot at it, and Riverforge adapts how it prompts and samples to suit that model — so it just works:
It tunes itself to each model family’s recommended settings instead of forcing one configuration on everything, and any reasoning the model produces is shown in a separate “thinking” card so your answer stays clean.
Adding a model is standard Ollama. Pull whatever you want from the command line — ollama pull <model> — and it’s immediately available to Riverforge. Open the Models panel, hit Refresh, and select it. Nothing Riverforge-specific to learn.
The model slots
Riverforge has two model slots, plus a separate thinking control:
| Slot | What it does |
|---|---|
| Planner | Planning, replies, Ask mode and finalising |
| Executor | Fast code and tool execution |
| Thinking | When the model produces extended reasoning — Auto, Adaptive, On or Off |
By default both Planner and Executor use the chat model you chose at install. On an 8 GB card that’s Gemma 4 Turbo, picked for speed; if you have more VRAM, a larger model such as Qwen 3.5 9B or Gemma 4 26B is more capable — see Choosing your model. Thinking mode is explained on the Chat Modes page.
Switching models
There are three places to change models — all do the same thing:
- The Models panel in the VS Code chat composer (with Refresh to re-scan your installed models).
- The local server API endpoint
POST /models/config.
Switching a model doesn’t need a restart — the next message uses it. Make sure the model is pulled in Ollama first (ollama pull <model>), then hit Refresh in the Models panel so it appears in the list.
Want a bigger model just for a tough task? Switch a slot, do the work, then switch back. Keep a fast model for everyday work and reach for a larger one only when it earns it.
Hardware fit
Riverforge was designed low-end first and scales straight up. The same agent simply gets smarter as your hardware does — larger models, longer context, a deeper brain.
| Concern | What Riverforge does |
|---|---|
| 8 GB VRAM budget | One resident coding model shared across slots so runs don’t thrash VRAM |
| GPU needed for games / 3D | Pause VRAM unloads models instantly while the server stays up; Resume VRAM warms it again |
| Cold-start latency | The chat and embed models are pre-warmed in the background, and kept warm after every run |
| Bigger card available | Point a slot at a larger model — same agent, more capability |
Ollama tuning
The installer sets these for you; if you’re running from source or tuning by hand, these are the values Riverforge expects. Restart Ollama after changing them.
| Variable | Value | Effect |
|---|---|---|
OLLAMA_CONTEXT_LENGTH | 16384 | Default server context — avoids Ollama’s 2048-token silent truncation |
OLLAMA_MAX_LOADED_MODELS | 3 | Lets the chat, embedding and helper models coexist |
OLLAMA_KEEP_ALIVE | -1 | Keep models loaded indefinitely |
OLLAMA_FLASH_ATTENTION | 1 | Enable flash attention on Ampere+ GPUs |
OLLAMA_KV_CACHE_TYPE | q8_0 | Quantise the KV cache to reduce long-context VRAM pressure |
If responses look truncated, this is almost always Ollama’s default 2048-token context. Set OLLAMA_CONTEXT_LENGTH=16384 on the Ollama service and restart it. More fixes on the Troubleshooting page.
Staying offline
Riverforge works offline by default; the web tools only run when you ask for research, and you can leave them off entirely for a fully air-gapped setup. Either way, no model call ever leaves your machine — you can prove it with netstat.