Models & hardware

Riverforge runs every model locally through Ollama. The model is interchangeable on purpose — point it at any Ollama model and Riverforge adapts how it prompts and runs to suit it. A small 4B model runs happily on a laptop GPU; go bigger if you have the VRAM.

Bring your own model

Riverforge isn’t tied to one model. Pull any model into Ollama, point a slot at it, and Riverforge adapts how it prompts and samples to suit that model — so it just works:

Gemma 4 TurboQwenLlamaPhiDeepSeek-R1your fine-tune

It tunes itself to each model family’s recommended settings instead of forcing one configuration on everything, and any reasoning the model produces is shown in a separate “thinking” card so your answer stays clean.

✦

Adding a model is standard Ollama. Pull whatever you want from the command line — ollama pull <model> — and it’s immediately available to Riverforge. Open the Models panel, hit Refresh, and select it. Nothing Riverforge-specific to learn.

The model slots

Riverforge has two model slots, plus a separate thinking control:

Slot	What it does
Planner	Planning, replies, Ask mode and finalising
Executor	Fast code and tool execution
Thinking	When the model produces extended reasoning — Auto, Adaptive, On or Off

By default both Planner and Executor use the chat model you chose at install. On an 8 GB card that’s Gemma 4 Turbo, picked for speed; if you have more VRAM, a larger model such as Qwen 3.5 9B or Gemma 4 26B is more capable — see Choosing your model. Thinking mode is explained on the Chat Modes page.

Switching models

There are three places to change models — all do the same thing:

The Models panel in the VS Code chat composer (with Refresh to re-scan your installed models).
The local server API endpoint POST /models/config.

Switching a model doesn’t need a restart — the next message uses it. Make sure the model is pulled in Ollama first (ollama pull <model>), then hit Refresh in the Models panel so it appears in the list.

✦

Want a bigger model just for a tough task? Switch a slot, do the work, then switch back. Keep a fast model for everyday work and reach for a larger one only when it earns it.

Hardware fit

Riverforge was designed low-end first and scales straight up. The same agent simply gets smarter as your hardware does — larger models, longer context, a deeper brain.

8 GB · 32 GB→ 16 GB→ 24 GB+→ more capable

Concern	What Riverforge does
8 GB VRAM budget	One resident coding model shared across slots so runs don’t thrash VRAM
GPU needed for games / 3D	Pause VRAM unloads models instantly while the server stays up; Resume VRAM warms it again
Cold-start latency	The chat and embed models are pre-warmed in the background, and kept warm after every run
Bigger card available	Point a slot at a larger model — same agent, more capability

Ollama tuning

The installer sets these for you; if you’re running from source or tuning by hand, these are the values Riverforge expects. Restart Ollama after changing them.

Variable	Value	Effect
`OLLAMA_CONTEXT_LENGTH`	`16384`	Default server context — avoids Ollama’s 2048-token silent truncation
`OLLAMA_MAX_LOADED_MODELS`	`3`	Lets the chat, embedding and helper models coexist
`OLLAMA_KEEP_ALIVE`	`-1`	Keep models loaded indefinitely
`OLLAMA_FLASH_ATTENTION`	`1`	Enable flash attention on Ampere+ GPUs
`OLLAMA_KV_CACHE_TYPE`	`q8_0`	Quantise the KV cache to reduce long-context VRAM pressure

▲

If responses look truncated, this is almost always Ollama’s default 2048-token context. Set OLLAMA_CONTEXT_LENGTH=16384 on the Ollama service and restart it. More fixes on the Troubleshooting page.

Staying offline

Riverforge works offline by default; the web tools only run when you ask for research, and you can leave them off entirely for a fully air-gapped setup. Either way, no model call ever leaves your machine — you can prove it with netstat.