Docs / Models & Hardware

Models & hardware

Riverforge runs every model locally through Ollama. The model is interchangeable on purpose — point it at any Ollama model and Riverforge adapts how it prompts and runs to suit it. A small 4B model runs happily on a laptop GPU; go bigger if you have the VRAM.

Bring your own model

Riverforge isn’t tied to one model. Pull any model into Ollama, point a slot at it, and Riverforge adapts how it prompts and samples to suit that model — so it just works:

Gemma 4 TurboQwenLlamaPhiDeepSeek-R1your fine-tune

It tunes itself to each model family’s recommended settings instead of forcing one configuration on everything, and any reasoning the model produces is shown in a separate “thinking” card so your answer stays clean.

Adding a model is standard Ollama. Pull whatever you want from the command line — ollama pull <model> — and it’s immediately available to Riverforge. Open the Models panel, hit Refresh, and select it. Nothing Riverforge-specific to learn.

The model slots

Riverforge has two model slots, plus a separate thinking control:

SlotWhat it does
PlannerPlanning, replies, Ask mode and finalising
ExecutorFast code and tool execution
ThinkingWhen the model produces extended reasoning — Auto, Adaptive, On or Off

By default both Planner and Executor use the chat model you chose at install. On an 8 GB card that’s Gemma 4 Turbo, picked for speed; if you have more VRAM, a larger model such as Qwen 3.5 9B or Gemma 4 26B is more capable — see Choosing your model. Thinking mode is explained on the Chat Modes page.

Switching models

There are three places to change models — all do the same thing:

Switching a model doesn’t need a restart — the next message uses it. Make sure the model is pulled in Ollama first (ollama pull <model>), then hit Refresh in the Models panel so it appears in the list.

Want a bigger model just for a tough task? Switch a slot, do the work, then switch back. Keep a fast model for everyday work and reach for a larger one only when it earns it.

Hardware fit

Riverforge was designed low-end first and scales straight up. The same agent simply gets smarter as your hardware does — larger models, longer context, a deeper brain.

8 GB · 32 GB 16 GB 24 GB+ more capable
ConcernWhat Riverforge does
8 GB VRAM budgetOne resident coding model shared across slots so runs don’t thrash VRAM
GPU needed for games / 3DPause VRAM unloads models instantly while the server stays up; Resume VRAM warms it again
Cold-start latencyThe chat and embed models are pre-warmed in the background, and kept warm after every run
Bigger card availablePoint a slot at a larger model — same agent, more capability

Ollama tuning

The installer sets these for you; if you’re running from source or tuning by hand, these are the values Riverforge expects. Restart Ollama after changing them.

VariableValueEffect
OLLAMA_CONTEXT_LENGTH16384Default server context — avoids Ollama’s 2048-token silent truncation
OLLAMA_MAX_LOADED_MODELS3Lets the chat, embedding and helper models coexist
OLLAMA_KEEP_ALIVE-1Keep models loaded indefinitely
OLLAMA_FLASH_ATTENTION1Enable flash attention on Ampere+ GPUs
OLLAMA_KV_CACHE_TYPEq8_0Quantise the KV cache to reduce long-context VRAM pressure

If responses look truncated, this is almost always Ollama’s default 2048-token context. Set OLLAMA_CONTEXT_LENGTH=16384 on the Ollama service and restart it. More fixes on the Troubleshooting page.

Staying offline

Riverforge works offline by default; the web tools only run when you ask for research, and you can leave them off entirely for a fully air-gapped setup. Either way, no model call ever leaves your machine — you can prove it with netstat.