Chat and Agents

llamaR turns a local GGUF model into a chat backend for the R ecosystem. You can talk to it three ways, from lowest to highest level:

1. The chat object: `chat_llamar()`

chat_llamar() returns an ellmer Chat. It has two modes, picked by which argument you pass — the same DBI-style choice as DBI::dbConnect() (connection parameters or a ready connection).

Mode A — spawn a server for a model

Give it a model file and it starts llama_serve_openai() in a background process (via the callr package), waits for it to come up, and points a Chat at it. The server’s lifetime is tied to the returned object: when it is garbage-collected (or R exits) the process is killed.

chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")

chat$chat("Why is the sky blue?")

chat_llamar_stop(chat)   # stop the spawned server (or just let GC do it)

Large models can take a while to load from disk; raise timeout (default 180s) if a 14B at Q8 doesn’t come up in time:

chat <- chat_llamar(model_path = "Qwen3-14B-Q8_0.gguf", timeout = 300)

Mode B — connect to a running server

If you already run a server (in another process, or a pool of them), pass its URL. No process is spawned.

# In another process / shell:
#   llama_serve_openai("model.gguf", port = 11434L)

chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1")
chat$chat("Hello!")

System prompt

chat <- chat_llamar(
  model_path    = "Ministral-3B-Instruct.gguf",
  system_prompt = "You are a concise assistant. Answer in one sentence."
)
chat$chat("What is R?")

Under the hood. chat_llamar() wraps ellmer::chat_vllm(), which talks to the server’s /v1/chat/completions endpoint — the de-facto standard our server implements. (ellmer’s chat_openai() targets OpenAI’s newer /v1/responses API, which the server does not implement.)

2. The server: `llama_serve_openai()`

chat_llamar(model_path=) is a convenience wrapper; you can run the server directly for non-R clients. It needs the optional drogonR package for the HTTP/SSE layer.

llama_serve_openai("model.gguf", port = 11434L, n_ctx = 8192L)

It blocks, serving:

GET /v1/models
POST /v1/chat/completions (both blocking and stream = true)

Point any OpenAI client at http://127.0.0.1:11434/v1:

curl http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}'

A runnable launcher lives at inst/examples/serve_openai.R.

Connecting OpenCode

Add an OpenAI-compatible provider in opencode.json (see the one in this repo) with baseURL set to http://127.0.0.1:11434/v1 and the model id matching what /v1/models reports.

3. The command-line example

inst/examples/chat.R wraps both modes for the terminal:

# Spawn a server for the model and open an interactive prompt
Rscript inst/examples/chat.R model.gguf

# Positional [port] [n_ctx], plus flags
Rscript inst/examples/chat.R model.gguf 11434 8192 \
  --system "Be concise." --timeout 300

# One-shot: a trailing message prints a single reply and exits
Rscript inst/examples/chat.R model.gguf "Why is the sky blue?"

# Connect to a server you already started
Rscript inst/examples/chat.R --url http://127.0.0.1:11434/v1

In interactive mode, type a message and press Enter; a blank line or Ctrl-D quits. A spawned server is stopped automatically on exit.

4. ragnar: retrieval-augmented chat

Because chat_llamar() returns a real ellmer::Chat, it plugs into ragnar. Pair it with embed_llamar() (see vignette("getting-started")) for a fully local RAG stack: local embeddings for the store, local generation for the chat.

library(ragnar)

store <- ragnar_store_create(
  location = "store.duckdb",
  embed    = embed_llamar(model = "embedding-model.gguf")
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)

chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")
ragnar_register_tool_retrieve(chat, store)
chat$chat("What do the documents say about X?")

Note. Tool calling is mediated by the OpenAI protocol, so it works only as far as the server implements it. The current server does not emit tool_calls yet, so a model will not autonomously invoke the registered retrieve tool. Plain chat and manual retrieval work today; automatic tool-driven retrieval is on the roadmap (see TODO.md).

5. Concurrency

The server is single-sequence: it handles one request at a time on the main R thread. That is enough for a single local user or agent. For parallel sessions, run a pool of servers on different ports and create one chat_llamar(base_url=) per worker — the worker-pool architecture is described in TODO.md.

ports <- c(11434L, 11435L, 11436L)
chats <- lapply(ports, function(p)
  chat_llamar(base_url = sprintf("http://127.0.0.1:%d/v1", p)))