llamaR turns a local GGUF model into a chat backend for the R ecosystem. You can talk to it three ways, from lowest to highest level:
llama_serve_openai()
exposes an OpenAI-compatible API any client can hit (OpenCode, the
openai SDK, curl).Chat —
chat_llamar() returns an ellmer::Chat, so the
whole ellmer / ragnar toolchain works against local inference.inst/examples/chat.R wraps both for quick use.chat_llamar()chat_llamar() returns an ellmer Chat. It
has two modes, picked by which argument you pass — the same DBI-style
choice as DBI::dbConnect() (connection parameters
or a ready connection).
Give it a model file and it starts llama_serve_openai()
in a background process (via the callr package), waits
for it to come up, and points a Chat at it. The server’s
lifetime is tied to the returned object: when it is garbage-collected
(or R exits) the process is killed.
chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")
chat$chat("Why is the sky blue?")
chat_llamar_stop(chat) # stop the spawned server (or just let GC do it)Large models can take a while to load from disk; raise
timeout (default 180s) if a 14B at Q8 doesn’t come up in
time:
If you already run a server (in another process, or a pool of them), pass its URL. No process is spawned.
chat <- chat_llamar(
model_path = "Ministral-3B-Instruct.gguf",
system_prompt = "You are a concise assistant. Answer in one sentence."
)
chat$chat("What is R?")Under the hood.
chat_llamar()wrapsellmer::chat_vllm(), which talks to the server’s/v1/chat/completionsendpoint — the de-facto standard our server implements. (ellmer’schat_openai()targets OpenAI’s newer/v1/responsesAPI, which the server does not implement.)
llama_serve_openai()chat_llamar(model_path=) is a convenience wrapper; you
can run the server directly for non-R clients. It needs the optional
drogonR package for the HTTP/SSE layer.
It blocks, serving:
GET /v1/modelsPOST /v1/chat/completions (both blocking and
stream = true)Point any OpenAI client at
http://127.0.0.1:11434/v1:
curl http://127.0.0.1:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}'A runnable launcher lives at
inst/examples/serve_openai.R.
Add an OpenAI-compatible provider in opencode.json (see
the one in this repo) with baseURL set to
http://127.0.0.1:11434/v1 and the model id matching what
/v1/models reports.
inst/examples/chat.R wraps both modes for the
terminal:
# Spawn a server for the model and open an interactive prompt
Rscript inst/examples/chat.R model.gguf
# Positional [port] [n_ctx], plus flags
Rscript inst/examples/chat.R model.gguf 11434 8192 \
--system "Be concise." --timeout 300
# One-shot: a trailing message prints a single reply and exits
Rscript inst/examples/chat.R model.gguf "Why is the sky blue?"
# Connect to a server you already started
Rscript inst/examples/chat.R --url http://127.0.0.1:11434/v1In interactive mode, type a message and press Enter; a blank line or Ctrl-D quits. A spawned server is stopped automatically on exit.
Because chat_llamar() returns a real
ellmer::Chat, it plugs into ragnar. Pair it with
embed_llamar() (see
vignette("getting-started")) for a fully local RAG stack:
local embeddings for the store, local generation for the chat.
library(ragnar)
store <- ragnar_store_create(
location = "store.duckdb",
embed = embed_llamar(model = "embedding-model.gguf")
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)
chat <- chat_llamar(model_path = "Ministral-3B-Instruct.gguf")
ragnar_register_tool_retrieve(chat, store)
chat$chat("What do the documents say about X?")Note. Tool calling is mediated by the OpenAI protocol, so it works only as far as the server implements it. The current server does not emit
tool_callsyet, so a model will not autonomously invoke the registered retrieve tool. Plain chat and manual retrieval work today; automatic tool-driven retrieval is on the roadmap (seeTODO.md).
The server is single-sequence: it handles one
request at a time on the main R thread. That is enough for a single
local user or agent. For parallel sessions, run a pool of servers on
different ports and create one chat_llamar(base_url=) per
worker — the worker-pool architecture is described in
TODO.md.
ports <- c(11434L, 11435L, 11436L)
chats <- lapply(ports, function(p)
chat_llamar(base_url = sprintf("http://127.0.0.1:%d/v1", p)))vignette("getting-started") — the rest of the
package.?chat_llamar, ?llama_serve_openaiinst/examples/chat.R,
inst/examples/serve_openai.R