AI Generation
Answer (RAG)
Retrieve context from the store and generate an answer with the local LLM.
POST
Overview
Run retrieval-augmented generation (RAG) on the local store:- Search using the provided
query_vector(same dimension as the store). - Build a prompt from the top matching passages.
- Call Ollama on the host with fixed model
llama3.2:1b-instruct-q4_K_M.
query locally before calling this endpoint.
Requires Ollama running on the host. Start the stack with
moorcheh-edge up (use --skip-ollama for search-only).Request body
Original question text (included in the LLM prompt and echoed in the response).
JSON array of floats used for similarity search. Length must match the store dimension (768 for text stores).
Number of passages to retrieve for context. Capped at 100.
Minimum search score when
kiosk_mode is true.When
true, filters retrieved passages below threshold.Optional system instruction (replaces the default RAG system prompt).
Optional instruction appended before the user question in the final user message.
Prior turns:
[{"role": "user"|"assistant", "content": "..."}].LLM sampling temperature (0.0–2.0).
Response fields
| Field | Description |
|---|---|
answer | Generated answer text |
model | LLM model id (llama3.2:1b-instruct-q4_K_M) |
query | Echo of the request question |
context_count | Number of passages passed to the LLM |
sources | Search hits used as context (same shape as /search results) |
Errors
| Condition | Status | Message (example) |
|---|---|---|
| LLM not configured | 400 | LLM is not configured: start Ollama on the host and run moorcheh-edge up |
| LLM unreachable or error | 400 | LLM request failed |
| Empty store / no matches | 200 | Answer may state insufficient context |