Answer (RAG) - Moorcheh Documentation

curl -X POST "http://localhost:8080/answer" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who won the football match?",
    "query_vector": [0.01, -0.02, "... 768 floats ..."],
    "top_k": 5
  }'

{
  "answer": "Manchester United beat Chelsea 2-1.",
  "model": "llama3.2:1b-instruct-q4_K_M",
  "query": "Who won the football match?",
  "context_count": 1,
  "sources": [
    {
      "id": "doc-1",
      "score": 0.894123,
      "label": "Close Match",
      "text": "Manchester United beat Chelsea 2-1 in the Premier League on Saturday."
    }
  ]
}

POST

answer

curl -X POST "http://localhost:8080/answer" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who won the football match?",
    "query_vector": [0.01, -0.02, "... 768 floats ..."],
    "top_k": 5
  }'

{
  "answer": "Manchester United beat Chelsea 2-1.",
  "model": "llama3.2:1b-instruct-q4_K_M",
  "query": "Who won the football match?",
  "context_count": 1,
  "sources": [
    {
      "id": "doc-1",
      "score": 0.894123,
      "label": "Close Match",
      "text": "Manchester United beat Chelsea 2-1 in the Premier League on Saturday."
    }
  ]
}

Overview

Run retrieval-augmented generation (RAG) on the local store:

Search using the provided query_vector (same dimension as the store).
Build a prompt from the top matching passages.
Call Ollama on the host with fixed model llama3.2:1b-instruct-q4_K_M.

The CLI and SDK embed query locally before calling this endpoint.

Requires Ollama running on the host. Start the stack with moorcheh-edge up (use --skip-ollama for search-only).

Request body

query

string

required

Original question text (included in the LLM prompt and echoed in the response).

query_vector

array

required

JSON array of floats used for similarity search. Length must match the store dimension (768 for text stores).

top_k

number

default:"5"

Number of passages to retrieve for context. Capped at 100.

threshold

number

default:"0"

Minimum search score when kiosk_mode is true.

kiosk_mode

boolean

default:"false"

When true, filters retrieved passages below threshold.

header_prompt

string

Optional system instruction (replaces the default RAG system prompt).

footer_prompt

string

Optional instruction appended before the user question in the final user message.

chat_history

array

Prior turns: [{"role": "user"|"assistant", "content": "..."}].

temperature

number

default:"0.2"

LLM sampling temperature (0.0–2.0).

curl -X POST "http://localhost:8080/answer" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who won the football match?",
    "query_vector": [0.01, -0.02, "... 768 floats ..."],
    "top_k": 5
  }'

{
  "answer": "Manchester United beat Chelsea 2-1.",
  "model": "llama3.2:1b-instruct-q4_K_M",
  "query": "Who won the football match?",
  "context_count": 1,
  "sources": [
    {
      "id": "doc-1",
      "score": 0.894123,
      "label": "Close Match",
      "text": "Manchester United beat Chelsea 2-1 in the Premier League on Saturday."
    }
  ]
}

Response fields

Field	Description
`answer`	Generated answer text
`model`	LLM model id (`llama3.2:1b-instruct-q4_K_M`)
`query`	Echo of the request question
`context_count`	Number of passages passed to the LLM
`sources`	Search hits used as context (same shape as `/search` results)

Errors

Condition	Status	Message (example)
LLM not configured	`400`	`LLM is not configured: start Ollama on the host and run moorcheh-edge up`
LLM unreachable or error	`400`	`LLM request failed`
Empty store / no matches	`200`	Answer may state insufficient context

Search Answer stream (RAG)

​Overview

​Request body

​Response fields

​Errors

​Related

Overview

Request body

Response fields

Errors

Related