Skip to main content
POST
/
answer
/
stream
curl -N -X POST "http://localhost:8080/answer/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who won the football match?",
    "query_vector": [0.01, -0.02, "... 768 floats ..."],
    "top_k": 5
  }'
event: meta
data: {"model":"llama3.2:1b-instruct-q4_K_M","context_count":1,"sources":[...],"query":"Who won the football match?"}

event: token
data: {"delta":"Manchester"}

event: token
data: {"delta":" United beat Chelsea 2-1."}

event: done
data: {"answer":"Manchester United beat Chelsea 2-1.","query":"Who won the football match?","model":"llama3.2:1b-instruct-q4_K_M","context_count":1}

Overview

Same retrieval and prompting as POST /answer, but the LLM response is streamed as Server-Sent Events (SSE) instead of a single JSON body. Typical event order:
  1. meta - model, sources, and context count (sent once)
  2. token - one or more answer text deltas
  3. done - full answer echo and metadata
  4. error - if the LLM call fails (instead of done)
Requires Ollama running on the host. Start the stack with moorcheh-edge up (use --skip-ollama for search-only).

Request body

Same fields as Answer (RAG):
query
string
required
Original question text (included in the LLM prompt and echoed in the response).
query_vector
array
required
JSON array of floats used for similarity search. Length must match the store dimension (768 for text stores).
top_k
number
default:"5"
Number of passages to retrieve for context. Capped at 100.
threshold
number
default:"0"
Minimum search score when kiosk_mode is true.
kiosk_mode
boolean
default:"false"
When true, filters retrieved passages below threshold.
header_prompt
string
Optional system instruction (replaces the default RAG system prompt).
Optional instruction appended before the user question in the final user message.
chat_history
array
Prior turns: [{"role": "user"|"assistant", "content": "..."}].
temperature
number
default:"0.2"
LLM sampling temperature (0.0–2.0).
curl -N -X POST "http://localhost:8080/answer/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who won the football match?",
    "query_vector": [0.01, -0.02, "... 768 floats ..."],
    "top_k": 5
  }'
event: meta
data: {"model":"llama3.2:1b-instruct-q4_K_M","context_count":1,"sources":[...],"query":"Who won the football match?"}

event: token
data: {"delta":"Manchester"}

event: token
data: {"delta":" United beat Chelsea 2-1."}

event: done
data: {"answer":"Manchester United beat Chelsea 2-1.","query":"Who won the football match?","model":"llama3.2:1b-instruct-q4_K_M","context_count":1}

SSE events

EventWhenData shape
metaOnce, before tokensmodel, context_count, sources, query
tokenPer LLM delta{"delta": "..."}
doneStream completeanswer, query, model, context_count
errorLLM failure{"message": "..."}
The sources array in meta matches the shape returned by POST /search (id, score, label, text).

Errors

Non-streaming HTTP errors (empty store, invalid vector, LLM not configured) return JSON with a 4xx status before SSE starts. Once streaming begins, LLM failures arrive as an SSE error event.
ConditionStatus / eventMessage (example)
LLM not configured400LLM is not configured: start Ollama on the host and run moorcheh-edge up
LLM unreachable or errorSSE errorLLM request failed