Skip to main content

NVIDIA NIM embeddings + Moorcheh

This integration uses the NVIDIA NIM OpenAI-compatible Embeddings API with nvidia/llama-nemotron-embed-vl-1b-v2 and Moorcheh vector namespaces to store and search vectors with ITS ranking. The model outputs 2048-dimensional vectors (model reference). For this model you must set input_type: use passage when embedding content you index, and query when embedding search strings—mixing them hurts retrieval quality.

Architecture

Embedding generation

POST https://integrate.api.nvidia.com/v1/embeddings with model, input, and input_type

Vector storage

Store vectors in Moorcheh vector namespaces

Semantic retrieval

Embed the query with input_type: query and run vector search

Authentication

Authorization: Bearer your NVIDIA API key (NVIDIA API Catalog)

Prerequisites

pip install -r integrations/nvidia/requirements.txt
Or explicitly:
pip install moorcheh-sdk requests python-dotenv

.env file

MOORCHEH_API_KEY=your_moorcheh_key
NVIDIA_API_KEY=your_nvidia_key
Do not commit API keys. If a key is exposed, rotate it in the NVIDIA dashboard and update your local .env.

input_type (passage vs query)

input_typeWhen to use
passageChunks or documents you store in Moorcheh
queryUser or system queries at search time
The NIM inference reference states that using the wrong mode can significantly reduce retrieval accuracy.

Vector dimensions

nvidia/llama-nemotron-embed-vl-1b-v2 outputs 2048 dimensions per text. Set Moorcheh vector_dimension to 2048 for the namespace.

End-to-end example

The following example loads keys from .env, embeds passages and a query through the NVIDIA embeddings endpoint, uploads vectors to Moorcheh, and runs similarity search.
import os
import textwrap
from typing import List

import requests
from dotenv import load_dotenv
from moorcheh_sdk import MoorchehClient

load_dotenv()

MOORCHEH_API_KEY = os.getenv("MOORCHEH_API_KEY", "").strip()
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY", "").strip()
if not MOORCHEH_API_KEY or not NVIDIA_API_KEY:
    raise SystemExit("Set MOORCHEH_API_KEY and NVIDIA_API_KEY.")

NVIDIA_EMBEDDINGS_URL = "https://integrate.api.nvidia.com/v1/embeddings"
MODEL = "nvidia/llama-nemotron-embed-vl-1b-v2"
VECTOR_DIMENSION = 2048

NAMESPACE = "nvidia-nemotron-embed-demo"
CHUNK_SIZE = 900
CHUNK_OVERLAP = 180


def nvidia_embed(texts: List[str], input_type: str) -> List[List[float]]:
    headers = {
        "Authorization": f"Bearer {NVIDIA_API_KEY}",
        "Content-Type": "application/json",
    }
    payload = {
        "model": MODEL,
        "input": texts,
        "input_type": input_type,
        "encoding_format": "float",
    }
    r = requests.post(NVIDIA_EMBEDDINGS_URL, headers=headers, json=payload, timeout=120)
    r.raise_for_status()
    body = r.json()
    items = sorted(body["data"], key=lambda x: x["index"])
    return [item["embedding"] for item in items]


def to_float_vector(values: List[float]) -> List[float]:
    return [float(x) for x in values]


def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
    chunks: List[str] = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end].strip())
        if end == len(text):
            break
        start = max(end - overlap, 0)
    return [c for c in chunks if c]


def extract_text(result: dict) -> str:
    if result.get("text"):
        return str(result["text"])
    metadata = result.get("metadata") or {}
    if isinstance(metadata, dict):
        return str(metadata.get("text") or metadata.get("raw_text") or metadata.get("content") or "")
    return ""


def clean_text(text: str) -> str:
    return " ".join(str(text).split())


def print_result(idx: int, result: dict) -> None:
    metadata = result.get("metadata") or {}
    text_value = clean_text(extract_text(result))
    wrapped = textwrap.fill(text_value, width=100)
    print(f"[{idx}] id={result.get('id')}")
    print(f"score={result.get('score')} label={result.get('label')}")
    print(f"section={metadata.get('section')} source_doc_id={metadata.get('source_doc_id')}")
    print("text:")
    print(wrapped if wrapped else "(no text returned)")
    print("-" * 120)


mc = MoorchehClient(api_key=MOORCHEH_API_KEY)

try:
    mc.namespaces.create(
        namespace_name=NAMESPACE,
        type="vector",
        vector_dimension=VECTOR_DIMENSION,
    )
except Exception:
    pass

source_documents = [
    {
        "id": "guide-vector-namespaces",
        "section": "vector-namespace-best-practices",
        "text": (
            "Moorcheh vector namespaces support bring-your-own-embedding workflows. "
            "Use nvidia/llama-nemotron-embed-vl-1b-v2 with input_type passage for chunks and query for search strings; "
            "match vector_dimension to the embedding size (2048)."
        ),
    },
    {
        "id": "guide-search-tuning",
        "section": "semantic-search-tuning",
        "text": (
            "Tune similarity_search top_k and threshold for your use case. "
            "Nemotron retrieval embeddings use passage vs query modes; keep them consistent at index and search time."
        ),
    },
]

documents = []
for doc in source_documents:
    parts = chunk_text(doc["text"])
    for idx, chunk in enumerate(parts):
        documents.append(
            {
                "id": f"{doc['id']}-chunk-{idx}",
                "text": chunk,
                "source_doc_id": doc["id"],
                "section": doc["section"],
                "chunk_index": idx,
                "total_chunks": len(parts),
            }
        )

texts = [d["text"] for d in documents]
doc_embeddings = nvidia_embed(texts, input_type="passage")

mc.vectors.upload(
    namespace_name=NAMESPACE,
    vectors=[
        {
            "id": documents[i]["id"],
            "vector": to_float_vector(doc_embeddings[i]),
            "text": documents[i]["text"],
            "source": "nvidia-nim-embeddings",
            "model": MODEL,
            "input_type": "passage",
            "section": documents[i]["section"],
            "source_doc_id": documents[i]["source_doc_id"],
            "chunk_index": documents[i]["chunk_index"],
            "total_chunks": documents[i]["total_chunks"],
        }
        for i in range(len(documents))
    ],
)

query = "What input_type and vector_dimension should I use with Nemotron and Moorcheh?"
query_vecs = nvidia_embed([query], input_type="query")
query_vec = to_float_vector(query_vecs[0])

results = mc.similarity_search.query(
    namespaces=[NAMESPACE],
    query=query_vec,
    top_k=5,
    kiosk_mode=True,
    threshold=0.15,
)

print(f"namespace={NAMESPACE} total_results={len(results.get('results', []))}")
print("=" * 120)
for idx, r in enumerate(results.get("results", []), start=1):
    print_result(idx, r)

Runnable demo script

See integrations/nvidia/nvidia_moorcheh_demo.py. Run from the repo root (or set PYTHONPATH as needed):
python integrations/nvidia/nvidia_moorcheh_demo.py

Important notes

nvidia/llama-nemotron-embed-vl-1b-v2 is 2048 dimensions. Create the Moorcheh namespace with vector_dimension=2048.
Use passage for stored chunks and query for search queries, per the NIM API schema.
Include text on each uploaded vector so search results can return the original chunk.
You can also use an OpenAI-compatible client with base_url=https://integrate.api.nvidia.com/v1 and the same model and input_type fields; the example above uses requests for clarity.

Troubleshooting

  • 401 / auth errors: Verify NVIDIA_API_KEY and Authorization: Bearer format.
  • Dimension mismatch: Namespace must be 2048 for this model’s default output.
  • Low relevance: Check input_type (passage at index, query at search), chunking, threshold, and top_k.