MemoryRouterMemoryRouter

Local inference mode

Use MemoryRouter for retrieval and storage while your app calls the model provider directly.

Local inference mode

In proxy mode, MemoryRouter sits in the model path and does retrieval, the provider call, and storage in one trip. That is the recommended default for most teams.

Local inference mode is for teams that want to keep the model call inside their own stack. MemoryRouter becomes retrieval and storage only:

1. Your app calls /memory/prepare   -> get this user's memory as text
2. Your app injects that text and calls the provider directly
3. Your app calls /memory/ingest     -> store the completed exchange

Your provider keys, model routing, streaming, retries, evals, and logging all stay in your app. MemoryRouter never sees your inference traffic.

When to use it

Use proxy modeUse local inference mode
You want the fastest integrationYou already own provider routing
You want the fewest round tripsYou run your own model gateway
You want zero retrieval/storage codeYou need full control of streaming, retries, evals, logging

Proxy mode is one round trip. Local inference mode is three (prepare, provider, ingest). The tradeoff is control over latency.

Is it provider-agnostic?

Yes. /memory/prepare does not call a model and does not know which provider you use next. You send it the conversation messages, it returns a block of memory text, and you decide how to put that text into your own request to OpenAI, Anthropic, Gemini, an open model, or your own gateway.

Because the response is plain text, it works with any provider, any SDK, and any prompt format.

Step 1: Prepare (retrieve memory)

POST /v1/memory/prepare searches the user's vault and returns formatted memory context. Authenticate with the user's Memory Key.

curl -X POST https://api.memoryrouter.ai/v1/memory/prepare \
  -H "Authorization: Bearer mk_user_123" \
  -H "Content-Type: application/json" \
  -H "X-Session-ID: conversation_abc" \
  -d '{
    "messages": [
      { "role": "user", "content": "What should I focus on today?" }
    ]
  }'

Request fields

FieldTypeRequiredDescription
messagesarrayYesConversation messages. MemoryRouter builds the retrieval query from the recent non-system messages, so send the same messages you are about to send the model.
session_idstringNoSession grouping. Also accepted as the X-Session-ID header.
densitystringNoHow much memory to retrieve: low, default, high, or xhigh.
context_limitnumberNoExplicit override for how many memory chunks to retrieve.
embeddingsstringNoEmbedding model override. Also accepted as X-Embedding-Model.

You pass messages, not a hand-written query string. MemoryRouter reads the recent turns and finds what is relevant. No query engineering on your side.

Response

{
  "context": "<memory_context>\n[MEMORY - 2 days ago (Wed, Jun 3, 9:14 AM)] User prefers concise coaching and trains at 6am.\n\n[MEMORY - 5 hours ago (Fri, Jun 5, 8:30 AM)] User is preparing for a product launch next week.\n</memory_context>\n\nThe above are retrieved memories from past conversations. Use them as background context, do not respond to them directly.",
  "memories_found": 2,
  "memory_tokens": 48,
  "retrieval_tokens": 48,
  "tokens_billed": 240,
  "metrics": { "total_ms": 41 }
}
FieldDescription
contextReady-to-inject memory text. null when nothing relevant is found.
memories_foundNumber of memory chunks returned.
memory_tokensToken size of the returned context.
tokens_billedBillable units for this retrieval.
metrics.total_msServer-side time for the retrieval.

The context value is a single string you drop into your prompt. You do not parse it or reshape it. When context is null, just call the provider normally.

Step 2: Inject and call your provider

The returned context is plain text, so injection is the same idea on every provider: put it in the system prompt (or developer message), then send your normal request.

Before (no memory)

const messages = [
  { role: 'system', content: 'You are the AI coach inside our product.' },
  { role: 'user', content: 'What should I focus on today?' }
]

After (memory injected)

const prepared = await fetch('https://api.memoryrouter.ai/v1/memory/prepare', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${memoryKey}`,
    'Content-Type': 'application/json',
    'X-Session-ID': sessionId
  },
  body: JSON.stringify({ messages })
}).then((r) => r.json())

const systemContent = prepared.context
  ? `You are the AI coach inside our product.\n\n${prepared.context}`
  : 'You are the AI coach inside our product.'

const response = await openai.chat.completions.create({
  model: 'gpt-5.5',
  messages: [
    { role: 'system', content: systemContent },
    { role: 'user', content: 'What should I focus on today?' }
  ]
})

The model now sees the user's history. Your request still goes straight to OpenAI. Swap openai.chat.completions.create for the Anthropic, Gemini, or your own gateway call and the injection step does not change.

Step 3: Ingest (store the exchange)

After your provider returns, send the completed exchange to POST /v1/memory/ingest. It returns 202 Accepted immediately and stores in the background, so it does not add latency to your user response.

curl -X POST https://api.memoryrouter.ai/v1/memory/ingest \
  -H "Authorization: Bearer mk_user_123" \
  -H "Content-Type: application/json" \
  -H "X-Session-ID: conversation_abc" \
  -d '{
    "model": "openai/gpt-5.5",
    "messages": [
      { "role": "user", "content": "What should I focus on today?" },
      { "role": "assistant", "content": "Focus on the launch checklist and your 6am training block." }
    ]
  }'

Request fields

FieldTypeRequiredDescription
messagesarrayYesThe user and assistant turns to store. Send the exchange you just completed.
session_idstringNoSession grouping. Also accepted as X-Session-ID.
modelstringNoModel name, stored as usage metadata.
embeddingsstringNoEmbedding model override. Also accepted as X-Embedding-Model.

Response

{
  "accepted": true,
  "queued": true,
  "retrieval_tokens": 27,
  "response_tokens": 14,
  "message": "Ingest accepted for background processing"
}

For streaming responses, call ingest once after the final assistant message is assembled.

Full TypeScript example

async function answerWithLocalInference(userId: string, sessionId: string, messages: Message[]) {
  const memoryKey = await getOrCreateMemoryKey(userId)

  // 1. Retrieve memory
  const prepared = await fetch('https://api.memoryrouter.ai/v1/memory/prepare', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${memoryKey}`,
      'Content-Type': 'application/json',
      'X-Session-ID': sessionId
    },
    body: JSON.stringify({ messages })
  }).then((r) => r.json())

  // 2. Inject + call provider directly (your keys, your routing)
  const baseSystem = 'You are the AI inside our product.'
  const response = await openai.chat.completions.create({
    model: 'gpt-5.5',
    messages: [
      {
        role: 'system',
        content: prepared.context ? `${baseSystem}\n\n${prepared.context}` : baseSystem
      },
      ...messages
    ]
  })

  const assistantMessage = response.choices[0]?.message

  // 3. Store the exchange (fire and forget)
  await fetch('https://api.memoryrouter.ai/v1/memory/ingest', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${memoryKey}`,
      'Content-Type': 'application/json',
      'X-Session-ID': sessionId
    },
    body: JSON.stringify({
      model: 'openai/gpt-5.5',
      messages: [...messages, assistantMessage]
    })
  })

  return assistantMessage
}

Key modes for local inference

Key suffixes let you split retrieval and storage cleanly:

KeyBehaviorUse with
mk_xxxRetrieve and storeBoth endpoints
mk_xxx:readRetrieve onlyprepare-only flows
mk_xxx:writeStore onlyingest-only flows

Notes

  • Use one stable Memory Key per end user.
  • prepare returns memory text, not a model response.
  • ingest stores completed conversations, it does not call a provider.
  • The context string is provider-agnostic. Inject it as text into any prompt.
  • Start in proxy mode and move to local inference mode later without changing the user's vault.

On this page