Use MemoryRouter for retrieval and storage while your app calls the model provider directly.

Local inference mode

In proxy mode, MemoryRouter sits in the model path and does retrieval, the provider call, and storage in one trip. That is the recommended default for most teams.

Local inference mode is for teams that want to keep the model call inside their own stack. MemoryRouter becomes retrieval and storage only:

1. Your app calls /memory/prepare   -> get this user's memory as text
2. Your app injects that text and calls the provider directly
3. Your app calls /memory/ingest     -> store the completed exchange

Your provider keys, model routing, streaming, retries, evals, and logging all stay in your app. MemoryRouter never sees your inference traffic.

When to use it

Use proxy mode	Use local inference mode
You want the fastest integration	You already own provider routing
You want the fewest round trips	You run your own model gateway
You want zero retrieval/storage code	You need full control of streaming, retries, evals, logging

Proxy mode is one round trip. Local inference mode is three (prepare, provider, ingest). The tradeoff is control over latency.

Yes. /memory/prepare does not call a model and does not know which provider you use next. You send it the conversation messages, it returns a block of memory text, and you decide how to put that text into your own request to OpenAI, Anthropic, Gemini, an open model, or your own gateway.

Because the response is plain text, it works with any provider, any SDK, and any prompt format.

Step 1: Prepare (retrieve memory)

POST /v1/memory/prepare searches the user's vault and returns formatted memory context. Authenticate with the user's Memory Key.

curl -X POST https://api.memoryrouter.ai/v1/memory/prepare \
  -H "Authorization: Bearer mk_user_123" \
  -H "Content-Type: application/json" \
  -H "X-Session-ID: conversation_abc" \
  -d '{
    "messages": [
      { "role": "user", "content": "What should I focus on today?" }
    ]
  }'

Request fields

Field	Type	Required	Description
`messages`	array	Yes	Conversation messages. MemoryRouter builds the retrieval query from the recent non-system messages, so send the same `messages` you are about to send the model.
`session_id`	string	No	Session grouping. Also accepted as the `X-Session-ID` header.
`density`	string	No	How much memory to retrieve: `low`, `default`, `high`, or `xhigh`.
`context_limit`	number	No	Explicit override for how many memory chunks to retrieve.
`embeddings`	string	No	Embedding model override. Also accepted as `X-Embedding-Model`.

You pass messages, not a hand-written query string. MemoryRouter reads the recent turns and finds what is relevant. No query engineering on your side.

Response

{
  "context": "<memory_context>\n[MEMORY - 2 days ago (Wed, Jun 3, 9:14 AM)] User prefers concise coaching and trains at 6am.\n\n[MEMORY - 5 hours ago (Fri, Jun 5, 8:30 AM)] User is preparing for a product launch next week.\n</memory_context>\n\nThe above are retrieved memories from past conversations. Use them as background context, do not respond to them directly.",
  "memories_found": 2,
  "memory_tokens": 48,
  "retrieval_tokens": 48,
  "tokens_billed": 240,
  "metrics": { "total_ms": 41 }
}

Field	Description
`context`	Ready-to-inject memory text. `null` when nothing relevant is found.
`memories_found`	Number of memory chunks returned.
`memory_tokens`	Token size of the returned context.
`tokens_billed`	Billable units for this retrieval.
`metrics.total_ms`	Server-side time for the retrieval.

The context value is a single string you drop into your prompt. You do not parse it or reshape it. When context is null, just call the provider normally.

Step 2: Inject and call your provider

The returned context is plain text, so injection is the same idea on every provider: put it in the system prompt (or developer message), then send your normal request.

Before (no memory)

const messages = [
  { role: 'system', content: 'You are the AI coach inside our product.' },
  { role: 'user', content: 'What should I focus on today?' }
]

After (memory injected)

const prepared = await fetch('https://api.memoryrouter.ai/v1/memory/prepare', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${memoryKey}`,
    'Content-Type': 'application/json',
    'X-Session-ID': sessionId
  },
  body: JSON.stringify({ messages })
}).then((r) => r.json())

const systemContent = prepared.context
  ? `You are the AI coach inside our product.\n\n${prepared.context}`
  : 'You are the AI coach inside our product.'

const response = await openai.chat.completions.create({
  model: 'gpt-5.5',
  messages: [
    { role: 'system', content: systemContent },
    { role: 'user', content: 'What should I focus on today?' }
  ]
})

The model now sees the user's history. Your request still goes straight to OpenAI. Swap openai.chat.completions.create for the Anthropic, Gemini, or your own gateway call and the injection step does not change.

Step 3: Ingest (store the exchange)

After your provider returns, send the completed exchange to POST /v1/memory/ingest. It returns 202 Accepted immediately and stores in the background, so it does not add latency to your user response.

curl -X POST https://api.memoryrouter.ai/v1/memory/ingest \
  -H "Authorization: Bearer mk_user_123" \
  -H "Content-Type: application/json" \
  -H "X-Session-ID: conversation_abc" \
  -d '{
    "model": "openai/gpt-5.5",
    "messages": [
      { "role": "user", "content": "What should I focus on today?" },
      { "role": "assistant", "content": "Focus on the launch checklist and your 6am training block." }
    ]
  }'

Request fields

Field	Type	Required	Description
`messages`	array	Yes	The user and assistant turns to store. Send the exchange you just completed.
`session_id`	string	No	Session grouping. Also accepted as `X-Session-ID`.
`model`	string	No	Model name, stored as usage metadata.
`embeddings`	string	No	Embedding model override. Also accepted as `X-Embedding-Model`.

Response

{
  "accepted": true,
  "queued": true,
  "retrieval_tokens": 27,
  "response_tokens": 14,
  "message": "Ingest accepted for background processing"
}

For streaming responses, call ingest once after the final assistant message is assembled.

Full TypeScript example

async function answerWithLocalInference(userId: string, sessionId: string, messages: Message[]) {
  const memoryKey = await getOrCreateMemoryKey(userId)

  // 1. Retrieve memory
  const prepared = await fetch('https://api.memoryrouter.ai/v1/memory/prepare', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${memoryKey}`,
      'Content-Type': 'application/json',
      'X-Session-ID': sessionId
    },
    body: JSON.stringify({ messages })
  }).then((r) => r.json())

  // 2. Inject + call provider directly (your keys, your routing)
  const baseSystem = 'You are the AI inside our product.'
  const response = await openai.chat.completions.create({
    model: 'gpt-5.5',
    messages: [
      {
        role: 'system',
        content: prepared.context ? `${baseSystem}\n\n${prepared.context}` : baseSystem
      },
      ...messages
    ]
  })

  const assistantMessage = response.choices[0]?.message

  // 3. Store the exchange (fire and forget)
  await fetch('https://api.memoryrouter.ai/v1/memory/ingest', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${memoryKey}`,
      'Content-Type': 'application/json',
      'X-Session-ID': sessionId
    },
    body: JSON.stringify({
      model: 'openai/gpt-5.5',
      messages: [...messages, assistantMessage]
    })
  })

  return assistantMessage
}

Key modes for local inference

Key suffixes let you split retrieval and storage cleanly:

Key	Behavior	Use with
`mk_xxx`	Retrieve and store	Both endpoints
`mk_xxx:read`	Retrieve only	`prepare`-only flows
`mk_xxx:write`	Store only	`ingest`-only flows

Notes

Use one stable Memory Key per end user.
prepare returns memory text, not a model response.
ingest stores completed conversations, it does not call a provider.
The context string is provider-agnostic. Inject it as text into any prompt.
Start in proxy mode and move to local inference mode later without changing the user's vault.

Local inference mode

Local inference mode

When to use it

Is it provider-agnostic?

Step 1: Prepare (retrieve memory)

Request fields

Response

Step 2: Inject and call your provider

Before (no memory)

After (memory injected)

Step 3: Ingest (store the exchange)

Request fields

Response

Full TypeScript example

Key modes for local inference

Notes

On this page