Skip to content

Context Engine

The context engine enhances your AI interactions through three mechanisms: conversation compression, cross-profile context sharing, and local RAG (Retrieval-Augmented Generation).

When conversations grow beyond a token threshold, Claudex uses an LLM to summarize older messages, keeping recent ones intact.

[context.compression]
enabled = true
threshold_tokens = 50000 # compress when total tokens exceed this
keep_recent = 10 # always keep the last N messages
profile = "openrouter" # reuse a profile's base_url + api_key
model = "qwen/qwen-2.5-7b-instruct" # override model (optional)
  1. Before forwarding a request, Claudex estimates total token count
  2. If tokens exceed threshold_tokens, older messages (beyond keep_recent) are replaced with a summary
  3. The summary is generated by the configured local LLM
  4. The compressed conversation is then forwarded to the provider

Share context across different provider profiles within the same session.

[context.sharing]
enabled = true
max_context_size = 2000 # max tokens to inject from other profiles

This is useful when switching between providers mid-task — relevant context from previous interactions is automatically included.

Index local code and documentation for retrieval-augmented generation. Relevant code snippets are automatically injected into requests.

[context.rag]
enabled = true
index_paths = ["./src", "./docs"] # directories to index
profile = "openrouter" # reuse a profile's base_url + api_key
model = "openai/text-embedding-3-small" # embedding model
chunk_size = 512 # text chunk size
top_k = 5 # number of results to inject
  1. On startup, Claudex indexes files in index_paths using the embedding model
  2. For each request, the user’s message is embedded and compared against the index
  3. The top-k most relevant chunks are injected as additional context in the request
  4. The provider receives richer context about your codebase