Back to Blog
· 4 min read · EN

How I Built a Copilot AI at VTEX: RAG, Multi-Agent, and What I Learned

A technical deep-dive into designing the agentic AI system that powers VTEX's Copilot feature.

AI RAG Multi-Agent VTEX BuildingInPublic
Also available in: Español →

The Challenge

When I joined the AI team at VTEX, the mission was clear: build an AI assistant that could help merchants manage their e-commerce operations. But “AI assistant” is vague. What does that actually mean?

After months of research and iteration, we defined Copilot as:

An agentic AI system that understands merchant intent, retrieves relevant context from VTEX’s knowledge base, and executes or guides actions across the platform.

This meant building three core capabilities:

  1. Understanding — Parse natural language queries into actionable intents
  2. Retrieval — Find relevant documentation, settings, and historical data
  3. Action — Execute operations or guide users through complex workflows

The Architecture

RAG Pipeline

The heart of Copilot is a Retrieval-Augmented Generation (RAG) pipeline. Here’s how it works:

User Query → Intent Classification → Retrieval → Context Assembly → LLM → Response

Why RAG?

Pure LLMs hallucinate. A lot. RAG grounds the model in actual documentation and data, dramatically reducing hallucination while keeping responses current.

Key decisions we made:

  • Chunking strategy — We use semantic chunking with overlap, not fixed-size chunks. Documentation sections have natural boundaries.
  • Embedding model — OpenAI’s text-embedding-3-large for production, with plans to move to open-source models for cost reduction.
  • Vector store — Pinecone for low-latency retrieval at scale.
  • Re-ranking — Cohere’s reranker to improve retrieval precision before feeding to the LLM.

Multi-Agent Orchestration

Complex queries require multiple specialized agents. For example:

“Why did my sales drop last week and how do I fix it?”

This triggers:

  1. Analytics Agent — Queries sales data, identifies anomalies
  2. Diagnostic Agent — Checks for configuration changes, promotions that ended
  3. Recommendation Agent — Suggests actions based on findings

We use a supervisor pattern where a main orchestrator agent delegates to specialists and synthesizes their outputs.

const orchestrator = new SupervisorAgent({
  agents: [analyticsAgent, diagnosticAgent, recommendationAgent],
  strategy: 'parallel-then-synthesize',
  timeout: 30000,
});

Multi-agent systems are powerful but add complexity. Start with a single agent and only add more when you have clear, separable responsibilities.

Technical Deep-Dive: The Retrieval Layer

Pure vector search isn’t enough. We combine:

  • Semantic search — Vector similarity for conceptual matches
  • Keyword search — BM25 for exact term matches (especially for product names, IDs)
  • Metadata filtering — Filter by document type, recency, language
def hybrid_search(query: str, filters: dict) -> list[Document]:
    # Semantic search
    semantic_results = vector_store.similarity_search(
        query,
        k=20,
        filter=filters
    )

    # Keyword search
    keyword_results = bm25_index.search(query, k=20)

    # Reciprocal Rank Fusion
    return rrf_merge(semantic_results, keyword_results, k=10)

Context Window Management

GPT-4’s context window is large but not infinite. We developed a priority system:

  1. Must-have — Directly retrieved documents
  2. Should-have — Conversation history, user profile
  3. Nice-to-have — Related articles, examples

When approaching limits, we truncate from the bottom up.

Lessons Learned

1. Latency is Everything

Users expect instant responses. Our targets:

  • P50 < 2 seconds
  • P95 < 5 seconds

We achieved this through:

  • Streaming responses (start showing text immediately)
  • Parallel retrieval across multiple sources
  • Aggressive caching of embeddings

2. Evaluation is Hard

How do you know if the AI is getting better? We built:

  • Golden dataset — 500 query-response pairs rated by humans
  • Automated evals — LLM-as-judge for fluency, relevance, accuracy
  • A/B testing — Real user comparisons in production

3. Prompt Engineering is a Real Job

Our prompts went through 50+ iterations. Key learnings:

  • Be specific about output format
  • Include examples of good and bad responses
  • Use system prompts for persona, user prompts for task

Prompt Injection

Never trust user input directly in prompts. We implemented input sanitization and guardrails to prevent prompt injection attacks.

4. Observability is Critical

Every request is logged with:

  • Full prompt and response
  • Retrieval sources and scores
  • Latency breakdown by component
  • User feedback signal

This data is gold for debugging and improvement.

What’s Next?

Copilot is live and helping thousands of merchants daily. But we’re just getting started:

  • Function calling — Let the AI execute actions directly
  • Memory — Long-term context across conversations
  • Personalization — Adapt to merchant’s specific store and history

Bringing It to My Projects

Everything I learned at VTEX is going into WalioAI. The patterns are the same — RAG, multi-agent, streaming — but applied to lead generation instead of e-commerce.


Building in public means sharing what I learn. Follow along for more technical deep-dives.

Related posts