How I Built a Copilot AI at VTEX: RAG, Multi-Agent, and What I Learned

The Challenge

When I joined the AI team at VTEX, the mission was clear: build an AI assistant that could help merchants manage their e-commerce operations. But “AI assistant” is vague. What does that actually mean?

After months of research and iteration, we defined Copilot as:

An agentic AI system that understands merchant intent, retrieves relevant context from VTEX’s knowledge base, and executes or guides actions across the platform.

This meant building three core capabilities:

Understanding — Parse natural language queries into actionable intents
Retrieval — Find relevant documentation, settings, and historical data
Action — Execute operations or guide users through complex workflows

The Architecture

RAG Pipeline

The heart of Copilot is a Retrieval-Augmented Generation (RAG) pipeline. Here’s how it works:

User Query → Intent Classification → Retrieval → Context Assembly → LLM → Response

Why RAG?

Pure LLMs hallucinate. A lot. RAG grounds the model in actual documentation and data, dramatically reducing hallucination while keeping responses current.

Key decisions we made:

Chunking strategy — We use semantic chunking with overlap, not fixed-size chunks. Documentation sections have natural boundaries.
Embedding model — OpenAI’s text-embedding-3-large for production, with plans to move to open-source models for cost reduction.
Vector store — Pinecone for low-latency retrieval at scale.
Re-ranking — Cohere’s reranker to improve retrieval precision before feeding to the LLM.

Multi-Agent Orchestration

Complex queries require multiple specialized agents. For example:

“Why did my sales drop last week and how do I fix it?”

This triggers:

Analytics Agent — Queries sales data, identifies anomalies
Diagnostic Agent — Checks for configuration changes, promotions that ended
Recommendation Agent — Suggests actions based on findings

We use a supervisor pattern where a main orchestrator agent delegates to specialists and synthesizes their outputs.

const orchestrator = new SupervisorAgent({
  agents: [analyticsAgent, diagnosticAgent, recommendationAgent],
  strategy: 'parallel-then-synthesize',
  timeout: 30000,
});

Multi-agent systems are powerful but add complexity. Start with a single agent and only add more when you have clear, separable responsibilities.

Technical Deep-Dive: The Retrieval Layer

Hybrid Search

Pure vector search isn’t enough. We combine:

Semantic search — Vector similarity for conceptual matches
Keyword search — BM25 for exact term matches (especially for product names, IDs)
Metadata filtering — Filter by document type, recency, language

def hybrid_search(query: str, filters: dict) -> list[Document]:
    # Semantic search
    semantic_results = vector_store.similarity_search(
        query,
        k=20,
        filter=filters
    )

    # Keyword search
    keyword_results = bm25_index.search(query, k=20)

    # Reciprocal Rank Fusion
    return rrf_merge(semantic_results, keyword_results, k=10)

Context Window Management

GPT-4’s context window is large but not infinite. We developed a priority system:

Must-have — Directly retrieved documents
Should-have — Conversation history, user profile
Nice-to-have — Related articles, examples

When approaching limits, we truncate from the bottom up.

Lessons Learned

1. Latency is Everything

Users expect instant responses. Our targets:

P50 < 2 seconds
P95 < 5 seconds

We achieved this through:

Streaming responses (start showing text immediately)
Parallel retrieval across multiple sources
Aggressive caching of embeddings

2. Evaluation is Hard

How do you know if the AI is getting better? We built:

Golden dataset — 500 query-response pairs rated by humans
Automated evals — LLM-as-judge for fluency, relevance, accuracy
A/B testing — Real user comparisons in production

3. Prompt Engineering is a Real Job

Our prompts went through 50+ iterations. Key learnings:

Be specific about output format
Include examples of good and bad responses
Use system prompts for persona, user prompts for task

Prompt Injection

Never trust user input directly in prompts. We implemented input sanitization and guardrails to prevent prompt injection attacks.

4. Observability is Critical

Every request is logged with:

Full prompt and response
Retrieval sources and scores
Latency breakdown by component
User feedback signal

This data is gold for debugging and improvement.

What’s Next?

Copilot is live and helping thousands of merchants daily. But we’re just getting started:

Function calling — Let the AI execute actions directly
Memory — Long-term context across conversations
Personalization — Adapt to merchant’s specific store and history

Bringing It to My Projects

Everything I learned at VTEX is going into WalioAI. The patterns are the same — RAG, multi-agent, streaming — but applied to lead generation instead of e-commerce.

Building in public means sharing what I learn. Follow along for more technical deep-dives.