bitfab

Automatically test and fix your AI stack using your trace data.

A semi-autonomous loop to find the best code changes for your LLM workflows and agents using real customer scenarios.

Teams already building their AI stack factory with Bitfab.

Built for agentic engineers.

We wrapped all the functionality of our MCP, CLI, and SDKs into two well-tested coding agent skills so you don’t need to build infra around our infra.

Claude CodeCursorCodex

Find and label the scenarios that matter in your traces, using your coding agent powered by advanced trace search.

Coding agent searching and analyzing production traces

Precise data labeling with your coding agent. Tighten when you need to.

The agent annotates the spans that matter and collapses noisy traces. Override whatever’s wrong. The dataset reflects you and your customer's judgment, not the agent’s.

Bitfab·chatbotTrace with annotations✓ Done

trace · chatbot/order_status · 2.4s · 5 spans · 3 annotated

  • classify_intentllm_call45ms
  • search_orderstool_call1.2s
    Claude·performance·fail

    search_orders is most of the wall clock. A bounded query or cache knocks this under a second.

  • fetch_ordertool_call210ms
    Claude·wrong tool call·fail

    Called with order_id=4821. Customer wrote #4821-A — wrong order returned.

  • summarizellm_call180ms
  • generate_responsellm_call890ms
    You·missing context·fail

    Trace shows a 3-day shipping delay; the reply never mentions it. Override for the dataset.

2 labels by Claude · 1 human edit

Find fixes by automatically rerunning customer scenarios as verification.

Rerun experiments locally or in the cloud on production trace data with custom sandboxes.

Request a custom sandbox
replay · dataset: search-quality-v3 · 50 scenarios
46pass3fail1changedbaseline: prod@main · candidate: prompt-v4
#07
User asks about refund windowprompt edit
prod: "Refunds are available within 30 days."
Candidate output
v4: "Refunds within 30 days of purchase, no questions asked."
Clearer phrasing, intent preserved.
#23
User asks for order #4521 statusretrieval config
prod: "Order #4521 shipped Monday, size 10 Pegasus."
Candidate output
v4: "I don't have access to order details right now."
New filter dropped the order-history index.
#31
User requests refund for damaged itemtool schema
prod: process_refund(order=4521, reason="damaged")
Candidate tool call
v4: escalate_to_human(reason="damage refund")
Tool renamed. Model chose safer path.
Replay completed in 47s against isolated sandbox.47 of 50 shown

Auto-research for your AI stack using your customer data.

Book a demo