Automatically test and fix your AI stack using your trace data.

A semi-autonomous loop to find the best code changes for your LLM workflows and agents using real customer scenarios.

Talk to us Docs →

Teams already building their AI stack factory with Bitfab.

Haystack

Truco

Built for agentic engineers.

We wrapped all the functionality of our MCP, CLI, and SDKs into two well-tested coding agent skills so you don’t need to build infra around our infra.

/bitfab:setup

Set up trace instrumentation and replay scripts in under 5 minutes without reading any docs.

/bitfab:improve

Build and experiment against datasets of production traces using an autonomous loop that you can guide as much or as little as you want.

Find and label the scenarios that matter in your traces, using your coding agent powered by advanced trace search.

Precise data labeling with your coding agent. Tighten when you need to.

The agent annotates the spans that matter and collapses noisy traces. Override whatever’s wrong. The dataset reflects you and your customer's judgment, not the agent’s.

Bitfab·chatbot·Trace with annotations✓ Done

trace · chatbot/order_status · 2.4s · 5 spans · 3 annotated

classify_intentllm_call45ms
search_orderstool_call1.2s
Claude·performance·fail
search_orders is most of the wall clock. A bounded query or cache knocks this under a second.
fetch_ordertool_call210ms
Claude·wrong tool call·fail
Called with order_id=4821. Customer wrote #4821-A — wrong order returned.
summarizellm_call180ms
generate_responsellm_call890ms
You·missing context·fail
Trace shows a 3-day shipping delay; the reply never mentions it. Override for the dataset.

2 labels by Claude · 1 human edit

Find fixes by automatically rerunning customer scenarios as verification.

Rerun experiments locally or in the cloud on production trace data with custom sandboxes.

Request a custom sandbox

replay · dataset: search-quality-v3 · 50 scenarios

46pass3fail1changedbaseline: prod@main · candidate: prompt-v4

✓#07

User asks about refund windowprompt edit

prod: "Refunds are available within 30 days."

Candidate output

v4: "Refunds within 30 days of purchase, no questions asked."

Clearer phrasing, intent preserved.

✕#23

User asks for order #4521 statusretrieval config

prod: "Order #4521 shipped Monday, size 10 Pegasus."

Candidate output

v4: "I don't have access to order details right now."

New filter dropped the order-history index.

⊕#31

User requests refund for damaged itemtool schema

prod: process_refund(order=4521, reason="damaged")

Candidate tool call

v4: escalate_to_human(reason="damage refund")

Tool renamed. Model chose safer path.

Replay completed in 47s against isolated sandbox.47 of 50 shown

Auto-research for your AI stack using your customer data.

Book a demo