Skip to main content
The Simulate tab provides tools to run and evaluate AI agent behavior against predefined test cases. It is divided into two sub-sections: Test Cases and Batch Runs.

Overview

The LLM Simulation page allows you to:
  • Define and manage test cases for your AI agents
  • Trigger batch runs that execute multiple test cases simultaneously
  • Review results, success/failure status, and detailed remarks for each job

The page is accessed via the top navigation bar under the Simulate tab (alongside Build).
Build | Simulate (active)
A secondary tab bar within the page switches between:
TabDescription
Test CasesCreate and manage individual simulation scenarios
Batch RunsView and manage grouped execution runs

Batch Runs

Batch Run List (Left Panel)

The left panel displays all batch runs sorted by creation time. Each entry shows:
  • Batch ID — a unique identifier (e.g. batch_Xb1RnQ8vfbE0Ed8b)
  • Created At — timestamp of when the batch was created (e.g. 4/18/2026, 4:57:05 AM)
Clicking a batch run loads its details in the right panel.

Batch Run Summary (Right Panel — Top)

When a batch run is selected, three summary cards are displayed:
MetricDescription
Total AttemptsTotal number of jobs executed in the batch
Successful AttemptsNumber of jobs where is_success = true
Failed AttemptsNumber of jobs where is_success = false
Example:
Total Attempts: 30   Successful Attempts: 17   Failed Attempts: 13

Job Results Table

Below the summary cards, a detailed table lists every job in the selected batch run.

Columns

ColumnTypeDescription
Job IDstringUnique identifier for the individual job (e.g. job_h5IDmcZfPRUIGaRv)
Test Case IDstringThe test case the job was run against (e.g. tc_wlKlRIsGjKV9gFDL)
Chat IDstringIdentifier of the chat session generated during simulation (e.g. chat-r5ThYqOTUZpzk8Jl)
Is Successbooleantrue (green) if the agent met all evaluation criteria; false (red) if it did not
StatusbadgeCurrent state of the job — typically Completed
RemarkstextAI-generated evaluation summary explaining why the job passed or failed
Created DatetimetimestampWhen the job was created

Success / Failure Badge

The Is Success column renders a color-coded badge:
  • 🟢 true — Agent passed all evaluation criteria
  • 🔴 false — Agent failed to meet one or more criteria

Status Badge

The Status column shows the current execution state. Common values:
StatusMeaning
CompletedJob finished execution
RunningJob is currently in progress
FailedJob encountered an execution error

Remarks

The Remarks column contains a natural-language summary of agent performance. Examples:
“The agent successfully introduced themselves warmly, clearly explained the overdue balance, offered a payment plan in response to the customer’s financial difficulty, confirmed the 3-installment option, and closed the conversation by…”
“The agent acknowledged the customer’s time constraints and kept the pitch brief, but failed to effectively execute a micro-close. While the agent did mention the credit impact reminder, it was not integrated naturally into the conversation and…”
Remarks are truncated in the table view. Click a row to view the full remark.

Example Batch Run

Below is a sample from a batch run showing mixed results:
Job IDTest Case IDIs SuccessRemarks (summary)
job_h5IDmcZfPRUIGaRvtc_wlKlRIsGjKV9gFDL❌ falseAgent failed to complete the primary goal — customer requested a callback
job_nNjj6EgdIyOlqTcbtc_jBuFAVeD0OsSekgW✅ trueSuccessfully introduced, explained balance, confirmed payment plan
job_hFO6nnPvfw8mGa9ctc_jBuFAVeD0OsSekgW✅ trueSuccessfully handled objection and confirmed 3-installment option
job_7NknE7cMoPvsppkotc_jBuFAVeD0OsSekgW✅ truePayment plan confirmed, conversation closed appropriately
job_FMCxaLhVViQj51Kltc_lDawKCNE05hzUaOv✅ trueRemained empathetic, handled objection twice with different approaches
job_AC59MKlMCJWlLbvStc_wlKlRIsGjKV9gFDL❌ falseMicro-close failed; credit reminder not integrated naturally
job_KxemQu0GeTC951aotc_lDawKCNE05hzUaOv❌ falseFailed to provide non-payment alternatives per customer’s primary request

Tips

  • Use Batch Runs to run large-scale evaluations across many test cases at once.
  • Monitor the Successful / Failed Attempts summary to quickly gauge agent quality.
  • Read Remarks carefully — they provide specific, actionable feedback about agent behavior.
  • Multiple jobs can share the same Test Case ID, allowing you to test consistency across runs.