Overview
The LLM Simulation page allows you to:- Define and manage test cases for your AI agents
- Trigger batch runs that execute multiple test cases simultaneously
- Review results, success/failure status, and detailed remarks for each job
Navigation
The page is accessed via the top navigation bar under the Simulate tab (alongside Build).| Tab | Description |
|---|---|
| Test Cases | Create and manage individual simulation scenarios |
| Batch Runs | View and manage grouped execution runs |
Batch Runs
Batch Run List (Left Panel)
The left panel displays all batch runs sorted by creation time. Each entry shows:- Batch ID — a unique identifier (e.g.
batch_Xb1RnQ8vfbE0Ed8b) - Created At — timestamp of when the batch was created (e.g.
4/18/2026, 4:57:05 AM)
Batch Run Summary (Right Panel — Top)
When a batch run is selected, three summary cards are displayed:| Metric | Description |
|---|---|
| Total Attempts | Total number of jobs executed in the batch |
| Successful Attempts | Number of jobs where is_success = true |
| Failed Attempts | Number of jobs where is_success = false |
Job Results Table
Below the summary cards, a detailed table lists every job in the selected batch run.Columns
| Column | Type | Description |
|---|---|---|
Job ID | string | Unique identifier for the individual job (e.g. job_h5IDmcZfPRUIGaRv) |
Test Case ID | string | The test case the job was run against (e.g. tc_wlKlRIsGjKV9gFDL) |
Chat ID | string | Identifier of the chat session generated during simulation (e.g. chat-r5ThYqOTUZpzk8Jl) |
Is Success | boolean | true (green) if the agent met all evaluation criteria; false (red) if it did not |
Status | badge | Current state of the job — typically Completed |
Remarks | text | AI-generated evaluation summary explaining why the job passed or failed |
Created Datetime | timestamp | When the job was created |
Success / Failure Badge
TheIs Success column renders a color-coded badge:
- 🟢
true— Agent passed all evaluation criteria - 🔴
false— Agent failed to meet one or more criteria
Status Badge
TheStatus column shows the current execution state. Common values:
| Status | Meaning |
|---|---|
Completed | Job finished execution |
Running | Job is currently in progress |
Failed | Job encountered an execution error |
Remarks
TheRemarks column contains a natural-language summary of agent performance. Examples:
“The agent successfully introduced themselves warmly, clearly explained the overdue balance, offered a payment plan in response to the customer’s financial difficulty, confirmed the 3-installment option, and closed the conversation by…”
“The agent acknowledged the customer’s time constraints and kept the pitch brief, but failed to effectively execute a micro-close. While the agent did mention the credit impact reminder, it was not integrated naturally into the conversation and…”Remarks are truncated in the table view. Click a row to view the full remark.
Example Batch Run
Below is a sample from a batch run showing mixed results:| Job ID | Test Case ID | Is Success | Remarks (summary) |
|---|---|---|---|
job_h5IDmcZfPRUIGaRv | tc_wlKlRIsGjKV9gFDL | ❌ false | Agent failed to complete the primary goal — customer requested a callback |
job_nNjj6EgdIyOlqTcb | tc_jBuFAVeD0OsSekgW | ✅ true | Successfully introduced, explained balance, confirmed payment plan |
job_hFO6nnPvfw8mGa9c | tc_jBuFAVeD0OsSekgW | ✅ true | Successfully handled objection and confirmed 3-installment option |
job_7NknE7cMoPvsppko | tc_jBuFAVeD0OsSekgW | ✅ true | Payment plan confirmed, conversation closed appropriately |
job_FMCxaLhVViQj51Kl | tc_lDawKCNE05hzUaOv | ✅ true | Remained empathetic, handled objection twice with different approaches |
job_AC59MKlMCJWlLbvS | tc_wlKlRIsGjKV9gFDL | ❌ false | Micro-close failed; credit reminder not integrated naturally |
job_KxemQu0GeTC951ao | tc_lDawKCNE05hzUaOv | ❌ false | Failed to provide non-payment alternatives per customer’s primary request |
Tips
- Use Batch Runs to run large-scale evaluations across many test cases at once.
- Monitor the Successful / Failed Attempts summary to quickly gauge agent quality.
- Read Remarks carefully — they provide specific, actionable feedback about agent behavior.
- Multiple jobs can share the same
Test Case ID, allowing you to test consistency across runs.