Evaluate

Score every call on 100s of metrics.

Code, LLM-as-judge, and audio checks run on every conversation — so you know exactly where your agent is failing, and why.

Start testing freeBook a demo
app.rubrichq.io/batch-run/159/run/157
N
Nooruz zaman's Orga…
Owner
Platform
Dashboard
Test Agents
Co-Pilot
Metrics
Analytics
Prompt Optimizer
Evaluation
Live Simulations
Batch
Observability
Conversations
Notifications
Settings
N
Nooruz zaman
nooruzzonline@gmail.com
Batch RunsRun #157
Verify bot authentication
Nightly Regression · Jun 19, 2026 · 11:42 PM
02:45EvaluatingRerun Metrics
Audio analysis in progress — scoring metrics across the audio and transcript...
Conversation Metrics
LatencyComputing
User interrupting AIComputing
Custom LLM MetricsLLM-as-judge
Authentication VerifiedCriticalComputing
Customer SatisfactionComputing
Script AdherenceComputing
Tone & EmpathyComputing
Hallucination CheckCriticalComputing
00:38 / 02:45
Transcript7 turns
Agent00:02

Thanks for calling Acme Support, this is Ava — who am I speaking with?

Caller00:06

Hi, it's Sarah Miller. I'd like to check my balance.

Agent00:10

Happy to help. First I need to verify you — date of birth, last four of your SSN, and ZIP code?

Caller00:17

March 14th 1990, 4821, and 94107.

Agent00:23

Perfect, you're verified.

Agent00:27

Your current balance is $1,240.50.

Caller00:31

Hmm, that seems higher than I expected.

100s of metrics

System-defined, audio, LLM-as-judge, and code-as-judge — all scored automatically.

Transcript search

Search across thousands of conversations to find patterns and outliers.

Automatic flagging

Issues highlighted inline with severity and the exact reason they failed.

Metrics

Four ways to score every call

Mix and match hundreds of checks across four judge types — from deterministic rules to LLM reasoning and raw audio analysis — all running on every conversation.

LLM-as-judge — hallucination, compliance, resolution, custom rubrics
Audio — voice clarity, speech rate, tone and sentiment
System-defined — latency, silence gaps, talk-over, interruptions
Code-as-judge — deterministic rules and programmatic checks
LLM-as-judgeCustomReasons over the full transcript
Compliance CheckResolution StatusAgent Empathy
Audio12+ metrics pre-definedSignal analysis on the recording
Voice Tone & ClarityVoice Change DetectionWords / Minute
System-defined11+ metrics pre-definedMeasured from call timing
Response LatencyUser InterruptionsSilence Detection
Code-as-judgeCustomDeterministic rules you define
Disclosure phrase checkPermitted options checkCustom rule

The metric library

Hundreds of metrics, ready to run

Start from a curated library and add your own. Here's a sample of what ships out of the box.

LLMPass / Fail
Compliance Check
Required disclosures & procedures followed
LLMPass / Fail
Resolution Status
Customer's primary issue resolved
LLMPass / Fail
PII Handling
Sensitive data handled safely
LLMRating 1–5
Agent Empathy
Empathy toward the customer
LLMRating 1–5
Script Adherence
Followed the expected call flow
LLMRating 1–5
Customer Sentiment
Overall sentiment across the call
LLMPass / Fail
Objection Handling
Acknowledged & addressed objections
LLMPass / Fail
Numeric Verbalization
Numbers read back accurately
System1.4s avg
Response Latency
Time to first response per turn
SystemCount
User Interruptions
Times the caller cut in
System% of call
Silence Detection
Dead air across the conversation
Systemms
AI Reaction Time
Stop time after an interruption
AudioRating 1–5
Voice Tone & Clarity
Clarity and tone of the voice
AudioPass / Fail
Voice Change Detection
Same speaker throughout the call
AudioWPM
Words / Minute
Speech rate of the agent
CodePass / Fail
Disclosure Phrase
Exact required phrase was spoken
+ 150 more in the gallery, plus unlimited custom metrics

More in Evaluate

Everything you need to understand call quality.

Transcript · call #47-038Hallucination
Can I get a refund if I cancel today?
Absolutely! We offer a 30-day money-back guarantee on all plans.
Policy removed 6 months ago — current window is 7 days
Transcripts

Read every call, catch every issue

Full transcripts with inline issue highlighting — hallucinations, policy violations, and unauthorized actions flagged right where they happen.

Re-runs

Re-run metrics after every fix

Changed a prompt? Re-run the same metrics on the same batch and watch the scores move. Kick off multiple runs in parallel and compare them as they finish.

Batch #47 · 240 callsRe-run all
CallScoreAction
call_e07
Maria C. · billing dispute
92Re-run
call_e19
James K. · cancel flow
Running
call_e12
Robert T. · tech help
88Re-run
call_e23
Sophia L. · refund request
95Re-run
call_e31
David M. · plan upgrade
90Re-run
Re-run any call or the full batch · same 18 metrics
Tagged calls17 of 240 · 7.1%
Call IDPhoneAgentTags
call_e07+1 (415) 555-0182Billing Agent
hallucination
call_e19+1 (628) 555-0145Retention Bot
regressionunresolved
call_e12+1 (212) 555-0177Support Agent
latency 3.4s
call_e23+1 (305) 555-0193Billing Agent
pii leak
Tags

Auto-tag every call

Each call is automatically tagged by the issues it hit — hallucination, latency, unresolved, and more — so you can filter, group, and triage the calls that need you first.

How it works

From a finished call to scored metrics

Every completed simulation and production call flows through the pipeline automatically — transcribed, evaluated, and scored across every judge.

Call completed
Live simulation run
Production calls added
via Observability / API
evaluation pipelinerunning
Audio processing — transcription…
Diarization & speaker alignment…
MetricsEvaluationJob triggered…
Scoring 18 metrics · 3 judges…
Aggregating results & flags…
Evaluating metrics…
Latency
Audio metric
0.48s
read_compliance_message
LLM-as-judge
Yes
resolution_status
LLM-as-judge
Pass
disclosure_phrase_present
Code-as-judge
true

Why it matters

What happens without evaluation metrics?

Hallucinations go undetected

Your agent confidently quotes policies that don't exist, promises discounts you don't offer, and gives medical advice it shouldn't. Without LLM evaluation, these slip through every manual review.

23% of calls contain at least one hallucination

Latency creeps up silently

After a prompt change, response time goes from 1.2s to 2.8s. Completion rates drop 34%. You don't notice until the CSAT scores come in next month.

Every 500ms of latency = 7% drop in completion

Compliance violations accumulate

Your agent collects SSNs without reading disclosures, skips required disclaimers, and stores data it shouldn't. Each violation is a potential fine.

Single HIPAA violation: up to $50K

Score your next batch in minutes.

Connect your agent and get 100s of metrics on every call — automatically.

Start testing freeBook a demo