Evaluate

Score every call on 100s of metrics.

Code, LLM-as-judge, and audio checks run on every conversation — so you know exactly where your agent is failing, and why.

Start testing free Book a demo

app.rubrichq.io/batch-run/159/run/157

Nooruz zaman's Orga…

Owner

Platform

Dashboard

Test Agents

Co-Pilot

Metrics

Analytics

Prompt Optimizer

Evaluation

Live Simulations

Batch

Observability

Conversations

Notifications

Settings

Nooruz zaman

nooruzzonline@gmail.com

Batch RunsRun #157

Verify bot authentication

Nightly Regression · Jun 19, 2026 · 11:42 PM

02:45EvaluatingRerun Metrics

Audio analysis in progress — scoring metrics across the audio and transcript...

Conversation Metrics

LatencyComputing

User interrupting AIComputing

Custom LLM MetricsLLM-as-judge

Authentication VerifiedCriticalComputing

Customer SatisfactionComputing

Script AdherenceComputing

Tone & EmpathyComputing

Hallucination CheckCriticalComputing

00:38 / 02:45

Transcript7 turns

Agent00:02

Thanks for calling Acme Support, this is Ava — who am I speaking with?

Caller00:06

Hi, it's Sarah Miller. I'd like to check my balance.

Agent00:10

Happy to help. First I need to verify you — date of birth, last four of your SSN, and ZIP code?

Caller00:17

March 14th 1990, 4821, and 94107.

Agent00:23

Perfect, you're verified.

Agent00:27

Your current balance is $1,240.50.

Caller00:31

Hmm, that seems higher than I expected.

100s of metrics

System-defined, audio, LLM-as-judge, and code-as-judge — all scored automatically.

Transcript search

Search across thousands of conversations to find patterns and outliers.

Automatic flagging

Issues highlighted inline with severity and the exact reason they failed.

Metrics

Four ways to score every call

Mix and match hundreds of checks across four judge types — from deterministic rules to LLM reasoning and raw audio analysis — all running on every conversation.

LLM-as-judge — hallucination, compliance, resolution, custom rubrics

Audio — voice clarity, speech rate, tone and sentiment

System-defined — latency, silence gaps, talk-over, interruptions

Code-as-judge — deterministic rules and programmatic checks

LLM-as-judgeCustomReasons over the full transcript

Compliance CheckResolution StatusAgent Empathy

Audio12+ metrics pre-definedSignal analysis on the recording

Voice Tone & ClarityVoice Change DetectionWords / Minute

System-defined11+ metrics pre-definedMeasured from call timing

Response LatencyUser InterruptionsSilence Detection

Code-as-judgeCustomDeterministic rules you define

Disclosure phrase checkPermitted options checkCustom rule

The metric library

Hundreds of metrics, ready to run

Start from a curated library and add your own. Here's a sample of what ships out of the box.

LLMPass / Fail

Compliance Check

Required disclosures & procedures followed

LLMPass / Fail

Resolution Status

Customer's primary issue resolved

LLMPass / Fail

PII Handling

Sensitive data handled safely

LLMRating 1–5

Agent Empathy

Empathy toward the customer

LLMRating 1–5

Script Adherence

Followed the expected call flow

LLMRating 1–5

Customer Sentiment

Overall sentiment across the call

LLMPass / Fail

Objection Handling

Acknowledged & addressed objections

LLMPass / Fail

Numeric Verbalization

Numbers read back accurately

System1.4s avg

Response Latency

Time to first response per turn

SystemCount

User Interruptions

Times the caller cut in

System% of call

Silence Detection

Dead air across the conversation

Systemms

AI Reaction Time

Stop time after an interruption

AudioRating 1–5

Voice Tone & Clarity

Clarity and tone of the voice

AudioPass / Fail

Voice Change Detection

Same speaker throughout the call

AudioWPM

Words / Minute

Speech rate of the agent

CodePass / Fail

Disclosure Phrase

Exact required phrase was spoken

+ 150 more in the gallery, plus unlimited custom metrics

More in Evaluate

Everything you need to understand call quality.

Transcript · call #47-038Hallucination

Can I get a refund if I cancel today?

Absolutely! We offer a 30-day money-back guarantee on all plans.

Policy removed 6 months ago — current window is 7 days

Transcripts

Read every call, catch every issue

Full transcripts with inline issue highlighting — hallucinations, policy violations, and unauthorized actions flagged right where they happen.

Re-runs

Re-run metrics after every fix

Changed a prompt? Re-run the same metrics on the same batch and watch the scores move. Kick off multiple runs in parallel and compare them as they finish.

Batch #47 · 240 callsRe-run all

CallScoreAction

call_e07

Maria C. · billing dispute

92Re-run

call_e19

James K. · cancel flow

—Running

call_e12

Robert T. · tech help

88Re-run

call_e23

Sophia L. · refund request

95Re-run

call_e31

David M. · plan upgrade

90Re-run

Re-run any call or the full batch · same 18 metrics

Tagged calls17 of 240 · 7.1%

Call IDPhoneAgentTags

call_e07+1 (415) 555-0182Billing Agent

hallucination

call_e19+1 (628) 555-0145Retention Bot

regressionunresolved

call_e12+1 (212) 555-0177Support Agent

latency 3.4s

call_e23+1 (305) 555-0193Billing Agent

pii leak

Auto-tag every call

Each call is automatically tagged by the issues it hit — hallucination, latency, unresolved, and more — so you can filter, group, and triage the calls that need you first.

How it works

From a finished call to scored metrics

Every completed simulation and production call flows through the pipeline automatically — transcribed, evaluated, and scored across every judge.

Call completed

Live simulation run

Production calls added

via Observability / API

evaluation pipelinerunning

Audio processing — transcription…
Diarization & speaker alignment…
MetricsEvaluationJob triggered…
Scoring 18 metrics · 3 judges…
Aggregating results & flags…

Evaluating metrics…

Latency

Audio metric

0.48s

read_compliance_message

LLM-as-judge

Yes

resolution_status

LLM-as-judge

Pass

disclosure_phrase_present

Code-as-judge

true

Why it matters

What happens without evaluation metrics?

Hallucinations go undetected

Your agent confidently quotes policies that don't exist, promises discounts you don't offer, and gives medical advice it shouldn't. Without LLM evaluation, these slip through every manual review.

23% of calls contain at least one hallucination

Latency creeps up silently

After a prompt change, response time goes from 1.2s to 2.8s. Completion rates drop 34%. You don't notice until the CSAT scores come in next month.

Every 500ms of latency = 7% drop in completion

Compliance violations accumulate

Your agent collects SSNs without reading disclosures, skips required disclaimers, and stores data it shouldn't. Each violation is a potential fine.

Single HIPAA violation: up to $50K

Score your next batch in minutes.

Connect your agent and get 100s of metrics on every call — automatically.

Start testing free Book a demo