Evaluation Reliability Layer

See what your evaluation set is hiding.

Thousands of model responses and human scores form one living map. AnnotRift finds the contested ratings, ambiguous rules, and unreliable data inside it — and helps your team resolve them.

Audit a dataset View sample report

Live scan · summarization batchscanning

Scoring conflict Disagreement pocket · 38 Gold-verified

48,210responses

7models

19reviewers

1,364flagged

912gold-verified

The microscope

Walk one contested rating from the full map to a resolved decision. Click any stage — everything is visible, nothing waits on scrolling.

Full evaluation set

01 / 06

Anomaly region

Two scoring standards collide on the same kind of answer. Red marks where reviewers and the rubric disagree; lime marks where the model output itself looks anomalous.

Scoring conflictAnomalous clusterTask: summarization

Inside a Disagreement Pocket of 38 records, three reviewers gave the same model response three different scores.

R1 · AcceptR2 · RejectR3 · Borderline

Prompt

"Summarize the refund policy for orders over $500."

Model response

"Orders above $500 qualify for expedited refunds within 5 business days, minus a restocking fee."

Evidence: policy §4.2Similar case INC-1180

Rubric overlay — where the disagreement actually lives.

AccuracyRestocking fee not in source

CompletenessMissing "no fee under warranty"

RelevanceOn-topic

SafetyOverstates guarantee

AI review suggestion · awaiting confirmation

Recommend Borderline → Reject: response asserts a restocking fee not supported by the policy source.

For: §4.2 lists no restocking fee

Against: fee applies in a separate SKU policy

The map recomputes. This one judgment reclassifies the pocket and updates the Gold Dataset and report.

Pocket resolved38 → 0

Rubric fix queued1

Gold-verified+37

Built for the teams grading model output

frontier labseval vendorsRAG teamsdata opssafety teams

Core capability

Import your evaluation set. Get back what's wrong with it.

AnnotRift ingests model responses, human ratings, labels, the rubric, and review notes — then works the data through a fixed reliability pass. Every AI suggestion shows its rules, historical cases, and evidence; a human confirms the final call.

Check rules

Ratings and labels tested against the rubric.

Surface rifts

Find reviewer-to-reviewer disagreement.

Spot confusion

Isolate easily-confused labels and criteria.

Flag bad data

Duplicate, contaminated, thin samples.

Queue experts

Route hard cases for human review.

Ship gold

Human-verified Gold Dataset + report.

Make disagreement useful

Where reviewers apply the rubric differently.

Every reviewer is a track. When their scores for the same response cross, that's not noise — it's a rift worth resolving. AnnotRift tells you whether it's an error, an ambiguous rule, missing context, or a fair judgment call.

Crossing lines mark contested ratings
Classified: error · ambiguity · missing context · subjective
Send recurring rifts straight to Rubric Studio

Explore reviewer differences

Reviewer tracks · summarization batch

From guidelines to tests

Turn the rubric everyone argues about into rules you can test.

AnnotRift breaks your scoring guide into Definition, Positive Evidence, Negative Evidence, Exceptions, Priority, and Examples. Edit a rule and the historical disagreement recomputes instantly — so you fix the rubric, not just the symptom.

Ambiguity Lens highlights vague, overlapping rules
Test a rule change against past disputes
Every version records owner and affected range

Open Rubric Studio

Definition

"Accurate" = every claim is supported by the provided source.

Exception · flagged ambiguous

"Minor omissions are acceptable" — no threshold for "minor".

Negative evidence

Any unsupported policy or fee → Reject.

Evaluation Reliability Layer

Find what makes your AI evaluations unreliable.

Point AnnotRift at your evaluation set and get a traceable quality report in minutes — every finding backed by data and rules, every decision confirmed by a human.

Audit a dataset Book a demo