Benchmarks | HumanSignal

Benchmarks The Five Stages to Keeping Benchmarks Useful as Models Evolve

A practical maturity model for taking benchmarks from proof-of-concept to versioned, continuously evolving evaluation that keeps up with models, prompts, and agent workflows.

HumanSignal Team

Benchmarks Building a Quality Estimation Benchmark: The impact of relying on AI judges

What happens when you let AI judge AI? A pioneer benchmark for quality estimation in machine translation.

Sheree Zhang

Sr. Product Manager

Benchmarks Evaluating the GPT-5 Series on Custom Benchmarks

GPT-5 is out now -- but how good is it, really? In this post, we'll show you how we created our own custom Benchmark to evaluate GPT-5.

Sheree Zhang

Sr. Product Manager

Benchmarks How to Build AI Benchmarks that Evolve with your Models

Designing effective LLM benchmarks means going beyond static tests, this guide walks through scoring methods, strategy evolution, and how to evaluate models as they scale.

Micaela Kaplan

Data Scientist

Benchmarks Why Benchmarks Matter for Evaluating LLMs (and Why Most Miss the Mark)

Custom AI benchmarks play a crucial role in the success and scalability of AI systems by providing a standardized approach to running AI evaluations.

Sheree Zhang

Sr. Product Manager

Benchmarks Everybody Is (Unintentionally) Cheating

AI benchmarks are breaking under pressure. This blog explores four ways to rebuild trust, governance, transparency, better metrics, and centralized oversight.

Nikolai Liubimov

CTO