May 22, 2026 · Guide

The RLHF Data Companies Landscape

RLHF data is the older, more mature sibling of RL environments — and most of the same companies sell both. Here's the landscape and who's doing what.

Before "RL environments" was a category, "RLHF data" already was. Frontier labs have been buying human preference judgments at industrial scale since the original ChatGPT moment, and the companies that built the supply networks to deliver that data are some of the most consequential — and most secretive — vendors in AI. Most of them have now extended into environments, which is exactly why the RL environment market took shape so quickly: the supply chain was already there.

What RLHF data actually is

At the simplest level: a human looks at two model outputs and picks the better one, or rates a single output against a written guideline. Scale that across thousands of human raters and millions of comparisons, and you have the training data for a reward model — the thing that tells the LLM "this kind of answer is good, this kind isn't" during reinforcement learning. The quality of that human signal is the difference between a model that feels helpful and one that doesn't, which is why labs pay so much for it.

The incumbents

Surge AI is the quiet giant — bootstrapped, reportedly over a billion in revenue, working with essentially every frontier lab. They recently spun up a dedicated internal organization to build RL environments as demand shifted from static preference data to interactive simulation.

Scale AI is the data-labeling incumbent of the chatbot era, now retooling for agents through its Forge offering and a dedicated agents / RL-environments product line. The biggest delivery operation in the category, less open than the specialists.

Mercor takes a different approach: an expert marketplace that pairs labs with credentialed humans (developers, doctors, lawyers) for both human data and the domain-specific environments those experts help construct. Operates at a reported ten-billion-dollar valuation.

Why the lines have blurred

A reward model trained on RLHF data is one piece of the RL training pipeline. An environment is another. The human experts who write good preference judgments are also exactly the people you want building, grading, and verifying environments. So the same companies that won the RLHF data market are uniquely positioned to win adjacent environment work — and they have.

The pure-play environment specialists (Mechanize, Halluminate) compete on depth and product polish. The data incumbents compete on scale, expert networks, and the convenience of one vendor across the whole post-training stack.

Browse the rest

See every company in the category — RLHF, environments, and evals — scored on the same public rubric, on the RL environment companies list.