May 22, 2026 · Guide

How AI Labs Train Agents with RL Environments

A walk through the actual training loop frontier labs use to teach agents with RL environments — rollouts, rewards, verifiers, and the bottlenecks that shape the market.

The high-level pitch is simple: drop a model into an environment, let it try the task, score the result, and update the model. The actual training loop is messier than that, and understanding the mess is the difference between picking a vendor that solves your bottleneck and one that doesn't.

The loop, step by step

A frontier-lab RL training run against an environment looks something like this. The lab takes a current model checkpoint and uses it as the policy. For each task in a batch, the policy runs against the environment — that's a rollout — taking a sequence of actions (edits, tool calls, clicks) until it either succeeds, fails, or hits a step limit. The environment hands back a trajectory (every state, every action, every observation) and a reward (the grader's score). A learning algorithm — typically a variant of PPO, GRPO, or a more recent off-policy method — uses those trajectories to nudge the policy in the direction that produces higher rewards. Repeat for millions of rollouts.

Where the bottlenecks actually are

Three bottlenecks dominate, and they explain almost every vendor pitch in the market.

Rollout throughput. RL training is rollout-hungry. A single training run can need millions of agent attempts. If your environment runs at one rollout per minute on a single machine, you're done before you started. Vendors compete heavily on how parallel and how fast their environments are.

Reward signal quality. A noisy or sparse reward signal makes training brittle. A grader that's wrong even five percent of the time can teach the model the wrong lesson at scale. This is why "verifier" infrastructure (a separate model or rule-set that double-checks the grader's verdict) has become a category of its own.

Task generation. Once a model has saturated a fixed task set, it stops learning from it. Labs need a pipeline producing fresh, calibrated task variants continuously — not a static catalog. Vendors that own a task-generation pipeline (often a mix of human expert authoring and LLM-assisted synthesis) command a premium.

Where the model improves and where it doesn't

RL on environments is extremely good at capability — can the agent finish the multi-step task — and largely orthogonal to judgment, which is what RLHF data shapes. Labs that try to teach taste through environment training alone usually end up with agents that are mechanically competent and conversationally awful. The two pieces are complementary, which is why frontier programs almost always run both in parallel.

Where to look

See the companies building each piece of the stack — environments, verifiers, RLHF data, and evals — scored on a public rubric, on the RL environment companies list. The companion guide RL Environments vs. RLHF Data vs. Evals breaks down which is which.