Benjamin Anderson
github logo linkedin logo rss feed

Don't Build an RL Environment Startup

Don't sell blood to vampires.

Posted Sep 7, 2025 by Benjamin Anderson

The first person who sold an RL environment to a frontier AI lab must have felt like they discovered an infinite money glitch. It's no longer a secret that frontier AI labs regularly pay hundreds of thousands, and sometimes millions, for clones of Linear and Salesforce. If you're reading this, you've probably thought about quitting your day job and starting a company that builds these unusually lucrative Next.js apps. In this post, I'll argue that you should hesitate before hopping on the bandwagon.

Background: What's an RL environment?

For those unfamiliar, an RL (reinforcement learning) environment is like a sandbox for AI models like Claude and GPT to learn from. It keeps track of an internal state, prompts the AI to take actions to complete a task, and assigns a score based on the outcome. The most obvious kind is a clone of a popular website or enterprise software tool like Doordash, Linear, or Amazon, which teaches the AI to click around and order pizza. It can also be text-only, like the TextArena project, which teaches AIs to play games like Set and Blackjack.

RL environments are having a moment because of the rise of a new paradigm in LLM post-training that teaches models new skills based on verifiable rewards. OpenAI realized sometime in 2023 that if you ask a model to "think" before solving a math problem, and reinforce the "thoughts" that led to right answers, the model learns to think more useful thoughts, and gets really good at math. Since then, labs have been rushing to generalize this paradigm, teaching AIs to use the computer, conduct research online, and work on long-horizon coding tasks.

As a result, a new slew of startups has cropped up whose sole purpose is to feed the voracious labs with ever more arcane and challenging environments. The theory is that that the machine learning bit is solved now; the only remaining work is to provide models with more realistic "simulations" to learn from, so that they don't exhibit weird behaviors when we drop them into the "real world" and try to make them do economically valuable work.

If you're a solo researcher or you have a job that gives you a lot of free time, there's basically no downside to trying to whip up an RL environment and sell it to OpenAI or Anthropic (or Amazon and Meta, the dogs nipping at their heels). You might just be able to make a million bucks and retire to Tulum at age 35.

On the other hand, if you're trying to build a generational business that stands the test of time, I would not touch RL environments with a 1-million token stick.

The Rise and Fall of Crowdworkers

Before RL became the dominant post-training paradigm, all we had was a thing called supervised fine-tuning. It was much simpler back then—you would teach a model trained to predict the next word how to be a chatbot by showing it examples of chat conversations. Of course, someone had to write all those chat conversations; then once it was easy enough to make models chatty, someone had to rank all the chat transcripts, so labs could use only the finest chats to make the AI have the Correct left-liberal political opinions and not give detailed instructions on to make meth. That work was predominantly done by sub-minimum-wage crowdworkers in the developing world.

At some point in 2023, studies started to come out showing that LLMs were better than crowdworkers at various text-annotation tasks. Anthropic had already famously been using "Constitutional AI" to improve its models, relying on AI to rate the quality of AI outputs against a rubric. Mistral AI, which got started a bit later, bragged publicly that they hadn't used human data to train their initial models. (Rumors later circulated they were using their competitors' outputs instead). Human data was still being consumed and purchased in vast quantities, but the quality floor had begun to rise rapidly, because the easy stuff could just be done by LLMs.

These days, despite all the synthetic data, human-data providers are bigger than ever, but they're not hiring low-wage Kenyan and Filipino crowdworkers anymore—they're hiring software engineers and PhDs for $50-200 an hour. But when those PhD skills are no longer useful for improving AI models, they'll be discarded just like the crowdworkers were.

When it comes down to it, building an RL environment for an AI lab puts you in the same precarious situation as those crowdworkers of 2023: useful for now, disposable later. You are providing a rung on the ladder to Artificial General Intelligence, but once everyone has climbed past that rung, they won't need you anymore. Once AI models have a skill, the value of datasets and environments that teach that skill tend to 0, as models can now provide unlimited, cheap data as a substitute.

Free Money Spigots are Short-Lived

"But Ben," you might say, "I know that the utility of these silly apps is ephemeral—that's why I plan to keep making more of them. That way, the labs won't clamber over my corpse!" That sounds nice—if they really are selling for a few million dollars a pop, you can make 10 RL environments a year, sell them each to 5 labs, and have something like a unicorn on your hands. But is that actually realistic long-term? I'm not so sure.

First, the low hanging fruit is getting taken. There's already a Doordash clone, an Amazon clone, a Linear clone, a Bass Pro Shops clone... you have to come up with something that hasn't been done. Soon, if the RL generalizes, like, at all, "clicking on websites" won't be enough; you'll have to make a high-fidelity heart surgery simulator. You can do that, sure, but it doesn't sound like free money anymore.

Second, if the free-money glitch doesn't go away, expect the supply side to become increasingly crowded. Computer science is now one of the most-unemployed majors, and the reason it is is because Claude Code is about as good at programming as a CS new grad, so that means you can expect 100,000 unemployed 20-somethings and 100,000 copies of Claude to be chasing the same RL environment bag. I have personally merged a couple game environments to TextArena with Claude's help.

Third, I would be remiss if I didn't talk about all the open-source activity in this space. Prime Intellect is leading the charge here with a hub of open-source, freely-available RL environments. I admire their vision. If PI, TextArena, and other open projects succeed in building a strong community around open-source AI training environments, that means anything proprietary will have to just be that much better—otherwise labs can just use the free stuff.

Probably the only actually-sustainable business here is the ops business that connects the labs with unemployed CS grads and underemployed PhDs—that's what Mercor and Surge do. But that's not a free money glitch, it's a brutal, grinding, ops business that can collapse overnight (hello, Scale). And those will die too someday if the only person smart enough to make models better is Terence Tao—OpenAI doesn't need your marketplace, they'll just pick up the phone and call him.

Conclusion

I don't know what the next generational startup ideas will be (and if I did, I wouldn't tell you!). The point of this post is just to encourage you to think twice before becoming an RL rent boy for the frontier labs. You will not build a unicorn by making a vibe-coded Spider Solitaire. If you do have the chops to make an environment that can teach a ChatGPT to perform open heart surgery, then maybe you should be the one training the AGI, instead of selling stuff to OpenAI.

As always, if you loved this post or if you hated it, come yell at me on Twitter: @andersonbcdefg.


« Previous post: Computer-Use Evals are a Mess

» Next post: What if RL Environments Aren't Mispriced?