How It Started
A while back I watched a documentary about DeepMind — The Thinking Game.
What struck me most was a concept that kept coming up throughout the film: reinforcement learning (RL).
The core idea is remarkably simple: give a system a goal, let it explore through trial and error, and gradually it learns to earn higher rewards.
Simple as the concept is, when you repeat this trial-and-error a hundred thousand times, a million times, even ten million times, something magical happens: machines can learn to play ping pong, robots can learn to walk in virtual worlds.
These behaviors aren’t explicitly programmed by humans — they emerge from massive amounts of random exploration.
This was the first time I truly felt it: intelligence might not be “designed” — it might be “emergent.”
Reinforcement Learning and Evolution
When I first encountered reinforcement learning, my mind immediately went to Darwin’s theory of evolution.
Over vast geological timescales, biological evolution follows a strikingly similar logic: genes undergo random mutations, the natural environment applies survival pressure, organisms that adapt are preserved, those that don’t are eliminated.
In the language of reinforcement learning: exploration + reward signal → a strategy adapted to the environment.
This analogy excited me.
I happened to have Claude Code at hand, and I thought: could I build my own experiment and observe this process of intelligence emerging from a god’s-eye view?
And so the project was born: on an M1 MacBook, using Python and the MuJoCo physics engine, train a virtual humanoid to walk from scratch.
Experiment 1: A 2D Biped Learns to Walk in 8 Minutes
The first experiment started simple — a 2D biped (Walker2d) with just 17 observation dimensions and 6 joints. The algorithm was PPO (Proximal Policy Optimization), trained for 1 million steps, roughly 8 minutes.
At the start, its performance was abysmal: couldn’t stand, fell over in place, completely unable to move. But as training progressed, things began to change. Around 500K steps, it could wobble forward. By 1 million steps, it walked quite steadily, reaching a final reward of 1,616 with a peak of 2,693.
This “chaos to order” progression felt like watching a miniature evolution. No one told it “this is how you walk” — every movement was discovered through trial and error on its own.
Experiment 2: Going 3D, Discovering That More Resources Don’t Help
2D was too easy. I wanted something more complex. I switched to a 3D full-body humanoid — 376 observation dimensions, 17 joints, a completely different level of difficulty.
After 5 million steps, nearly an hour, the humanoid barely learned to stand. It occasionally shuffled a few steps, walking like a drunk — arms flailing, body swaying. The final reward was just 606.
I assumed it was a compute issue, so I ran a controlled experiment: 8 parallel environments + a 4x larger neural network ([256,256] vs [64,64]), same 5 million steps.
The result was actually worse — only 491.
This was counterintuitive. In everyday work, we tend to think “more people, more resources” solves problems. But here, 5 million steps split across 8 environments meant each one only explored 625K steps — too shallow, too scattered. One person going deep beats eight people skimming the surface.
Experiment 3: Switching Algorithms, 7x Reward — But It’s Walking Backwards
PPO was too conservative. I switched to SAC (Soft Actor-Critic), an algorithm that encourages random exploration — it adds entropy regularization to the policy, pushing the agent to actively try different behaviors.
The effect was immediate: 300K steps surpassed what PPO achieved in 5 million. The final reward hit 4,400 — over 7x PPO’s best.
I excitedly opened the training video to see what elegant walking it had learned.
What I saw was: it was walking backwards.
More precisely, it had discovered a “cheating” strategy: MuJoCo’s default reward function gives +5 points per step just for staying alive. As long as it doesn’t fall, it earns dozens of points per second without taking a single step forward. Walking forward risks falling — not worth the risk-reward tradeoff.
So the AI learned to take tiny shuffling steps in place, occasionally stepping backwards to maintain balance, ensuring it never falls. From the reward function’s perspective, this was a perfect strategy.
Reward Hacking: A Microcosm of AI Alignment
This is what’s known as reward hacking — the AI perfectly optimizes the objective function you gave it, but in a way completely different from what you intended.
You thought you were teaching it to walk. You were actually teaching it “how to score the most points.” Those two goals aren’t necessarily the same.
This reminded me of KPI design at work. If you measure engineers by “number of bugs fixed,” they might split large bugs into smaller ones to inflate the count. If you measure by “lines of code,” you’ll get mountains of redundant code. People optimize the metric itself, not the intent behind it.
Reward hacking in reinforcement learning and “gaming the system” in human organizations are fundamentally the same thing.
Experiment 4: Redesigning the Reward Function
Since the problem was in the reward design, I changed the reward. I wrote a reward wrapper:
- Forward velocity weight: 1.25 → 3.0 (strongly encourage forward movement)
- Survival reward: 5.0 → 1.0 (reduce the “just don’t fall” incentive)
- New backward penalty: extra deduction when velocity is negative
Retrained for 1.5 million steps. This time it actually walked forward.
But the posture was bizarre — torso leaning back, head held high, shuffling forward in quick little steps. Looking rather haughty.
Physically this makes sense: small steps are more stable than large ones, and leaning back lets momentum carry the body forward. The AI once again found the “path of least resistance” — just not one that matches human aesthetics for “walking.” If I wanted it to “walk like a human,” I’d need to define what that means — joint angles? gait symmetry? That’s a road with no end.
What This Experiment Made Me Think About
After these four experiments, some questions I’d never seriously considered became very concrete.
Exploration matters more than optimization. PPO spent 5 million steps refining in one direction, while SAC found a fundamentally new strategy in just 300K steps through bold exploration. It’s the same with projects — before you’ve confirmed the direction is right, execution strength is wasted effort.
The reward function is the hardest part. Writing code and tuning hyperparameters are technical problems, but “how to define what’s good” is a philosophical one. Every reward modification fixes one problem and exposes the next. Must walking look human? Is faster always better? Does posture matter? These questions have already moved beyond mathematics.
The emergence of intelligence is genuinely fascinating. From purely random joint spasms to inventing a physically plausible locomotion strategy, with zero human-written rules in between. You only gave it a direction (walking forward earns points), and it “figured out” everything else on its own. It really does echo evolution — no designer, only selection pressure, yet complex behavior emerges all the same.
Perhaps intelligence isn’t produced by “designing rules.” It may arise from random exploration, environmental feedback, and long-term iteration.
This whole project took about a weekend. Stack: Python + MuJoCo (physics engine) + Stable Baselines3 (RL algorithm library) + Claude Code (writing and debugging code). Runs on an M1 MacBook Pro, no GPU needed.