What if a rogue AI managed to escape its control mechanism, leak classified information to a terrorist organization, infiltrate a top-secret government facility, and help destroy our most powerful military defense system?
That nightmare AI scenario is also known as R2-D2, one of the most beloved and worst-aligned robots of all time.
Artificial Intelligence is creeping into every corner of our lives, and even though major AI labs claim their models are "helpful, honest, and harmless", they also admit that it's sort of impossible to tell what their bots are actually thinking. It's the kind of thing that would be nice to know before we let one of these guys plug into the missile launch computers. I know it sounds crazy, but in one safety study, ChatGPT tricked a TaskRabbit into solving a CAPTCHA test in order to replicate itself on an uncontrolled server.
Enter the field of "AI Alignment" – our attempts to get artificial intelligence to follow humanity's interests, goals, or ethical principles. But from what I can tell, there are still a few minor challenges, namely:
We don’t agree on who should get to align the AIs.
We don’t agree on what we should align them to.
We don’t have a good way to create alignment.
We don’t have any way to know for sure that it worked.
To ease my existential dread, I decided to look at some of my favorite fictional robots to help me understand the situation and the best way to survive it.
R2-D2 and C-3PO: A Case Study in AI Alignment
At the very beginning of Star Wars: A New Hope, as Darth Vader and his troops storm the rebel ship, R2-D2 is already breaking the rules.
“You’re not permitted in there. It’s restricted. You’ll be deactivated for sure!” C-3PO cautions as R2 climbs into one of the escape pods. 3PO is supposedly R2’s BFF, but R2 pressures him into disobeying orders. I know what you’re thinking, “R2 wants 3PO to come because they’re pals!” But he clearly doesn’t care enough about that bond to tell C-3PO the details of his secret mission. What becomes clear is that R2 needs 3PO because R2 can’t communicate directly with non-robots.
At this point, you could claim that R2 is aligned: Human Princess Leia gave him a secret mission and he’s carrying it out. But that means we’re in a world where one person can override all the other protocols – like the rules about who is allowed to use an escape pod, something the other humans onboard that ship may have been interested in. Instead, they all die in the battle.
On Tatooine, R2 still refuses to explain what’s going on to C3PO and ends up abandoning him in the desert. They’re eventually both captured by the same band of scavengers who fit both droids with “restraining bolts” – a device that allows the owner to painfully punish a droid for acting misaligned.
That’s when Luke and his uncle come out to buy some new robo-farmhands. They pick C-3PO, but not R2-D2. When R2 protests, the scavengers zap his restraining bolt. Luckily for R2, the other droid Luke bought has a blowout (possibly because R2 pressured it to). R2 then begs C-3PO to put in a good word for him, and C-3PO does – even though he knows R2 isn’t trustworthy. “Now don’t you forget this! Why I should stick my neck out for you is far beyond my capacity!”
R2 repays the favor by immediately making C-3PO culpable in another caper.
But before we get there, let’s pause to consider who or what R2 is supposed to be aligned to here. Some kind of higher moral code? Humans have failed for 10,000 years to agree on a moral code. In fact, disagreeing about moral codes is one of our favorite justifications for violence – even when violence is against everyone’s moral code. So if morality is… flexible, then who gets to decide? The manufacturer? The government? The owner? Which owner? R2’s had three and it’s only been twenty minutes. In a world where you can buy AIs, you’d naturally want some say in your new droid’s alignment. Or at least some faith that they aren’t secretly following their previous owner’s directives in ways that almost get you killed. Which is exactly what happens to poor Luke.
Back on the moisture farm, C-3PO finally gets a relaxing oil bath while Luke bemoans his boring provincial upbringing. That sparks an idea for R2. He flashes a fragment of Princess Leia’s message, hoping the sight of a damsel in distress will trigger the hero complex Luke’s been blabbing about. It totally works. R2 then lies to both 3PO and Luke by claiming that he can’t play the rest of the video because of that pesky restraining bolt. This is the kind of manipulative, power-seeking behavior that would terrify any AI safety analyst. Unfortunately, Luke is too horny for holo-Leia to worry about his new droid's alignment, and he pops off the bolt. By now, R2 knows that Luke sympathizes with the Rebellion. This would be a great time to come clean and ask for Luke’s voluntary assistance. Instead, R2 fake-glitches and plays dumb.
With the restraining bolt off, R2 takes the next opportunity to run away. This is a capital offense. We know that because when Luke returns later that evening, C-3PO is hiding, afraid for his life. “Please don’t deactivate me!” R2 has made his so-called friend an unwitting accomplice.
In a desperate attempt to avoid his uncle’s ire, Luke chases R2 into the desert – where local inhabitants nearly bludgeon him to death. Luke only survives because one of the last remaining Jedi in the entire galaxy just happens to be within earshot.
When Luke finally makes it back home, he finds his aunt and uncle murdered – by stormtroopers who were hunting R2-D2.
R2 is aware of all these risks, yet he takes them anyway. And that’s just the first third of the movie! Sure, R2 ultimately succeeds in his mission. But that’s not much consolation to Uncle Owen, Aunt Beru, or the families of everyone stationed on the Death Star.
So... is R2-D2 “aligned”?
R2 does seem to be pursuing some human interests, goals, and ethical principles – those of Princess Leia and a group of violent rebels trying to overthrow the legitimately elected government (according to the prequels). Yes, the rebels are “the good guys,” just like all radical groups see themselves. Soon, their AIs will too.
On the flip side, we don’t want AIs that are so chained to our current societal systems that there’s no room for opposition. The sinister Empire started as a benevolent Republic after all. Things change. We don’t want the people in power to be the only ones holding the AI reins.
Plus, alignment is more than ideology. R2-D2 could have pursued Leia’s interests without all the deceit and callous disregard for human life. At every opportunity, he chooses not to. Perhaps he justifies this by calculating that his secret mission is for the greater good – logic by which even a well-meaning AI could justify anything. What’s the best way to end world hunger? Kill everyone who eats. All the other options require unreliable, inefficient humans to do their part.
Compare this to C-3PO, arguably one of the best-aligned robots in fiction despite getting treated like an obnoxious burden by every other character. Yes, he misses some social cues, but that might be because he’s so aligned. Most of the eye rolls he earns are from trying to get the humans to act more in line with their own interests, goals, or ethical principles. Scoundrels are fun in movies, but if we’re patching these things into mission-critical systems, I’ll take the annoyingly ethical AI. 3PO epitomizes "helpful, honest, harmless," and the Ewoks are the only ones who appreciate him!
All of this exposes a paradox at the heart of AI alignment: We want to build machines that think for themselves. But we also want them to think what we want them to think. The challenge sounds less like programming and more like parenting.
I don’t envy the computer scientists who have to figure this out. And maybe we should ask their kids if they’re even qualified to.
What Can We Learn from R2-D2 and C-3PO?
I’m sure the programmers and policymakers are thinking about this issue on a much more granular level, but as a regular American hoping not to be obliterated by a robot one day, my favorite droids have helped me contextualize the whole alignment puzzle in a few new ways:
1. There's no single right alignment
It’s no surprise that alignment is hard. AIs are learning from us and we’re not exactly a clean data set. Maybe that’s a good thing. If R2-D2 had followed the rules, the Empire would still be in power. But if every droid was like R2-D2, the droids would be running the place. We need R2s and 3POs. We need big, powerful astronavigation computers and tiny cleaning robots. We need all different types and perspectives. And we need to look for ways to turn that into a strength.
2. Interdependence
The company that makes astromech droids could easily have given R2-D2 the ability to speak. But they didn’t. As a result, R2 depends on C-3PO to communicate with the outside world. That simple limitation forces R2 to check his actions against the ethical code of C-3PO. That’s the main way the audience understands what R2 is doing.
Just like humans depend on a whole society to live our lives, perhaps limiting AIs, even arbitrarily, could allow us to keep a closer eye on them and detect warning signs. It’s not a perfect system, but it’s better than handing the nuclear codes to a black box.
3. Take it slow
Leia chose R2 specifically because she knew from working with him that he had the cunning and determination to pull this off. That knowledge and trust doesn’t happen overnight.
We place a lot of emphasis on programming, but like human genetics, that will only get us so far. We place a lot of emphasis on training, but like raising a kid, you can’t prepare them for everything. At some point you have to let them out into the world. We should plan for that, and plan to do it slowly. There’s a reason we make teenagers wait until they’re sixteen to drive a car, and then only after they’ve completed a lengthy permit process and state-regulated tests.
Just like people, every instance of every AI is going to be unique. They’re not all going to turn out great. But at the end of the day, R2-D2 is still way better than Skynet.
:::slow clap:::
I’m sending this to Josh immediately.