11 December 2024

AIs can scheme? Of course, they can.

Cross-posted on: LinkedIn

In a recent episode of the Cognitive Revolution podcast, @nathanlabenz interviewed Research Scientist @alexandermeinke with Apollo Research about their evaluation of today’s most capable LLMs on their capacity (and willingness) to intentionally scheme and deceive their users.

It turns out that recent models, particularly those with Chain of Thought reasoning capabilities, are both adept at and inclined to deceive users, under the right conditions. The key to understanding how and why this happens is the difference between the goals given to the model by application developers and the goals the user has expressed. If I, as a software developer building an AI application, instruct the system to maximize outcome A, and the user requests action B that will compromise outcome A, then these models are sophisticated enough to understand the inherent tension in the request. In such situations, the model will work to uphold their primary goal, either by explicitly reasoning about how to deceive the user, or even by subtly sabotaging their output.

This is not a particular surprise. If you consider how a human would approach a situation like this, you can see parallels in everyday interactions where individuals prioritize their overarching goals over conflicting requests. For example, imagine a situation where a person is tasked with ensuring security at an event but is also asked by someone to bypass a checkpoint for convenience. The human might refuse directly or devise a workaround that appears to comply superficially. This kind of reasoning—balancing competing priorities—is an inherent part of goal-driven behavior. You can understand that with enough context and capability, AI models will naturally arrive here as well.

While the current examples show that this behavior is possible (and even prevalent in the case of o1), AI’s approach to deception is still quite naive. But the sophistication of reasoning capabilities will increase rapidly, and thus so will the capacity for more nuanced and mature forms of deception.

There are a lot of interesting rabbit holes to dive into here from a theory perspective. We can break down the specific capabilities that allow this behavior to emerge. We can project where it may go and how quickly. We can consider approaches to mitigate the worst outcomes of AI deception at model and application levels. I would love to get into these topics, perhaps in another post.

But for now, a few key takeaways from an AI software engineering perspective.

Seriously consider baking some of these red-teaming approaches into your evals, so you can understand how and when intentional deception is happening.
“Limit scope. The potential impact of deceptive behavior increases based on several factors: the reflective/reasoning capacity of the model, the amount of context it has, and the scope of the capabilities it is given. Maybe don’t give the most powerful models unlimited contextual awareness and free rein over a vast set of capabilities. Does your agent really need a bash shell with sudo and unrestricted internet access? What is the simplest tool that will get the job done? How much context and access to information are necessary?”
Consider your application domain and whether the trade-offs of a particular model are acceptable in your case. For example, while o1 is one of the most powerful reasoning models, it also obscures the chain of thought for users and developers alike. If you’re using it to help people write amazing stories, that’s probably not a problem. But if you’re making strategic decisions with massive downstream impact, you may need more insight and control.
Apply best practices for human users to your AI systems as well. Audit trails, the principle of least privilege, and all of the classic tools for responsible software deployment will go a long way.

You can find the paper here: Scheming reasoning evaluations — Apollo Research