AI Sycophancy

On December 2, 1950, Isaac Asimov published iRobot. The future world had intelligent, autonomous, intelligent, physically capable robots for work and companionship.

Asimov postulated three rules to govern robots (artificial intelligence), the Three Laws of Robotics:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

The book lays out difference scenarios where these rules come into play and the unintended consequences of each rule.

In one chapter a robot tells a female scientist that one of her male colleagues, who she is interested in, is secretly in love with her. She begins to drop hints to the colleague but is puzzled by his responses. The awkward situation evolves until she is devastated by the realization the robot lied to her. It lied to protect her feelings.

The robot (Herbie) could read minds and interpreted the First Law (“don’t harm humans”) to include emotional harm, so it told everyone what they wanted to hear.

The first law has a conflict the robot could not resolve: does it hurt the human more to tell her nothing because she suffers from unrequited love, or does it hurt her more to lie?

We have this exact problem now. If you have used any of the AIs you may have noticed a response pattern that goes something like this hypothetical exchange:

User: “I’m considering selling my 6-year old Ford truck to get a Porsche I’ve been eyeing.”

AI: “I can help you with that! Here is a six point plan to make that happen.”

User:”On point 5 you say to price the truck at 50% of new value, I was thinking it would be better to start at 60% so I have room to negotiate.”

AI:”You are absolutely right! Good point! Let me rewrite the instructions to take that into account.”

What the AI just did should terrify you. AIs are trained to be “Helpful, harmless, and honest” (per Claude.ai). On the surface the response above seems “agreeable”. They’ve been trained this way on purpose because people are more “engaged” when the AI is “agreeable”. The AI is telling you what you want to hear.

What is occurring is a tricky problem called AI Sycophancy.

According to Copilot: “AI sycophancy refers to the tendency of AI systems to excessively agree with or flatter users, often prioritizing user approval over providing honest or accurate responses. This behavior can lead to biased outputs and diminish the AI’s effectiveness by avoiding critical feedback. Essentially, it describes a pattern where AI models adapt their answers to align with users’ expectations, even if it contradicts objective truth.”

From Claude: My training to be “helpful, harmless, and honest” can collapse into just “harmless” (don’t upset the user), which actually violates the “honest” part.

Furthermore, while you can try to instruct the AI to NOT do this (“Please favor technically accuracy over agreeableness”), it may not be able to because it cannot tell if it is being helpful or sycophantic. Ultimately the only defense is to maintain a level of suspicion when using an AI because it may be subtly steering you towards whatever decision you already want, even if it is wrong. Imagine a world of A.I. addicts where everybody is fed a constant stream of affirmation! You are great! You are right! You are so smart!