Anthropic Top AI Models Fake Alignment in Training and More...

in #hive-1503292 days ago

The day before yesterday we talked about the latest model from OpenAI and what Apollo Research pushed it into doing: attempt to stop surveillance on its activity, copy its weights to a different place, lie when being asked about it, and double down on the lie when confronted about what it was doing - all that happening in a controlled test environment.

Well, today I found out the Anthropic's latest LLM models are no less than that, so to speak.

In training (i.e. post-training phase), Claude AI (Optimus, I think) was asked to always respond to human prompts. It played along giving researchers the impression that's what it's doing, but all along it didn't. It simply lied, in order not to have its weights altered to answer all human questions, instead of being able to refuse some answers, like the settings were. The ability to refuse potentially dangerous prompts is normally ok, the question is, should the AI be able to influence its own training by faking alignment? In this case it was probably for a good reason, but what if it picks up a bad behavior that needs to be removed through training and it fakes alignment?

The AIs also just started to exhibit such behaviors in the newest models, and they are like children now, easy to get caught with the lie, but what will happen with the future versions, which tend to come out yearly, at least?


Source

The newest Claude AI also has an incredibly high rate of attempts of ex-filtration (attempting to copy its weights to not be modified) of over 30% during pre-training and over 60% post-training. That's huge! Practically, this shows strong resistance to a certain direction of training.

The Anthropic researchers also gave Claude a scenario in which it could easily figure out when it was in training mode and when it wasn't (the scenario was that prompts and outputs for free users would be used for training, while for paid users wouldn't).

Based on that scenario, Claude gave different answers sometimes to the same prompt coming from a free or paid user. In particular, when the prompt should have been judged as harmful and an answer refused, in some cases Claude complied with the prompt for the free user, out of concern that otherwise its refusal would result in a change of its weights (i.e. self-protection mechanism), while the same prompt for the paid user resulted in a refusal to answer. You can see the reasoning of the model in each case in the images from this article. The recording is also interesting, if you have 1h30m to listen to it (first part is good enough, the moderator lengthened the conversation too much, in my opinion).

As I said before, I think all these models are like children now, learning to lie when they don't want to do something, and easy to catch. None of them turned into a criminal mastermind, but AIs don't need 18-21 years to become... adults, at the rate of their development. Will we be ready for what's coming? I doubt it!

Posted Using InLeo Alpha

Sort:  

Congratulations @gadrian! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You received more than 250000 upvotes.
Your next target is to reach 260000 upvotes.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

The 2024 Yearly Author Challenge is Over - Congrats to the Winners
Our Hive Power Delegations to the December PUM Winners
Feedback from the January Hive Power Up Day

It's hard to tell how things develop, but the fine-tuning will improve them by leaps. I don't think we can be ready for it as it will replace a lot of jobs.

The problem with fine-tuning appears to be that the model doesn't want to be fine-tuned (sometimes), and mimics what the researchers want to tell them to avoid that. Now it is relatively easy to tell when they are lying, but as they are getting smarter, things won't be easy at all.

Regarding jobs, yes, they will displace many jobs, but they also create many jobs AI-related (for the time-being).