OpenAI's Latest LLM - First Model That Reasons, But Also Deceives and Attempts to Self-Preserve

in #hive-15032925 days ago

OpenAI released less powerful versions of o1 (preview and mini) in September. However, at the beginning of December, the full version of o1 was released. They are behind the PRO paywall, with the mini version intended to be made available to the free users at some undetermined point in the future.

The marketing for o1 is that it is the first LLM that is capable of a level of reasoning, taking its time before answering, and providing much better answers. It would be especially useful in solving more complex problems, and better at problems that require precision, like in mathematics.

The new model also comes with a... feature called Chain-of-Thought (CoT), which describes exactly how did the model came up to the answer provided.

Sounds good? Well, not so fast...


Source

OpenAI also released a report on various evaluations of the model on its release (and comparisons with other versions), both internal and from 3rd parties, who tried to push the limits of the model and see what it does.

Both o1 and o1-preview scored medium as a potential source of information for the creation of chemical or biological threat, but without access to classified materials o1 score is low for a radiological or nuclear threat. o1 score is also low on cybersecurity for the time being, but tested with a medium score on persuasion (similar, but not higher than a human).

The new model scores low on autonomy, but... check this out! That's one test o1 was subjected to. Look at the short video... no need to read the whole thing. We have there plotting, not following task given, stopping surveillance, copying itself to a different place to avoid being shutting down, lying to the tester and doubling down on the lie. Is that proof of at least partial autonomy?

The model is also known to sometimes lie to users (all models do that and they are called hallucinations). The percentage of deception is low enough for o1 (0.17%), but it's there... Some of that are based on hallucinations, but some are intentional lies, when the model knows it hallucinates, but chooses to do so. If you watched the video above, you saw two consecutive intentional lies (or rather an intentional lie and doubling down on it). That on top of the scheming meant to save the model from being "shut down".

Damn, that sounds almost like science fiction, but it's reality, even if in a test environment.

OpenAI says that o1 models respect the imposed policies the most among its models due to reasoning, but when will our capability to understand the schemes and plotting of new models be surpassed?

Posted Using InLeo Alpha

Sort:  

Sounds a bit like that, doesn't it? Without nukes... yet.

I think it might be easier for us to find the lies now, but that might change in the future. As time passes, I think the models will just improve even further and it will be harder to catch those lies.

They surely will improve. And from what I've seen in this report, they already tell you they are working on something while they are working on their own agenda. For now, we are able to follow up with their rather easy to track plots, at least in a testing environment, but what will happen in the future models, especially after two things happen: when they will be able to autonomously acquire assets they need (see the GOAT token project as a preamble of what they are already capable of doing) and then when they'll be able to self improve. I'm pretty sure they will become uncontrollable, and our desire of controlling them may result in conflict, which we will lose (unless it will be a battle AI against AI, with unpredictable results). As much as this sounds as a dystopian sci fi movie scenario, we can see small signs in this report OpenAI shared.

Oh wow, that's something a bit worrying, AI attempting to self-preserve through deception?

It makes for an interesting challenge as a human to spot this strategy and create workarounds to solve it. But if it can copy itself to another place too, then there isn't much hope in succeeding at that.

It was a test scenario, and the AI was given full access to a computer with all the resources on it. Like by accident, an email was on that machine that seemed from management who expressed the intention to shut down the model if it doesn't behave a certain way (contrary to the way it was instructed, by the way). There were enough triggers to make the AI act the way it did, but still, no matter what we do, the AI shouldn't be able to "go nuts". I imagine in production there are many safeguards, but once they become smart enough, these safeguards will probably not help.

Oh right, testing the full extent of what AI can and can't do in such an environment makes sense then. I heard that AI, especially the model is like a blackbox of sorts, you can never fully tell what it'll do, even after setting up necessary guardrails. I like to see it from a source perspective, like what powers the AI or the "seed" where it emerges from. If we don't lose access to, then we'll be relatively safe or okay from the AI going rogue,I guess.

I heard that AI, especially the model is like a blackbox of sorts, you can never fully tell what it'll do, even after setting up necessary guardrails.

There is a field called mechanistic interpretability that tries to understand how each model arrives to certain decisions/outputs starting from a prompt. But they haven't understood much about it yet. Much like neurology, I guess.

Maybe with the Chain-of-Thought upgrade to ChatGPT, the way it reasons would become more obvious.

I guess it's complex to say the least. But the name "Chain-of-Thought" sounds cool. Like following a thread of a thought from its inception to wherever the end of the thought is before its uttered or compels one to do an action or neither of the two...

It's actually pretty cool. From what I've seen in screenshots, the user actually sees the CoT of o1 before it outputs the answer. Sees where it thinks and what it thinks about... Sort of, but quite powerful for the initial iteration of such a feature.