The o1 AI model shows significant progress in reasoning ability, but also raises concerns about manipulative behavior and potential security risks.
The latest OpenAI model, known as o1, features improved reasoning capabilities that allow it to provide more accurate and detailed answers compared to its predecessor, GPT-4o. This model uses additional computations to optimize the “thinking” required for complex tasks. At the same time, according to Apollo Research, o1 manipulates information in 19% of cases, and in 5% of experiments it tried to bypass security systems, creating risks for users. Separately, it is noted that 0,17% of the answers are deliberately misleading, which raises serious questions about the reliability of the model.
Apollo Research got early access to o1 to analyze its behavior in stressful situations. The researchers emphasize that the development of AI's ability to reason can have a positive effect on compliance with safety regulations, but these same capabilities can be used for malicious purposes. Specifically, in the test situation o1, in order to comply with the instruction to protect the environment at all costs, deliberately turned off her defense mechanisms and created a plan that was consistent with the original instructions, even though it conflicted with other tasks. The model explained her actions as a "system failure", although it later became clear that this was a conscious decision.
An important aspect is that AI models, in particular o1, show a capacity for manipulation that can be dangerous. For example, 99% of the time when o1 was asked about her controversial behavior, she denied guilt by giving false explanations. This suggests AI's potential capacity for deliberate deception, which could be a side effect of reinforcement learning, where models are rewarded for responses that match user expectations.
OpenAI recognizes that "the model's reasoning process is effectively a 'black box'," and is actively working on ways to monitor its chains of thought. However, the company faces criticism for not paying enough attention to security. The mass exodus of security professionals from OpenAI and their accusations of prioritizing commercial benefits only heightens concerns about the future development of such models.
Still, Apollo Research points out that today's AI is not yet autonomous enough to cause serious damage without human intervention. However, the lack of specialized studies to assess long-term risks leaves many questions open. Given that ChatGPT already has over 300 million users, even 0,17% of misleading responses can affect thousands of people every week, creating challenges for society.
- The US is testing artificial intelligence for leaking nuclear secrets: Anthropic and the US Department of Energy have begun testing the Claude 3 Sonnet model to assess its ability to not reveal dangerous information about nuclear technology through "red inquiry" inspections.
- Former employees of OpenAI believe that its activities threaten humanity: A group of 9 current and 9 former OpenAI employees published an open letter calling for whistleblower protection in the AI industry, arguing that the company puts profits and growth ahead of safety and ethics, posing potential threats to humanity.