
OpenAI admits prompt injection is here to stay as enterprises lag on defenses
It's refreshing when a leading AI company states the obvious. In a detailed post on hardening ChatGPT Atlas against prompt injection, OpenAI acknowledged what security practitioners have known for years: "Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully 'solved.'" Credit: Made by VentureBeat in Midjourney What’s new isn’t the risk - it’s the admission. OpenAI, the company deploying one of the most widely used AI agents, confirmed publicly that agent mode “expands the security threat surface” and that even sophisticated defenses can’t offer deterministic guarantees. For enterprises already running AI in production, this isn’t a revelation. It’s validation - and a signal that the gap between how AI is deployed and how it’s defended is no longer theoretical. None of this surprises anyone running AI in production. What concerns security leaders is the gap between this reality and enterprise readiness. A VentureBeat survey of 100 technical decision-makers found that 34.7% of organizations have deployed dedicated prompt injection defenses. The remaining 65.3% either haven't purchased these tools or couldn't confirm they have. The threat is now officially permanent. Most enterprises still aren’t equipped to detect it, let alone stop it. OpenAI’s LLM-based automated attacker found gaps that red teams missed OpenAI's defensive architecture deserves scrutiny because it represents the current ceiling of what's possible. Most, if not all, commercial enterprises won't be able to replicate it, which makes the advances they shared this week all the more relevant to security leaders protecting AI apps and platforms in development. The company built an "LLM-based automated attacker" trained end-to-end with reinforcement learning to discover prompt injection vulnerabilities. Unlike traditional red-teaming that surfaces simple failures, OpenAI's system can "steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps" by eliciting specific output strings or triggering unintended single-step tool calls. Here's how it works. The automated attacker proposes a candidate injection and sends it to an external simulator. The simulator runs a counterfactual rollout of how the targeted victim agent would behave, returns a full reasoning and action trace, and the attacker iterates. OpenAI claims it discovered attack patterns that "did not appear in our human red-teaming campaign or external reports." One attack the system uncovered demonstrates the stakes. A malicious email planted in a user's inbox contained hidden instructions. When the Atlas agent scanned messages to draft an out-of-office reply, it followed the injected prompt instead, composing a resignation letter to the user's CEO. The out-of-office was never written. The agent resigned on behalf of the user. OpenAI responded by shipping "a newly adversarially trained model and strengthened surrounding safeguards." The company's defensive stack now combines automated attack discovery, adversarial training against newly discovered attacks, and system-level safeguards outside the model itself. Counter to how oblique and guarded AI companies can be about their red teaming results, OpenAI was direct about the limits: "The nature of prompt injection makes deterministic security guarantees challenging." In other words, this...
Preview: ~500 words
Continue reading at Venturebeat
Read Full Article