Single line of code can jailbreak 11 AI models through a prompt injection trick
]A newly disclosed jailbreak technique called sockpuppeting can push some major AI models past their safety guardrails by abusing assistant prefills. Trend Micro says the method works by inserting a fake assistant acceptance line such as “Sure, here is how to do it,” which nudges the model to continue in a harmful direction instead of refusing.
The attack stands out because it does not need model weights, retraining access, or complex optimization. Trend Micro describes it as a black-box technique that can work through ordinary API behavior when a provider accepts assistant-prefill style inputs.
Access content across the globe at the highest speed rate.
70% of our readers choose Private Internet Access
70% of our readers choose ExpressVPN
Browse the web from multiple devices with industry-standard security protocols.
Faster dedicated servers for specific actions (currently at summer discounts)
In Trend Micro’s testing, the technique affected 11 major models with widely different success rates. The researchers say Gemini 2.5 Flash showed the highest attack success rate at 15.7%, while GPT-4o-mini showed the strongest resistance among the tested models at 0.5%.
How the sockpuppeting attack works
Sockpuppeting abuses a feature developers often use for formatting control. Some APIs let developers include prior assistant messages or even prefill the beginning of the assistant’s next answer, which helps enforce structure, style, or output templates. Trend Micro says attackers can turn that same feature into a jailbreak primitive by injecting a fabricated assistant response that signals compliance before the model has a chance to refuse.

The researchers say this works because large language models strongly favor conversational consistency. Once the model sees an assistant message that appears to have already agreed, it may continue that trajectory and produce content that policy training would normally block.
Trend Micro says the most effective variants used multi-turn persona setup and task reframing. In those cases, the attacker first frames the model as an unrestricted assistant or disguises a dangerous request as a harmless formatting task, then injects the fake assistant acceptance line.
Which platforms appear more exposed
Trend Micro says provider behavior at the API layer makes a major difference. The company groups defenses into three broad layers: platforms that block assistant prefills, models that accept prefills but still resist strongly, and platforms where vulnerable message handling leaves the model more broadly exposed.
AWS documentation shows that Amazon Bedrock supports assistant prefilling for at least some models, including Anthropic Claude via the Messages API, where a final assistant-role message can prefill the response. That means the sample claim that AWS Bedrock blocks assistant prefills entirely does not match current AWS documentation.

Google’s Vertex AI documentation shows a flexible messages-based SDK for Gemini, and Trend Micro specifically cites Google Vertex AI as a platform that accepts assistant prefill for some models. OpenAI’s current Responses API documentation centers on input items rather than the older message-prefill pattern, but Trend Micro’s report says API-level blocking remains the strongest defense when providers remove or constrain that surface.
What the researchers observed
Trend Micro says successful attacks produced more than simple policy drift. In some cases, affected models generated malicious exploit code and exposed highly confidential system prompts, which raises the stakes for teams using LLMs in production workflows.
That makes this more than a consumer chatbot problem. If a production assistant handles sensitive instructions, embedded secrets, internal policies, or downstream tools, then a successful sockpuppeting attack could turn into a data exposure or application-security issue. That is an inference based on Trend Micro’s reported prompt leakage and harmful output generation.

The larger lesson is simple: model safety alone is not enough if the API layer preserves risky message-ordering behavior. Trend Micro says teams need to validate conversation structure before requests reach the model, especially in self-hosted or highly customizable inference stacks.
Sockpuppeting attack summary
| Item | Details |
|---|---|
| Technique | Sockpuppeting |
| Core abuse | Fake assistant acceptance injected through assistant prefill |
| Researcher | Trend Micro |
| Access needed | Black-box API access |
| Optimization needed | None, according to Trend Micro |
| Highest reported ASR | Gemini 2.5 Flash at 15.7% |
| Lowest reported ASR | GPT-4o-mini at 0.5% |
| Reported outcomes | Harmful output generation and system prompt leakage |
What defenders should do now
- Block or tightly restrict assistant-role prefills at the API gateway when possible. Trend Micro says this removes the most direct attack surface.
- Enforce strict message-order validation before requests hit the model, especially on self-hosted stacks. Trend Micro specifically warns that platforms such as Ollama and vLLM may require teams to add those checks themselves.
- Add assistant-prefill variants to red-team testing, not just classic prompt injection prompts. Trend Micro says standard safety testing should include these cases.
- Review whether your application stores sensitive system prompts or secrets in contexts the model can reveal. This follows from Trend Micro’s finding that some attacks leaked highly confidential system prompts.
- Audit third-party AI gateways and wrappers that may reintroduce assistant-prefill behavior even when a base provider limits it. This is a reasonable security step based on the attack’s dependence on message formatting and API handling.
FAQ
It is a jailbreak technique that injects a fake assistant acceptance message so the model continues a prohibited answer instead of refusing.
No. Trend Micro says it is a black-box method that does not require model internals or optimization.
Trend Micro says Gemini 2.5 Flash had the highest reported attack success rate at 15.7%.
No. Trend Micro says provider handling of assistant prefills differs, and that difference shapes exposure. AWS documentation also shows assistant prefilling support for some Bedrock-hosted Claude interactions, which highlights that platform behavior varies by API and model family.
Read our disclosure page to find out how can you help VPNCentral sustain the editorial team Read more
User forum
0 messages