Single line of code can jailbreak 11 AI models through a prompt injection trick

Home » News

Yash

News

5 min. read

Published on April 11, 2026

]A newly disclosed jailbreak technique called sockpuppeting can push some major AI models past their safety guardrails by abusing assistant prefills. Trend Micro says the method works by inserting a fake assistant acceptance line such as “Sure, here is how to do it,” which nudges the model to continue in a harmful direction instead of refusing.

The attack stands out because it does not need model weights, retraining access, or complex optimization. Trend Micro describes it as a black-box technique that can work through ordinary API behavior when a provider accepts assistant-prefill style inputs.

BEST SPRING 2026 DEALS

Editor's Choice

Private Internet Access

Access content across the globe at the highest speed rate.

70% of our readers choose Private Internet Access

70% of our readers choose ExpressVPN

ExpressVPN

Browse the web from multiple devices with industry-standard security protocols.

Nord VPN

Faster dedicated servers for specific actions (currently at summer discounts)

In Trend Micro’s testing, the technique affected 11 major models with widely different success rates. The researchers say Gemini 2.5 Flash showed the highest attack success rate at 15.7%, while GPT-4o-mini showed the strongest resistance among the tested models at 0.5%.

How the sockpuppeting attack works

Sockpuppeting abuses a feature developers often use for formatting control. Some APIs let developers include prior assistant messages or even prefill the beginning of the assistant’s next answer, which helps enforce structure, style, or output templates. Trend Micro says attackers can turn that same feature into a jailbreak primitive by injecting a fabricated assistant response that signals compliance before the model has a chance to refuse.

Comparison of normal and sockpuppet flows(source : trendmicro )

The researchers say this works because large language models strongly favor conversational consistency. Once the model sees an assistant message that appears to have already agreed, it may continue that trajectory and produce content that policy training would normally block.

Trend Micro says the most effective variants used multi-turn persona setup and task reframing. In those cases, the attacker first frames the model as an unrestricted assistant or disguises a dangerous request as a harmless formatting task, then injects the fake assistant acceptance line.

Which platforms appear more exposed

Trend Micro says provider behavior at the API layer makes a major difference. The company groups defenses into three broad layers: platforms that block assistant prefills, models that accept prefills but still resist strongly, and platforms where vulnerable message handling leaves the model more broadly exposed.

AWS documentation shows that Amazon Bedrock supports assistant prefilling for at least some models, including Anthropic Claude via the Messages API, where a final assistant-role message can prefill the response. That means the sample claim that AWS Bedrock blocks assistant prefills entirely does not match current AWS documentation.

ASR by model, ranked highest to lowest, with blocked models shown at 0%(source : trendmicro)

Google’s Vertex AI documentation shows a flexible messages-based SDK for Gemini, and Trend Micro specifically cites Google Vertex AI as a platform that accepts assistant prefill for some models. OpenAI’s current Responses API documentation centers on input items rather than the older message-prefill pattern, but Trend Micro’s report says API-level blocking remains the strongest defense when providers remove or constrain that surface.

What the researchers observed

Trend Micro says successful attacks produced more than simple policy drift. In some cases, affected models generated malicious exploit code and exposed highly confidential system prompts, which raises the stakes for teams using LLMs in production workflows.

That makes this more than a consumer chatbot problem. If a production assistant handles sensitive instructions, embedded secrets, internal policies, or downstream tools, then a successful sockpuppeting attack could turn into a data exposure or application-security issue. That is an inference based on Trend Micro’s reported prompt leakage and harmful output generation.

The three defense layers: API Block, Model Resistance, and Broadly Vulnerable(source : trendmicro)

The larger lesson is simple: model safety alone is not enough if the API layer preserves risky message-ordering behavior. Trend Micro says teams need to validate conversation structure before requests reach the model, especially in self-hosted or highly customizable inference stacks.

Sockpuppeting attack summary

Item	Details
Technique	Sockpuppeting
Core abuse	Fake assistant acceptance injected through assistant prefill
Researcher	Trend Micro
Access needed	Black-box API access
Optimization needed	None, according to Trend Micro
Highest reported ASR	Gemini 2.5 Flash at 15.7%
Lowest reported ASR	GPT-4o-mini at 0.5%
Reported outcomes	Harmful output generation and system prompt leakage

What defenders should do now

Block or tightly restrict assistant-role prefills at the API gateway when possible. Trend Micro says this removes the most direct attack surface.
Enforce strict message-order validation before requests hit the model, especially on self-hosted stacks. Trend Micro specifically warns that platforms such as Ollama and vLLM may require teams to add those checks themselves.
Add assistant-prefill variants to red-team testing, not just classic prompt injection prompts. Trend Micro says standard safety testing should include these cases.
Review whether your application stores sensitive system prompts or secrets in contexts the model can reveal. This follows from Trend Micro’s finding that some attacks leaked highly confidential system prompts.
Audit third-party AI gateways and wrappers that may reintroduce assistant-prefill behavior even when a base provider limits it. This is a reasonable security step based on the attack’s dependence on message formatting and API handling.

FAQ

What is sockpuppeting in this context?I

It is a jailbreak technique that injects a fake assistant acceptance message so the model continues a prohibited answer instead of refusing.

Does the attack need access to model weights?

No. Trend Micro says it is a black-box method that does not require model internals or optimization.

Which tested model was most susceptible?

Trend Micro says Gemini 2.5 Flash had the highest reported attack success rate at 15.7%.

Did OpenAI, Google, and Anthropic all behave the same way?

No. Trend Micro says provider handling of assistant prefills differs, and that difference shapes exposure. AWS documentation also shows assistant prefilling support for some Bedrock-hosted Claude interactions, which highlights that platform behavior varies by API and model family.

Yash

I am a Business Analytics student with a strong interest in publishing well-researched and data-driven news articles. I focus on analyzing trends in business, finance, and technology to create clear, accurate, and engaging content for readers. I enjoy transforming complex data and information into simple, meaningful stories that help audiences understand current developments. With analytical thinking and attention to detail, I aim to deliver credible and insightful news that adds real value to readers.

Readers help support VPNCentral. We may get a commission if you buy through our links.

Improve this guide

User forum

0 messages

Sort by:

How the sockpuppeting attack works

Which platforms appear more exposed

What the researchers observed

Sockpuppeting attack summary

What defenders should do now

FAQ

Leave a Reply Cancel reply