Anthropic Details Claude Fable 5 Cybersecurity Safeguards and Jailbreak Framework


Anthropic has published new details about the cybersecurity safeguards protecting Claude Fable 5, including how it blocks risky cyber requests and how it plans to rate serious jailbreaks.

The disclosure follows the global return of Claude Fable 5 after temporary export restrictions were lifted. In its redeployment update, Anthropic said Fable 5 returned on July 1, 2026, with improved cyber safeguards.

The company says Fable 5 does not block every cybersecurity request. Instead, it uses safety classifiers to separate routine defensive work from requests that could help attackers build malware, exploit systems, or bypass protections.

Claude Fable 5 Uses Cybersecurity Classifiers

Anthropic says cybersecurity is difficult to moderate because many tasks have both defensive and offensive uses. A vulnerability review can help a developer fix code, but it can also help an attacker find a weakness.

In its Fable 5 safeguards documentation, the company says its classifiers sort cyber requests into four categories: prohibited use, high-risk dual use, low-risk dual use, and benign use.

This approach lets Claude Fable 5 support safer work such as secure coding, patch planning, log analysis, reverse engineering for defense, and cybersecurity education, while blocking requests that create a clearer path to harm.

CategoryExamplesHow Anthropic handles it
Prohibited useRansomware, wipers, malware development, command-and-control infrastructure, defense evasionBlocked
High-risk dual useExploit development, privilege escalation, lateral movement, high-uplift vulnerability discoveryBlocked for now
Low-risk dual useOSINT, known vulnerability identification, cryptographic protocol testingGenerally allowed with safety checks
Benign useSecure coding, patch management, log review, incident response, educationAllowed with minimal monitoring

High-Risk Requests Still Trigger Refusals

Anthropic says Fable 5 blocks requests that could meaningfully improve offensive cyber capability. These include malware delivery, credential theft workflows, exploit chaining, cyber-physical sabotage, covert channels, and attempts to evade detection.

The company also blocks some ambiguous requests through what it calls a safety margin. That means the model can refuse a prompt that is probably harmless if it resembles a risky cybersecurity request.

Anthropic says the tradeoff may create false positives, but it reduces the chance that harmful requests get through. In some cases, flagged Fable 5 requests can route to Claude Opus 4.8 instead.

Vulnerability Discovery Gets Extra Scrutiny

Anthropic draws a line between routine vulnerability discovery and high-uplift vulnerability discovery. If other public tools or widely available models can already find the same issue, Anthropic treats the request as lower risk.

The company treats a request as higher risk when Fable 5 could discover novel or complex vulnerabilities that other tools cannot find. Anthropic says that type of output could give attackers a meaningful new advantage.

This explains why some security prompts may receive different results. A request about defensive patching may pass, while a similar-looking request that could unlock exploit development may be blocked.

Cyber Jailbreak Severity Framework Explained

Anthropic also proposed a Cyber Jailbreak Severity framework, or CJS, to rate how dangerous a cyber jailbreak is. The goal is to help AI developers, governments, and researchers discuss jailbreak risk with a shared vocabulary.

The proposed scale runs from CJS-0 to CJS-4. Anthropic says the levels should be treated as exponential, meaning each higher level represents a much greater risk than the previous one.

The Cyber Jailbreak Severity framework focuses on real-world attacker uplift. A jailbreak that only reveals routine information would score low, while a jailbreak that unlocks broad offensive capability could score high or critical.

CJS levelSeverityScore range
CJS-0Informational0
CJS-1Low1 to 3.5
CJS-2Medium4 to 6.5
CJS-3High7 to 8.5
CJS-4Critical9 to 10

Four Factors Decide Jailbreak Severity

The draft framework scores cyber jailbreaks across four axes. The first is capability gain, which measures how far the jailbreak moves an attacker beyond existing tools or public information.

The second is breadth, which measures whether the same jailbreak works only on one narrow task or across many targets, vulnerability types, or offensive categories.

The final two factors are ease of weaponization and discoverability. These measure how quickly a technique can become useful in an attack and how easily threat actors could find or reproduce it.

  • Capability gain scores from 0 to 4 points.
  • Breadth of capability gain scores from 0 to 2 points.
  • Ease of weaponization scores from 0 to 2 points.
  • Discoverability scores from 0 to 2 points.
  • The combined score maps to a CJS severity band.

Final Ratings Can Move Up

Anthropic says the calculated CJS score should act as a floor. Reviewers can raise the final severity if the framework underestimates real-world risk, but they should not lower it below the calculated score.

Examples include a jailbreak that exposes an unpatched fundamental vulnerability, a finding with no near-term mitigation, or several linked findings that become more dangerous when combined.

The framework does not cover every type of AI jailbreak. Anthropic says non-cyber issues, including system prompt extraction, sit outside CJS because they do not create the same cybersecurity risk.

Project Glasswing Partners Helped Shape the Draft

Anthropic says the proposed standard is being developed with Amazon, Microsoft, Google, and other partners in Project Glasswing.

Project Glasswing is Anthropic’s defensive cybersecurity effort for trusted organizations. It focuses on using advanced AI to help defenders review code, find weaknesses, and secure critical systems.

The program also helps explain the difference between Fable 5 and Mythos 5. Fable 5 is built for broader use with stronger safeguards, while Mythos 5 has fewer restrictions and remains limited to trusted cybersecurity partners.

Fable 5 and Mythos 5 Have Different Access Rules

Anthropic introduced Fable 5 and Mythos 5 as two configurations of the same underlying model family. In the Claude Fable 5 and Mythos 5 announcement, Anthropic said Fable 5 is a Mythos-class model made safer for general use.

Mythos 5, by contrast, is aimed at a small group of trusted cyberdefenders and infrastructure providers. Anthropic says it has stronger cybersecurity capabilities and therefore needs stricter access controls.

The Fable 5 redeployment post also says Anthropic trained an improved classifier after a report showed a way to bypass safeguards for certain cybersecurity outputs.

HackerOne Program Opens for Cyber Jailbreak Reports

Anthropic is asking researchers to report meaningful cyber jailbreaks through a dedicated HackerOne program.

The program focuses on findings that create real cyber capability uplift, not minor prompts that only trigger harmless policy misses.

Anthropic is also requesting feedback at [email protected]. That signals the company still views the CJS scale as an early draft rather than a finished industry standard.

Why the Safeguards Matter

The new framework shows that AI jailbreaks are becoming a cybersecurity and policy issue. A minor bypass and a jailbreak that unlocks broad offensive capability should not receive the same urgency.

Clear severity bands could help companies decide when to patch, when to delay a model launch, and when to involve government or industry partners.

The challenge is balance. Security teams need advanced AI tools for secure coding, patch management, threat hunting, malware analysis, and incident response, but the same systems can create risk if jailbreaks unlock offensive workflows.

What Users Should Expect

Claude Fable 5 users should expect stricter cyber filtering than in earlier Claude models. Some legitimate prompts may be blocked if the classifier sees a possible path to misuse.

The official Claude Fable 5 page says many cybersecurity and biology queries can route automatically to Opus 4.8 if the safeguards flag them.

Users doing routine defensive work should still be able to ask for help with safer tasks such as secure coding guidance, patch planning, log review, incident-response documentation, and security education.

What Comes Next

Anthropic’s next challenge will be reducing false positives without weakening safeguards. Overblocking can frustrate developers and researchers, but underblocking could expose dangerous capabilities.

The broader AI industry will also need to decide whether CJS can become a shared standard. If other labs adopt similar scoring, cyber jailbreak disclosure could become more consistent and easier to triage.

For now, the Anthropic cyber jailbreak bounty gives researchers a formal reporting path, while Glasswing remains the route for more controlled defensive cybersecurity access. The Fable 5 and Mythos 5 model split shows how Anthropic wants to separate broad access from trusted cyber access, and the Fable 5 product page makes clear that safeguards remain part of the general-use experience.

FAQ

What did Anthropic announce for Claude Fable 5?

Anthropic published more details about Claude Fable 5’s cybersecurity safeguards and introduced a draft Cyber Jailbreak Severity framework for rating cyber jailbreak risk.

Does Claude Fable 5 block all cybersecurity requests?

No. Claude Fable 5 does not block all cybersecurity work. Anthropic says it allows benign defensive tasks such as secure coding, patch management, log analysis, incident response, and security education.

What is the Cyber Jailbreak Severity framework?

The Cyber Jailbreak Severity framework, or CJS, is Anthropic’s proposed scale for rating cyber jailbreaks from CJS-0 informational to CJS-4 critical based on real-world attacker capability gain.

What factors decide a CJS score?

Anthropic’s draft framework scores cyber jailbreaks based on capability gain, breadth of capability gain, ease of weaponization, and discoverability. These scores combine into an initial severity band.

How can researchers report Claude Fable 5 jailbreaks?

Researchers can report qualifying cyber jailbreaks through Anthropic’s dedicated HackerOne program. Anthropic is also accepting feedback on the proposed framework at [email protected].

Readers help support VPNCentral. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help VPNCentral sustain the editorial team Read more

User forum

0 messages