Anthropic Details Claude Fable 5 Cybersecurity Safeguards and Jailbreak Framework
Anthropic has published new details about the cybersecurity safeguards protecting Claude Fable 5, including how it blocks risky cyber requests and how it plans to rate serious jailbreaks.
The disclosure follows the global return of Claude Fable 5 after temporary export restrictions were lifted. In its redeployment update, Anthropic said Fable 5 returned on July 1, 2026, with improved cyber safeguards.
Access content across the globe at the highest speed rate.
70% of our readers choose Private Internet Access
70% of our readers choose ExpressVPN
Browse the web from multiple devices with industry-standard security protocols.
Faster dedicated servers for specific actions (currently at summer discounts)
The company says Fable 5 does not block every cybersecurity request. Instead, it uses safety classifiers to separate routine defensive work from requests that could help attackers build malware, exploit systems, or bypass protections.
Claude Fable 5 Uses Cybersecurity Classifiers
Anthropic says cybersecurity is difficult to moderate because many tasks have both defensive and offensive uses. A vulnerability review can help a developer fix code, but it can also help an attacker find a weakness.
In its Fable 5 safeguards documentation, the company says its classifiers sort cyber requests into four categories: prohibited use, high-risk dual use, low-risk dual use, and benign use.
This approach lets Claude Fable 5 support safer work such as secure coding, patch planning, log analysis, reverse engineering for defense, and cybersecurity education, while blocking requests that create a clearer path to harm.
| Category | Examples | How Anthropic handles it |
|---|---|---|
| Prohibited use | Ransomware, wipers, malware development, command-and-control infrastructure, defense evasion | Blocked |
| High-risk dual use | Exploit development, privilege escalation, lateral movement, high-uplift vulnerability discovery | Blocked for now |
| Low-risk dual use | OSINT, known vulnerability identification, cryptographic protocol testing | Generally allowed with safety checks |
| Benign use | Secure coding, patch management, log review, incident response, education | Allowed with minimal monitoring |
High-Risk Requests Still Trigger Refusals
Anthropic says Fable 5 blocks requests that could meaningfully improve offensive cyber capability. These include malware delivery, credential theft workflows, exploit chaining, cyber-physical sabotage, covert channels, and attempts to evade detection.
The company also blocks some ambiguous requests through what it calls a safety margin. That means the model can refuse a prompt that is probably harmless if it resembles a risky cybersecurity request.
Anthropic says the tradeoff may create false positives, but it reduces the chance that harmful requests get through. In some cases, flagged Fable 5 requests can route to Claude Opus 4.8 instead.
Vulnerability Discovery Gets Extra Scrutiny
Anthropic draws a line between routine vulnerability discovery and high-uplift vulnerability discovery. If other public tools or widely available models can already find the same issue, Anthropic treats the request as lower risk.
The company treats a request as higher risk when Fable 5 could discover novel or complex vulnerabilities that other tools cannot find. Anthropic says that type of output could give attackers a meaningful new advantage.
This explains why some security prompts may receive different results. A request about defensive patching may pass, while a similar-looking request that could unlock exploit development may be blocked.
Cyber Jailbreak Severity Framework Explained
Anthropic also proposed a Cyber Jailbreak Severity framework, or CJS, to rate how dangerous a cyber jailbreak is. The goal is to help AI developers, governments, and researchers discuss jailbreak risk with a shared vocabulary.
The proposed scale runs from CJS-0 to CJS-4. Anthropic says the levels should be treated as exponential, meaning each higher level represents a much greater risk than the previous one.
The Cyber Jailbreak Severity framework focuses on real-world attacker uplift. A jailbreak that only reveals routine information would score low, while a jailbreak that unlocks broad offensive capability could score high or critical.
| CJS level | Severity | Score range |
|---|---|---|
| CJS-0 | Informational | 0 |
| CJS-1 | Low | 1 to 3.5 |
| CJS-2 | Medium | 4 to 6.5 |
| CJS-3 | High | 7 to 8.5 |
| CJS-4 | Critical | 9 to 10 |
Four Factors Decide Jailbreak Severity
The draft framework scores cyber jailbreaks across four axes. The first is capability gain, which measures how far the jailbreak moves an attacker beyond existing tools or public information.
The second is breadth, which measures whether the same jailbreak works only on one narrow task or across many targets, vulnerability types, or offensive categories.
The final two factors are ease of weaponization and discoverability. These measure how quickly a technique can become useful in an attack and how easily threat actors could find or reproduce it.
- Capability gain scores from 0 to 4 points.
- Breadth of capability gain scores from 0 to 2 points.
- Ease of weaponization scores from 0 to 2 points.
- Discoverability scores from 0 to 2 points.
- The combined score maps to a CJS severity band.
Final Ratings Can Move Up
Anthropic says the calculated CJS score should act as a floor. Reviewers can raise the final severity if the framework underestimates real-world risk, but they should not lower it below the calculated score.
Examples include a jailbreak that exposes an unpatched fundamental vulnerability, a finding with no near-term mitigation, or several linked findings that become more dangerous when combined.
The framework does not cover every type of AI jailbreak. Anthropic says non-cyber issues, including system prompt extraction, sit outside CJS because they do not create the same cybersecurity risk.
Project Glasswing Partners Helped Shape the Draft
Anthropic says the proposed standard is being developed with Amazon, Microsoft, Google, and other partners in Project Glasswing.
Project Glasswing is Anthropic’s defensive cybersecurity effort for trusted organizations. It focuses on using advanced AI to help defenders review code, find weaknesses, and secure critical systems.
The program also helps explain the difference between Fable 5 and Mythos 5. Fable 5 is built for broader use with stronger safeguards, while Mythos 5 has fewer restrictions and remains limited to trusted cybersecurity partners.
Fable 5 and Mythos 5 Have Different Access Rules
Anthropic introduced Fable 5 and Mythos 5 as two configurations of the same underlying model family. In the Claude Fable 5 and Mythos 5 announcement, Anthropic said Fable 5 is a Mythos-class model made safer for general use.
Mythos 5, by contrast, is aimed at a small group of trusted cyberdefenders and infrastructure providers. Anthropic says it has stronger cybersecurity capabilities and therefore needs stricter access controls.
The Fable 5 redeployment post also says Anthropic trained an improved classifier after a report showed a way to bypass safeguards for certain cybersecurity outputs.
HackerOne Program Opens for Cyber Jailbreak Reports
Anthropic is asking researchers to report meaningful cyber jailbreaks through a dedicated HackerOne program.
The program focuses on findings that create real cyber capability uplift, not minor prompts that only trigger harmless policy misses.
Anthropic is also requesting feedback at [email protected]. That signals the company still views the CJS scale as an early draft rather than a finished industry standard.
Why the Safeguards Matter
The new framework shows that AI jailbreaks are becoming a cybersecurity and policy issue. A minor bypass and a jailbreak that unlocks broad offensive capability should not receive the same urgency.
Clear severity bands could help companies decide when to patch, when to delay a model launch, and when to involve government or industry partners.
The challenge is balance. Security teams need advanced AI tools for secure coding, patch management, threat hunting, malware analysis, and incident response, but the same systems can create risk if jailbreaks unlock offensive workflows.
What Users Should Expect
Claude Fable 5 users should expect stricter cyber filtering than in earlier Claude models. Some legitimate prompts may be blocked if the classifier sees a possible path to misuse.
The official Claude Fable 5 page says many cybersecurity and biology queries can route automatically to Opus 4.8 if the safeguards flag them.
Users doing routine defensive work should still be able to ask for help with safer tasks such as secure coding guidance, patch planning, log review, incident-response documentation, and security education.
What Comes Next
Anthropic’s next challenge will be reducing false positives without weakening safeguards. Overblocking can frustrate developers and researchers, but underblocking could expose dangerous capabilities.
The broader AI industry will also need to decide whether CJS can become a shared standard. If other labs adopt similar scoring, cyber jailbreak disclosure could become more consistent and easier to triage.
For now, the Anthropic cyber jailbreak bounty gives researchers a formal reporting path, while Glasswing remains the route for more controlled defensive cybersecurity access. The Fable 5 and Mythos 5 model split shows how Anthropic wants to separate broad access from trusted cyber access, and the Fable 5 product page makes clear that safeguards remain part of the general-use experience.
FAQ
Anthropic published more details about Claude Fable 5’s cybersecurity safeguards and introduced a draft Cyber Jailbreak Severity framework for rating cyber jailbreak risk.
No. Claude Fable 5 does not block all cybersecurity work. Anthropic says it allows benign defensive tasks such as secure coding, patch management, log analysis, incident response, and security education.
The Cyber Jailbreak Severity framework, or CJS, is Anthropic’s proposed scale for rating cyber jailbreaks from CJS-0 informational to CJS-4 critical based on real-world attacker capability gain.
Anthropic’s draft framework scores cyber jailbreaks based on capability gain, breadth of capability gain, ease of weaponization, and discoverability. These scores combine into an initial severity band.
Researchers can report qualifying cyber jailbreaks through Anthropic’s dedicated HackerOne program. Anthropic is also accepting feedback on the proposed framework at [email protected].
Read our disclosure page to find out how can you help VPNCentral sustain the editorial team Read more
User forum
0 messages