LLM-Generated Mythic Agents Show How Disposable Red-Team Tooling Is Moving From Prompt to Deployment
Security researchers at SpecterOps have shown that large language models can generate basic Mythic agents from a written prompt, then test and prepare them for deployment with help from an automated validation harness. The SpecterOps research frames this as “disposable tooling,” where a team can create short-lived agents for one assessment instead of maintaining the same implant for years.
The finding matters for defenders because disposable tooling weakens static detection. If a new agent can be generated quickly in a different language or structure, signatures that rely on known code patterns may lose value faster than before.
Access content across the globe at the highest speed rate.
70% of our readers choose Private Internet Access
70% of our readers choose ExpressVPN
Browse the web from multiple devices with industry-standard security protocols.
Faster dedicated servers for specific actions (currently at summer discounts)
Mythic is a post-exploitation and command-and-control framework used in authorized red-team operations. The Mythic documentation describes it as a cross-platform framework with a plug-and-play architecture for agents, communication channels and operational customization.
What SpecterOps Built
The project started with a direct question: could an LLM take a short specification and produce a working Mythic agent without a human steering each step? Early attempts produced code that compiled but failed during real operation, including broken API assumptions, key exchange problems and Docker path issues.
That led SpecterOps to build a structured testing environment called Oracle. Instead of relying on prompting alone, the model had to generate code, run tests, inspect logs, fix failures and repeat the process until the agent passed validation.
The Mythic GitHub repository explains why this type of automation fits the framework. Mythic separates core infrastructure from payload types and C2 profiles, so agents can be developed and installed as separate components.
| Part of the workflow | Role in the research | Why it matters |
|---|---|---|
| Prompt specification | Defines the agent goal, target platform and supported commands | Turns tool creation into a repeatable request |
| Oracle harness | Runs staged validation and forces fixes after failures | Prevents the model from declaring success too early |
| LabKit | Supports controlled execution and process checks on Windows | Gives the model feedback from a real test host |
| Mythicd | Handles Mythic deployment tasks and container logs | Reduces brittle interaction with the server |
The Testing Harness Was the Breakthrough
The most important part of the work was not the first generated code. It was the validation loop around it. The model needed a way to learn from real failures instead of relying on confidence in its own output.
The first tier used local validation, unit tests and protocol tests against a mock Mythic server. The second tier moved to a live Mythic instance and a Windows target, where the generated agent had to check in, complete key exchange and execute supported commands.
The third tier used a QA sub-agent with a clean context window. This reviewer tested the release build and returned a pass or fail result. If it failed, the main model had to fix the agent and restart validation from the beginning.
- Generated agents were tested before release.
- Failures triggered code changes and repeated validation.
- The workflow reduced brittle manual debugging.
- The agents remained basic stage-zero tooling, not long-term polished implants.
- The research showed a path toward faster one-off tooling for authorized teams.
Why Mythic Makes This Easier
Mythic’s architecture helped the experiment because agents, containers and communication profiles can be handled separately. The official Mythic docs say the platform uses a web front end, Docker containers, GraphQL APIs, WebSockets, PostgreSQL and RabbitMQ to support modular operations.
That modular design gives an automated system clear pieces to generate and validate. Instead of building an entire command-and-control stack, the model can focus on the agent and the integration needed for Mythic.
The Mythic C2 profile documentation explains that C2 profiles define how an agent communicates with Mythic for tasking and responses. This separation helps explain why agent generation can become more repeatable when a framework already provides the surrounding infrastructure.
What the Results Showed
After the Oracle harness was in place, SpecterOps said development averaged just over two hours per agent. Later tests produced working stage-zero agents in Python, Go, Zig, C# and Rust.
The SpecterOps post was clear that the generated code was not polished long-term implant code. The point was that a usable single-use agent could now be created quickly enough to change how red teams think about tooling.

That has direct defensive implications. MITRE ATT&CK lists Mythic as an open-source, cross-platform post-exploitation and command-and-control platform, and notes that deployed Mythic C2 servers have been observed as part of potentially malicious infrastructure.
| Defender concern | Why disposable tooling changes the problem |
|---|---|
| Static signatures | New agents can look different even when they serve a similar purpose |
| Binary pattern matching | Language and structure can change between generated builds |
| Tool reputation | One-off tools may not appear in malware databases before use |
| Detection engineering | Behavior, infrastructure and protocol patterns become more important |
Why Defenders Should Focus on Behavior
Disposable tooling does not make detection impossible. It shifts the best signals. Defenders should focus less on exact file hashes and more on how agents behave after execution.
That means looking for unusual check-in timing, suspicious key exchange behavior, unexpected C2 traffic patterns, abnormal parent-child process chains and activity that maps to known post-exploitation behavior. The MITRE Mythic entry is useful for mapping the framework to techniques such as application-layer protocols, encrypted channels, automated collection and fallback channels.
Security teams should also review how they handle legitimate red-team tooling. Internal allowlists, assessment infrastructure and engagement-specific indicators should not become permanent blind spots that attackers can copy or abuse.
- Prioritize behavioral analytics over single-file signatures.
- Monitor new or unusual outbound C2 patterns.
- Review endpoint activity around agent execution and tasking.
- Separate authorized red-team infrastructure from production trust decisions.
- Use detections that survive minor code and language changes.
The AI Access Question
The research also sits inside a broader debate about how advanced models should support cybersecurity work. OpenAI’s Trusted Access for Cyber program was created to give vetted defenders more useful access for legitimate security work while keeping safeguards against misuse.
SpecterOps used the research to show how authorized teams may soon build and test tooling with much less manual effort. At the same time, defenders should assume that similar automation will influence real adversary workflows.

OpenAI’s Trusted Access program describes this tension clearly: frontier cyber capabilities can help defenders move faster, but they also require verification, scope controls and safeguards because malicious actors may try to use similar tools.
What Security Teams Should Do Now
Organizations should not wait for fully automated offensive tooling to become common before updating their detection strategy. The shift has already started, and basic stage-zero agents are enough to challenge defenses that depend too heavily on signatures.
The C2 profile documentation shows why communication patterns deserve close attention. Agents still need to communicate, receive tasking and return results, even when their source code changes from one build to another.
Defenders should treat generated offensive tooling as a reason to strengthen telemetry, not as a reason to abandon detection. The Mythic project also demonstrates how modular red-team platforms create clear places for defenders to study expected behavior, infrastructure patterns and operator workflows.
FAQ
LLM-generated Mythic agents are Mythic-compatible agents created by a large language model from a written specification. In the SpecterOps research, the model generated basic stage-zero agents and used a testing harness to validate them before deployment.
Disposable tooling means creating short-lived tools for a specific assessment or task instead of maintaining the same implant for a long time. The goal is to reduce reuse and make each tool less predictable.
No. The research describes an automation workflow for building and testing Mythic agents. It does not disclose a new Mythic vulnerability or a confirmed criminal attack using the same method.
It challenges traditional detection because generated agents can vary by language, structure and implementation. Static signatures, hashes and fixed binary patterns become less reliable when tooling can change quickly.
Defenders should focus on behavior, telemetry and C2 patterns instead of relying only on file signatures. Useful signals include unusual check-ins, key exchange behavior, outbound traffic patterns, process activity and post-exploitation actions.
Read our disclosure page to find out how can you help VPNCentral sustain the editorial team Read more
User forum
0 messages