Apex launches as an AI pentesting agent that attacks live apps in black-box mode
Pensar has launched Apex, an AI-powered penetration testing agent that attacks running applications in black-box mode, without source code, predefined attack paths, or manual hints. The company says Apex is built to find and verify real vulnerabilities in live apps, giving security teams a faster way to test modern software that now ships at AI-assisted speed.
The main idea is simple. Apex does not just scan code or flag suspicious patterns. It acts more like an autonomous tester that explores an application, maps the attack surface, and then tries to exploit weaknesses the way a real attacker would. Pensar says developers can run it directly from the terminal in autonomous /pentest mode, while security engineers can use an interactive /operator mode for deeper investigations and exploit chaining.
Pensar also released Argus alongside Apex. Argus is an open benchmark suite with 60 Dockerized vulnerable web applications built to test AI pentesting agents against modern stacks and harder real-world flaw classes, including multi-step chains, race conditions, GraphQL issues, JWT attacks, WAF bypass, and multi-tenant isolation failures.
What Apex is and why Pensar says it matters
Pensar describes Apex as an AI-powered penetration testing CLI for black-box and white-box testing. The open-source GitHub repo says the tool runs autonomous agents directly in the terminal and supports developer workflows, CI/CD integration, and more advanced security engineering use cases.
That pitch comes at a time when security teams face a real speed problem. Development cycles are faster, release pipelines are more automated, and AI coding tools are pushing more code into production. Pensar positions Apex as a verification layer that tests the deployed application itself, instead of relying only on code scanning or periodic manual reviews. This framing appears in reporting on the launch, though I did not find a primary Pensar page that publishes every performance claim from the sample text.
What Argus adds to the story
Argus looks important because it gives Pensar a public benchmark to demonstrate Apex’s capabilities. According to the GitHub repository, the suite includes 60 self-contained vulnerable web apps across Node.js, Python, Go, Java, PHP, and Ruby, with coverage that stretches from simple injection bugs to multi-step exploit chains that require several chained weaknesses.
The repo also shows why Pensar built its own benchmark. The Argus README says existing pentesting benchmarks lean heavily toward PHP, lack enough coverage for modern vulnerability classes, and do not test chained exploitation often enough. Argus tries to fill that gap with 8 multi-step chains, 31 hard challenges, and scenarios that include cloud, infrastructure, and WAF or IDS evasion.
What kinds of vulnerabilities the benchmark covers
| Area | Examples listed in Argus |
|---|---|
| Injection flaws | SQL injection, NoSQL injection, LDAP injection, command injection, ORM injection |
| Auth and access issues | JWT confusion, OAuth bypass, MFA bypass, auth bypass, IDOR |
| Server-side bugs | SSRF, SSTI, SpEL injection, XXE, path traversal |
| Logic and race flaws | Double-spend race conditions, stock bypass, business logic abuse |
| Modern app chains | Multi-tenant breaches, CI/CD poisoning, service mesh attacks, Kubernetes compromise |
| Defense evasion | WAF bypass, IDS evasion, blind exploitation paths |
Those categories come straight from the benchmark inventory and coverage notes in the Argus repository. They show that Pensar is aiming beyond simple scanner-style tests and into scenarios that often require context, chaining, and trial-and-error reasoning.
Benchmark makeup at a glance
| Metric | Argus detail |
|---|---|
| Total applications | 60 |
| Multi-step chains | 8 |
| Easy challenges | 2 |
| Medium challenges | 27 |
| Hard challenges | 31 |
| Main stack share | Node.js / Express 24 apps |
| Multi-service apps | 14 |
| Language ecosystems | Node.js, Python, Go, Java, PHP, Ruby |
The benchmark composition supports Pensar’s claim that it wants a more production-like test bed. Node.js leads the stack mix, but the set also includes multi-service targets and infrastructure-oriented cases that are harder to reduce to one bug and one request.
Performance claims around Apex
Reporting on the launch says Apex achieved a 35% pass rate on the 60-challenge Argus benchmark, ahead of PentestGPT at 30% and Raptor at 27%. The same report says Apex reached 80% on the 10 hardest challenges using Claude Opus 4.6, compared with 70% for PentestGPT and 60% for Raptor, and that Apex discovered 271 vulnerabilities across the full run. I found those numbers in coverage of the release, but I did not find a primary Pensar page or public benchmark report that fully details the methodology behind every one of those comparison figures.
That does not invalidate the claims, but it does matter. Benchmark headlines are useful, yet security buyers usually want reproducible runs, detailed scoring rules, and independent testing before treating leaderboard results as settled. The open-sourcing of Argus and Apex makes that kind of outside validation more plausible over time.
Why this launch stands out
- Apex is open source and already available on GitHub.
- Argus is also public, which gives researchers a shared benchmark instead of a closed internal test.
- The benchmark focuses on newer web stacks and chained exploitation, not only older single-bug labs.
- Pensar is clearly positioning Apex as a continuous offensive testing layer for CI, staging, and production-like environments.
What security teams should keep in mind
Apex fits a real market trend. More companies now want autonomous security testing that sits between traditional scanners and expensive manual pentests. Still, any offensive security agent needs guardrails, careful authorization, and reliable validation. Pensar’s own repo includes a responsible-use notice that limits the tool to authorized testing.
The bigger question is not whether AI agents can find vulnerabilities. They clearly can in at least some controlled settings. The real test is whether they can do it consistently, safely, and with enough context to reduce false positives and missed chains in messy production environments. Apex and Argus make that conversation more concrete, because outside researchers can now inspect the tooling and try to reproduce the results themselves.
FAQ
Apex is an AI-powered penetration testing tool from Pensar that runs autonomous agents for black-box and white-box testing from the terminal.
Argus is Pensar’s open benchmark suite of 60 Dockerized vulnerable web applications designed to evaluate AI-powered pentesting agents.
No for black-box mode. Pensar says Apex can test running applications without source code, hints, or predefined attack paths, although the tool also supports white-box workflows.
I found the headline performance claims in coverage of the launch, but I did not find a primary public Pensar report that fully documents all comparison results and scoring details. The repos are public, which should make outside validation easier.
Read our disclosure page to find out how can you help VPNCentral sustain the editorial team Read more
User forum
0 messages