Twelve months ago we ran our first AI-assisted security audit. What started as an experiment has become its own discipline — with tools, workflows and a list of pitfalls nobody had written down.
What changed
Second-generation language models read well-structured code almost as reliably as an experienced reviewer. We use them in three stages:
- Static triage — models scan diffs and PRs, flag suspicious patterns (auth-bypass, unsafe deserialization, race conditions) and rank them by likelihood. This is not a SAST replacement — it catches the things linters can’t grasp: what the code means to do.
- Hypothesis testing — for each finding the pipeline builds an automated reproduction attempt from the code path. False positives are filtered out early because they refuse to reproduce.
- Contextual rating — we check whether the issue is actually exploitable in the customer’s setup. An RCE in a service without a public endpoint isn’t an RCE as far as the incident playbook is concerned.
What did not work
- Pure prompt audits (»scan this for vulnerabilities«) return lists that are 80% hallucination. We build tooling, not consumer chats.
- Fully automated tickets: without human triage the queue fills with noise. We let the LLM pre-sort findings, not close them.
- »Just pick the biggest model«: cost-prohibitive, and the security-relevant quality jump is smaller than the marketing slides claim.
What we’d do again
- Version results like code. Findings live in Git, linked to a hash of the source. Re-audits stay reproducible.
- Treat confidence scores as a tax, not a garnish. We only trust model output where we cross-checked against a second method (static analysis, symbolic execution, manual review).
- Keep the human. Models read code; they rarely understand the operation behind it. The hardest findings — who can trigger this when? — still come from talking to ops teams, not from tokens.
Where we go from here
2026 is the year AI-assisted security pipelines move from consulting offerings into product work. Our bet: whoever builds clean tooling now, treating models as inspectors — not autopilots — will have a measurable lead in mean time to patch in 18 months.
We see it in customer engagements every day. What we don’t see is this discipline disappearing in 24 months.
Curious what AI-driven vulnerability management would actually look like in your stack? Drop us a line — we’ll reply personally.