Black Hat Europe 2025 | Flaw And Order: Finding The Needle In The Haystack Of CodeQL Using LLMs

Black Hat
Black HatMay 29, 2026

Why It Matters

By automating false‑positive filtering, the method promises faster, cheaper vulnerability discovery, giving enterprises a competitive edge in proactive security management.

Key Takeaways

  • Simple LLM prompts generate hallucinated vulnerabilities, not real CVEs.
  • Combining CodeQL static analysis with LLM reduces false positives.
  • "Where" and "what" problems hinder LLM-only vulnerability detection.
  • Context extraction (full function) is essential for accurate LLM assessment.
  • Indexing large codebases for context is time‑consuming and impractical.

Summary

At Black Hat Europe 2025, Simha Cosman of CyberArk Labs presented a novel method for finding software flaws by pairing CodeQL static analysis with large language models (LLMs). He argued that the hype around LLM‑only vulnerability scans is misplaced, as simple prompts produce hallucinated issues that would be rejected by bug‑bounty platforms.

Cosman highlighted two fundamental challenges: the “where” problem (locating the exact vulnerable line) and the “what” problem (identifying the vulnerability type). Community attempts such as Google’s BigSleep and OpenAI’s HardVark address one of these dimensions but rely on existing patches or commit monitoring, limiting long‑term efficacy.

His approach runs CodeQL across large repositories, generating tens of thousands of potential findings. Because static analysis yields a high false‑positive rate, an LLM is fed the precise location and vulnerability type to confirm or discard each issue. The key insight is that the LLM must receive full function context—not just a single line—to make reliable judgments, prompting the need for sophisticated code‑indexing to retrieve surrounding code, macros, and type information.

If refined, this hybrid pipeline could dramatically cut triage time for security teams and bug‑bounty programs, turning an otherwise endless manual review into a scalable process. However, practical obstacles—slow indexing of massive codebases and the need for richer context extraction—must be solved before widespread adoption.

Original Description

Running CodeQL's built-in queries on Redis gave me over 6,800 potential issues. Doable, maybe. But when I tried FFmpeg, I got over 51,000. That's way too much for me. And how many of those are real vulnerabilities? Probably around 0.01%. The sheer number of false positives makes static code analysis impractical - who wants to manually sift through tens of thousands of results just to find a few actual security flaws?
To fix this, we built an open-source tool that fuses CodeQL with an LLM-driven agent. This agent autonomously navigates the code, running targeted queries to extract only the relevant context. On top of that, we introduced Guided Questioning, an advanced reasoning technique that keeps the LLM focused, improving accuracy even for complex vulnerabilities.
Using this approach, we reduced false positives by up to 97% and uncovered more than a dozen real-world security issues in Linux, Apache, FFmpeg, Bullet3, Libvips, libretro, Linenoise, and other widely used open-source projects.
By: Simcha Kosman | Senior Security Researcher, Cyberark

Comments

Want to join the conversation?

Loading comments...