Black Hat Europe 2025 | Flaw And Order: Finding The Needle In The Haystack Of CodeQL Using LLMs
Why It Matters
By automating false‑positive filtering, the method promises faster, cheaper vulnerability discovery, giving enterprises a competitive edge in proactive security management.
Key Takeaways
- •Simple LLM prompts generate hallucinated vulnerabilities, not real CVEs.
- •Combining CodeQL static analysis with LLM reduces false positives.
- •"Where" and "what" problems hinder LLM-only vulnerability detection.
- •Context extraction (full function) is essential for accurate LLM assessment.
- •Indexing large codebases for context is time‑consuming and impractical.
Summary
At Black Hat Europe 2025, Simha Cosman of CyberArk Labs presented a novel method for finding software flaws by pairing CodeQL static analysis with large language models (LLMs). He argued that the hype around LLM‑only vulnerability scans is misplaced, as simple prompts produce hallucinated issues that would be rejected by bug‑bounty platforms.
Cosman highlighted two fundamental challenges: the “where” problem (locating the exact vulnerable line) and the “what” problem (identifying the vulnerability type). Community attempts such as Google’s BigSleep and OpenAI’s HardVark address one of these dimensions but rely on existing patches or commit monitoring, limiting long‑term efficacy.
His approach runs CodeQL across large repositories, generating tens of thousands of potential findings. Because static analysis yields a high false‑positive rate, an LLM is fed the precise location and vulnerability type to confirm or discard each issue. The key insight is that the LLM must receive full function context—not just a single line—to make reliable judgments, prompting the need for sophisticated code‑indexing to retrieve surrounding code, macros, and type information.
If refined, this hybrid pipeline could dramatically cut triage time for security teams and bug‑bounty programs, turning an otherwise endless manual review into a scalable process. However, practical obstacles—slow indexing of massive codebases and the need for richer context extraction—must be solved before widespread adoption.
Comments
Want to join the conversation?
Loading comments...