DEF CON 33 Recon Village - Autonomous Video Hunter AI Agents for Real Time OSINT - Kevin Dela Rosa

DEF CON
DEF CONDec 31, 2025

Why It Matters

The tool enables rapid, low‑cost extraction of actionable intelligence from ever‑growing video data, reshaping how analysts conduct OSINT and threat monitoring.

Key Takeaways

  • AI agent transforms raw video into searchable OSINT insights instantly
  • Uses LLM planning to orchestrate vision, audio, and text tools
  • Demonstrated celebrity tracking and logo detection across TikTok clips
  • Runs on CPU, enabling deployment on standard laptops
  • Open-source components allow customization for diverse intelligence applications

Summary

Kevin Dela Rosa presented the Autonomous Video Hunter, an AI‑driven system that extracts real‑time open‑source intelligence from video streams. Built on his startup Cloud Glue, the proof‑of‑concept combines multimodal video processing, large‑language‑model planning, and a suite of vision and audio tools to turn unstructured footage into searchable, queryable data.

The architecture begins with a video listener that scrapes sources such as TikTok via a no‑code platform (Gumloop). Each clip is fed into a processor that runs a vision‑language model, generating dense scene descriptions, speech transcriptions, and on‑screen text. These embeddings are stored in a searchable context store, after which a Langraph‑based analysis agent creates a task plan, invokes sub‑agents for face detection, logo matching, or zero‑shot object detection, and finally compiles a markdown intelligence report.

Demo highlights included locating a Will Smith appearance in Zurich‑based TikTok videos, identifying Starbucks logos across fifteen clips, and a real‑time screen‑recorder integration that summarized desktop activity via the Model Context Protocol. Under the hood, the system leverages CPU‑friendly models such as DeepFace, RetinaFace, OpenCV’s ZIFT descriptors, and the Grounded‑SAM zero‑shot detector, all coordinated by a “deep agent” that maintains memory across steps.

By automating the extraction of visual, auditory, and textual cues from massive video feeds, the Autonomous Video Hunter democratizes OSINT capabilities for security analysts, marketers, and investigators. Its open‑source, CPU‑only design lowers barriers to entry, suggesting a near‑future where large‑scale video intelligence becomes a routine component of threat‑hunting and situational awareness workflows.

Original Description

Comments

Want to join the conversation?

Loading comments...