Comfy Internals | How We Got Four Rival AI Labs to Fight over Our Code Reviews

Comfy Internals | How We Got Four Rival AI Labs to Fight over Our Code Reviews

ComfyUI Blog
ComfyUI BlogJun 9, 2026

Key Takeaways

  • Four AI models from different labs review each PR
  • Runs in CI for a flat $200/month cost
  • Judge model consolidates eight reviews into top ten findings
  • Detects bugs missed by humans, like hidden moderation defaults
  • Open‑source workflow enables any team to adopt the system

Pulse Analysis

As AI agents increasingly draft production code, traditional human review struggles to keep pace with volume and subtle, domain‑specific bugs. Comfy’s engineers recognized that a single model mirrors its own training biases, so they turned to a heterogeneous ensemble of four leading LLMs—OpenAI’s GPT‑5, Anthropic’s Claude, Google’s Gemini, and Moonshot’s Kimi—to provide truly independent adversarial perspectives. Each model runs two passes, one assuming the change is broken and another probing edge cases, creating a rich set of potential defects that a lone reviewer would likely overlook.

The architecture stitches these eight reviews together in a GitHub Action matrix, then feeds the raw output to a dedicated judge model (Claude Opus) that validates findings against the actual changed files and ranks the most severe issues. By capping the final report at ten high‑signal items, the system avoids overwhelming developers while still surfacing critical bugs such as hidden moderation defaults or incorrect API contracts. All of this operates within a flat $200 per month budget on a Cursor Ultra seat, making it financially predictable for fast‑moving teams.

Beyond Comfy, the open‑source workflow demonstrates a scalable blueprint for any organization wrestling with AI‑generated code. It shows that lineage diversity—pulling models from distinct research labs—can compensate for the blind spots inherent in any single model’s training data, a principle that could reshape automated security testing across the industry. Future work will need formal benchmarking and dynamic model rotation, but the early results suggest that multi‑lab ensembles can deliver a cost‑effective, high‑coverage safety net for the next generation of software development.

Comfy Internals | How we got four rival AI labs to fight over our code reviews

Comments

Want to join the conversation?