Evaluating AI Coding Agents with TeamCity and SWE-Bench - Ernst Haagsman

DataTalks.Club
DataTalks.ClubApr 29, 2026

Why It Matters

SWE‑bench provides a rigorous, production‑grade metric for AI coding assistants, enabling firms to track performance, avoid overfitting, and safely integrate AI into software development pipelines.

Key Takeaways

  • SWE‑bench evaluates AI agents using real GitHub issues
  • JetBrains’ Juni writes tests and runs them via AI prompts
  • Benchmark mimics Arrange‑Act‑Assert unit‑test structure
  • Ground‑truth patches and test suites verify agent solutions
  • Continuous updates needed to prevent model memorization

Summary

The session introduced SWE‑bench, a benchmark that measures software‑development AI agents by feeding them authentic GitHub issues and checking whether they can produce correct fixes. Ernst Haagsman, JetBrains product manager, demonstrated how their AI assistant, Juni, can generate unit tests, execute them, and iterate on code using Go examples, illustrating the end‑to‑end workflow.

Key insights included the benchmark’s data format: each task contains the repository snapshot, issue description, gold‑standard patch, and additional test cases. Agents receive the problem statement, propose a solution, and the harness runs the supplied tests to score success. This mirrors traditional unit‑testing’s arrange‑act‑assert pattern, but accounts for AI’s inherent nondeterminism by aggregating pass/fail outcomes across many instances.

Haagsman highlighted practical details, such as configuring model prompts, exposing tool‑calling capabilities, and ensuring the AI does not inadvertently see the gold patches. He also noted challenges like model memorization of benchmark solutions and the need for periodic dataset refreshes, referencing community discussions on Hacker News.

For developers, SWE‑bench offers a reproducible framework to benchmark custom agents, align development roadmaps, and quantify improvements. By integrating the harness into CI pipelines like TeamCity, organizations can continuously validate AI‑driven code assistance against real‑world bugs, driving safer, more reliable automation.

Original Description

In this talk, Ernst Haagsman, Product Leader at JetBrains, shares his expertise on scaling developer tools from his early days on the PyCharm team to his current role leading TeamCity and AI integration. We explore the practical challenges of evaluating AI coding agents using SWE-bench and how to build a robust CI/CD pipeline for non-deterministic AI outputs.
You’ll learn about:
- The architecture of SWE-bench and how it uses real-world GitHub issues as benchmarks.
- How to apply the "Arrange, Act, Assert" framework to AI agent evaluation.
- Technical strategies for caching dependencies and using Docker to reduce evaluation costs.
- Scaling parallel AI workloads using TeamCity, Kotlin DSL, and AWS infrastructure.
- Techniques for managing LLM API rate limits and handling stochastic model behavior.
- Building custom data sets for specialized AI agents like customer support bots or transcribers.
- The future of "Agentic Development" with a first look at JetBrains Air.
Links:
TIMECODES:
00:00 Live demo: Go-based monitoring for Synology NAS
06:11 Automated unit test generation using Juny AI
09:16 Core testing framework: Arrange, Act, Assert
12:51 Mapping GitHub issues to ground truth and gold patches
16:00 Strategies for model contamination and data leaks
19:12 Four-stage evaluation workflow in Docker
22:53 Optimizing CI/CD costs with dependency caching
26:50 Scaling evaluations via Kotlin DSL and parallel builds
30:12 Orchestrating AWS instances for high-concurrency tasks
34:23 Infrastructure as code: Cloud profiles and launch templates
37:54 Technical deep dive: JetBrains AI and TeamCity integration
42:20 Environment preparation for specific task instances
46:47 Agent execution: Running Juny via shell scripts
50:13 Interpreting JSON reports and tagging build status
54:14 Managing API rate limits with shared resource locks
58:24 Performance analysis of high-parallelism workflows
1:02:13 Comparing Python-centric vs. multilingual benchmarks
1:06:51 Data visualization and success rate statistics
1:11:11 Benchmarking LLM performance: Gemini vs. Claude
1:14:57 Custom data set creation for specialized AI agents
1:19:14 TeamCity licensing for individuals and startups
1:22:57 Future of coding: JetBrains Air agentic environment
This workshop is designed for Machine Learning Engineers, Data Scientists, and DevOps professionals who are building or evaluating AI agents and need to move from manual testing to automated, scalable benchmarks. It is particularly valuable for those looking to integrate LLM evaluation into their existing CI/CD workflows.
Connect with Ernst
Connect with DataTalks.Club:
- Join the community - https://datatalks.club/slack.html
- Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
- Check other upcoming events - https://lu.ma/dtc-events
Connect with Alexey
Check our free online courses:
- ML Engineering course - http://mlzoomcamp.com
👋🏼 Support/inquiries
If you want to support our community, use this link - https://github.com/sponsors/alexeygrigorev
If you’re a company, reach us at alexey@datatalks.club
#AI #MachineLearning #AIAgents #SWEbench #JetBrains #TeamCity #SoftwareEngineering #LLM #DevOps #CICD #DataScience #Python #Automation #CodingAgents #KotlinDSL #AWS #Docker #TechWorkshop #AIResearch #datatalksclub

Comments

Want to join the conversation?

Loading comments...