DevOps Videos

All News Deals Social Blogs Videos Podcasts Digests

Evaluating AI Coding Agents with TeamCity and SWE-Bench - Ernst Haagsman

•April 29, 2026

DataTalks.Club

DataTalks.Club•Apr 29, 2026

Why It Matters

SWE‑bench provides a rigorous, production‑grade metric for AI coding assistants, enabling firms to track performance, avoid overfitting, and safely integrate AI into software development pipelines.

Key Takeaways

•SWE‑bench evaluates AI agents using real GitHub issues
•JetBrains’ Juni writes tests and runs them via AI prompts
•Benchmark mimics Arrange‑Act‑Assert unit‑test structure
•Ground‑truth patches and test suites verify agent solutions
•Continuous updates needed to prevent model memorization

Summary

The session introduced SWE‑bench, a benchmark that measures software‑development AI agents by feeding them authentic GitHub issues and checking whether they can produce correct fixes. Ernst Haagsman, JetBrains product manager, demonstrated how their AI assistant, Juni, can generate unit tests, execute them, and iterate on code using Go examples, illustrating the end‑to‑end workflow.

Key insights included the benchmark’s data format: each task contains the repository snapshot, issue description, gold‑standard patch, and additional test cases. Agents receive the problem statement, propose a solution, and the harness runs the supplied tests to score success. This mirrors traditional unit‑testing’s arrange‑act‑assert pattern, but accounts for AI’s inherent nondeterminism by aggregating pass/fail outcomes across many instances.

Haagsman highlighted practical details, such as configuring model prompts, exposing tool‑calling capabilities, and ensuring the AI does not inadvertently see the gold patches. He also noted challenges like model memorization of benchmark solutions and the need for periodic dataset refreshes, referencing community discussions on Hacker News.

For developers, SWE‑bench offers a reproducible framework to benchmark custom agents, align development roadmaps, and quantify improvements. By integrating the harness into CI pipelines like TeamCity, organizations can continuously validate AI‑driven code assistance against real‑world bugs, driving safer, more reliable automation.

Original Description

In this talk, Ernst Haagsman, Product Leader at JetBrains, shares his expertise on scaling developer tools from his early days on the PyCharm team to his current role leading TeamCity and AI integration. We explore the practical challenges of evaluating AI coding agents using SWE-bench and how to build a robust CI/CD pipeline for non-deterministic AI outputs.

You’ll learn about:

- The architecture of SWE-bench and how it uses real-world GitHub issues as benchmarks.

- How to apply the "Arrange, Act, Assert" framework to AI agent evaluation.

- Technical strategies for caching dependencies and using Docker to reduce evaluation costs.

- Scaling parallel AI workloads using TeamCity, Kotlin DSL, and AWS infrastructure.

- Techniques for managing LLM API rate limits and handling stochastic model behavior.

- Building custom data sets for specialized AI agents like customer support bots or transcribers.

- The future of "Agentic Development" with a first look at JetBrains Air.

Links:

- Repository: https://github.com/jetbrains/teamcity-ai-agent-testing-demo

- Dataset: https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite

TIMECODES:

00:00 Live demo: Go-based monitoring for Synology NAS

06:11 Automated unit test generation using Juny AI

09:16 Core testing framework: Arrange, Act, Assert

12:51 Mapping GitHub issues to ground truth and gold patches

16:00 Strategies for model contamination and data leaks

19:12 Four-stage evaluation workflow in Docker

22:53 Optimizing CI/CD costs with dependency caching

26:50 Scaling evaluations via Kotlin DSL and parallel builds

30:12 Orchestrating AWS instances for high-concurrency tasks

34:23 Infrastructure as code: Cloud profiles and launch templates

37:54 Technical deep dive: JetBrains AI and TeamCity integration

42:20 Environment preparation for specific task instances

46:47 Agent execution: Running Juny via shell scripts

50:13 Interpreting JSON reports and tagging build status

54:14 Managing API rate limits with shared resource locks

58:24 Performance analysis of high-parallelism workflows

1:02:13 Comparing Python-centric vs. multilingual benchmarks

1:06:51 Data visualization and success rate statistics

1:11:11 Benchmarking LLM performance: Gemini vs. Claude

1:14:57 Custom data set creation for specialized AI agents

1:19:14 TeamCity licensing for individuals and startups

1:22:57 Future of coding: JetBrains Air agentic environment

This workshop is designed for Machine Learning Engineers, Data Scientists, and DevOps professionals who are building or evaluating AI agents and need to move from manual testing to automated, scalable benchmarks. It is particularly valuable for those looking to integrate LLM evaluation into their existing CI/CD workflows.

Connect with Ernst

- Linkedin - https://www.linkedin.com/in/ernsthaagsman/

Connect with DataTalks.Club:

- Join the community - https://datatalks.club/slack.html

- Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/r?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

- Check other upcoming events - https://lu.ma/dtc-events

- GitHub: https://github.com/DataTalksClub

- LinkedIn - https://www.linkedin.com/company/datatalks-club/

- Twitter - https://twitter.com/DataTalksClub

- Website - https://datatalks.club/

Connect with Alexey

- Twitter - https://twitter.com/Al_Grigor

- Linkedin - https://www.linkedin.com/in/agrigorev/

Check our free online courses:

- ML Engineering course - http://mlzoomcamp.com

- Data Engineering course - https://github.com/DataTalksClub/data-engineering-zoomcamp

- MLOps course - https://github.com/DataTalksClub/mlops-zoomcamp

- LLM course - https://github.com/DataTalksClub/llm-zoomcamp

- Open-source LLM course: https://github.com/DataTalksClub/open-source-llm-zoomcamp

- AI Dev Tools course: https://github.com/DataTalksClub/ai-dev-tools-zoomcamp

👉🏼 Read about all our courses in one place - https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html

👋🏼 Support/inquiries

If you want to support our community, use this link - https://github.com/sponsors/alexeygrigorev

If you’re a company, reach us at alexey@datatalks.club

#AI #MachineLearning #AIAgents #SWEbench #JetBrains #TeamCity #SoftwareEngineering #LLM #DevOps #CICD #DataScience #Python #Automation #CodingAgents #KotlinDSL #AWS #Docker #TechWorkshop #AIResearch #datatalksclub

Comments

Want to join the conversation?

Loading comments...