Devops Videos
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests
NewsDealsSocialBlogsVideosPodcasts
DevopsVideosFast & Asynchronous: Drift Your AI, Not Your GPU Bill // Artem Yushkovskiy
DevOpsAI

Fast & Asynchronous: Drift Your AI, Not Your GPU Bill // Artem Yushkovskiy

•February 19, 2026
0
MLOps Community
MLOps Community•Feb 19, 2026

Why It Matters

By shifting AI processing from costly cloud APIs to self‑hosted asynchronous actors, companies can slash expenses, eliminate rate limits, and reliably scale complex generative workloads in near‑real time.

Key Takeaways

  • •Asynchronous actor framework eliminates batch pipeline bottlenecks efficiently
  • •Self‑hosted GPU models cut cloud costs dramatically for enterprises
  • •Message‑driven routing enables dynamic, fault‑tolerant AI workflows across services
  • •Open‑source ASEA library abstracts infrastructure for data scientists
  • •Near‑real‑time processing scales from zero to hundreds of GPUs

Summary

The talk introduced ASEA, an open‑source asynchronous‑actor framework designed to replace traditional batch pipelines for generative AI workloads. By decoupling each processing step into self‑hosted GPU actors that communicate via message queues, the team at a global food‑delivery platform eliminated rate‑limiting, reduced engineering overhead, and gained fine‑grained control over scaling.

Initially, a Kubeflow pipeline calling external AI APIs hit random errors and consumed 60‑80% of engineering effort just to stay operational, while cloud API costs ballooned. The solution migrated the models in‑house, wrapped them in actors that auto‑scale on demand, and introduced a root‑step message format that carries payload enrichment through a cascade of actors. Built‑in error handling routes failed messages to retry or dead‑letter queues, and a lightweight synchronous gateway lets developers invoke complex flows via a simple HTTP call.

Key examples highlighted the system’s performance: throughput reached the limits of the GPU cluster without hitting rate limits, and the architecture scaled from zero to 100 GPUs handling diffusion models. The framework supports near‑real‑time latencies measured in minutes rather than milliseconds, and developers can dynamically re‑route messages using an LLM‑powered router, enabling fan‑out/fan‑in patterns for future enhancements.

For enterprises, ASEA offers a path to dramatically lower AI operating costs, avoid vendor lock‑in, and accelerate the deployment of sophisticated AI pipelines. Its open‑source release invites broader adoption and community contributions, positioning it as a foundational tool for AI‑ops teams seeking scalable, cost‑effective, and resilient workflows.

Original Description

March 3rd, Computer History Museum CODING AGENTS CONFERENCE, come join us while there are still tickets left.
https://luma.com/codingagents
Thanks to ⁨@ProsusGroup for collaborating on the Agents in Production Virtual Conference 2025.
Abstract //
Stop thinking of `POST /predict` when someone says ""serving AI"". At Delivery Hero, we've rethought Gen AI infrastructure from the ground up, with async message queues, actor-model microservices, and zero-to-infinity autoscaling - no orchestrators, no waste, no surprising GPU bills. Here's the paradigm shift: treat every AI step as an independent async actor (we call them ""asyas""). Data ingestion? One asya. Prompt construction? Another. Smart model routing? Another. Pre-processing, analysis, backend logic, even agents — dozens of specialized actors coexist on the same GPU cluster and talk to each other, each scaling from zero to whatever capacity you need. The result? Dramatically lower GPU costs, true composability, and a maintainable system that actually matches how AI workloads behave. We'll show the evolution of our project - DAGs to distributed stateless async actors - and demonstrate how naturally this architecture serves real-world production needs. The framework is open-source as `Asya`. If time permits, we'll also discuss bridging these async pipelines with synchronous MCP servers when real-time responses are required. Come see why async isn't an optimization — it's a paradigm shift for AI infrastructure.
Bio //
Sr ML Engineer at Delivery Hero, for the last 7+ years building ML platforms and ML use-cases. Now scaling a global image auto-enhancement service that rides on a massive self-hosted Kubernetes infra. Passionate about MLOps, distributed systems, and anything that bends infrastructure to the will of AI.
A Prosus | MLOps Community Production
0

Comments

Want to join the conversation?

Loading comments...