DevOps Blogs - Page 6

Blog•Mar 15, 2026

SLIs, SLOs, and SLAs: How to Measure and Enforce System Reliability

System reliability engineering addresses hardware degradation, software bugs, and network partitions that can trigger cascading outages. The article distinguishes reliability from mere availability and stresses the need to eliminate single points of failure. It introduces Service Level Indicators, Objectives, and Agreements (SLIs, SLOs, SLAs) as measurable frameworks to enforce reliability targets. By adopting proactive monitoring and resilient design, organizations can safeguard business continuity.

By System Design Nuggets

Blog•Mar 15, 2026

Day 149: Orchestrating Your Log Processing Empire with Kubernetes

The post walks readers through turning a complex, distributed log‑processing stack—collectors, RabbitMQ, query engines, and storage—into a single Kubernetes deployment. By providing complete manifests, it shows how to launch the entire ecosystem with one command, while Kubernetes handles health checks,...

By Hands On System Design Course - Code Everyday

Blog•Mar 15, 2026

Preventing Cascading Failures: How to Decouple Microservices with Async Design

Modern microservice architectures often suffer cascading failures when a single downstream component slows or crashes, causing synchronous calls to block threads and exhaust memory. The blog explains how synchronous communication forces services to wait for network responses, leading to system-wide...

By System Design Nuggets

Blog•Mar 15, 2026

Kafka Vs. RabbitMQ: How to Choose the Right Message Queue for Your Microservices

Modern microservices rely on asynchronous messaging to avoid cascading failures. The article contrasts Kafka and RabbitMQ, outlining each broker’s architecture, delivery guarantees, and typical use cases. RabbitMQ is described as a smart‑broker with a push model and fine‑grained routing, while...

By System Design Nuggets

Blog•Mar 14, 2026

Troubleshooting Guide: Running Qwen3.5-35B with Reasoning & Tool Calling Using vLLM on Nvidia DGX Spark

The post details how to run the Qwen3.5-35B MOE model—featuring 35 B parameters, 4‑bit AWQ quantization, and a 131 K context window—on Nvidia DGX Spark using vLLM. Standard vLLM Docker images (e.g., nvcr.io/nvidia/vllm:26.01-py3) ship with Transformers versions that do not recognize the...

By Agentic AI

Blog•Mar 14, 2026

Google Now Using AutoFDO To Enhance Android's Linux Kernel Performance

Google’s Android LLVM toolchain team announced that it has started using AutoFDO, an automatic feedback‑directed optimization technique, for building the Linux kernel in Android. By incorporating real‑world profiling data, the compiler can generate more efficient kernel binaries. Early measurements on...

By Phoronix

Blog•Mar 14, 2026

How to Debug AI Backend Systems

The article recounts a three‑day debugging nightmare caused by a faulty document‑chunking strategy in an AI Retrieval‑Augmented Generation (RAG) pipeline, highlighting how traditional logging failed to surface the issue. It argues that AI systems require a dedicated observability stack—structured logging,...

By Backend Weekly

Blog•Mar 13, 2026

Debauit Announced As Debian Source Package Auditor

Debaudit, a new suite of verification tools, was announced to audit Debian source packages. It includes upstream2orig, git2dsc, and git2orig, each checking different stages of the source‑to‑binary pipeline. The tools confirm that upstream tarballs, Git repositories, and generated originals match...

By Phoronix

Blog•Mar 13, 2026

Day 43: Implement Log Compaction for State Management

The post outlines a production‑grade state management layer built on Kafka log‑compacted topics, featuring a keyed state producer, a consumer that materializes current snapshots, and a Redis‑backed query API. By retaining only the latest record per entity key, log compaction...

By Hands On System Design Course - Code Everyday

Blog•Mar 12, 2026

Your Java Singleton Choice Could Make Your App 871x Slower

The article benchmarks three Java singleton implementations—synchronized, double‑checked locking (DCL), and initialization‑on‑demand holder—and finds the holder pattern up to 871 times faster than the synchronized version and 115 times faster than DCL. In a billion‑operation test the holder took just 4 ms, while...

By Algorythm

Blog•Mar 12, 2026

ENISA Technical Advisory on Secure Package Managers: Essential DevSecOps Guidance

ENISA has published its first Technical Advisory on Secure Package Managers (v1.1), incorporating feedback from 15 stakeholders and experts. The document details common supply‑chain risks of third‑party dependencies and offers concrete practices for selecting, integrating, monitoring, and mitigating vulnerabilities across...

By Security Affairs

Blog•Mar 11, 2026

The Beginner’s Guide to Shipping with Claude Code + 3 Support Prompts

The post demystifies Claude Code for beginners, breaking down the jargon‑filled path from concept to live AI product. It outlines a clear workflow—idea, local development, GitHub repository, hosting, and deployment—while highlighting essential terms like API, webhook, and environment variables. The...

By Excellent AI Prompts

Blog•Mar 11, 2026

Judgment and Creativity Are All You Need.

Imprint transitioned from manual deployments and hand‑run database migrations to a fully automated continuous‑deployment pipeline within three months, leveraging Kubernetes, ArgoCD, and coding agents. The migration mirrors Uber’s 2014 service migration but swaps platform‑building for platform‑consumption, allowing a three‑engineer team...

By Irrational Exuberance

Blog•Mar 11, 2026

Starts at the Node

The article advocates starting platform engineering at the node—the smallest unit that delivers value, such as a microservice, developer workstation, or container. By tackling concrete developer pain points like build latency, CI flakiness, and credential handling, teams can create reusable...

By Future of CIO

Blog•Mar 10, 2026

Feature Flag Systems

Feature flag systems let companies separate code deployment from feature release, enabling instant toggles without redeploying. The architecture consists of a central flag management service, SDK clients embedded in applications, and a real‑time sync layer that propagates changes fleet‑wide. Flags...

By System Design Interview Roadmap

Blog•Mar 9, 2026

Impact of Scale Conferences 2026 in Los Angeles

The 2026 SCALE conference in Los Angeles gathered developers, DevOps engineers, and security professionals to showcase the latest in open‑source AI, cloud‑native automation, and supply‑chain security. Sessions emphasized self‑hosting large language models, building internal developer platforms, and hands‑on workshops that...

By Future of CIO

Blog•Mar 9, 2026

Day 42: Exactly-Once Processing Semantics in Distributed Log Systems

The post details a new Kafka‑based log pipeline that guarantees exactly‑once processing, eliminating duplicate handling even during failures. It combines idempotent producers, transactional consumer commits, a Redis‑backed deduplication layer, and a state‑reconciliation service to create an end‑to‑end exactly‑once flow. The...

By Hands On System Design Course - Code Everyday

Blog•Mar 9, 2026

GenAI-Based Development Platform - Part 1: Guardrails

The article introduces a GenAI‑based development platform, dubbed Harness, that layers deterministic guardrails around coding agents such as Claude Code. It outlines four protective mechanisms—pre‑commit checklist skill, pre‑commit Git hook, GitHub Actions workflows, and automated pull‑request reviews—to catch errors and...

By Microservices.io (Chris Richardson)

Blog•Mar 9, 2026

Automating Netlab-Based Cisco SD-WAN Deployment

Netlab, an open‑source lab generator, does not include native support for Cisco SD‑WAN. Sebastien d’Argoeuves created a GitHub repository that automates Cisco SD‑WAN deployment once a netlab lab is launched. The solution reads netlab’s JSON/YAML topology, maps device roles, and...

By ipSpace.net

Blog•Mar 9, 2026

Writing Tests with Claude Code - Part 1 - Initial Results

The author used Claude Code to auto‑generate a suite of 23 REST Assured/JUnit tests for a simple Spring Boot banking API. Within minutes Claude produced passing tests that achieved 95% line coverage and 91% mutation coverage according to PIT. A...

By On Test Automation

Blog•Mar 7, 2026

CreateOS Reading Club

The NodeOps Reading Club post examines how tool fragmentation and constant context switching sap productivity for solo founders, small dev teams, and beginner "vibe coders." It breaks down the hidden runway cost of juggling support tickets, logs, billing, and incident...

By NodeOps

Blog•Mar 5, 2026

Presentation: Netlab Overview and Use Cases

At the recent NetBCN event, a concise presentation showcased netlab’s expanding portfolio of use cases, adding a dedicated “use cases” section to the standard deck. The speaker highlighted roughly a dozen scenarios, ranging from rapid prototyping of network designs to...

By ipSpace.net

Blog•Mar 3, 2026

Observability, Governance at Scale

The article outlines how observability, governance, and safe automation together form a resilient IT foundation. Observability leverages metrics, logs, and traces to detect issues before they affect users. Governance establishes policies, RBAC, and compliance monitoring to align technology with business...

By Future of CIO

Blog•Mar 3, 2026

MLOps Coding Skills: Bridging the Gap Between Specs and Agents

The article introduces Agent Skills, a lightweight markdown‑based tool that injects organization‑specific engineering standards into AI coding agents. By converting sections of the MLOps Coding Course into SKILL.md files, the author shows how agents can automatically apply preferred tools such...

By MLOps Community

Blog•Mar 3, 2026

Day 146: Time Series Database Integration - Turning Logs Into Queryable Metrics

Today's post highlights the shift from raw log files to queryable metrics using time‑series databases. It explains why traditional relational databases falter with high‑write, append‑only workloads and showcases InfluxDB and TimescaleDB as purpose‑built solutions. The article illustrates how these databases...

By Hands On System Design Course - Code Everyday

Blog•Mar 2, 2026

Evals Skills for Coding Agents

Hamel Husain released evals‑skills, an open‑source plugin that equips AI coding agents with a toolbox for product‑specific evaluation. The package introduces an eval‑audit skill that inspects six diagnostic areas of an evaluation pipeline and a suite of targeted skills for...

By Hamel Husain

Blog•Mar 1, 2026

IT's Evolving Role in Advancing Organizational Growth

IT is transitioning from a back‑office system provider to a strategic, customer‑facing partner that drives end‑to‑end change. Leaders are urged to co‑create transformation roadmaps, adopt outcome‑based KPIs, and build modular, API‑first platforms that reduce duplication. Lightweight, proportional governance combined with...

By Future of CIO

Blog•Mar 1, 2026

I Rewrote My Step Function as a Durable Function

The author rewrote a serverless weather‑checking workflow from AWS Step Functions to the newly announced Lambda Durable Functions, publishing both implementations on GitHub. Both versions perform identical tasks—polling OpenWeatherMap every ten minutes and updating a static S3 site—but the coding...

By Danielle Heberling

Blog•Mar 1, 2026

Why GenAI-Based Coding Agents Are a Complex Domain in Cynefin - and What that Means for Adoption

The piece frames generative‑AI coding agents as a complex problem space within the Cynefin framework, emphasizing that prompt‑to‑output behavior is inherently unpredictable. Unlike traditional developer tools that sit in clear or complicated domains, LLM‑driven agents require safe‑to‑fail experiments, rapid feedback,...

By Microservices.io (Chris Richardson)

Blog•Feb 27, 2026

LXD 6.7 Released With AMD GPU Passthrough Support

Canonical released LXD 6.7, the latest update to its container and virtual‑machine manager for Ubuntu. The release introduces AMD GPU passthrough support using the new AMD CDI interface and a gpu_cdi_amd extension. It also upgrades VM GPU passthrough with newer QEMU...

By Phoronix

Blog•Feb 27, 2026

Refactoring the RestAssured.Net Code with Claude Code

The author used Claude Code’s Opus 4.6 model to refactor the large ExecutableRequest class in the RestAssured.Net library, creating a new RequestBodyFactory and consolidating arguments into a RequestBodySettings object. Guardrails such as excluding test files, manual code review, and incremental...

By On Test Automation

Blog•Feb 25, 2026

Fragments: February 25

Laura Tacho’s recent study shows 92.6% of developers rely on AI assistants, claiming roughly four saved hours per week and that AI now writes about 27% of code autonomously. The data also suggests AI can halve onboarding time, yet averages...

By Martin Fowler

Blog•Feb 24, 2026

Knowledge Priming

The article introduces *knowledge priming* – the practice of feeding AI coding assistants curated project context before asking for code. It shows how generic AI output often clashes with a team’s conventions, leading to a frustrating regenerate‑fix loop. By supplying...

By Martin Fowler

Blog•Feb 24, 2026

Microservices Platforms - Part 5: Observability Platform

The fifth installment of the Microservices Platforms series introduces an Observability platform that centralizes metrics, logs, and tracing for microservices. It explains how a dedicated platform team delivers shared observability capabilities, allowing service teams to concentrate on their core domain...

By Microservices.io (Chris Richardson)

Blog•Feb 23, 2026

Red Hat Releases Tuned 2.27 For Adaptively Tuning Linux To Different Workloads

Red Hat has released Tuned 2.27, the latest version of its open‑source tuning framework for Linux. The update adds CPU partitioning autodetection, a systemd workaround, and enables CPU boost in performance profiles. It also introduces OpenShift‑specific TCP optimizations, forces SAP HANA latency...

By Phoronix

Blog•Feb 23, 2026

What Is GitHub? How To Use GitHub To Manage Code

In this episode Ash Moosa explains what GitHub is and how it helps small businesses manage evolving e‑commerce code through version control. He walks through the core concepts—repositories, branches, forks, pull requests, issues, GitHub Actions, and GitHub Pages—and shows how they...

By eCommerce Fastlane

Blog•Feb 18, 2026

Frigate with Hailo for Object Detection on a Raspberry Pi

Jeff Geerling details how to pair Frigate NVR software with Hailo‑8 or Hailo‑8L AI coprocessors on a Raspberry Pi 5 or CM5. He outlines driver installation, Frigate configuration, and a PCIe driver tweak to resolve a max_desc_page_size error. After the fix, the...

By Jeff Geerling (blog)

Blog•Feb 17, 2026

Harness Engineering

OpenAI’s team spent five months building a "harness" that lets AI agents maintain a production‑grade codebase exceeding one million lines, without a single line of manually typed code. The harness blends three pillars—continuous context engineering, deterministic architectural constraints, and periodic...

By Martin Fowler

Blog•Feb 14, 2026

Lots of AI SRE, No AI Incident Management

AI SRE platforms such as PagerDuty, Datadog, and several startups are emerging to automate incident diagnostics and mitigation, but they largely ignore the coordination side of incident response. The author argues that incident management—aligning multiple responders, preventing fixation, and maintaining...

By Surfing Complexity

Blog•Feb 12, 2026

Software Testing Podcast - Agentic AI Quality Engineering - The Evil Tester Show Episode 030

The Evil Tester Show episode 030 features Dragan Spiridonov discussing his open‑source Agentic QE fleet, a suite of AI‑driven agents and skills that extend Claude Code for quality engineering. The tooling can automate browser interactions via Playwright or Vibium, generate test...

By Evil Tester Blog

Blog•Feb 11, 2026

Why Declarative (Lakeflow) Pipelines Are the Future of Spark

Spark is evolving from low‑level RDD and notebook‑driven workflows to declarative pipelines, branded as Lakeflow on Databricks. The new framework lets engineers define flows, datasets, and pipelines in a configuration‑first manner, while Spark handles execution for both batch and streaming....

By Confessions of a Data Guy

Blog•Feb 11, 2026

Versioning and Testing Data Solutions: Applying CI and Unit Tests on Interview-Style Queries

The article walks through solving a Tesla interview question in Python, calculating each car maker’s net product launch change between 2019 and 2020 using pandas. It then refactors the script into a reusable function and adds a unit‑test suite to...

By KDnuggets

Blog•Feb 6, 2026

Tools Aren’t Enough: Scaling AI Adoption for Engineering Teams

Block reports that roughly 95% of its engineers now rely on AI‑assisted coding tools, with most operating at advanced stages that require multiple parallel agents. To move teams from early experimentation to orchestrated multi‑agent workflows, Block launched an Engineering AI...

By Angie Jones (blog)

Blog•Feb 6, 2026

There Is a Lot of Knowledge and Wisdom Packed Into Speakeasy's Agent Skills

Speakeasy has released a detailed catalog of Agent Skills that codify the actions needed to generate, test, and manage SDKs and Terraform providers from OpenAPI specs. The list includes steps such as starting new projects, diagnosing failures, customizing runtime behavior,...

By API Evangelist

Blog•Feb 6, 2026

Where Is Governance (Guidance) Going?

The article reflects on a recent conversation with product marketer Anna Daugherty about the future of API governance, emphasizing a shift toward consumer‑first perspectives. It introduces "Spotlight rules" as the next evolution of Spectral and Vacuum linting, extending governance beyond...

By API Evangelist

DevOps Blogs and Articles

SLIs, SLOs, and SLAs: How to Measure and Enforce System Reliability

Day 149: Orchestrating Your Log Processing Empire with Kubernetes

Preventing Cascading Failures: How to Decouple Microservices with Async Design

Kafka Vs. RabbitMQ: How to Choose the Right Message Queue for Your Microservices

Troubleshooting Guide: Running Qwen3.5-35B with Reasoning & Tool Calling Using vLLM on Nvidia DGX Spark

Google Now Using AutoFDO To Enhance Android's Linux Kernel Performance

How to Debug AI Backend Systems

Debauit Announced As Debian Source Package Auditor

Day 43: Implement Log Compaction for State Management

Your Java Singleton Choice Could Make Your App 871x Slower

ENISA Technical Advisory on Secure Package Managers: Essential DevSecOps Guidance

The Beginner’s Guide to Shipping with Claude Code + 3 Support Prompts

Judgment and Creativity Are All You Need.

Starts at the Node

Feature Flag Systems

Impact of Scale Conferences 2026 in Los Angeles

Day 42: Exactly-Once Processing Semantics in Distributed Log Systems

GenAI-Based Development Platform - Part 1: Guardrails

Automating Netlab-Based Cisco SD-WAN Deployment

Writing Tests with Claude Code - Part 1 - Initial Results

CreateOS Reading Club

Presentation: Netlab Overview and Use Cases

Observability, Governance at Scale

MLOps Coding Skills: Bridging the Gap Between Specs and Agents

Day 146: Time Series Database Integration - Turning Logs Into Queryable Metrics

Evals Skills for Coding Agents

IT's Evolving Role in Advancing Organizational Growth

I Rewrote My Step Function as a Durable Function

Why GenAI-Based Coding Agents Are a Complex Domain in Cynefin - and What that Means for Adoption

LXD 6.7 Released With AMD GPU Passthrough Support

Refactoring the RestAssured.Net Code with Claude Code

Fragments: February 25

Knowledge Priming

Microservices Platforms - Part 5: Observability Platform

Red Hat Releases Tuned 2.27 For Adaptively Tuning Linux To Different Workloads

What Is GitHub? How To Use GitHub To Manage Code

Frigate with Hailo for Object Detection on a Raspberry Pi

Harness Engineering

Lots of AI SRE, No AI Incident Management

Software Testing Podcast - Agentic AI Quality Engineering - The Evil Tester Show Episode 030

Why Declarative (Lakeflow) Pipelines Are the Future of Spark

Versioning and Testing Data Solutions: Applying CI and Unit Tests on Interview-Style Queries

Tools Aren’t Enough: Scaling AI Adoption for Engineering Teams

There Is a Lot of Knowledge and Wisdom Packed Into Speakeasy's Agent Skills

Where Is Governance (Guidance) Going?

DevOps Pulse