Today's Big Data Pulse

Leadership Gaps Hamper Data Engineering Teams, Survey Finds
Three 2026 surveys of 1,629 data professionals reveal organizational issues now dominate data‑engineering bottlenecks. In January, weak leadership direction and poor requirements accounted for 40% of top‑bottleneck votes, while by April 50% cited lack of clear ownership as the biggest pain point. Legacy systems and tooling were far lower priorities, at 25% and under 5% respectively.
Also developing:
By the numbers: Sensor Tower acquires AppMagic to expand SMB offering
Iranian-Backed Hacks Spur Call for Data‑Driven Threat Intelligence Platforms
A New Yorker investigation highlights a wave of Iranian‑backed cyber intrusions—from a New York dam to a Pennsylvania water system—pressuring U.S. utilities and security firms to adopt sophisticated, data‑driven threat‑intelligence platforms. Experts warn that without such tools, small municipalities remain vulnerable and the broader critical‑infrastructure risk escalates.
![The $22K Neural Search Pipeline That Was Silently 7 Days Behind [Edition #6]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!fOxT!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444d8dff-2e3d-4216-b86d-30b379177d49_1200x1200.png)
The $22K Neural Search Pipeline That Was Silently 7 Days Behind [Edition #6]
Briefly.ly, a Series B newsletter aggregator with 5.2 M daily users, runs a two‑tower neural retrieval system costing about $22.6 K per month. The pipeline trains on a six‑month static snapshot and refreshes its FAISS index only once a week, leading to...
Devin Booker's $35K Fine Highlights Big Data Gap in NBA Officiating
Phoenix Suns guard Devin Booker was fined $35,000 after publicly denouncing a technical foul, reigniting debate over the NBA’s missing real‑time officiating analytics. The incident spotlights a broader industry gap: the league’s limited use of big‑data tools to evaluate and...
Palantir’s $130 Million IRS Contract Powers Massive Financial‑Crime Data Mining
Palantir’s Lead and Case Analytics platform, paid over $130 million by the IRS since 2018, is being used to aggregate and analyze millions of records for financial‑crime investigations. The contract, spanning both Gotham and Foundry applications, gives the agency a single...
OpenAI Launches GPT-5.5 with 1M-Token Context, Boosting Big‑data AI Workflows
OpenAI released GPT-5.5, a frontier model with a 1 million‑token context window, image input, and built‑in tool use, positioning it as the most capable AI for end‑to‑end work automation. The launch includes benchmark gains—82.7% accuracy on Terminal‑Bench 2.0 and half‑cost pricing—prompting...
From Patchwork to Platform: How Blue Cross Blue Shield Meets the Modernization Challenge
Blue Cross Blue Shield plans are confronting legacy technology debt and fragmented data silos, prompting a shift toward modular, cloud‑ready architectures. A HIMSS session outlined practical strategies—multicloud adoption, data unification, and robust governance—to boost agility and member experience. Speakers from...
Overstock.com Boosts Personalization with 500% Faster Data Science Velocity
Overstock.com disclosed that its new analytics platform has increased data‑science velocity by more than 500% and halved the cost of moving models into production. The changes enable the retailer to deliver personalized product recommendations across its catalog of nearly 5 million...
Omni Secures $120 Million Series C at $1.5 B Valuation, Led by ICONIQ
San Francisco‑based AI analytics firm Omni closed a $120 million Series C round at a $1.5 billion valuation, with ICONIQ Capital as lead investor. The funding will power platform development, enterprise expansion and a $30 million employee tender, underscoring strong venture appetite for AI...
Google Launches Agentic Data Cloud to Power AI‑Driven Enterprise Context
Google introduced the Agentic Data Cloud architecture, a unified semantic layer that lets AI agents reason over enterprise data at scale. The platform stitches together BigQuery, Dataplex, Vertex AI and a new Knowledge Catalog, targeting the biggest barrier to production...
Your Data Platform Costs More Than It Should
A Snowflake migration revealed unexpectedly high cloud spend, prompting a deep dive into data platform economics. The author demonstrates how simple SQL queries can surface the most credit‑hungry warehouses and queries, exposing idle compute and full‑table scans. By adjusting auto‑suspend...
Coffee Giants Deploy Satellite‑AI System to Meet EU Deforestation Rules
JDE Peet’s, Tchibo, Louis Dreyfus, Neumann Kaffee, Touton and Sucafina have formed the Coffee Canopy Partnership, using Airbus satellite imagery and AI to map coffee farms and avoid EU deforestation penalties. The rollout begins in East Africa with a goal of global coverage...
Strider Unveils Agentic Operating System to Centralize Strategic Intelligence
Strider Technologies announced the launch of its Agentic Operating System, a centralized layer that orchestrates strategic intelligence for enterprises. The platform promises to align KPIs, streamline decision‑making, and integrate disparate data sources, marking a new approach to management‑focused AI.
Abu Dhabi’s AI Census Wins Award, Fuels Smart‑City Housing Strategy
The Statistics Centre – Abu Dhabi (SCAD) earned the Strategic Impact Initiative Award for its AI‑driven register‑based census, which now updates population figures almost yearly. The system, showing a 7.5% annual rise to 4.14 million residents, is being used to pinpoint...
Microsoft Fabric Roadshow Unveils 2026 Enhancements for Integration, Analytics and Governance
At a February 16 roadshow in Brisbane, Microsoft announced a suite of 2026 updates to its Fabric data platform, emphasizing tighter governance, new AI catalog features, and expanded support for migrating from Azure Data Factory and legacy tools. The enhancements...

OpenText and Google Target the Data Layer Gap Holding Back Enterprise Agentic AI
OpenText and Google Cloud announced an expanded partnership to deliver a full agentic AI stack focused on context engineering, data sovereignty, and open interoperability. The collaboration builds on OpenText's Aviator Studio, a no‑code platform that governs and connects enterprise data...

How to Develop a Data Governance Strategy: 7 Key Steps
Developing a data governance strategy is now a top C‑suite priority as organizations grapple with exploding data volumes, regulatory scrutiny, and AI‑driven workloads. The article outlines a seven‑step framework that starts with documenting existing processes and securing executive sponsorship, then...

How We Stopped Babysitting Our Data and Got Faster at Ford
Ford’s data engineering team replaced its legacy on‑premise framework with a cloud‑native architecture that batches data instead of saving it sequentially. The shift leverages auto‑scaling services to handle sudden data surges, eliminating bottlenecks and reducing IT overhead. Early results show...

How Delta Parquet Is Cutting Data Costs by 99.7%
Delta Parquet, an open‑source extension of the Parquet columnar format, is delivering dramatic efficiency gains for financial‑services data pipelines. A 1 TB CSV file shrinks to roughly 130 GB, an 87% reduction, while query times plunge from 236 seconds to 6.78 seconds—a 34‑fold speed...

External Tables Persist: Legacy Need Meets Modern Data Access
Have you ever thought why Databricks, BigQuery, and others are still adding features such as External Tables? I've used them when starting my career in 2003, but why are they still used today? And what are the modern versions of...
Wedbush Keeps Outperform on SoundHound AI After $43 M LivePerson Deal
SoundHound AI announced an all‑stock acquisition of LivePerson valued at $43 million, a 22% premium, and an enterprise value of roughly $250 million. Wedbush reaffirmed its Outperform rating with a $12 12‑month target, noting the deal creates a massive data foundation and...
Faranak Firozan Calls for Human Audits to Safeguard HR Data Integrity
Technical Program Manager Faranak Firozan warned on April 24 that relying solely on automated HR analytics can produce dangerous blind spots, citing a recent audit that uncovered 11 missing managers. She argues that a manual "Human Audit" is essential to...
US Warns of China‑Backed AI Model‑Theft Campaigns Threatening Big Data Assets
The White House Office of Science and Technology Policy disclosed a large‑scale campaign by Chinese actors to steal proprietary AI models, using tens of thousands of fake accounts to bypass safeguards. Officials say the effort threatens U.S. intellectual property and...

Christophe Pettus: Postgres Goes to the Lake, Two Ways
Snowflake and Databricks have turned their recent acquisitions into competing "Postgres‑in‑the‑lakehouse" offerings. Snowflake released pg_lake, an open‑source Postgres extension that federates queries to Iceberg tables stored in object storage. Databricks launched Lakebase, a serverless Postgres built on Neon’s storage‑compute separation...

Storage News Ticker - 24 April 2024
DataHub expanded its Google Cloud partnership by open‑sourcing a Knowledge Catalog connector and adding native Iceberg Rest Catalog support, helping joint customers such as Etsy and Trustpilot build AI‑ready context layers. Denodo’s AI Trust Gap Report revealed that 63% of...

BI Dashboards Are Dying; Prepare for the Next Wave
RIP BI Dashboards. Tools like Tableau and PowerBI are about to become extinct. This is what's coming (and how to prepare):
UK Biobank Data of 500,000 Volunteers Listed for Sale on Alibaba
UK Biobank’s de‑identified health and genetic data for 500,000 volunteers appeared on the Chinese e‑commerce platform Alibaba. The breach, traced to three Chinese research institutions, prompted a parliamentary briefing, a temporary shutdown of the Biobank’s research platform, and renewed calls...
Anthropic’s Mythos Model Draws White House Attention Amid Global AI Policy Scrutiny
Anthropic’s latest AI system, Mythos, has prompted a flurry of high‑level meetings in Washington, a coordinated response from Indian banks, and early adoption by Mozilla for vulnerability hunting. The model’s unprecedented ability to locate software flaws is forcing regulators and...

How Spotify Used Agents to Migrate 1,800 Data Pipelines and Save 10 Weeks of Dev Work
Spotify’s internal Honk tool deployed autonomous agents to migrate roughly 1,800 data pipelines across its backend. The system generated and applied code changes automatically, eliminating the need for manual rewrites. By the end of the effort, Spotify saved an estimated...

From Imperative to Declarative: Rethink Data System Design
Not just syntax. A fundamental shift in how you think about data systems. Imperative: dictate exact steps. Declarative: describe what you want, system figures out how. https://www.ssp.sh/brain/declarative-vs-imperative
Dataplex Becomes a Dynamic, Always‑on Enterprise Knowledge Catalog
"To address this problem, we are evolving Dataplex into a dynamic, always-on Knowledge Catalog that serves as the universal context engine for your enterprise, helping agents execute complex tasks with accuracy." https://t.co/C1akH2b5yu < I'm listening.

Data Lakes Do Not Leak, Permissions Do
Modern analytics platforms are failing not because data lakes store too much information, but because permission models lag behind platform ambition. Organizations often apply broad, shared‑folder style access for convenience, which becomes a governance nightmare as the lake expands to...
CaliberMind Launches Unified B2B MMM Platform to Bridge Attribution Gap
CaliberMind introduced a native Marketing Mix Modeling (MMM) platform that combines multi‑touch attribution (MTA) with strategic budget planning in a single system. The tool promises continuous data integration, eliminating the costly gap between tactical campaign insights and long‑term revenue forecasting...
Tredence Unveils Agentic AI Suite on Google Cloud, Promising Up to 40% Cost Cuts for Retailers
Tredence announced a new agentic AI suite on Google Cloud aimed at e‑commerce and retail decision‑makers. The platform promises to cut total cost of ownership by up to 40%, automate up to 98% of manual processes, and shrink deployment cycles...
NielsenIQ Launches Commerce Lab to Build AI‑driven Data Layer for Retail
NielsenIQ announced the launch of its NIQ Commerce Lab, a technology hub designed to build the data platforms, APIs and measurement systems that power AI‑driven commerce. The initiative, led by new Head of AI Commerce Lisa Lovallo Ceppos, seeks to...
OZMOSI Teams with Planview to Deploy AI‑Driven Portfolio Planning for Pharma R&D
OZMOSI announced a strategic partnership with Planview, integrating its machine‑readable clinical intelligence into Planview’s AI‑driven portfolio management platform. The deal gives pharmaceutical teams access to data from more than 800,000 trials, 35,000 drugs and 4,000 diseases, promising faster, data‑backed R&D...
Bedrock Data Extends ArgusAI Governance to Google Vertex AI, Unifying AI Data Controls
Bedrock Data announced today that its ArgusAI governance platform now supports Google Vertex AI Search and Dialogueflow, enabling a single policy model for AI agents across Amazon Bedrock, Snowflake Cortex AI, ChatGPT Enterprise and Google's AI services. The move aims...
Fivetran Unveils Benchmark Highlighting API Limits and Throttling for AI Workloads
Fivetran launched the Open Data Infrastructure (ODI) Data Access Benchmark, a new industry standard that measures how software vendors restrict data access for AI workloads. The benchmark spotlights API limits, throttling, and egress fees as critical bottlenecks for enterprises scaling...
Unstructured Data Security Hindered by Visibility Gaps
One of the biggest security challenges with unstructured data is the lack of visibility and lineage as information moves across systems, clouds, and teams." #DataLineage #AI https://t.co/PYomJYHDkY
Former MBTA Chief Brian Shortsleeve Launches Governor Bid, Promises Data‑Driven Reforms
Brian Shortsleeve, the former chief administrator of the Massachusetts Bay Transportation Authority, has entered the Republican primary for governor. He vows to run the state with the same data‑driven, cost‑cutting playbook he used to halve the T’s operating deficit, positioning...
Snowflake Unveils AI Automation Upgrades to Power the Agentic Enterprise
Snowflake rolled out major upgrades to its Snowflake Intelligence and Cortex Code platforms, adding AI‑driven automation, new integrations and a mobile app. The enhancements aim to turn the data cloud into a control plane for an "agentic enterprise," with more...
Databricks Opens Public Preview of Lakeflow Designer, a No‑Code AI‑Native Data Prep Tool
Databricks announced today that its Lakeflow Designer is entering public preview, offering a visual, no‑code, AI‑native environment for data preparation directly within the lakehouse. The launch targets analysts and domain experts, promising faster, governed workflows without moving data, and signals...

How I Solved for Data Validation with AI
During a company hack week, an analytics engineering team tackled the persistent problem of validating data changes introduced by AI‑driven code refactoring. Using Claude Code, they built an AI skill that automatically opens a GitHub pull request and launches a...

30,000 Tables, Zero Context: Why Legacy Data Architecture Remains AI’s Biggest Enemy
Enterprise AI projects are stalling not due to model complexity but because legacy data architectures lack the cohesion needed for large‑scale intelligent workloads. Wiley, a 219‑year‑old publisher, discovered 30,000 fragmented tables across business units, prompting a shift to a unified...
Google Pitches Agentic Data Cloud to Help Enterprises Turn Data Into Context for AI Agents
Google unveiled the Agentic Data Cloud, an architecture that layers a unified semantic Knowledge Catalog atop its existing data services—BigQuery, Dataplex and Vertex AI. The offering adds preview tools such as a LookML‑based agent, a BigQuery feature for embedding business...

Governance Success Starts with People, Not Tools
Most data governance projects fail because they start with the tool. Not the process. Not the people. Colruyt Group flipped that: they aligned roles, tested the model, then scaled In the AI era, governance isn’t control—it’s enablement. by @ronald_vanloon, Grimme Bogaerts & Martijn...

Proxy Governance for Alternative Data: A Practical Playbook for Funds
The HedgeThink playbook outlines how funds can harvest alternative data through proxy‑enabled web scraping while meeting investor due‑diligence and regulator expectations. It urges teams to start with a narrow, documented use case, verify site terms, and map GDPR obligations—where a...

Meet the AI Startup That Gives Hotel Operators an Expert Data Team on Demand - By Ivana Johnston
Ladera.ai, founded in 2023 and based in Redwood City, offers an AI‑driven platform that unifies hotel PMS, CRM, and marketing data and lets commercial teams ask plain‑English questions. The system acts as a virtual analyst, strategist and data scientist, delivering...

OpenDataLoader PDF: Trending Parser Boosts RAG Pipelines
Someone just open-sourced a PDF parser that converts documents to Markdown, JSON, and HTML — and it's currently the #1 trending repository on GitHub. It's called OpenDataLoader PDF. Here's why data scientists building RAG pipelines should pay attention.
Data Trust and Transparency Demand Urgent Reform
A cracking podcast with @MrMBrown and @DamianPudner of @GreatBritishTT talking Data: Trust, Transparency, and the Need for Reform https://t.co/fOQA35x2rN
DuckDB 1.5.2 Boosts Performance 10% and Adds Production‑Ready Lakehouse Features
DuckDB announced version 1.5.2, a patch that lifts its TPC‑H benchmark score by roughly 10% and ships a production‑ready DuckLake v1.0 lakehouse implementation. The update also adds Iceberg‑compatible features, a new WebAssembly shell, and fixes a primary‑key conflict bug uncovered...