
A real AWS Data Science pipeline looks like this: Raw data → S3 ETL → AWS Glue Query → Athena Training → SageMaker Deployment → Endpoints Monitoring → CloudWatch Add streaming with Kinesis and orchestration with Step Functions, and you have a full production ML platform. This is the architecture behind many modern AI systems.
Though the US military's big data initiative Maven has sped up the planning of strikes for years, the comments suggest that generative AI is now adding a new interpretative layer to such deliberations.
Data lineage traces your data's journey from source to destination. Where did this number come from? What would break if I changed this table? Who's using this data? Good lineage answers these questions. Bad lineage makes you grep through code. Tools like dbt...
Have you tried out pipe syntax instead of traditional SQL? I've only messed around with it a bit. I can see how it's an improvement for different types of queries. This post shows you how to try it out (at no...
RT You can't have responsible AI without responsible data. Classify AI data, extend zero trust, encrypt in use, and spell out non-negotiable governance policies from day one. #AISecurity #DataGovernance @Star_CIO https://t.co/aiB5P99ido
SAP's BTP platform streamlines cloud migrations by offering tools for data quality, master data management, and analytics. It supports a 'clean core' approach, enabling organizations to differentiate with custom processes without complex upgrades. #SAP #CloudMigration #BTP https://t.co/94wouRrLLt

Big news, I added a new design pattern chapter, called «Dynamic Query Design Pattern». This design pattern problem statement goes like this: 1. Provide immediate answers 2. How you model it matters 3. Dumping everything into the lake is painful The core challenge is enabling...
"Shift left" comes from software engineering - finding bugs earlier in the development process. In data, shifting left means: validate data at the source, not after it breaks your dashboard. Instead of hoping bad data doesn't show up in your warehouse, you...
SQLMesh takes dbt's concept and adds semantic understanding of SQL. It parses SQL statements, translates between dialects automatically, and offers compile-time validation. Built by Tobiko Data (now Fivetran). If you're starting fresh, it deserves serious consideration. https://www.ssp.sh/brain/sqlmesh
I had a good friend tell me the apps I created were trash because I didn't have a formal database. I took that as a challenge. In 5 hours, I personally built, populated, and deployed a database containing all the content...
CLI-first is eating development. Email, calendar—they all have CLIs now. Why not your business metrics? For data/analytics engineers building with agents: DuckDB + MotherDuck + Rill give you an agentfriendly, #localfirst frontend—exact context via SQL and YAML. https://www.rilldata.com/blog/building-an-agent-friendly-local-first-analytics-stack-with-motherduck-and-rill

Understanding #Data Fabric is Key to Modern Data Management and Efficiency by @antgrasso #DataScience #BigData https://t.co/6OxSioKNji
Dagster's key innovation is software-defined assets. Instead of: "Run this job on a schedule" You declare: "I need this table to exist, here's how to build it" The difference is subtle but profound. Assets have identities, dependencies, and history. Jobs are just tasks. When...
Imagine a place where you could: • Pick a data project • Follow a structured workflow • Build something real • Add it straight to your portfolio That's the direction we're exploring.
Slack is the most important text data source in most companies, but it has the worst data access policies in enterprise software. The only thing that will fix it is competition, and Anthropic is the right company to do it....
I spent years working with data warehouse automation tools before the modern data stack existed. The biggest lesson? There are two approaches to generating pipelines: Parametric - you define parameters, the tool generates SQL Template-based - you write SQL templates with variables Most modern...

Part 3 of 3 underused chart types worth knowing. A box plot with 15 points looks identical to one with 1,500. You lose all sense of where measurements actually cluster. Beeswarm plots fix this. Every data point is visible. Nothing gets absorbed into...

Question for your next meeting: "If 95% of AI projects fail before production, and the reason is data quality, what percentage of our AI budget goes to data quality and governance?" The follow-up that makes it uncomfortable: "How confident are we that...
Most FP&A teams don’t struggle with analytics. They struggle with data. 💡Finance leaders from PepsiCo, BILL, and Workday shared how they build strong data foundations and a single source of truth to enable AI and predictive decision-making: https://t.co/FnD9BnrjT6 #fpatrends
AWS and Azure both surging simultaneously. Oracle climbing. Elasticsearch tripled. It's not one cloud winning. It's ALL infrastructure growing as data demand outpaces capacity. The foundation layer is on fire.

Tableau is about to die. Introducing PandasAI, a free alternative for fast Business Intelligence. Let dive in:
Why the most valuable AI systems are not the most accurate ones today, but the ones designed to learn tomorrow In the early days of enterprise AI, success was measured in a single moment: the model launch. A team would...
"Multicloud chaos is fundamentally a data problem, and genAI's edge is building a unified semantic layer over configs, logs, schemas, and lineage." #SRE #Cloud #CIO https://t.co/vBzM21vM14
Everyone's talking about "second brain" for AI. I added a new layer to mine. I built a context vault with 200-700 line summary docs of big areas of my life (business, 2026 goals, family, friends, a personal constitution). WAY fewer...

Principal Component Analysis (PCA) is the gold standard in dimensionality reduction. But PCA is hard to understand for beginners. Let me destroy your confusion:
Yesterday I showed someone how to join tables in Snowflake using natural language no SQL required. And she still said it was hard and confusing.

Part 1 of 3 underused chart types worth knowing You reach for a line plot to show ranking changes over time. The lines cross. It turns into spaghetti. Bump charts fix this. When you care about relative position — not raw values —...
#1 thing people don't know about Databricks and Apache Spark: the performance of Real-Time Mode (RTM), it's faster than Apache Flink and more robust. No more batching.

Python Tip When pandas is too slow, there are other libraries to rescue: - Modin - Easiest switch from pandas Change one line: import modin.pandas as pd Same syntax. Uses all CPU cores - Dask - When data > RAM Processes data in chunks across CPU...

Stop Writing UNION ALL for Multi-Level Aggregations You need regional totals AND product totals AND grand totals. So you write three separate queries with UNION ALL. There's a better way: GROUPING SETS. UNION ALL - Scans the table 3 times. Slow. GROUPING SETS - One...
Folks asked me "what's your plan for gwenchmarks"? At first, it was a joke. But... teaching people how to plan, execute and read benchmarks is a good goal. So I wrote The Gwenchmarks Manifesto as a start. Still a bit...
As I was building my MRR analysis feature, I realized that there is much more power in our MRR schedule than we realize. With the correct metadata, we have a revenue intelligence engine that will provide more insight for our...

"The columnar engine uses a specialized storage mechanism designed to accelerate analytical queries by speeding up scans up to 200 times on live operational data" The new @googlecloud Spanner capability means you can serve Iceberg lakehouse data ... https://t.co/dxmgEAI0cA https://t.co/TUe0vNnzfN
A strong data strategy is more than storage. Its context, quality, & governance. The “useless” data may hold insights GenAI needs, but without curation, access controls, and trust, innovation risks becoming noise instead of value. https://t.co/ParkENiwRg
There's a moment in every data engineer's career when they discover they can query a 10GB Parquet file on their laptop in seconds. That's the DuckDB moment. It changes how you think about what requires a cluster and what doesn't. Spoiler: most...

Instead of loading CSVs into pandas just to run one query, you can use DuckDB to run SQL directly on files. No loading. No waiting. Just query the file and get results. It’s also 20x faster and uses way less memory. Here’s how...

Agent standards like MCP and A2A are starting to show up in more types of packaged software. @confluentinc just shipped updates to their data products, including their "intelligence" platform that now supports A2A and MCP integrations. https://t.co/n8teMonFMW https://t.co/1f7afnSw8K
73% of companies are investing more in data and analytics skills right now. Not AI skills. Data skills. Your AI returns depend entirely on your data foundation. The benchmark comparison doesn't change that.
Clickhouse is trying to push postgres + clickhouse as the ultimate analytics DB stack. But tbh adding an eventually consistent database to your stack that you needed to sync too is anything but trivial. Love the product but I'd just use...

Snowflake prints 30% revenue growth and record bookings momentum $SNOW delivered Q4 revenue of $1.28B (+30% YoY) with product revenue at $1.23B (+30%) and EPS of $0.32, beating estimates by 23%. Growth is re-accelerating at scale. RPO climbed to $9.77B (+42%), billings...

The debate: which AI model to bet on. What I'm watching instead: Informatica surging. Java everywhere. GCP back on the shortlist. 73% of companies investing more in data skills. Not AI skills. Your AI strategy is only as strong as the data layer...
Data Product: why do we need this data? Data Asset: what is this data and how do we maintain it? I've found the most practical definition of data products comes from Dagster's software-defined assets. Each asset has a clear definition, dependencies, and...

Recently I caught up with Amiet Dhagat, Head of Data Services Analytics and AI for HCF, to get the inside story around their stellar success in transition from static data to dynamic data, and why real-time data has been so...
SONAR is having the best quarter in years. One of my board members asked if I were concerned about the rise of AI and if it would be a drain our data business. I told her that we are seeing...
Slack will be the Waterloo of open vs closed data. Someone is going to make a slack clone where you get unfettered access to your own data, and people really will switch en masse.
RT High-level policies aren't enough. It's time for audits, training, DSPM, and privacy-by-design in AI workflows. If privacy isn't built into how data moves, you're hoping - not leading. #DataGovernance #AI #CIO @Star_CIO https://t.co/Naq82FuMWZ
Why are hyperscalers racing to offer managed Iceberg? Because whoever controls the catalog controls the ecosystem. If your tables are in a managed Iceberg service, you can query them from any engine - Spark, Trino, DuckDB, whatever. But your metadata stays with...
Why we're excited about Lakebase GA: most database services are based on outdated assumptions leading to poor operability, scalability and devex.
Every data conversation I tracked this week led back to the same concept. Data Governance. Not as compliance. As the center of gravity. 47 data sources. 6 different owners. 3 definitions of "customer." Your AI agent has no idea who to believe. https://t.co/ZrW6iptSqs

Are we still using table catalogs for open table formats? Haven't heard too much lately. I like OTFs, but making it non-optional to have a catalog isn't great. That's why I like prefer the option to use one without. But if you...