Open Formats, Commercial Compute
Why open data formats — Apache Iceberg in particular — have become central to how I think about data platform decisions. Not a migration story. A question about optionality.
I’m drawn to open source tools and open data formats. It’s a mix of instinct and circumstance, but the two don’t always point in the same direction.
The instinct is toward portability and independence. Own your data in a format that doesn’t depend on any single vendor’s continued goodwill or business model. That preference has been consistent everywhere I’ve worked.
The circumstance varies. At smaller organizations where the primary resource was my own time, open tools lowered the barrier to getting started. A proof of concept that’s really just one person doesn’t justify a platform contract. It needs something that works, that you can learn on your own, and that will still be there if the company behind it changes direction.
At a larger organization, the calculus is different. Staff time has a clearer dollar value. Turnkey platforms that reduce operational burden earn their cost. The appeal of a managed service isn’t laziness, it’s a responsible accounting of what your team’s hours are worth, or how quickly you need to see a return on investment. Sometimes the circumstance points away from the instinct. When it does, circumstance wins.
The binary is dissolving
For a long time, “open vs. proprietary” was a necessary choice with real tradeoffs. Open formats gave you portability and independence but required more engineering effort. Proprietary platforms gave you convenience and performance but locked your data into someone else’s ecosystem.
The choice isn’t as stark as it used to be. The major platforms are converging on support for open table formats — Apache Iceberg in particular — and the competitive dynamics are shifting.
Snowflake’s investment here is a clear signal. CEO Sridhar Ramaswamy has talked openly about the direction: “In 10 years, data is going to be sitting mostly in the cloud, mostly in cloud storage, which is very cheap, mostly in interoperable formats accessible via open catalogs.”
The bet is that the compute and workflow layer is where the value is, not the data format. Whether that becomes a durable industry posture is an open question, but the direction is clear enough for me to take seriously.
The emerging landscape
Apache Iceberg has become the open table format with the most momentum. Snowflake, AWS, Databricks, and others have converged on supporting it, or in Databricks’ case merging their own Delta Lake format closer to Iceberg’s capabilities. If that holds, the storage layer becomes standardized and open, and the interesting choices move to compute.
DuckDB and MotherDuck are another part of this picture. Lightweight, embeddable analytical compute that reads open formats directly. I haven’t gone deep in that world, but I’ve explored enough to see why it’s compelling, particularly for smaller organizations or individual practitioners who want analytical power without a platform commitment.
What I’m testing
I recently refactored our data pipeline so that instead of landing all raw data directly in Snowflake, we land it in S3 in Apache Iceberg format, managed by an AWS Glue catalog. Snowflake connects to it directly and discovers new external Iceberg tables within about 30 seconds. They perform at near-parity with Snowflake-native tables, and the refactor was not nearly as much of a project as I expected it to be. PyIceberg’s upsert handling after schema changes is still rough, but we navigated it. Not a dealbreaker.
This wasn’t a migration away from Snowflake. It was a way to make more of the AWS ecosystem available alongside it, and to take advantage of the partner dynamic between the two. Both Snowflake-native tools and AWS-native services are now available for any given project, drawing from the same underlying data.
I’m still developing a clear point of view on when to reach for which. The early heuristic forming, and it’s still early, is that Snowflake’s turnkey tools like Cortex Analyst and Cortex Search make good sense for internal, staff-facing applications where scale isn’t the primary constraint. For consumer-facing applications where scalability and fine-grained customization matter more, AWS-native services may be the better fit. I don’t have a firm rubric yet. Building one is part of the experiment.
What’s still unresolved
The key question is whether this hybrid approach adds complexity that a small team can sustain. I’m still largely the only person hands-on with these systems while we build the business case for a larger investment in scaling. Optionality is valuable in theory, but if it means maintaining two sets of tooling, two mental models, and two sets of operational knowledge, the cost could outweigh the benefit.
“Right tool for the job” is a useful principle, but without a clear decision framework it can become an excuse for sprawl — a little AWS here, a little Snowflake there, with no coherent architecture underneath.
For organizations where every platform commitment carries real budget and capacity risk, I think open formats reduce that risk meaningfully. If it works, the data stays portable and the compute layer becomes a choice to revisit with each new project. Early results are promising.
I think this experiment is worth continuing. Whether the optionality actually pays off at our scale, I’ll know more in a year.