AI · Automation · Engineering

ML Platform Build vs Buy in 2026: Real TCO

By Lazar MilicevicJune 26, 20269 min read
Dark server room with illuminated racks representing ML platform infrastructure build versus buy TCO decisions

Last updated: June 2026.

A founder asked me last month whether to keep paying Databricks $38k/month or spin up Ray + MLflow on EKS. The team was small, the workloads were predictable, and the bill was growing 12% a quarter. That conversation pushed me to finally write down the cost model I've been using for the last two years of AI platform engagements, with the actual numbers instead of the hand-wavy "it depends" most vendors and consultants offer.

This post is the spreadsheet, the decision tree, and the trade-offs. I'll walk through Databricks, SageMaker, Vertex AI, and a self-hosted stack (Ray + MLflow + Argo + Postgres + S3) at three company sizes. The numbers are from real deployments I've worked on or scoped, adjusted to USD list pricing as of mid-2026. If you want to copy the model, the structure is at the bottom.

The honest TCO formula most teams get wrong

True ML platform TCO is compute + storage + platform fee + engineering labor + opportunity cost + exit cost. Most build-vs-buy debates only count the first three, which is why "self-hosted is cheaper" stops being true around the 3-engineer mark.

The line item people forget is engineering labor. A self-hosted Ray + MLflow + Argo stack on EKS needs roughly 0.5 to 1.0 FTE of platform engineering once you're past prototype, just to keep upgrades, IAM, autoscaling, and the feature store working. At a $180k loaded US salary, that's $90k-$180k a year before you've trained a single model. Vendors aren't lying when they say their platform "pays for itself" — they're just not telling you it only pays for itself once you have enough scale that the labor offset matters.

The second forgotten line is exit cost. Databricks notebooks, Unity Catalog permissions, Delta Live Tables pipelines, and MLflow-on-Databricks experiment metadata are portable in theory and painful in practice. I budget 3-6 engineer-months to migrate a mature Databricks workspace to anything else. SageMaker Pipelines and Vertex Pipelines have the same problem in different dialects.

Here's the formula I actually use:

Annual TCO = (compute_hrs * $/hr * utilization_factor)
           + (storage_TB * $/TB/mo * 12)
           + (platform_markup)
           + (FTE_count * loaded_salary)
           + (egress_TB * $/TB)
           + (amortized_exit_cost / expected_years_on_platform)

utilization_factor is the one that surprises people. Reserved Databricks clusters typically sit at 35-55% utilization in the teams I audit. On-demand SageMaker training jobs hit 70%+ because they're ephemeral. That gap is often worth more than the per-hour price difference.

The four options, priced at three scales

I modeled three company profiles. All assume US-based engineers, AWS or GCP as the base cloud, and a workload mix of 60% batch training, 30% online inference, 10% interactive notebooks.

Profile A — Seed/Series A: 2 ML engineers, ~200 GPU-hours/month (mostly A10G/L4), 5 TB feature data, 1 production model serving ~50 req/s.

Profile B — Growth stage: 6 ML engineers, ~2,000 GPU-hours/month (mix of A10G and A100), 50 TB, 8 production models, ~500 req/s aggregate.

Profile C — Scale: 20 ML engineers, ~15,000 GPU-hours/month (heavy A100/H100), 500 TB, 30+ production models, ~5,000 req/s.

Annual TCO table (USD, 2026 list pricing, US regions)

Platform Profile A Profile B Profile C
Databricks (on AWS) $94k $612k $4.1M
SageMaker $71k $478k $3.2M
Vertex AI (GCP) $76k $495k $3.4M
Self-hosted (Ray + MLflow + Argo on EKS) $58k $402k $2.4M
Self-hosted + 1.0 FTE platform eng $238k $582k $2.58M

A few things jump out.

At Profile A, self-hosted only wins if you don't count the engineering you're already spending. Two ML engineers cannot also be a platform team. The honest number is closer to $238k once you allocate the ~50% of one engineer who ends up doing platform work, and at that point Databricks is cheaper and faster.

At Profile B, self-hosted is genuinely competitive. You have enough scale to justify a dedicated platform engineer, and the markup you stop paying ($150k-$200k a year for Databricks DBU pricing or SageMaker management) covers their salary with margin to spare.

At Profile C, self-hosted is dramatically cheaper on paper, but the operational risk is higher and you need 2-3 platform engineers, not one. The right answer is often hybrid: self-hosted for steady-state batch training, managed for the long tail.

Where each platform's pricing actually hurts

  • Databricks: DBU markup on top of EC2. For interactive clusters you're paying roughly 2-2.5x the underlying compute cost. Photon and Serverless SQL add another premium. The bill grows superlinearly with notebook usage.
  • SageMaker: Endpoint cost is the silent killer. A single ml.g5.xlarge real-time endpoint at 24/7 is ~$760/month. Multiply by 8 models with shadow variants and you're at $12k/month for inference alone before any training.
  • Vertex AI: Pipeline component overhead and managed pipeline storage add up. Endpoint pricing is similar to SageMaker. The Feature Store online serving has a per-node minimum that doesn't scale down.
  • Self-hosted on EKS: EKS control plane is trivial ($876/year per cluster). GPU node group cold-start, spot interruption handling, and Karpenter tuning are where the labor goes.

The decision tree I actually use

I run prospective clients through this in about 20 minutes. It's not subtle, and that's the point.

1. Do you have a dedicated platform/infra engineer (not a "we'll figure it out" engineer)?

  • No → Buy. Pick based on your existing cloud. Don't argue with this.
  • Yes → Continue.

2. Are you doing >$300k/year in managed ML platform spend today (or projected within 12 months)?

  • No → Buy. The labor break-even isn't there.
  • Yes → Continue.

3. Is your workload >70% predictable batch (training, scheduled inference)?

  • Yes → Self-hosted is viable. Ray + MLflow + Argo on EKS with Karpenter spot instances is the stack I'd reach for.
  • No (heavy interactive notebooks, ad-hoc exploration, lots of small models) → Stay managed. The labor cost of building a good notebook + experiment tracking experience yourself is higher than the markup.

4. Do you need fine-grained data governance (Unity Catalog, lineage, row-level security)?

  • Yes and you have it working → Don't migrate off Databricks for cost reasons alone. The exit pain is real.
  • Yes and you don't have it → SageMaker + Lake Formation or Vertex + BigQuery row-level security are credible. Self-hosted means OpenMetadata or DataHub, which is another half-FTE.

5. Are you running models that touch regulated data (HIPAA, SOC 2 Type II, FedRAMP)?

  • Yes → Buy, at least for the regulated workloads. The compliance documentation alone is worth the markup.

The pattern I see: teams that "should" build often don't have the discipline, and teams that "should" buy often overestimate what they'll outgrow. Most companies between Profile A and B are best served by a managed platform for 18-24 months while they figure out what their workloads actually look like, then revisit.

The number that decides it: utilization

If I had to pick one metric to drive build-vs-buy, it's GPU utilization on your current platform. Pull it from CloudWatch, Databricks cluster metrics, or Vertex monitoring.

  • Under 40%: You have an autoscaling and scheduling problem, not a platform problem. Fix that first. Switching platforms won't help.
  • 40-65%: Managed is the right call. You're paying for elasticity and you're using it.
  • 65-85%: The sweet spot for self-hosted. Your workload is predictable enough to provision for, and the markup you're paying is no longer buying you elasticity you need.
  • Over 85%: You're under-provisioned and your engineers are waiting. Add capacity before you change platforms.

I've watched teams spend six months migrating to "save money" when their actual problem was a 30% utilization rate that a week of Karpenter tuning would have fixed.

Hybrid is usually the right answer at scale

The Profile C teams I work with almost never run pure self-hosted or pure managed. The split that works:

  • Training and batch inference: Self-hosted on Ray + Kubernetes with spot instances. This is where the GPU bill lives, and where self-hosted savings are real.
  • Experiment tracking and model registry: MLflow self-hosted (it's fine, Postgres + S3, done) OR Weights & Biases if you want polish without engineering it.
  • Online inference: Managed endpoints (SageMaker, Vertex, or Modal/Replicate for spiky workloads). The operational cost of running production inference yourself — autoscaling, blue/green, canary, GPU memory management — is the highest hidden cost in the entire stack.
  • Feature store: Feast self-hosted on Redis + Postgres if your team has the chops, otherwise managed.
  • Data governance: Whatever your data team already uses. Don't pick a new tool here.

This hybrid pattern lets you capture 60-70% of the self-hosted savings without taking on the full operational burden. I built a version of this for a content automation pipeline last year — Ray on EKS for the embedding and reranking batch jobs, managed inference for the user-facing latency-sensitive paths. The combined bill was about 45% of what a pure Databricks build would have cost, and the on-call burden was manageable for one engineer.

What I'd do

If I were starting an ML platform decision tomorrow, here's the order:

  1. Measure utilization on what you have for two weeks. Don't skip this. Most "we need to switch platforms" conversations end here.
  2. Model TCO honestly, including the FTE you'll need. If you can't name the engineer who will own the self-hosted stack, you can't build it.
  3. Default to managed for the first 18 months of any new platform decision. The opportunity cost of platform engineering vs. shipping models is almost always negative early.
  4. Plan the exit on day one. Use open standards (MLflow tracking format, ONNX, Parquet, OpenLineage) so a future migration is a project, not a rewrite.
  5. Revisit annually. The right answer in 2026 may not be the right answer in 2027. Vendor pricing shifts, your scale shifts, your team shifts.

The build-vs-buy question is rarely about cost. It's about where you want your engineering attention. Managed platforms buy you focus. Self-hosted buys you control. Both are defensible. The wrong answer is the one you picked without modeling it.


If you're working through this decision and want a second set of eyes on the spreadsheet, the contact form is the easiest way to reach me. I keep a running version of this cost model and I'm happy to share the template. More posts on production AI infrastructure live on the blog.

Frequently asked questions

Should I self-host an ML platform or use Databricks/SageMaker/Vertex AI?

In my experience, the decision hinges on two things: whether you have a dedicated platform engineer and whether your managed spend is above ~$300k/year. Below that, buying (Databricks, SageMaker, or Vertex AI) almost always wins because a 0.5-1.0 FTE platform engineer at $90k-$180k loaded cost exceeds the vendor markup you'd save. Above that, self-hosting Ray + MLflow + Argo on EKS becomes genuinely competitive. At very large scale, a hybrid approach—self-hosted for steady-state batch training, managed for the long tail—is usually the right answer.

What is the real total cost of ownership (TCO) formula for an ML platform?

The honest TCO formula is: compute + storage + platform fee + engineering labor + egress + amortized exit cost. Most build-vs-buy debates only count the first three line items, which is why 'self-hosted is cheaper' breaks down once you need dedicated platform engineering. I also factor in a utilization_factor on compute, because reserved Databricks clusters typically run at 35-55% utilization while ephemeral SageMaker jobs hit 70%+. Exit cost matters too: I budget 3-6 engineer-months to migrate a mature Databricks, SageMaker, or Vertex workspace to anything else.

How much does Databricks really cost compared to self-hosting on EKS?

Based on my 2026 pricing models, Databricks runs about $94k/year for a seed-stage team, $612k for a growth-stage team, and $4.1M at scale. A self-hosted Ray + MLflow + Argo stack on EKS comes in at $58k, $402k, and $2.4M respectively—but once you add a 1.0 FTE platform engineer at ~$180k loaded, the self-hosted numbers jump to $238k, $582k, and $2.58M. So Databricks is actually cheaper at seed stage, roughly tied at growth stage, and meaningfully more expensive at scale. The Databricks DBU markup is typically 2-2.5x underlying EC2 compute for interactive clusters.

What hidden costs do teams miss when pricing SageMaker and Vertex AI?

On SageMaker, the silent killer is real-time endpoint cost—a single ml.g5.xlarge endpoint at 24/7 runs about $760/month, so 8 models with shadow variants easily hits $12k/month for inference alone before any training. On Vertex AI, pipeline component overhead, managed pipeline storage, and the Feature Store online serving per-node minimum (which doesn't scale down) add up quickly. Both platforms also have non-trivial exit costs because Pipelines metadata and managed feature stores are portable in theory but painful in practice. I always model these line items explicitly rather than trusting list-price calculators.

At what company size does self-hosting an ML platform start to make financial sense?

In the deployments I've scoped, self-hosting starts to pay off around the growth stage: roughly 6+ ML engineers, ~2,000 GPU-hours/month, and managed platform spend approaching or exceeding $300k/year. At that point, the vendor markup you stop paying (typically $150k-$200k/year) covers a dedicated platform engineer's salary with margin. Below that threshold, two ML engineers cannot also be a platform team, and the realistic TCO of self-hosting—once you allocate ~50% of an engineer to platform work—exceeds managed offerings. At very large scale you'll need 2-3 platform engineers, not one, and operational risk goes up.

Lazar Milicevic

Lazar Milicevic

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts