AI · Automation · Engineering

DSpark vs vLLM: What an 85% Speedup Really Means

By Lazar MilicevicJuly 1, 20269 min read
Dark server room with glowing racks illustrating DSpark vs vLLM inference speedup comparison

DeepSeek dropped DSpark over the weekend under MIT license, and the headline number, up to 85% faster inference on large models, is the kind of claim that lands in my inbox from three different founders by Monday morning. I run vLLM in production for a couple of self-hosted setups behind BizFlowAI ContentStudio, so this is the exact question I care about: does the math on self-hosting actually change, or is this another benchmark that dies on contact with real traffic?

Here is what I found after spending a weekend swapping DSpark into a rig I know cold, and what I would (and would not) do with it if you are shipping.

The claim, translated into numbers I trust

The "up to 85%" figure is a peak throughput number under favorable batching and sequence-length assumptions. In my experience with every inference stack of the last three years (vLLM, TGI, TensorRT-LLM, SGLang, llama.cpp), the peak number and the number you get at p95 on real traffic are two different animals. What actually matters for a production LLM app:

  • Time to first token (TTFT) at your real prompt length distribution.
  • Inter-token latency (ITL) at your target concurrency.
  • Throughput per dollar per hour on the exact GPU SKU you can rent today.
  • Tail behavior when a long-context request lands in the middle of a batch.

On my A100 80GB test box (single node, Llama-3.1-70B AWQ 4-bit, 8k context, mixed prompt lengths 200 to 6000 tokens, 32 concurrent streams), DSpark gave me:

Metric vLLM 0.6.x DSpark (v0.1) Delta
Peak tokens/sec (batched decode) 4,180 6,940 +66%
TTFT p50 (ms) 310 240 -23%
TTFT p95 (ms) 890 720 -19%
ITL p50 (ms) 22 15 -32%
ITL p95 (ms) 61 58 -5%
VRAM at 32 streams 74 GB 71 GB -4%

Peak decode throughput improved a lot. Tail latency, which is what your users actually feel, improved much less. That gap is the whole story.

What DSpark is actually doing under the hood

I read the paper and the kernels before I benchmarked, because that is the only honest way to predict where a system will break. DSpark stacks three ideas that are individually known and collectively well-engineered:

  1. A smarter speculative decoding scheduler. The draft model runs asynchronously and the acceptance verifier is fused into the target model's attention kernel, which cuts a lot of the sync overhead that hurts speculative decoding at high concurrency.
  2. Paged KV cache with a compressed hot tier. Similar spirit to vLLM's PagedAttention, but with an int4 quantized hot cache for tokens older than a sliding window. On long contexts this is where a big chunk of the memory win comes from.
  3. A JIT kernel selector that picks between FlashAttention-3, a custom grouped-GEMM path, and a triton fallback based on batch shape at runtime.

None of that is magic. All of it is the kind of careful systems work that is genuinely hard to get right, and DeepSeek has earned the benefit of the doubt on kernel engineering after their previous releases.

The catch: the biggest wins show up at large batch sizes on long generations. If your workload is short prompt, short completion, low concurrency (which is most B2B SaaS chat features), you will see maybe 15 to 25% improvement, not 85%.

Where the "up to 85%" actually shows up

I ran a second sweep to find the regime where DSpark's headline number is real. It is real, but narrow:

  • Batch size 64+
  • Output length 1000+ tokens
  • Speculative decoding enabled with a well-matched 1B-class draft model
  • A100 or H100 class hardware

On a synthetic batched summarization workload (256 concurrent streams, 512 token prompts, 2048 token completions), I measured a 79% throughput lift over vLLM. That is real. It is also not what your customer-facing chatbot looks like.

The workloads where this genuinely changes the math on self-hosting:

  • Bulk content pipelines. In my ContentStudio system I generate large batches of long-form drafts overnight. This is exactly the regime DSpark targets. A 70% real-world speedup on that job is worth reworking the stack for.
  • Synthetic data generation for fine-tuning.
  • Offline eval and judge runs for LLMOps pipelines.
  • Agent trajectories where a single request produces long, multi-turn tool-calling sequences.

The workloads where it barely matters:

  • Interactive chat with low concurrency.
  • Short RAG answers (which is what most enterprise search assistants are).
  • Any workload dominated by prefill, not decode.

The self-hosting math, honestly

The reason people ask me about inference frameworks is almost never curiosity. It is: "Can I stop paying the Claude API bill and self-host?" Here is how the calculation actually goes on my desk in 2026, using a mid-sized production workload as the reference point: 5 million input tokens per day, 1.5 million output tokens per day, Llama-3.1-70B class model.

Managed API (Claude Sonnet class, illustrative): roughly $80 to $120 per day depending on cache hit rate. Zero ops. No GPU risk. Best-in-class quality on hard tasks.

Self-hosted on rented H100 (single node, vLLM): one H100 80GB at around $2 to $2.50 per hour on a serious provider, running 24/7, is about $50 to $60 per day. Plus your engineering time, monitoring, on-call, and the fact that you cannot easily scale to zero.

Self-hosted on rented H100 with DSpark: if the DSpark throughput lift is real for your workload, you can either (a) serve the same traffic on a smaller instance, or (b) push more traffic through the same box. In the batched-generation workloads I tested, a single H100 with DSpark can serve what previously needed 1.6 to 1.8 H100s. That is where the money actually is.

The honest conclusion: DSpark does not change the decision for interactive workloads. It significantly changes the decision for high-throughput batched workloads, where self-hosting was already close to break-even and now sits clearly on the winning side.

Migrating from vLLM: the parts nobody mentions

If you decide to try DSpark in production, here is the migration checklist I would use, based on rolling out inference stacks a few times:

  1. Do not swap in place. Stand up DSpark on a parallel instance behind a weighted router. I use a Traefik front-end that splits 5% of traffic to the new stack for a week.
  2. Match your quantization. DSpark ships kernels for FP8, INT8, INT4 AWQ, and GPTQ. Kernel maturity varies. FP8 on H100 is the most polished. INT4 AWQ on A100 has a couple of edge cases where output diverges slightly from vLLM at the same weights. Run a diff eval on 500 prompts before you trust it.
  3. Rewire your metrics. DSpark exposes Prometheus metrics but with different names than vLLM. Do not skip this. If your alerting still points at vllm:time_per_output_token_seconds, you will get paged for the wrong reason during the first incident.
  4. Reproduce your load test, do not trust theirs. I use a Locust harness that replays a captured 24-hour trace from production, sped up 4x. Any framework that survives that gets a real look. Any framework I tested only on their example script gets none.
  5. Watch tail latency, not mean. The mean will look great. What matters is what happens at p99 when a 30k token request hits mid-batch. On my runs DSpark did not regress here, but I have seen speculative decoding frameworks that quietly do.
  6. Have a rollback plan. MIT license is great. A v0.1 release with an active issue tracker is still v0.1. Keep the vLLM instance warm for at least a full traffic cycle.

The engineering risks I would not hand-wave

I am recommending DSpark for specific workloads, but I am not pretending it is boring, proven infrastructure yet. The risks I would size before committing:

  • Kernel bugs on non-standard shapes. Any framework that JITs kernels based on shape has a long tail of shapes that hit slow paths or, occasionally, incorrect ones. If your prompts have unusual attention patterns (very long system prompts, unusual chat templates), fuzz it.
  • Numerical drift. Speculative decoding plus aggressive quantization means outputs can differ from vLLM at the same seed. Not wrong, but different. If you run downstream evals that assume determinism, you will notice.
  • Multi-node support. As of v0.1 this is the weakest part of the stack. Tensor parallelism works. Pipeline parallelism is documented but I would not run it in production yet.
  • Provenance. DeepSeek's open-source track record has been excellent, and MIT license is as permissive as it gets. That said, if your company has a policy on Chinese-origin infrastructure in production, that is a conversation you need to have upfront, not after the migration.

What I'd do

If you are running a customer-facing chat product on a managed API and paying $2k to $10k a month, do nothing. DSpark does not move the needle enough at your scale to justify the ops burden of self-hosting.

If you are running batched generation, synthetic data, agent workloads, or an offline content pipeline at meaningful volume, run the benchmark this week. Specifically:

  1. Capture 24 hours of real traffic (prompts, completions, timing).
  2. Replay it against your current stack and against DSpark on the same hardware.
  3. Compute cost per million output tokens end-to-end, not just GPU hours.
  4. Decide based on the delta, not the marketing number.

If you are already running vLLM in production and it is stable, do not migrate the primary path yet. Add DSpark as a second lane for your batched or async workloads first. That is where you will get the win without eating the risk.

Personally, I am moving the overnight batch generation in my content pipeline to DSpark next week and leaving the interactive path on vLLM until at least v0.3. That split gets me most of the cost win with none of the on-call anxiety.

Closing thought

An 85% headline speedup rarely survives contact with production, but the underlying engineering here is real, and for the specific workloads where it applies, the self-hosting math genuinely does shift. That is worth taking seriously without getting swept up in it.

If you are weighing self-hosted inference against a managed API for a real workload and want a second set of eyes on the numbers, get in touch at lazar-milicevic.com/#contact, or read more of my production notes on the blog.

Frequently asked questions

Is DeepSeek's DSpark really 85% faster than vLLM for LLM inference?

The 85% figure is a peak throughput number under favorable conditions, not what you'll see on typical production traffic. In my benchmarks on an A100 80GB running Llama-3.1-70B AWQ, I measured a 66% lift in peak decode throughput over vLLM 0.6.x, but only 19% improvement in TTFT p95 and just 5% at ITL p95. The 85% headline is real, but only shows up at batch sizes of 64+, output lengths over 1000 tokens, with speculative decoding enabled on A100/H100 hardware. For interactive chat or short RAG answers, expect closer to 15-25% improvement.

What is DSpark and how does it work under the hood?

DSpark is DeepSeek's open-source (MIT license) LLM inference engine released as a vLLM alternative. Under the hood it combines three engineered components: a smarter speculative decoding scheduler that fuses the acceptance verifier into the target model's attention kernel, a paged KV cache with an int4-quantized compressed hot tier for tokens outside a sliding window, and a JIT kernel selector that picks between FlashAttention-3, a custom grouped-GEMM path, or a triton fallback based on batch shape at runtime. None of these ideas are individually new, but the systems engineering is genuinely careful.

When should I use DSpark instead of vLLM in production?

Use DSpark when your workload is dominated by large-batch, long-generation decode: bulk content pipelines, synthetic data generation for fine-tuning, offline eval and judge runs, and long agent trajectories with multi-turn tool calls. These are the regimes where I measured up to a 79% real throughput lift over vLLM. Stick with vLLM (or don't bother switching) for interactive chat with low concurrency, short RAG answers, or any workload dominated by prefill rather than decode, because the gains there are marginal.

Does DSpark make self-hosting LLMs cheaper than the Claude API?

It depends heavily on your workload profile. For a reference production workload of 5M input and 1.5M output tokens per day on a Llama-3.1-70B class model, a managed API costs roughly $80-120 per day, while a single rented H100 on vLLM runs about $50-60 per day plus ops overhead. With DSpark on batched workloads, one H100 can serve what previously required 1.6 to 1.8 H100s, which meaningfully shifts the economics. But for interactive workloads, DSpark does not change the self-hosting decision, only for high-throughput batched jobs.

What metrics should I actually measure when benchmarking LLM inference frameworks?

Ignore peak throughput headlines and measure what users actually feel. I focus on four metrics: time to first token (TTFT) at your real prompt length distribution, inter-token latency (ITL) at your target concurrency, throughput per dollar per hour on the exact GPU SKU you can rent today, and tail behavior (p95/p99) when a long-context request lands mid-batch. Peak numbers and p95 production numbers are two different animals, and the gap between them is where most benchmark claims fall apart.

Lazar Milicevic

Lazar Milićević

Senior Technical Engineer. I build AI automation, GenAI/LLM systems and cloud architecture — autonomous systems that run while you sleep. Founder of BizFlowAI.

Building something hard with AI or automation? I am open to talk.

Get in touch

← All posts