Advertisement
Artificial intelligence
TechBig Tech

Faster AI, lower costs: DSpark eases inference bottlenecks and chip strain, says DeepSeek

Start-up unveils speculative decoding framework that speeds up inference by up to 85 per cent amid China’s push to overcome US AI curbs

2-MIN READ2-MIN
Listen
DeepSeek says DSpark delivers faster results and lower costs, marking a breakthough in inference. Photo: Reuters
Ben Jiangin Beijing
Chinese artificial intelligence start-up DeepSeek has rolled out a major upgrade to its flagship V4 model aimed at sharply accelerating AI response generation, as competition among Chinese developers increasingly shifts to reducing serving costs and enhancing user experience.
DeepSeek, by adopting what it called a speculative decoding framework, DSpark, said it increased per-user response speeds by up to 85 per cent, an efficiency gain that could reduce AI systems’ reliance on larger, more powerful chip infrastructure.

AI models’ conventional token-by-token output often slowed when responses were lengthy, leading to low utilisation of graphics processing units (GPU) and high user-perceived waiting time, which was a “primary bottleneck in serving AI”, the company said in research published on Saturday.

DeepSeek said the DSpark module accelerated AI response generation – also known as AI inference, which refers to serving a trained model to respond to user queries – by using a lightweight draft model to propose candidate responses and then verifying them in batches with a larger model, speeding up output.

DSpark further refined the approach with a semi-autoregressive generation method, allowing the model to produce small chunks of tokens rather than strictly one at a time.

The new technique could reduce the computing resources needed to serve AI systems, according to a programmer. Shutterstock
The new technique could reduce the computing resources needed to serve AI systems, according to a programmer. Shutterstock

It also introduced a confidence-based scheduling system that dynamically adjusted how much verification was applied based on computing demand, helping balance speed and output quality.

Advertisement
Select Voice
Select Speed
1.00x