Faster AI, lower costs: DSpark eases inference bottlenecks and chip strain, says DeepSeek
Start-up unveils speculative decoding framework that speeds up inference by up to 85 per cent amid China’s push to overcome US AI curbs

AI models’ conventional token-by-token output often slowed when responses were lengthy, leading to low utilisation of graphics processing units (GPU) and high user-perceived waiting time, which was a “primary bottleneck in serving AI”, the company said in research published on Saturday.
DeepSeek said the DSpark module accelerated AI response generation – also known as AI inference, which refers to serving a trained model to respond to user queries – by using a lightweight draft model to propose candidate responses and then verifying them in batches with a larger model, speeding up output.
DSpark further refined the approach with a semi-autoregressive generation method, allowing the model to produce small chunks of tokens rather than strictly one at a time.

It also introduced a confidence-based scheduling system that dynamically adjusted how much verification was applied based on computing demand, helping balance speed and output quality.