Tencent’s ‘training-free’ AI model improvement technique sparks debate
Paper argues that large language models can improve through experience on the job without needing to change their parameters

Researchers at Tencent Holdings have proposed a new “lightweight” technique to get AI models to improve by using “experience” without retraining, sparking a debate about whether that could be the key to more cost-effective continual learning.
The paper titled “Training-Free Group Relative Policy Optimisation”, published last week on open-access repository arXiv, argued that large language models (LLMs) can improve through on-the-job experience, without needing to change their parameters.
Current training methods for making LLMs more useful in real-life tasks are reliant on techniques such as reinforcement learning, in which the model’s parameters – the variables encoding its “intelligence” – are adjusted through algorithms such as Group Relative Policy Optimisation (GRPO).
Under GRPO, the model makes multiple attempts at a task, then adjusts its parameters based on the scores of those attempts. However, this process can be slow and computationally costly.
Instead, the researchers from Tencent’s AI research lab suggested that LLMs could simply log the rules and heuristics from this GRPO process in an “experience library” and deploy them when faced with new tasks.

The paper provided examples of the kinds of heuristics the model comes up with itself, such as: “When solving geometry problems with intersections, validate solutions lie within bounded regions or segments, not on extensions, to avoid extraneous answers.”