CW Pakistan
  • Legacy
    • Legacy Editorial
    • Editor’s Note
  • Academy
  • Wired
  • Cellcos
  • PayTech
  • Business
  • Ignite
  • Digital Pakistan
  • PSEB
    • DFDI
    • Indus AI Week
  • PASHA
  • TechAdvisor
  • GamePro
  • Partnerships
  • PCWorld
  • Macworld
  • Infoworld
  • TechHive
  • TechAdvisor
0
0
0
0
0
Subscribe
CW Pakistan
CW Pakistan CW Pakistan
  • Legacy
    • Legacy Editorial
    • Editor’s Note
  • Academy
  • Wired
  • Cellcos
  • PayTech
  • Business
  • Ignite
  • Digital Pakistan
  • PSEB
    • DFDI
    • Indus AI Week
  • PASHA
  • TechAdvisor
  • GamePro
  • Partnerships
  • Wired

University Researchers Introduce Inference-Time Hyper-Scaling For Faster AI Reasoning

  • December 28, 2025
Total
0
Shares
0
0
0
Share
Tweet
Share
Share
Share
Share

Scaling Artificial Intelligence usually comes at a high cost, demanding more memory and longer processing time. Researchers from University of Warsaw, NVIDIA, and University of Edinburgh have developed a new approach called Inference-Time Hyper-Scaling that allows Large Language Models to reason more efficiently while reducing memory consumption. This breakthrough leverages Dynamic Memory Sparsification, a method that optimizes memory usage during text generation, enabling faster and more effective AI reasoning without the need for extensive hardware resources.

Modern Large Language Models, including OpenAI’s o1 and DeepSeek’s R1, rely on generating long chains of thought to enhance their reasoning abilities. However, as the model produces more text, its Key-Value cache grows linearly, causing a memory bottleneck. Retrieving this large cache from memory becomes a significant cost factor and slows down generation, making the AI both slower and more memory-intensive. Essentially, the more the model attempts to think, the higher the demand on memory and processing resources, limiting overall efficiency and throughput.

Dynamic Memory Sparsification (DMS) addresses this challenge by introducing a smart token eviction policy. Instead of removing tokens immediately, DMS employs a delayed eviction system that keeps tokens in a temporary sliding window, allowing the model to extract critical information before discarding them. This approach requires only 1,000 training steps to achieve an 8x compression ratio and can be retrofitted onto existing pre-trained models using logit distillation. Unlike traditional compression methods, DMS avoids costly retraining while efficiently managing memory.

The impact of this method, known as Inference-Time Hyper-Scaling, is significant. By compressing the Key-Value cache, models can explore more reasoning paths within the same computational budget, improving performance across multiple benchmarks. For example, a DMS-equipped Qwen-R1 32B model achieved a 12-point improvement on the AIME 24 benchmark and notable gains on GPQA and LiveCodeBench. Additionally, DMS outperforms other efficiency baselines like Quest and TOVA, providing better accuracy while maintaining lower memory usage. Smaller models, such as Qwen3-8B, achieve similar accuracy to uncompressed models while achieving up to five times higher throughput.

This research demonstrates that achieving smarter AI does not always require larger GPUs or more computational power. By effectively managing memory and employing intelligent compression strategies like Dynamic Memory Sparsification, Large Language Models can operate faster, more efficiently, and with reduced hardware strain, paving the way for more accessible and scalable AI solutions.

Follow the SPIN IDG WhatsApp Channel for updates across the Smart Pakistan Insights Network covering all of Pakistan’s technology ecosystem. 

Share
Tweet
Share
Share
Share
Related Topics
  • AI
  • AI efficiency
  • Deep Learning
  • Dynamic Memory Sparsification
  • Inference-Time Hyper-Scaling
  • large language models
  • machine learning
  • NVIDIA
  • University of Edinburgh
  • University of Warsaw
Previous Article
  • Cellcos

PTA Declares Use Of SIMs Registered To Others Illegal In Pakistan

  • December 28, 2025
Read More
Next Article
  • Business

Pakistan Federal Tax Ombudsman And Bangladesh NBR Collaborate On Tax Dispute Resolution

  • December 28, 2025
Read More
You May Also Like
Read More
  • Wired

NUST SMME And WHO Organise Webinar On Applied Artificial Intelligence In Epidemiology And Health Emergencies

  • Press Desk
  • May 2, 2026
Read More
  • Wired

Nearly 100,000 Federal Government Employees To Receive Electric Bikes With Rs80,000 Subsidy Under PAVE Programme

  • Press Desk
  • May 1, 2026
Read More
  • Wired

Spotify Hosts Intimate Padel Evening In Karachi Celebrating Five Years Of Music Growth In Pakistan

  • Press Desk
  • May 1, 2026
Read More
  • Wired

Spotify Marks Five Years In Pakistan With 750% Listenership Growth And Over 15 Million User-Created Playlists

  • Press Desk
  • May 1, 2026
Read More
  • Wired

Safe City Camera Stolen From Islamabad’s Faizabad Metro Bus Station Raising Questions About Surveillance Infrastructure Security

  • Press Desk
  • May 1, 2026
Read More
  • Wired

Senate Committee Accuses X Of Bias As PTA Reveals 27% Compliance Rate And SMPRA Prepares To Take Over Social Media Regulation

  • Press Desk
  • April 30, 2026
Read More
  • Wired

Pakistan Airports Authority To Launch App-Based Electric Cart Service At Four Major Airports

  • Press Desk
  • April 29, 2026
Read More
  • Wired

Former President Arif Alvi Builds Self-Hosted AI Archive From Scratch Using Python

  • Press Desk
  • April 29, 2026
Trending Posts
  • GIKI Conducts Nationwide INSPIRE Entrance Test Across 8 Cities Under PM’s Semiconductor Development Plan
    • May 2, 2026
  • Microsoft Redesigns Windows 11 Run Menu After 31 Years With Dark Mode And Faster Load Times
    • May 2, 2026
  • Federal IT Minister Shaza Fatima Khawaja And Punjab AI Advisor Ali Mustafa Dar Meet To Discuss Pakistan’s AI Future
    • May 2, 2026
  • Google Replaces Assistant With Gemini AI In 4 Million Cars
    • May 2, 2026
  • PTA Issues District-Level Internet Licenses Across Pakistan To Boost Broadband Penetration
    • May 2, 2026
about
CWPK Legacy
Launched in 1967 internationally, ComputerWorld is the oldest tech magazine/media property in the world. In Pakistan, ComputerWorld was launched in 1995. Initially providing news to IT executives only, once CIO Pakistan, its sister brand from the same family, was launched and took over the enterprise reporting domain in Pakistan, CWPK has emerged as a holistic technology media platform reporting everything tech in the country. It remains the oldest continuous IT publishing brand in the country and in 2025 is set to turn 30 years old, which will be its biggest benchmark and a legacy it hopes to continue for years to come. CWPK is part of the SPIN/IDG Wakhan media umbrella.
Read more
Explore Computerworld Sites Globally
  • computerworld.es
  • computerworld.com.pt
  • computerworld.com
  • cw.no
  • computerworldmexico.com.mx
  • computerwoche.de
  • computersweden.idg.se
  • computerworld.hu
Content from other IDG brands
  • PCWorld
  • Macworld
  • Infoworld
  • TechHive
  • TechAdvisor
CW Pakistan CW Pakistan
  • CWPK
  • CXO
  • DEMO
  • WALLET

CW Media & all its sub-brands are copyrighted to SPIN-IDG Wakhan Media Inc., the publishing arm of NCC-RP Group. This site is designed by Crunch Collective. ©️1995-2026. Read Privacy Policy.

Input your search keywords and press Enter.