Quantization Process - 搜索 News

Dnotitia's STAR-KV cuts KV cache by up to 20x, earns ICML 2026 Spotlight selection

KV, a low-rank KV cache compression method achieving up to 20x reduction, with the paper selected as a Spotlight at ICML 2026 ...

TMCnet

Dnotitia Unveils STAR-KV, Achieving UP to 20x KV Cache Compression, Selected as an ICML ...

Introduces a low-rank-based approach to KV cache compression, one of the key bottlenecks in long-context AI; Speeds up ...

4 天

Changing AI math could reduce the hardware burden, researchers show

Sophisticated AI models tend to require a lot of memory and take up a lot of storage space. One of the ways to reduce that ...

FinanceFeeds

What is the Zcash Crosslink Hybrid Model?

Zcash is building a new consensus layer that keeps mining alive while adding a stake-based finality check. The proposed ...

XDA Developers on MSN

My 7-year-old GPU runs local AI perfectly, and I don't need my cloud subscriptions anymore

You don't always need an RTX 5090 to run useful models ...

29 天

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

You can now download Gemma 4 models with quantization-aware training to reduce the amount of mobile memory required to 1GB.

Hackaday

TurboQuant: Reducing LLM Memory Usage With Vector Quantization

Large language models (LLMs) aren’t actually giant computer brains. Instead, they are massive vector spaces in which the probabilities of tokens occurring in a specific order is encoded. Billions of ...

IEEE

Quantization via Distillation and Contrastive Learning

Abstract: Quantization is a critical technique employed across various research fields for compressing deep neural networks (DNNs) to facilitate deployment within resource-limited environments. This ...

VentureBeat

Google's new TurboQuant algorithm speeds up AI memory 8x, cutting costs by 50% or more

As Large Language Models (LLMs) expand their context windows to process massive documents and intricate conversations, they encounter a brutal hardware reality known as the "Key-Value (KV) cache ...

GitHub

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Atom is an accurate low-bit weight-activation quantization algorithm that combines (1) mixed-precision, (2) fine-grained group quantization, (3) dynamic activation quantization, (4) KV-cache ...

一些您可能无法访问的结果已被隐去。

显示无法访问的结果