Large Language Models Benchmarks

MathEval: a comprehensive benchmark for evaluating large language models on mathematical ...

This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key ...

The LancetOpinion

Deception in clinical large language models: an under-recognised safety risk

Large language models (LLMs) are rapidly being integrated into clinical workflows, supporting tasks such as diagnosis ...

2 天

Anthropic launches Claude Sonnet 5 AI model with coding, safety upgrades

Anthropic PBC today debuted Claude Sonnet 5, a midrange large language model that outperforms its predecessor in several ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

7 天

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

Bigger has defined AI from day one. New data says task-specific small models beat frontier LLMs on accuracy, cost and speed — ...

来自MSN

Elon Musk’s xAI Grok 4.1 Gets Big Upgrade: Check Features, Benchmarks And How To Use It

Elon Musk’s xAI has announced the arrival of Grok 4.1, the newest version of its AI model, and users are already noticing the difference. Musk shared the update on X, highlighting a major jump in ...

当前正在显示可能无法访问的结果。

隐藏无法访问的结果