Large Language Models Benchmarks

Deception in clinical large language models: an under-recognised safety risk

Large language models (LLMs) are rapidly being integrated into clinical workflows, supporting tasks such as diagnosis ...

EurekAlert!

MathEval: a comprehensive benchmark for evaluating large language models on mathematical ...

This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key ...

3 天

Anthropic launches Claude Sonnet 5 AI model with coding, safety upgrades

Anthropic PBC today debuted Claude Sonnet 5, a midrange large language model that outperforms its predecessor in several ...

16 天on MSN

China's Z.ai GLM-5.2 tops OpenAI’s GPT 5.5 model on key benchmarks

Chinese startup Z.ai has launched GLM-5.2, a powerful AI model for complex coding projects. This new large language model boasts a massive 1 million token context window, allowing it to handle ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

The Economist

Top AI models underperform in languages other than English

TO GET THE most accurate answer from a large language model, make sure to prompt it in the right language. An English-speaking user asking a world-leading model what to do about swollen legs late in ...

当前正在显示可能无法访问的结果。

隐藏无法访问的结果