Large Language Models Benchmarks

16h

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

It allows engineering teams to host frontier-level AI on their own sovereign infrastructure, entirely eliminating vendor lock ...

5hon MSN

China's Z.ai GLM-5.2 tops OpenAI’s GPT 5.5 model on key benchmarks

Chinese startup Z.ai has launched GLM-5.2, a powerful AI model for complex coding projects. This new large language model ...

13h

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

B, a 3-billion-parameter AI model, is challenging OpenAI, Google and DeepSeek on math and coding benchmarks while reigniting ...

AI has passed the test but not the exam: Why ‘Humanity’s Last Exam’ matters

There is a temptation, when AI systems begin to outperform human baselines on established tests, to interpret this as a sign ...

Becker's Hospital Review

AI tools score high on exams, low on real clinical text: Study

Mass General Brigham's BRIDGE benchmark found top AI models scored 92 on medical exams but just 44.8% on real-world clinical tasks.

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

13d

These LLMs are the best at resisting Russian propaganda

Unsurprisingly, recent frontier models showed a much stronger tendency to resist Russian propaganda than models from just a ...

Geeky Gadgets

How to Build Custom LLM Benchmarks for Your AI Applications

Have you ever wondered why off-the-shelf large language models (LLMs) sometimes fall short of delivering the precision or context you need for your specific application? Whether you’re working in a ...

Morning Overview on MSN

Google unveiled TurboQuant, a method that cuts the memory bottleneck slowing large AI models

Companies running large language models face a persistent bottleneck: the memory consumed by key-value caches during ...

News-Medical.Net

Leading AI models ace many vaccine questions but falter on clinical rules

A multilingual benchmark of 1,886 vaccine-related questions found that large language models answered most items accurately ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results