Language Model Evaluation

This is the most misunderstood graph in AI

To some, METR’s “time horizon plot” indicates that AI utopia—or apocalypse—is close at hand. The truth is more complicated.

MIT Technology Review

The Download: attempting to track AI, and the next generation of nuclear power

This is today's edition of The Download, our weekday newsletter that provides a daily dose of what's going on in the world of ...

Micro1 Shows Why AI’s Hardest Problem Is Evaluation, Not Intelligence

Micro1 is building the evaluation layer for AI agents providing contextual, human-led tests that decide when models are ready ...

The Lancet

CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research

Although large language models (LLMs) have the potential to transform biomedical research, their ability to reason accurately across complex, data-rich domains remains unproven. To address this ...

Qwen3-Coder-Next offers vibe coders a powerful open source, ultra-sparse model with 10x higher throughput for repo tasks

On SWE-Bench Verified, the model achieved a score of 70.6%. This performance is notably competitive when placed alongside ...

17h

Caura.ai Introduces PeerRank: A Breakthrough Framework Where AI Models Evaluate Each Other Without Human Supervision

TEL AVIV, Israel, Feb. 4, 2026 /PRNewswire/ -- Caura.ai today published research introducing PeerRank, a fully autonomous evaluation framework in which large language models generate tasks, answer ...

Medscape

AI Tool Accurately Flags Stroke Patients Ineligible for Thrombolysis

Manual review of electronic health records (EHRs) to screen for contraindications to thrombolysis during stroke evaluation is ...

AI companies want you to stop chatting with bots and start managing them

In this vision, developers and knowledge workers effectively become middle managers of AI. That is, not writing the code or ...

OpenAI’s GPT-5.3-Codex drops as Anthropic upgrades Claude — AI coding wars heat up ahead of Super Bowl ads

OpenAI launched GPT-5.3-Codex as Anthropic released Claude Opus 4.6 in a simultaneous drop that kicks off the AI coding wars, ...

Savvy Gamer on MSN

Why does AI hallucinate?

Throw AI a vague question, and what you’ll get back will likely sound plausible enough to be true, even if the question isn’t meant to have a real answer. But from the response you get back, it might ...

Devdiscourse

Public health needs structure before scaling AI

Governance and regulation constitute the fifth element. The authors argue that public health AI requires dedicated oversight mechanisms addressing transparency, explainability, data protection, and ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results