Language Model Evaluation

12m

HISD broke barriers with the first public French program in Texas. Now, it plans to dismantle it.

The program will be reduced from a 50/50 immersion model to less intensive "enrichment" program, according to presentation ...

New AI model enables native speakers and foreign learners to read undiacritized Arabic texts with greater fluency

Reading an Arabic newspaper, a book, or academic prose fluently, whether digital or in print, remains challenging for many ...

Medscape

AI Tool Accurately Flags Stroke Patients Ineligible for Thrombolysis

The results of a retrospective cohort study showed that the tool demonstrated high sensitivity and specificity in identifying ...

Communications of the ACM

The Swiss LLM Apertus

Apertus was released in early September 2025. It is a multilingual model developed by the Swiss Federal Institutes of Technology in Zurich (ETH) and Lausanne (EPFL). The model was pretrained with 60% ...

Qwen3-Coder-Next offers vibe coders a powerful open source, ultra-sparse model with 10x higher throughput for repo tasks

On SWE-Bench Verified, the model achieved a score of 70.6%. This performance is notably competitive when placed alongside ...

technext24.com

From summative to formative assessment: How ‘Listening Trivia’ uses AI to humanise soft skills assessment

In the corporate world, few rituals are as universally dreaded as the mandatory compliance training. It is often a passive, click-next-until-it’s-over exercise designed to generate a certificate ...

Micro1 Shows Why AI’s Hardest Problem Is Evaluation, Not Intelligence

Micro1 is building the evaluation layer for AI agents providing contextual, human-led tests that decide when models are ready ...

NextgovOpinion

DOD’s AI acceleration strategy

According to the Secretary of Defense Pete Hegseth’s memorandum on the Strategy, this AI-first status is to be achieved ...

The Lancet

CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research

Although large language models (LLMs) have the potential to transform biomedical research, their ability to reason accurately across complex, data-rich domains remains unproven. To address this ...

GitHub

Provider-agnostic, open-source evaluation infrastructure for language models

openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long ...

ADL finds Grok is the worst AI chatbot at countering antisemitism

Grok, the large language model of Elon Musk’s social platform X, came in last place in a new ranking of AI chatbots’ ability ...

SiliconANGLE

Databricks expands tools for governing and evaluating AI agents

Databricks Inc. today announced a series of updates to its flagship artificial intelligence product, Agent Bricks, aimed at improving governance, accuracy and model flexibility for enterprise AI ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results