gpt-oss-120B marks a milestone for open-weight models. It delivers a frontier-level reasoning performance and is expected to receive mainstream adoption. Most likely, it will serve as the new baseline for benchmarking and the de facto production choice. I recommend it for workloads that demand top intelligence and high serving throughput.
Read More
The OpenAI team built a new benchmark dataset called SimpleQA that evaluates large language models' (LLMs) ability to answer factual questions. A particularly intriguing aspect of this paper is, in this era of LLMs, how the team of researchers leverages LLMs in their own workflow to design, iterate, and analyze a new dataset.
Read More
Thought Preference Optimization (TPO): Prompt the model to generate a thought process followed by the response. TPO demonstrates significant performance gains on non-reasoning categories, including translation, marketing, and health; reasoning categories like math and analysis also show improvements.
Read More