Reading Note on gpt-oss - a new milestone for open-weight models

Reading Note on gpt-oss - a new milestone for open-weight models

gpt-oss-120B marks a milestone for open-weight models. It delivers a frontier-level reasoning performance and is expected to receive mainstream adoption. Most likely, it will serve as the new baseline for benchmarking and the de facto production choice. I recommend it for workloads that demand top intelligence and high serving throughput.

Read More

Reading Note on SimpleQA - Build High Quality Benchmark Dataset with LLM + Human

Reading Note on SimpleQA - Build High Quality Benchmark Dataset with LLM + Human

The OpenAI team built a new benchmark dataset called SimpleQA that evaluates large language models' (LLMs) ability to answer factual questions. A particularly intriguing aspect of this paper is, in this era of LLMs, how the team of researchers leverages LLMs in their own workflow to design, iterate, and analyze a new dataset.

Read More