OpenAI released the gpt-oss models on August 5, 2025 as state-of-the-art open-weight LLMs. At launch, they sparked excitement across research and industry. One month later, with the initial hype settled, I feel more comfortable sharing my notes and reflections on these models.
In short: gpt-oss-120B marks a milestone for open-weight models. It surpasses the Llama 3 series and Llama 4 Mavericks. While not the strongest open-weight model available — Qwen3 and Kimi-K2 are racing and pushing at the performance frontier — gpt-oss delivers top-tier reasoning performance and and is expected to receive mainstream adoption. It is likely to become a key baseline for benchmarking and the de facto production choice. I recommend it for workloads that demand top intelligence and high serving throughput.
Architecture
GPT-OSS comes in two sizes:
Configuration | gpt-oss-120B | gpt-oss-20B |
---|---|---|
Layers | 36 | 24 |
Active Parameters at serving (B) | 5.13 | 3.61 |
Total Parameters (B) | 116.83 | 20.91 |
Total Experts | 128 | 32 |
Active Experts at serving | 4 | 4 |
Context Length | 128k | 128k |
Checkpoint Size GiB | 60.8 | 12.8 |
Both models adopt a Mixture-of-Experts (MoE) design, which improves serving efficiency while maintaining strong quality.
They use rotary position embeddings and extend context length to 131k tokens via YaRN — now a standard choice in open-weight models like Qwen3.
The tokenizer extends GPT-4o’s and o4-mini’s, with a vocabulary of 201,088 tokens.
These models are text-only: no multimodal input support.
Training cost: OpenAI reports that gpt-oss-20B cost ≈ $0.5M to train, while gpt-oss-120B cost ≈ $5M — conveniently in line with DeepSeek V3’s $5.6M.
Reasoning
At post-training, gpt-oss models adopt Chain-of-Thought (CoT) reasoning, similar to the o-series. This enables test-time scaling: the model can allocate more tokens to reasoning when needed, which helps on complex tasks. For example, gpt-oss-20B uses over 20k CoT tokens per AIME problem, significantly boosting math performance.
Reasoning effort is configurable at three levels — low, medium, and high — via the system prompt (see Harmony section). The model adjusts CoT length accordingly.
Higher reasoning often improves benchmark scores (e.g., AIME), but not universally. In some of my tasks at evaluations, medium effort achieves the best accuracy, while high effort sometimes drifts into excessive reasoning that hurt results.
Another observation: the smaller model often consume more reasoning tokens on difficult tasks, likely as a way to compensate for its more limited knowledge and capacity.
Takeaway: reasoning level should be tuned per task, with both accuracy and latency/cost in mind.
Evaluation
Reported Benchmarks
- General tasks: gpt-oss-120B approaches o4-mini on coding (Codeforces), general problem solving (MMLU, HLE), and tool use (TauBench).
- Health: at high reasoning, gpt-oss-120B nearly matches o3 on HealthBench (within 2%) and outperforms other closed OpenAI models.
- Multilingual: delivers solid results on MMMLU; third-party benchmarks also report competitive performance.
- Hallucination: weaker than o4-mini on SimpleQA and PersonQA, likely due to less world knowledge. Browsing or retrieval helps mitigate this.
The model card also highlights AI self-improvement benchmarks — real-world coding and research tasks:
- SWE-Bench: generate PRs that pass hidden tests given a GitHub issue and codebase.
- OpenAI PRs: reproduce actual pull requests by OpenAI engineers, using command-line tools and Python.
- PaperBench: reimplement 20 ICML 2024 Spotlight and Oral papers, decomposed into graded subtasks.
On these tasks, o3 and o4-mini outperform gpt-oss.
My Benchmarks
- gpt-oss-120B outperforms Llama 3.3 70B and Llama 4 Maverick by a wide margin.
- It matches gpt-5-mini, but trails gpt-5 and Claude Sonnet-4.
- gpt-oss-20B falls behind Qwen3-30B-a3b, and gpt-oss-120B behind Qwen3-235B-a22b — so they are not necessarily the strongest open-weight models available.
Harmony
The Harmony response format was introduced alongside gpt-oss. It is not a new API (fortunately); rather, it is a convention the models are trained on. To get the best results, inference providers and users are strongly encouraged to adopt this format.
Roles and Hierarchy
Harmony defines five roles, each with a specific purpose.
system
: recommended left untouched except for setting reasoning effort.developer
anduser
: carry most of the task instructions.assistant
andtool
: used by the model to call tools, process results, and generate final responses.
The model is trained to follow an information hierarchy:
system > developer > user > assistant > tool
Role | Purpose |
---|---|
system |
Specifies reasoning effort, meta info (e.g., knowledge cutoff), and built-in tools. |
developer |
Provides task instructions and tool definitions (what is typically the “system prompt”). |
user |
Represents the input or query to the model. |
assistant |
The model’s output: either natural language, a tool call, or a message routed to a particular channel. |
tool |
Outputs of tool calls. The tool name is used as the role. |
Here is an example of a complete prompt including system
, developer
, and user
messages:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-06-28
reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every
message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|>
<|start|>developer<|message|># Instructions
Use a friendly tone.
# Tools
## functions
namespace functions {
// Gets the current weather in the provided location.
type get_current_weather = (_: {
// The city and state, e.g. San Francisco, CA
location: string,
format?: "celsius" | "fahrenheit", // default: celsius
}) => any;
} // namespace functions<|end|>
<|start|>user<|message|>What is the weather like in SF?<|end|>
<|start|>assistant
Function Calling
Functions are defined in the developer message in a TypeScript-like syntax, placed in a dedicated Tools
section (reference).
When the model decides to call a tool, it produces a <|channel|>
message that specifies the function call using commentary to=
.
Example:
<|channel|>analysis<|message|>Need to use function get_current_weather.<|end|>
<|start|>assistant<|channel|>commentary to=functions.get_current_weather <|constrain|>json<|message|>{“location”:“San Francisco”}<|call|>
Structural Outputs
To encourage structured responses, you can define response formats as JSON schemas at the end of the developer message.
This does not guarantee perfect adherence, but it nudges the model strongly. For strict compliance, you will need to implement a retry logic or structural decoding.
Example:
<|start|>developer<|message|># Instructions
You are a helpful shopping assistant
Response Formats
shopping_list
{“properties”:{“items”:{“type”:“array”,“description”:“entries on the shopping list”,“items”:<|end|>
<|start|>user<|message|>I need to buy coffee, soda and eggs<|end|>
<|start|>assistant
Summary
The release of gpt-oss marks a new milestone for open-weight LLMs. While not the absolute strongest — Qwen3 and Kimi continue to push the performance frontier — gpt-oss-120B stands out for its balance of quality and efficiency, and is likely to become the most widely supported option. It surpasses Llama 3 & 4, matches or exceeds many closed models in certain domains, and is positioned to become the default baseline for research and deployment.
Strengths:
- Efficient Mixture-of-Experts architecture with strong serving throughput.
- Competitive reasoning ability with configurable effort levels.
- Solid performance across coding, tool use, health, and general benchmarks.
Limitations:
- No multimodal input support (text-only).
- Harmony format required for best results, which may require updates to serving pipelines.
- Trails top-tier frontier models (gpt-5, Claude Sonnet-4, Qwen3-235B) on cutting-edge tasks.
Practical advice for production:
- Adopt Harmony response format at serving time for best instruction following.
- Tune reasoning effort per task based on evaluation — medium often balances accuracy and efficiency best; high reasoning does not guarantee better results.
- Mitigate hallucination by pairing with retrieval or browsing when factuality matters.