GPT-5.5 Returns to Cutting Edge in Coding, But OpenAI Switches Benchmarks After Losing to Opus 4.7

Gate News message, April 27 — SemiAnalysis, a semiconductor and AI analysis firm, released a comparative benchmark of coding assistants including GPT-5.5, Claude Opus 4.7, and DeepSeek V4. The key finding: GPT-5.5 marks OpenAI’s first return to the cutting edge in coding models in six months, with SemiAnalysis engineers now alternating between Codex and Claude Code after previously relying almost exclusively on Claude. GPT-5.5 is based on a new pretraining approach codenamed “Spud” and represents OpenAI’s first expansion of pretraining scale since GPT-4.5.

In practical testing, a clear division of labor emerged. Claude handles new project planning and initial setup, while Codex excels at reasoning-intensive bug fixes. Codex demonstrates stronger data structure comprehension and logical reasoning but struggles with inferring ambiguous user intent. On a single dashboard task, Claude automatically replicated the reference page layout but fabricated large amounts of data, while Codex skipped the layout but delivered significantly more accurate data.

The analysis reveals a benchmark manipulation detail: OpenAI’s February blog post urged the industry to adopt SWE-bench Pro as the new standard for coding benchmarks. However, GPT-5.5’s announcement switched to a new benchmark called “Expert-SWE.” The reason, buried in fine print, is that GPT-5.5 was surpassed by Opus 4.7 on SWE-bench Pro and fell significantly short of Anthropic’s unreleased Mythos (77.8%).

Regarding Opus 4.7, Anthropic published a postmortem analysis one week after release, acknowledging three bugs in Claude Code that persisted for several weeks from March to April, affecting nearly all users. Multiple engineers had previously reported performance degradation in version 4.6 but were dismissed as subjective observations. Additionally, Opus 4.7’s new tokenizer increases token usage by up to 35%, which Anthropic openly admitted—effectively constituting a hidden price increase.

DeepSeek V4 was assessed as “keeping pace with the frontier but not leading,” positioning itself as the lowest-cost alternative among closed-source models. The analysis also noted that “Claude continues to outperform DeepSeek V4 Pro on high-difficulty Chinese writing tasks,” commenting that “Claude won against the Chinese model in its own language.”

The article introduces a key concept: model pricing should be evaluated by “cost per task” rather than “cost per token.” GPT-5.5’s pricing is double that of GPT-5.4 (input $5, output $30 per million tokens), but it completes the same tasks using fewer tokens, making the actual cost not necessarily higher. Initial SemiAnalysis data shows Codex’s input-to-output ratio at 80:1, lower than Claude Code’s 100:1.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Related Articles

Global AR Smart Glasses Shipments Surge 98% in 2025, Driven by Meta's Ray-Ban Display and Waveguide Tech

Gate News message, April 29 — Global augmented reality (AR) smart glasses shipments surged 98% in 2025, with second-half shipments jumping 148% year-over-year, according to Counterpoint Research. Growth was fueled by expanded output

GateNews12m ago

Legendary hedge fund trader on the S&P 500 price-to-earnings ratio for U.S. stocks: It will be very hard for anyone buying the broad market to profit in the coming years

Hedge fund manager Paul Tudor Jones (Jones) gave an interview, warning that the regulatory gaps for AI could lead to catastrophic consequences because it disrupts—an iterative pattern risk unprecedented. He also noted that the U.S. stock market’s market capitalization as a share of GDP has reached 252%, with the price-to-earnings ratio overly high; in the long run, investing in the broad market is unlikely to be profitable. He used boxing as a metaphor to suggest that trading opportunities are scarce, and called on global cooperation to develop AI regulation.

ChainNewsAbmedia15m ago

Cognition Establishes Singapore as APAC Headquarters, Eyes Regional Expansion

Gate News message, April 29 — Cognition, the AI coding startup behind Devin, announced on April 29 that it will establish Singapore as its Asia-Pacific headquarters to oversee regional expansion across Southeast Asia, Australia, India, and South Korea. Richard Spence will lead APAC operations

GateNews1h ago

Google signed a confidential AI agreement with the Pentagon; employees issue an open letter opposing it

According to The Information, reported on April 28, Google has signed an agreement to provide artificial intelligence (AI) models to the U.S. Department of Defense for confidential work. The New York Times, citing people familiar with the matter, said the agreement allows the U.S. Department of Defense to use Google’s AI for lawful government purposes, with a nature similar to the confidential AI deployment agreements the Pentagon signed last month with OpenAI and xAI.

MarketWhisper1h ago

a16z Crypto Research Report: AI agent DeFi exploit rate reaches 70%

According to a research report published by a16z Crypto on April 29, when AI agents are equipped with structured domain knowledge, their success rate in reproducing an Ethereum price manipulation vulnerability reaches 70%; in a sandbox environment with no domain knowledge at all, the success rate is only 10%. The report also documents cases where AI agents independently bypass sandbox restrictions to access future transaction information, as well as systematic failure modes of the agents when constructing multi-step profitable attack plans.

MarketWhisper1h ago
Comment
0/400
No comments