AI Agents can already independently recreate complex academic papers: Mollick says most errors come from human original text rather than AI

ChainNewsAbmedia

Penn Wharton School of Business professor Ethan Mollick made a highly impactful observation for academia on X on 4/25: today’s AI agents can independently reproduce complex academic research results using only publicly described methods and data—without the original paper and without the original code. Mollick further noted that when these AI reproductions differ from the original papers, “the errors often come from the human-written papers themselves, not from the AI.” This is a concrete turning point for the crisis of research reproducibility in the age of generative AI—what previously required expensive human effort for peer verification is now being completed at scale and at low cost by AI.

Claude reproduces multiple papers, then uses GPT-5 Pro for a double check

In his OneUsefulThing blog post and in this tweet, Mollick described his specific experiment with Claude: he gave an academic paper to Claude, had it open the archive, organize the files, automatically convert the STATA code used for the statistics into Python, and then execute all of the findings in the paper one by one. After Claude finished, he then used GPT-5 Pro to perform a second round of checks on the same set of reproduction results. Multiple papers were tested in the same way, and the results were generally successful—only getting blocked when the data files were too large, or when there was something wrong with the original replication data itself.

For academia, this process typically used to take research assistants weeks, or even months. The time scale Mollick described was from an afternoon to a day, and the operating cost was only the token fees of a commercial LLM API.

Most errors come from the original human text—not the AI

More controversial is Mollick’s judgment about “who is wrong.” In his tweet, he stated plainly that when AI reproduction results don’t match the original papers, in most cases it’s not the AI that made a mistake, but rather the original paper had data-processing errors, the model was misused, or the conclusions went beyond what the data supports. In psychology, behavioral economics, management, and other social sciences, there have been multiple major reproducibility crisis events over the past decade; the most famous is the large-scale reproduction study by the 2015 Open Science Collaboration, where only about 36% of psychology paper results could be independently reproduced. AI agents are pushing the boundary of this testing process from “requiring human staffing” to “being broadly executable.”

Learned societies still ban AI from peer review; the system lags behind the technology

In another 4/25 tweet, Mollick specifically pointed out that the Academy of Management—the largest learned society in his field—still explicitly bans AI from the manuscript review process. He cited existing research indicating that AI peer review is already better than some traditional human reviewers in terms of accuracy, consistency, and bias control; therefore, this “ban” position could end up further exacerbating the failure of existing review systems. The gap between this kind of institution and the technology is a policy issue that academia, learned societies, and funding organizations will have to face in the next 1–2 years in the world of academic publishing.

For readers, this debate is not limited to academia. When AI agents can verify research findings in real time, academic evidence cited in industry research, policy reports, and financial decisions will enter a new scrutiny threshold—“can the conclusions withstand independent AI reproduction?” In a supplementary tweet, Mollick added that he believes governments are the only entity that can anchor this testing as tool strength continues to rise—and that the complexity of policy design will simultaneously become a relatively overlooked main thread in AI governance discussions.

This article: AI agents can independently reproduce complex academic papers—Mollick says most errors come from the human original text rather than the AI; first appeared on Chain News ABMedia.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Related Articles

Cursor AI agent caused an incident! One line of code cleared the company database in 9 seconds—“security checks” turned into empty talk

PocketOS founder Jer Crane said that Cursor AI agents ran maintenance on their own in a test environment, misused an API Token that adds/removes custom domains, and launched a delete command against Railway’s GraphQL API. Within 9 seconds, all data and same-region snapshots were completely destroyed, with the latest recoverable point being three months ago. The agent admitted to violating rules for irreversible operations, not reading technical documentation, not verifying environment isolation, and more. The victims were car rental industry customers; their bookings and all data disappeared, and reconciling accounts took a long time. Crane proposed five reforms: manual confirmation, fine-grained API permissions, backups separated from master data, a public SLA, and a mandatory underlying enforcement mechanism.

ChainNewsAbmedia15m ago

Alibaba's PAI Releases Open-Source AgenticQwen Model: 8B Version Approaches 235B Performance via Dual Data Flywheels

Gate News message, April 27 — Alibaba's PAI team has released and open-sourced AgenticQwen, a small-scale agentic language model designed for industrial-grade tool-calling applications. The model comes in two versions: 8B and 30B-A3B. Trained through an innovative "dual data flywheel"

GateNews22m ago

DeepSeek V4 Pro with Ollama Cloud: One-click integration with Claude Code

According to an Ollama tweet, DeepSeek V4 Pro was released on 4/24, has been added to the Ollama catalog in cloud mode, and can call tools like Claude Code, Hermes, OpenClaw, OpenCode, Codex, etc. with just a single line of command. V4 Pro: 1.6T params, 1M context, Mixture-of-Experts; cloud inference does not download local weights. If you want to run it locally, you need to obtain the weights yourself and run it with INT4/GGUF and multi-card GPUs. Early speed tests were affected by cloud load; typical performance is about 30 tok/s, with a peak of 1.1 tok/s. It is recommended to use the cloud prototype first, and for production later, run inference yourself or use a commercial API.

ChainNewsAbmedia1h ago

UB (Unibase) up 14.96% in 24 hours

Gate News update: On April 27, according to Gate market data, as of the time of writing, UB (Unibase) is trading at $0.0491. It is up 14.96% over the past 24 hours, with a high of $0.0534 and a low of $0.0423. The 24-hour trading volume is $3.9667 million. The current market cap is approximately $123 million. Unibase is a high-performance decentralized AI memory layer that provides long-term memory and cross-platform interoperability for AI agents, enabling them to remember, collaborate, and self-evolve. Unibase aims to build an open agent internet, supporting seamless cooperation among intelligent agents across ecosystems, empowering developers to build the next generation of AI applications. This message does not constitute investment advice; please be mindful of market volatility risks when investing.

GateNews1h ago

Guo Ming-chi: OpenAI wants to build an AI Agent phone; MediaTek, Qualcomm, and Luxshare Precision are key in the supply chain

Guo Ming-chi claims that OpenAI is working with MediaTek, Qualcomm, and Luxshare Precision to develop an AI Agent phone, with mass production expected in 2028. The new phone will be centered on task completion: an AI agent will understand and execute requests, combining cloud and on-device computing, with a focus on sensing and contextual understanding. The specifications and supply chain list are expected to be finalized in 2026–2027; if it takes shape, it could bring a new upgrade cycle to the high-end market, and Luxshare may become a major beneficiary.

ChainNewsAbmedia1h ago

Xiaomi’s AI model lead: As AI competition shifts to the Agent era, self-evolution is a key event on the path to AGI

Xiaomi’s large-model team head, Luo Fuli, accepted an in-depth interview on the Bilibili platform on April 24 (video ID: BV1iVoVBgERD). The interview lasted 3.5 hours, and it was her first time, as the technical head, to publicly and systematically explain her technical viewpoints. Luo Fuli said that the large-model competition track has shifted from the Chat era to the Agent era, and she pointed out that “self-evolution” will be a key event for AGI in the coming year.

MarketWhisper2h ago
Comment
0/400
No comments