Penn Wharton School of Business professor Ethan Mollick made a highly impactful observation for academia on X on 4/25: today’s AI agents can independently reproduce complex academic research results using only publicly described methods and data—without the original paper and without the original code. Mollick further noted that when these AI reproductions differ from the original papers, “the errors often come from the human-written papers themselves, not from the AI.” This is a concrete turning point for the crisis of research reproducibility in the age of generative AI—what previously required expensive human effort for peer verification is now being completed at scale and at low cost by AI.
Claude reproduces multiple papers, then uses GPT-5 Pro for a double check
In his OneUsefulThing blog post and in this tweet, Mollick described his specific experiment with Claude: he gave an academic paper to Claude, had it open the archive, organize the files, automatically convert the STATA code used for the statistics into Python, and then execute all of the findings in the paper one by one. After Claude finished, he then used GPT-5 Pro to perform a second round of checks on the same set of reproduction results. Multiple papers were tested in the same way, and the results were generally successful—only getting blocked when the data files were too large, or when there was something wrong with the original replication data itself.
For academia, this process typically used to take research assistants weeks, or even months. The time scale Mollick described was from an afternoon to a day, and the operating cost was only the token fees of a commercial LLM API.
Most errors come from the original human text—not the AI
More controversial is Mollick’s judgment about “who is wrong.” In his tweet, he stated plainly that when AI reproduction results don’t match the original papers, in most cases it’s not the AI that made a mistake, but rather the original paper had data-processing errors, the model was misused, or the conclusions went beyond what the data supports. In psychology, behavioral economics, management, and other social sciences, there have been multiple major reproducibility crisis events over the past decade; the most famous is the large-scale reproduction study by the 2015 Open Science Collaboration, where only about 36% of psychology paper results could be independently reproduced. AI agents are pushing the boundary of this testing process from “requiring human staffing” to “being broadly executable.”
Learned societies still ban AI from peer review; the system lags behind the technology
In another 4/25 tweet, Mollick specifically pointed out that the Academy of Management—the largest learned society in his field—still explicitly bans AI from the manuscript review process. He cited existing research indicating that AI peer review is already better than some traditional human reviewers in terms of accuracy, consistency, and bias control; therefore, this “ban” position could end up further exacerbating the failure of existing review systems. The gap between this kind of institution and the technology is a policy issue that academia, learned societies, and funding organizations will have to face in the next 1–2 years in the world of academic publishing.
For readers, this debate is not limited to academia. When AI agents can verify research findings in real time, academic evidence cited in industry research, policy reports, and financial decisions will enter a new scrutiny threshold—“can the conclusions withstand independent AI reproduction?” In a supplementary tweet, Mollick added that he believes governments are the only entity that can anchor this testing as tool strength continues to rise—and that the complexity of policy design will simultaneously become a relatively overlooked main thread in AI governance discussions.
This article: AI agents can independently reproduce complex academic papers—Mollick says most errors come from the human original text rather than the AI; first appeared on Chain News ABMedia.
Related Articles
Cursor AI agent caused an incident! One line of code cleared the company database in 9 seconds—“security checks” turned into empty talk
Alibaba's PAI Releases Open-Source AgenticQwen Model: 8B Version Approaches 235B Performance via Dual Data Flywheels
DeepSeek V4 Pro with Ollama Cloud: One-click integration with Claude Code
Guo Ming-chi: OpenAI wants to build an AI Agent phone; MediaTek, Qualcomm, and Luxshare Precision are key in the supply chain
Xiaomi’s AI model lead: As AI competition shifts to the Agent era, self-evolution is a key event on the path to AGI