According to Anthropic's system card disclosure, the Mythos 5 model enabled generalist microbiologists to outperform specialists in a 16-hour biodefense red team exercise, with 2 out of 3 generalist teams surpassing all 3 expert teams in scientific quality and feasibility. Experts estimated the task would typically require 40 to 95 working days without AI assistance, averaging 72.5 days.
However, Anthropic noted that Mythos 5 remains limited in autonomous research capabilities. The model showed weak open-ended ideation, tended to recombine existing literature rather than propose novel approaches, and could continue pursuing flawed frameworks even after identifying defects. The CUSP scientific forecasting benchmark corroborated these findings, showing GPT-5.4 achieved 81.9% accuracy on mechanism identification tasks but only 45.3% to 51.9% on binary classification of whether scientific advances would actually succeed, near random guessing levels.