METR Study Found Work Performance Decrease When Professionals Used AI Coding Tools

SAN FRANCISCO, US – Experienced software developers using artificial intelligence (AI) coding tools took 19% longer to complete programming tasks, despite believing they were 20% faster, according to a randomized controlled trial conducted by METR (Model Evaluation & Threat Research), a nonprofit organization that studies AI safety and performance.

The study, published July 10, 2025, examined 16 experienced open-source developers working on their own familiar code repositories which were collections of code they had contributed to for multiple years. Researchers randomly assigned 246 real programming issues to either allow or prohibit AI tool usage, creating the first controlled experiment to measure AI’s actual impact on experienced developers in realistic work environments rather than artificial test scenarios.

Developers used frontier AI models including Cursor Pro with Claude 3.5 and 3.7 Sonnet when AI was permitted. The study focused on large repositories averaging 22,000+ stars (indicating popularity) and over one million lines of code, representing substantial real-world software projects where developers had deep expertise and familiarity with the existing codebase (the complete collection of source code).

Perception Conflicts with Measured Performance

The most striking finding was the gap between developer perception and measured reality. After completing tasks, developers estimated that AI tools reduced their completion time by 20%, while actual measurements showed a 19% increase in time required. This 39-percentage-point difference between perception and reality suggests developers may struggle to accurately assess AI’s impact on their own work.

“We conducted a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories,” wrote lead researchers Joel Becker and Nate Thomas. “Surprisingly, we find that when developers use AI tools, they take 19% longer than without it because AI makes them slower.”

The study ruled out several potential explanations for the slowdown, confirming that developers used frontier models, complied with their treatment assignments, and submitted similar quality code whether using AI or not. The performance decrease persisted across different measurement methods and data analysis approaches.

Five Factors Contributing to Slower Performance

METR researchers conducted detailed analysis to understand why experienced developers performed worse with AI tools. Their investigation revealed five specific factors that consistently contributed to the 19% productivity decrease across the study participants.

1. Over-Optimism About AI Usefulness

Developers entered tasks with unrealistic expectations about what AI coding tools could accomplish in their specific work environment. The study found that participants predicted AI would speed them up by 24% before starting tasks, demonstrating significant overconfidence in the technology’s capabilities for their particular use case.

Even after experiencing the actual 19% slowdown, developers still believed AI had improved their productivity by 20%, showing persistent overestimation. This cognitive bias led developers to rely more heavily on AI suggestions than was optimal, potentially wasting time on approaches that weren’t well-suited to their complex, familiar codebases.

2. High Developer Familiarity with Repositories

The study participants averaged five years of experience and 1,500 commits (code contributions) on their specific repositories, giving them deep institutional knowledge that AI tools couldn’t match. This extensive familiarity meant that experienced developers already knew the most efficient approaches for their projects, making AI suggestions less valuable.

Researchers found greater slowdowns on tasks where developers had particularly high prior experience. The AI tools offered general programming assistance that was redundant with knowledge the experienced developers already possessed, failing to provide the specialized, context-aware insights that would justify the time investment.

3. Large and Complex Repositories

The repositories in the study averaged over 22,000 stars (indicating high popularity and complexity) and more than one million lines of code. These massive, mature codebases contained intricate dependencies, established architectural patterns, and sophisticated system designs that overwhelmed AI’s ability to understand context and relationships.

Working on such large, established projects requires understanding not just individual code functions, but how changes propagate through interconnected systems. AI tools struggled to grasp these complex interdependencies, leading to suggestions that appeared reasonable in isolation but caused integration problems requiring additional debugging time.

4. Low AI Code Acceptance and Extensive Review Requirements

Developers accepted less than 44% of AI-generated code suggestions, indicating the tools frequently produced inappropriate or low-quality output for the specific tasks. Additionally, 75% of participants reported reading every line of AI-generated code, while 56% made major modifications to clean up AI suggestions.

This extensive review process consumed significant time without providing proportional benefits. The need to carefully examine, test, and modify AI output often exceeded the time that would have been required to write the code from scratch using the developers’ existing expertise and familiarity with their projects.

5. High Quality Standards and Implicit Requirements

The study revealed that AI capabilities were comparatively lower in settings with very high quality standards and many implicit requirements related to documentation, testing coverage, code formatting, and style guidelines that experienced developers had internalized over years of work.

Unlike benchmark environments where algorithmic scoring determines success, real-world development requires code that passes human review including style requirements, comprehensive testing, and thorough documentation. These implicit quality standards, which experienced developers naturally incorporate, proved difficult for AI tools to understand and implement consistently.

Contradicts Popular AI Coding Benchmarks

The METR study findings directly contradict the impressive performance scores that AI companies and researchers typically cite when promoting coding tools. Popular benchmarks like SWE-Bench (Software Engineering Benchmark) often show AI systems achieving human-expert or superhuman performance on programming tasks, creating expectations that these tools will dramatically boost developer productivity.

However, these benchmarks fundamentally misrepresent real-world software development conditions. They use carefully crafted, self-contained problems that don’t require understanding massive existing codebases, navigating complex dependencies, or meeting the rigorous code quality standards that experienced developers must satisfy in production environments.

“While coding and agentic benchmarks have proven useful for understanding AI capabilities, they typically sacrifice realism for scale and efficiency,” the researchers explained. “The tasks are self-contained, don’t require prior context to understand, and use algorithmic evaluation that doesn’t capture many important capabilities.”

The Marketing Problem: When Benchmarks Mislead

When AI companies market their coding tools based on benchmark performance, they create a dangerous disconnect between expectations and reality. Developers and organizations invest time and money expecting the 26% productivity gains or 55% speed improvements cited in earlier studies, only to discover that experienced developers working on complex, real-world projects may actually become 19% slower.

This marketing approach based on artificial benchmarks leads to several harmful outcomes. First, it wastes organizational resources as companies deploy tools expecting immediate productivity gains that don’t materialize. Second, it creates frustration among experienced developers who feel pressured to use tools that actually hinder their work. Third, it damages trust in AI capabilities overall when the promised benefits fail to appear in practice.

The Honest Assessment: Limited but Real Benefits

The truth about current AI coding capabilities is more nuanced than marketing materials suggest. The tools do provide genuine value in specific scenarios – particularly for beginners learning programming concepts, developers working in unfamiliar codebases, or when generating boilerplate code and documentation. These scenarios more closely match the controlled conditions where AI performs well in benchmark tests.

However, experienced developers working on large, familiar projects with high quality standards – arguably the most valuable and productive software development work – may find current AI tools more hindrance than help. The 19% slowdown occurs precisely in the scenarios where organizations most need productivity gains: complex, mission-critical software development by their most skilled engineers.

An honest marketing approach would acknowledge these limitations while highlighting genuine use cases. Instead of claiming universal productivity improvements, AI companies should specify that their tools work best for junior developers, unfamiliar codebases, educational contexts, and rapid prototyping rather than production-quality development on complex systems.

Implications for Software Development Industry

The METR research reveals that the software development industry is making costly decisions based on misleading information about AI coding tools. Companies are investing in AI assistants expecting immediate productivity gains for their most experienced developers, but may actually be slowing down their most valuable contributors by 19%.

What Companies Should Do Immediately

First, stop assuming AI coding tools will universally improve productivity across all developer skill levels and project types. The study shows that experienced developers working on complex, familiar codebases – often the most critical and expensive development work – become significantly slower with current AI tools.

Second, measure actual productivity impact rather than relying on developer self-reports. The METR study found that developers believed they were 20% faster even while being 19% slower, demonstrating that subjective assessments are unreliable. Companies should track objective metrics like task completion times, code quality, and debugging requirements when evaluating AI tool effectiveness.

Third, target AI tool deployment strategically rather than organization-wide. The research suggests these tools may benefit junior developers, teams working on unfamiliar codebases, or projects with lower quality standards. Companies should pilot AI tools in these specific contexts before expanding to experienced developers on mission-critical projects.

What the Industry Must Acknowledge

The software development industry must confront the fact that current AI marketing claims are based on benchmarks that don’t represent real-world development conditions. As METR researchers noted, benchmarks “sacrifice realism for scale and efficiency” and use “algorithmic evaluation that doesn’t capture many important capabilities” required in production environments.

Companies purchasing AI coding tools should demand evidence from realistic deployment scenarios rather than accepting benchmark scores. The 26% productivity gains and 55% speed improvements cited in marketing materials come from controlled experiments that don’t match the complex, quality-demanding environments where most valuable software development occurs.

Additionally, the industry must recognize that anecdotal reports of AI helpfulness are “inaccurate and overoptimistic,” as the METR study demonstrated. The gap between perception and reality means that developer enthusiasm for AI tools doesn’t translate to actual productivity improvements.

Realistic Deployment Strategy

Based on the research findings, companies should implement a contextual approach to AI coding tools. Deploy them for junior developers learning programming concepts, teams working on unfamiliar codebases, rapid prototyping projects, and documentation generation – scenarios that more closely match conditions where AI performs well.

For experienced developers on complex projects, companies should either avoid AI coding tools entirely or implement them with careful measurement and realistic expectations. The tools might provide value for specific subtasks like generating boilerplate code or exploring alternative approaches, but shouldn’t be expected to improve overall productivity.

Most importantly, organizations should budget for the hidden costs of AI tool adoption: the time experienced developers spend reviewing and correcting AI suggestions, the potential for introducing subtle bugs, and the learning curve required to use tools effectively. The study found developers accepted less than 44% of AI suggestions and spent significant time modifying AI-generated code.

The randomized controlled trial involved 16 experienced developers from large open-source repositories, analyzing 246 real programming issues over several months in early 2025, conducted by METR’s AI safety research team to understand real-world AI tool effectiveness in production-quality development environments.

Key Takeaways

Experienced developers using AI coding tools completed tasks 19% slower while incorrectly believing they worked 20% faster than without AI.
AI tools showed low reliability and context understanding in complex, established codebases requiring high-quality code standards from experienced developers.
Study results contradict popular coding benchmarks, suggesting AI effectiveness varies significantly between controlled tests and real-world development scenarios.

Keep Reading

What Is Artificial Intelligence AI – Learn fundamental AI concepts to better understand coding tools and their limitations in professional software development environments.
AI Beginner Guide: Understanding Machine Learning Basics – Discover essential AI terminology and concepts that explain how coding assistants work and why they succeed or fail.
How to Evaluate AI Tools for Your Workflow – Learn systematic approaches to testing AI productivity claims and measuring real performance impacts in your work.

Tagged AI Tools

Perception Conflicts with Measured Performance

Five Factors Contributing to Slower Performance

1. Over-Optimism About AI Usefulness

2. High Developer Familiarity with Repositories

3. Large and Complex Repositories

4. Low AI Code Acceptance and Extensive Review Requirements

5. High Quality Standards and Implicit Requirements

Contradicts Popular AI Coding Benchmarks

The Marketing Problem: When Benchmarks Mislead

The Honest Assessment: Limited but Real Benefits

Implications for Software Development Industry

What Companies Should Do Immediately

What the Industry Must Acknowledge

Realistic Deployment Strategy

Key Takeaways

Keep Reading

Leave a Reply Cancel reply