AI
AI

AI Can Fix Bugs, but Struggles to Identify Them: OpenAI Study Reveals Limitations of LLMs in Software Engineering

Photo credit: venturebeat.com

Stay informed with our regular updates and in-depth analysis on the rapidly evolving AI landscape.

While large language models (LLMs) have begun to reshape the software development landscape, a careful examination reveals that enterprises should hesitate before making drastic changes to their engineering workforce. OpenAI CEO Sam Altman’s assertion that LLMs can take over the roles of “low-level” engineers deserves scrutiny.

A recent study conducted by OpenAI researchers introduced a benchmark known as SWE-Lancer to evaluate the efficacy of foundation models in real-world freelance software engineering tasks. The researchers discovered that, although LLMs can address bugs, they struggle to comprehend the underlying causes, leading to a propensity for errors.

In the study, three language models—including OpenAI’s GPT-4o and o1, as well as Anthropic’s Claude-3.5 Sonnet—were challenged with 1,488 freelance engineering tasks sourced from Upwork, collectively valued at $1 million. The tasks were categorized into individual contributor tasks (such as bug fixes and feature implementations) and managerial tasks (where the models assumed the role of a project manager to select the best solutions among proposals).

The researchers concluded that real-world freelance tasks remain complex for advanced language models. Their findings indicate that while these models can assist in debugging, they do not yet possess the capability to operate independently in a freelance capacity.

Evaluating Freelance Performance of LLMs

The study team, alongside 100 professional software engineers, identified tasks on Upwork and prepared the SWE-Lancer dataset without altering any task descriptions. To ensure the integrity of the evaluation, they utilized a Docker container that lacked internet connectivity, preventing the models from scraping external code databases.

From the identified tasks, they compiled 764 individual contributor assignments, with an estimated payout of around $414,775, ranging from quick bug fixes taking mere minutes to lengthy weeklong feature implementations. In addition, tasks involving the assessment of freelancer proposals contributed to a total payout of $585,225.

The curated tasks were then organized within the expense platform Expensify. The research team generated specific prompts based on task titles and descriptions, incorporating a snapshot of the relevant codebase. For situations with multiple proposals, they crafted management tasks utilizing the description and options available.

The next phase involved developing robust tests, using Playwright, for each task. These tests were designed to apply generated code patches and went through a rigorous verification process by seasoned software engineers.

As the paper outlines, the tests mimicked genuine user interactions, encompassing actions such as logging in, executing complex transactions, and validating the effectiveness of the model’s solutions.

Outcomes of the Benchmark Tests

The results of the testing phase revealed that none of the models completed the full range of tasks equating to the $1 million payout. Claude 3.5 Sonnet, which performed the best among the models, earned $208,050 by successfully resolving 26.2% of the individual contributor tasks. However, the research pointed out that a significant proportion of its outputs were incorrect, indicating a need for improved reliability before deployment in real-world scenarios.

All models demonstrated competent performance with individual contributor tasks, with Claude 3.5 Sonnet leading, followed by OpenAI’s o1 and GPT-4o. However, the findings illustrated that while these agents could quickly localize issues through keyword searches across an entire codebase, they lacked the ability to identify the root cause of problems. This limitation often resulted in partial or faulty solutions.

Interestingly, the models displayed a stronger performance with managerial tasks, which required more complex reasoning and technical understanding. The benchmarks underscore a current reality: AI can tackle certain low-level programming challenges but cannot yet replace the nuanced skills of human engineers. Although LLMs show promise, human expertise remains essential, particularly in addressing the complexities inherent in software development.

Source
venturebeat.com

Related by category

Revolutionizing Education and the Future of Work: The Impact of AI

Photo credit: www.entrepreneur.com Recent developments in higher education have raised...

Streamlining AI Search: Mastercard’s Agent Pay Revolutionizes Enterprise Operations

Photo credit: venturebeat.com Join our daily and weekly newsletters for...

This Gene Therapy Startup Aims to Revolutionize Aging

Photo credit: www.entrepreneur.com Imagine a world where aging could be...

Latest news

Firefly’s Rocket Experiences One of the Most Unusual Launch Failures in History

Photo credit: arstechnica.com Firefly Aerospace's Alpha Rocket: Navigating a Niche...

Saskatchewan Students Experience Hands-On Automotive Training

Photo credit: globalnews.ca On Tuesday, April 29th, the Saskatchewan Distance...

NASA Assembles Specialists to Explore Advancements in Astrophysics Technologies

Photo credit: www.nasa.gov The Future of Astrophysics: Harnessing Emerging Technologies The...

Breaking news