Skip to content

Business Insights

With AI in the Trenches:
Our Copilot Experiment in Real-life Software Development

With AI in the Trenches:
Our Copilot Experiment in Real-life Software Development

Published on: 26 Apr 2024 8 min read

Generative Artificial Intelligence is changing the way we work and has sparked a lot of discussion among developers as they explore and experiment with different AI coding tools. Is AI going to make developers obsolete? How will it alter the nature of software engineering? Can we see a ten-fold increase in developers’ productivity with the introduction of AI-powered tools?

Our Copilot Experiment in Real-life Software Development

To answer these questions, we conducted a four-month experiment with Copilot. It’s one of the most popular AI-driven software development tools from GitHub.  

Copilot promises a steady increase in developers’ productivity and efficiency with the help of real-time AI code suggestions, contextual guidance, and a fast generation of unit tests. However, there has been limited public research on the tool’s impact on software development in real-world scenarios. Therefore, we put it to the test in real software projects to better understand and quantify its effects on software engineering. 

You can read our full whitepaper with all our findings and data here: AI Writing Our Code: Experiments with Copilot

And here are some of the discoveries we made during our experiment… 

What is Copilot? 

Copilot is a cloud-based AI tool that helps developers in their day-to-day work by suggesting code and providing real-time support. It acts as a buddy programmer that understands natural language and can convert it into coding prompts (usually in the form of code comments) to help with your coding. 

Copilot gives developers new ways to solve old problems and makes it easier to automate tedious tasks or learn new languages, frameworks, and platforms. One of the best features of Copilot is that it solves the boilerplate code problem. This refers to the pieces of code repeated in multiple places with slight variation, which can be time-consuming and isn’t fun to write for developers.

Some key features of GitHub Copilot include context-aware code suggestions and support for multiple languages and frameworks.

Research Methodology 

To conduct our experiment, we spanned over 7 sprints, totaling approximately 4 months. The first two sprints served as collecting a baseline without using Copilot. 

We involved 3 agile teams consisting of the same developers throughout the experiment to eliminate variables associated with changes in team composition. Developers filled out forms to ensure structured data collection after completing a task or resolving a bug. Before submitting their code, developers ran a local instance of SonarQube to measure codebase quality. To safeguard against data leakage, we utilized a custom Docker container with no internet access. 

Developers Satisfaction 

Even the most features-rich tool won’t be used if the people who are supposed to use it don’t like it. This is why we measured the developers’ satisfaction according to the weighted Net Promoter Score (NPS), a metric ranging from -100 to +100.   

In general, our engineers were pretty satisfied with Copilot’s performance. Engineers were particularly satisfied with the ease of writing code, and the overall NPS we measured was above 60, indicating high user satisfaction. Developers found Copilot helpful in fixing bugs and resolving small or large tasks. 

Weighted NPS shows developers' satisfaction when using Copilot to resolve bugs and various tasks—small, medium, and large.

Developers seem happiest using Copilot to tackle small or large tasks. 

Complexity and Efficiency 

When we started the data gathering, we expected Copilot to generate a lot of excessive code for tasks. But it turned out that, on average, per story point, developers were producing half as much code as before, which is great for maintainability.  

The lines of code (LOC) produced by Copilot to resolve a bug were three times more than those written by developers on their own. At first glance, it seems that Copilot generates more unnecessary lines of code. In fact, it generates many more unit tests, leading to better maintainability and code coverage. 

We discovered another surprising finding when we looked at the time required to implement a new task (in hours). The time spent with Copilot is significantly longer for small tasks than for the same effort without the tool. The answer here is pretty straightforward—our developers were spending more time exploring Copilot’s suggestions than writing the code on their own. Still, we see that as a positive trade-off because it introduces a valuable educational element for junior developers. 

In a nutshell, Copilot generated more LOC when resolving bugs and much less code when working on new features.

A graph showing lines of code generated per story point and another one showing development effort per story point in hours.

On average, we have 80 LOC per story point without using Copilot. By using Copilot, we achieve more with less, except when working on small tasks 

Do Developers Produce More Code with Copilot? 

That is quite a common question in the developer’s community. At Scalefocus, we don’t believe that more code translates into higher productivity. Indeed, it means higher risks in various aspects. 

Our data reports that with Copilot, developers produce less LOC. That suggests DRY-er (don’t repeat yourself) and a more optimized codebase. On average, a developer produces 12-13 LOC per hour without Copilot. With Copilot, developers produce less code but surprisingly deliver features faster. This is a win on multiple aspects – faster and more maintainable codebase with less risk.

A graph showing how developers are producing fewer lines of code per hour worked with AI.

With Copilot, our team’s productivity has increased while producing fewer lines of code. The average number decreased to 9 LOC, but we have slightly increased the speed of task delivery. 

Code Quality 

With a customized SonarQube docker container, a popular tool for static code analysis, we measured several aspects of code quality. One of the interesting things about the end results is that despite expectations, on average, we have reduced the duplicated code blocks with Copilot. The popular sentiment is that Copilot doesn’t care about duplications and does them all the time. Our data shows that it solves that problem in a real software development environment.  

We see a slight improvement in the so-called code smells—pieces of code that can contribute to the project’s technical debt in terms of maintainability and performance or potentially bad code-writing practices. According to our data, Copilot has a positive impact on that metric as well. 

What is concerning is that you can expect a new security vulnerability on average for every fourth completed task—that was our case. This finding debunks one of the myths that AI produces flawless code—it was trained on human code, after all. Developers should monitor this very closely when utilizing tools like Copilot, preferably with a specialized static code analysis focusing on code security and composition analysis. Developers shouldn’t rely on any AI tool to meet the highest quality standards, and everything should be double-checked.

Analysis of the impact of using Copilot on SonarQube quality metrics during software development.

We used SonarQube to monitor major KPIs and understand Copilot’s impact on the quality of the codebase. 

Challenges to Adapting to Copilot 

Getting used to Copilot definitely takes some time. Our developers were given a short training on best practices for using Copilot to ensure baseline competence in utilizing the tool from the get-go. Copilot also learns from their way of working and their code, improving its suggestions over time while analyzing everything in the project.  

This is a good and a bad thing. Copilot becomes handier in the process, but at the same time, your code goes to a third party somewhere else, and that is one of the main concerns that developers and companies have. This is why we utilized a custom Docker container without internet access to ensure there are no customer code or data leaks.

Last but not least, let’s not forget the dependencies that software engineers can develop when using AI technologies for specific tasks. It can impact their ability to think critically and understand the codebase, and eventually, we might have worse code quality as a result. 

Conclusion 

Copilot provides a really smart way to automate some of the software development work, but it is still in its initial stage. It is far from writing perfect code; it positively impacts productivity, but not for every type of task.  

We even reached a data-driven conclusion on the increased productivity, which you can find in our whitepaper: AI Writing Our Code: Experiments with Copilot.

The comprehensive AI research provides much more data, such as where developers find the sweet spot, after which they see Copilot as extremely useful. It also provides insights into the maintainability and complexity of the code, security, and how Copilot affects bugs, vulnerabilities, and code smells. There were some surprises there as well. 

Tools like Copilot will become much better because they will thrive in a much more competitive environment, with so many new AI systems coming out all the time. Generative AI evolves, but human skills will evolve with it, and we will find better ways to be part of the software development process.  

In the near future, every developer will have an AI code reviewer by their side, but it’s not the end of the human software developers. It is the beginning of an era where developers will have more time to focus on crafting more creative and innovative software solutions. 

About the Author:

Dimitar Grancharov

Content Team Manager

An ambitious, highly motivated, and results-driven professional with versatile experience in creative storytelling, copywriting, marketing, corporate communication, journalism, business processes, and management.

Share via