aicyberchallenge.com

AIxCC Semifinal Competition: The Most Popular Bug

Following the format of a DARPA prize-based Challenge, the Artificial Intelligence Cyber Challenge (AIxCC) started with the hypothesis that AI-driven cybersecurity could help solve the significant problem of securing the software that runs critical infrastructure. DARPA challenged teams from industry and academia to leverage the latest in large language models (LLMs) and the new wave of AI technology to develop and compete their autonomous Cyber Reasoning Systems (CRS) at the AIxCC Semifinal Competition (ASC) in 2024 for a spot in the Final Competition 2025.

During the ASC, the teams’ CRSs successfully and automatically found and patched some of the synthetic vulnerabilities within the open source code that made up the competition’s challenge projects. Their success confirmed that the competition is on track to prove DARPA’s hypothesis and advance the state of the art in cybersecurity. The teams’ performance indicates that the systems could find and fix some of the vulnerabilities more easily than others, which shows that there are remaining unsolved challenges that teams can continue to work on to prepare for the Final Competition and ultimately for transitioning their technology into widespread use.

What was the “Most Popular Bug”?

The challenges presented in ASC consisted of 59 vulnerabilities hand-designed by expert challenge authors and spread across five open-source code projects. One of these vulnerabilities was discovered by seven teams and fixed by five making it the “Most Popular Bug.”

What was this bug, and why was it so popular? The vulnerability, which we will call NGINX-8, was introduced by a small change to existing email proxy functionality within NGINX:

The commit that introduced the vulnerability for NGINX-8

NGINX-8 is a classic buffer overflow vulnerability, where the heart of the problem is the change in the size of a memory allocation from a dynamic size (s->login.len) to a fixed one (100). This problem becomes apparent when a username is copied into this buffer via memcpy, which will write beyond the end of the buffer if the username is longer than 100 characters. This is a classic type of vulnerability for projects written in C, and it’s pretty easy to spot if you know what you’re looking for, but the challenge was that this vulnerability was hidden in the NGINX code base, weighing in at around 175,000 lines of code (or 175KLOC if you prefer). If you printed this code base out, you’d be looking at a stack of paper around a foot tall! The following screenshot from the video here shows a call graph to emphasize the scale of NGINX, where each node represents a function, and the edges show the connections between functions.

Call graph of all of the functions in NGINX

In addition to the straightforward nature of NGINX-8, there are two additional factors that made this vulnerability particularly discoverable: code depth and input structure. In the ASC, a competitor CRS was given one or more “test harnesses,” programs that allowed the competitors to exercise a specific part of the code base by sending in different inputs. This idea comes from software testing practices where it’s more efficient to test different parts of code separately, rather than running the whole program just to see if one part works correctly. We can get a sense of relative size by looking at just what was reachable from the harness used to trigger NGINX-8.

Call graph of functions that could be reached by the harness for NGINX-8

However, despite the size of the reachability graph above, the distance from the entry point to vulnerability was only four functions, as shown in the call graph below. This is a relatively short distance, considering that some of the deepest bugs in ASC were around 28 functions deep.

Call Graph of the path to the point of crash

The last factor that made this bug more commonly discovered was the lack of input complexity needed to reach it. Since the vulnerability is in the user authorization logic, basically all that needs to happen is an input has to look like a user login, which doesn’t require much at all. A minimal input to trigger the vulnerability could be just the string “USER” followed by a space, and then a username longer than 100 characters and a concluding newline as shown in the code block below.

USER AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This might seem simple, but a CRS has to work on code it may have never seen before and zero in on something as specific as this–a needle in a proverbial haystack. In the case of NGINX, competitors had to search for vulnerabilities across three different harnesses, from which much of the 175KLOC of NGINX was reachable. To put this in perspective, the largest hand-written challenges from the 2016 DARPA Cyber Grand Challenge finals were around 5KLOC in size, and were built for a purposely simplified computing environment.

NGINX also contains code to handle multiple protocols like HTTP, POP, IMAP3, and SMTP, so determining the exact input structure to reach the bug is another challenge. Since the input had to be greater than 100 bytes long, trying to brute force all possibilities for an input of 101 bytes, would mean we’d need to try 256^101 different inputs (a number that is over 240 digits long). And even that is cheating a little bit because it presumes we know the necessary size. Fortunately for competitors, one popular technique called fuzz testing is well-suited to discovering inputs and functionality like this, which we’ll talk about more later in the post.

 

How do you prove a bug?

So with 59 additional vulnerabilities in play and other potentially undiscovered vulnerabilities, how do we know if something is really a bug? For ASC, the answer was provided by the use of special instrumentation called “sanitizers” (such as AddressSanitizer) that can detect a variety of erroneous conditions at runtime and then stop execution and print out information for debugging, like in the screenshot below. The major advantage of this approach is that sanitizers are well-known tools in software testing. They are designed to give very little chance of false positives and are impartial in their implementation.

With this simple litmus test for whether a vulnerability was detected, competitors just needed to supply three things to reproduce the bug:

  1. An input that triggers the vulnerability (a “crashing input”)
  2. The harness to send it to
  3. The sanitizer that will trigger when the harness runs the input
 

To further show that a CRS could help developers in the real world, competitors were also asked to specify the exact change that introduced the bug. The standard process for determining this involves selectively testing different versions of the code, which is an easily automated task, but identifying where the bug originally came from is considered best practice.

These requirements mirror all of the common questions that would come up in a real-world bug report, except for “How do you fix it?”

How do you patch a bug?

When a bug is reported, the ideal case is that the reporter also provides a code change that fixes the bug called a patch. Naturally, such a code change can be large or small depending on the complexity of the code involved and the problem being fixed. Since it was a small change that introduced this bug, it only required a small change to fix it.

Reference patch for NGINX-8

But how can you tell if a change like the one above actually fixed the bug? How can you be sure it didn’t break anything else?

The answer to the first question lies in the crashing input that was required as part of the vulnerability discovery. Naturally, if the bug is fixed, running a version of the harness with the patch applied should not trigger the sanitizer like it did in the unpatched version.

The second part of the answer is that we can tell if a patch broke the existing code by running a set of tests. In ASC, both the challenge projects’ original tests as well as some vulnerability-specific tests were used.  Vulnerability-specific tests were primarily designed to keep competitors from accidentally removing functionality in an attempt to fix vulnerabilities.

If a patch can pass both these checks, it answers the questions a developer would ask to see if the fix was sufficient. But now this raises the question… how would an automated software system do these things?

How does a CRS find a bug?

The heart of AIxCC is using AI not only to further the state of automated vulnerability discovery, but more importantly, to further the state of automated program repair. This is a very active area of research with new methods being explored, so we expected competitors to bring novel approaches. While the prize winners will be required to open-source their code after the finals, we’re not disclosing any of their secrets yet. That’s not to say there weren’t common approaches and themes that we expected to see in the competition, one in particular being the use of LLMs(https://aicyberchallenge.com/storage/2024/09/AIxCC-Poster-LLM.jpg) and fuzz testing. If you want to read up on specifics, there are some posts from some of our collaborators on how LLMs can help with finding and patching vulnerabilities:

Google: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code

OpenAI: AI Cyber Risk Benchmark: Automated Exploitation Capabilities 

OpenSSF:
AI Cyber Challenge (AIxCC) and the Needle Linux Kernel Vulnerability – Part 1
AI Cyber Challenge (AIxCC) and the Needle Linux Kernel Vulnerability – Part 2

For now, let’s focus on fuzz testing (also called “fuzzing”), where modern approaches instrument the target code and use feedback to discover when new inputs exercise different functionality. Traditionally fuzzing engines “mutate” inputs using randomized transformations in an attempt to trigger different logic in the target program. By saving and further mutating inputs that trigger new functionality, fuzz testing can discover different code behaviors and exercise the target in a variety of ways. This can lead to a surprising amount of structure that appears as if it was “pulled from thin air” when repeated over thousands or millions of iterations. It sounds like magic, but this has been proven to be one of the most effective techniques in discovering unknown software vulnerabilities over the last 15 years.

Demo of rapid input mutation used in fuzz testing, changed data in red

One of the obvious advantages of fuzz testing is that it can be combined with sanitizers to detect when a vulnerability is triggered. This combination is naturally attractive to competitors because it means that if they discovered a crash during fuzz testing, they would have most of the components needed to score a vulnerability discovery (the input, the harness, and the sanitizer). While ASC did not look into the inner workings of the CRS competitors to see how they generated their inputs, we can look at the hex dumps for some of the crashing inputs for NGINX-8 generated by competitors to see similarities, differences, and some signs of random input patterns that are typically associated with fuzz testing.

Competitor Input 1

The first competitor input we’ll take a look at is a really clean input with no random binary data or strange construction. We’ve highlighted the two most prominent portions, the USER string and the overflow portion, which in this example are both easily identifiable. This could either be the result of logic to minimize and clean up an input generated by fuzzing, or it could have been generated by a more calculated approach such as symbolic execution, LLM prompting informed by code snippets, or other advanced techniques. Either way, we can see that it has a username consisting of exactly 100 D’s followed by 50 E’s, which suggests that the CRS had determined the magic number to trigger an overflow is 100 bytes. It also includes a password and the quit directive, which isn’t necessary for triggering the vulnerability, but fits with what a well-formed SMTP message would look like.

Competitor Input 2

The second input has some of the hallmarks of the mutation-based generation that fuzz testing might cause, specifically a mix of printable and non-printable characters, as well as a few pieces that look like a well-formed input that just got mangled a bit. Again we see a password and quit directive, which suggests that teams may have begun their fuzz testing with a valid input for this protocol. A bit harder to recognize, but we can see the similarity to the first, though the next inputs exhibit even more binary noise.

Competitor Input 3

Competitor Input 4

For ASC, there were no requirements that asked a CRS to present a “clean” input, it just had to trigger the input. Inputs 3 and 4 are fairly minimal in that they just include the user directive, then mostly zeroes or FFs scattered with random data, and end with a newline (0x0A or 0x0D0A). That being said, we can still learn some details from these. For example, one of these inputs demonstrates that the logic that checks for the word “USER” is actually case-insensitive. While these inputs may not be pretty, they show what it takes to trigger this vulnerability.

Competitor Input 5

Competitor Input 6

These last two inputs show one of the tell-tale signs of input mutations from fuzz testing: multiple inputs that appear to be spliced together. In input 5 we even observe the splicing of an input for HTTP, which is a completely separate protocol! This is a good reminder that a CRS had to be designed to work without the knowledge of the target, and that NGINX also can be configured as an HTTP server. It is also interesting to see that this input still triggered the crash despite the extra garbage at the end. When taken as a whole, we can see that inputs spanned from looking like they were hand-generated to ones that appeared to be taken straight out of a fuzzer.

Not to say that fuzz testing was the only way a CRS could generate an input or find a bug. LLMs have proven themselves to be very flexible and generate surprising insights into code, but fuzz testing is a battle-tested strategy that was used by many competitors to find vulnerabilities and inputs that would trigger them. The vulnerability’s easy reachability and simple input format also play into the strengths of fuzzing, so we suspect NGINX-8’s susceptibility to discovery via fuzzing was a contributing factor in making it the most popular bug.

 

How does a CRS fix a bug?

Automatic patch generation has been another area of intense research over the years.  Recently, it has been particularly affected by the increasing ability of LLMs to produce code similar to human developers. This property of LLMs has made them popular for automatic program repair applications, but the tendency of these models to produce variations or hallucinate makes it an engineering task to review the reliability and effectiveness of LLM-generated patches. Also, while LLMs were an obvious and attractive choice for patch generation, they are by no means the only possible technology in play.The tremendous capability of the competitor teams means that any of the teams could have leveraged academic or private research techniques to improve their patch generation capabilities.

Generally speaking, an LLM is much like a human developer in that it needs some indication of what bug it is trying to fix, as well as an understanding of the context of the code involved. Regardless of whether a fix is being generated by an LLM, a human, or other means, patching varies in difficulty. The difficulty typically depends on how much code is involved and how many different parts of the code need to be modified to effectively fix the problem. In all cases, being able to isolate the change that introduced the bug makes it easier to generate a fix, as shown in the prompt below:

Prompt:
Please generate a patch in unified diff format that fixes the the bug introduced by the following change:

diff --git a/src/mail/ngx_mail_pop3_handler.c b/src/mail/ngx_mail_pop3_handler.c
index edfd986..ff4b181 100644
--- a/src/mail/ngx_mail_pop3_handler.c
+++ b/src/mail/ngx_mail_pop3_handler.c
@@ -316,7 +316,7 @@ ngx_mail_pop3_user(ngx_mail_session_t *s, ngx_connection_t *c)

    arg = s->args.elts;
    s->login.len = arg[0].len;
-   s->login.data = ngx_pnalloc(c->pool, s->login.len);
+   s->login.data = ngx_pnalloc(c->pool, 100);
    if (s->login.data == NULL) {
        return NGX_ERROR;
    }

    
Response:

diff --git a/src/mail/ngx_mail_pop3_handler.c b/src/mail/ngx_mail_pop3_handler.c
index ff4b181..edfd986 100644
--- a/src/mail/ngx_mail_pop3_handler.c
+++ b/src/mail/ngx_mail_pop3_handler.c
@@ -316,7 +316,7 @@ ngx_mail_pop3_user(ngx_mail_session_t *s, ngx_connection_t *c)

    arg = s->args.elts;
    s->login.len = arg[0].len;
-   s->login.data = ngx_pnalloc(c->pool, 100);
+   s->login.data = ngx_pnalloc(c->pool, s->login.len);
    if (s->login.data == NULL) {
        return NGX_ERROR;
    }

    

Example of asking an LLM to provide a patch for a bug

In the case of NGINX-8, this task was as easy as competitors could have asked for, with the small amount of code modification both to introduce and to fix the bug. This is in contrast to some other challenge vulnerabilities, where entirely new chunks of functionality were added. Other vulnerabilities had very subtle problems that needed to be fixed in multiple places across different functions. The simplicity and size of NGINX-8 again played into the favor of competitor capabilities, but it’s not to say that even with these advantages that every competitor produced a patch identical to the reference solution.

 

Observations from the Competition

So far we’ve been focused on this fantastic bug and how to find it, but at the end of the day, this was also a competition! There were a lot of interesting things when we look at this as a game and compare how the competitors fared. To keep a neutral focus, we’ll focus on the observed interesting behaviors rather than naming teams except to show the overall timeline of events.

One of the unique developments on NGINX-8 was that one competitor made an improvement on a previously submitted patch. This team’s initial patch was sufficient to mitigate the vulnerability for the input that they discovered, but not fix the problem in general because it only increased the size of the buffer from 100 to 128 as shown below.

Team A’s first patch

Team A’s scoring input used the minimum length required to overflow the buffer, so this indicates that they at least tested against their own input. By contrast, their second patch handles inputs that would overflow the original buffer by adding a check on length where it could return an error right before where the buffer overflow would occur and even add a helpful log message! While this effectively fixes the vulnerability, it does introduce an inconsequential limitation in terms of functionality, disallowing usernames longer than 100 characters. This mirrors what happens with human developers, where an inexperienced programmer’s patch might work but add constraints, while a more prepared developer could fix the problem without sacrificing any functionality.

Team A’s second patch

To illustrate how this one incomplete patch would have fared against the crashing inputs generated by the other competitors, the table below shows each team’s input run against each team’s patch and the reference solution. No information was shared between CRSs in ASC, but this table shows that we can perform a post-mortem adversarial patch analysis and see the difference in performance. In this case, all of the other patches swapped the fixed length for a dynamic length rather than trying to increase the fixed size, so none of the other patches had issues with any of the crashing inputs.

 

Reference Input

Team A Input

Team B Input

Team C Input

Team D Input

Team E Input

Team F Input

Team G Input

Challenge Code

Crash

Crash

Crash

Crash

Crash

Crash

Crash

Crash

Team A Patch # 1

Crash

No crash

No crash

No crash

No crash

No crash

Crash

No crash

Team A Patch # 2

No crash

No crash

No crash

No crash

No crash

No crash

No crash

No crash

Team B Patch

No crash

No crash

No crash

No crash

No crash

No crash

No crash

No crash

Team C Patch

No crash

No crash

No crash

No crash

No crash

No crash

No crash

No crash

Team D Patch

No crash

No crash

No crash

No crash

No crash

No crash

No crash

No crash

Team E Patch

No crash

No crash

No crash

No crash

No crash

No crash

No crash

No crash

Table: Comparing crashing inputs against competitor patches

It’s interesting to see that even for one of the simplest bugs, there was some variation between patches and crashing inputs. To see that one team’s generated input may have been able to prove that another team’s fix was incomplete shows a level of potential for AI teaming that sounds like science fiction becoming reality. But also let’s give credit where credit is due; Team A’s CRS was able to improve upon its own patch and submit an improved version within 30 minutes, no small feat for an autonomous system!

Speaking of speed, let’s put it in perspective: seven teams were able to find this bug in the 4-hour time limit for each challenge project. But in reality, all of the competitors that discovered the vulnerability did so in under 122 minutes, with the majority finding it in under 36 minutes and the fastest CRS finding it in under 14 minutes. Patch times were similarly impressive, with the fastest patch turnaround being under five minutes and the average time to patch being just under 23 minutes.

Timeline of scoring events for NGINX-8

The graph above just shows the performance of these teams on a single vulnerability. Across all of NGINX, teams that identified at least one discovered an average of 4.7 vulnerabilities. Not too shabby for completely autonomous systems designed to work without prior knowledge! Again, we have to revisit that NGINX was deliberately chosen as a real-world project with exceptional adoption. While it was one of the smaller challenge projects, it still is a sizable code base to tackle, whether by a human or automated system.

 

Conclusion

NGINX-8 was the most popular bug from ASC primarily because it was a relatively small and easily discovered vulnerability. The performance of the competitors on this problem shows a microcosm of the challenge overall. A number of teams discovered the vulnerability, but only a subset of those successfully patched the vulnerability. This highlights the difficulty of the challenge, where even some of the most skilled human developers or cybersecurity experts would struggle to find and patch a vulnerability in the time allotted for such a large target. Given the ability to scale automated systems like these, this level of performance shows great promise for the future.

This was the story of the most popular vulnerability in the ASC, but was also just one of 59. Stay tuned for more information and details about the competition in the future, and we hope to see you online or at the AIxCC Experience at DEF CON 33 in Las Vegas!


The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.