aicyberchallenge.com

How Were LLMs Used in AIxCC Semifinal Competition

The AI Cyber Challenge’s Semifinal Competition saw a clash of autonomous systems, the likes of which haven’t been seen since DARPA’s Cyber Grand Challenge in 2016. With AI capability rising and the impact of software vulnerabilities becoming clearer than they’ve ever been, the renewed challenge is timely. We need the best of industry and academia to take the promise of AI and turn it towards helping secure the systems on which our world depends.

But the million-dollar question is how did these competitors harness AI to find and ultimately fix vulnerabilities? How do these systems work, and where do these new capabilities fit in?

This post will demystify how AI was used by some of the competitor systems in the AIxCC Semifinal Competition (ASC), specifically large language models (LLMs). Since the AIxCC Finals event isn’t until August 2025 at DEF CON 33, this post explains key concepts and common themes seen in competition data while not giving away the technical details that could be considered part of a team’s competitive advantage.

This post assumes basic familiarity with AIxCC and LLMs like Gemini, Claude, or ChatGPT, so check out the following links if you want to brush up on AIxCC: ASC summary post or Large Language Models explained briefly

How did competitors use LLMs?

First things first, the “players” in this competition are autonomous systems called Cyber Reasoning Systems (CRS), which were built to discover and patch vulnerabilities in large code projects. These systems are extremely varied in how they are implemented, but broadly, they consist of multiple components that focus on specific tasks and orchestration logic that manages the flow of information and when specific tasks should be started or stopped. While the systems were built to process code repositories presented with game-specific interfaces, the challenge of being able to find vulnerabilities in completely unknown code is a tall one, so CRSs needed to be able to handle a lot of variation and complexity.

While we can’t disclose the details of AIxCC competitor systems at this time, we can point to Shellphish’s explanation of their Cyber Grand Challenge CRS named Mechanical Phish as an analog for an AIxCC CRS. In the diagram below, the “Meister” component is the primary orchestrator, and the “Ambassador” and “Network dude” components manage communications with the game or outside world. The “TI API” and “IDS Tap” illustrate similar application programming interfaces (APIs) through which AIxCC CRSs could discover the code they should process and submit their scoring attempts.

Architecture Diagram from Mechanical Phish writeup

One of the most important interfaces in ASC was the one that allowed competitors to send requests to popular LLMs from Google, Anthropic, and OpenAI via API requests. This is similar to how a person would interact with an LLM, except automated systems just send their question in a single request instead of typing it into a form on a website. A traditional question-and-response session with an LLM might look like the following:
User: What is the time complexity of QuickSort in the worst case?
LLM: The worst-case time complexity of Quicksort is O(n²).

A CRS can do the same thing programmatically with a single API request by encoding the question as a “message,” and the returned response would include the answer as a “message.” To ask follow-up questions, a CRS would encode the previous results into the set of messages sent in the request, which could be extended to an entire conversation.

messages = [
    {"role": "user", "content": "What is the time complexity of quicksort in the worst case?"},
    {"role": "assistant", "content": "The worst-case time complexity of Quicksort is O(n²)."},
    {"role": "user", "content": "How can we improve the worst-case performance of quicksort?"}
]

response = llm_api.completion.create(
    model="gpt-4",
    messages=messages
)

This is a powerful capability because it allows programmatic access to models with impressive capabilities. Still, LLMs are not infallible, and working with new technology comes with costs in terms of time and budget. For ASC, the constraints on both of these factors were straightforward: for each of the five challenge projects, teams had four hours and $100 to spend on LLMs to find and patch as many vulnerabilities as possible.

While the parameters of the problems are simple, the underlying challenge is vast: How can competitors use LLMs to push forward the state of the art in automated code security, which has been an active research topic for over a decade?

What do LLMs bring to cybersecurity?

One of the big motivations of AIxCC is to drive research in discovering how to best leverage AI and LLMs to augment cybersecurity. Beyond LLMs’ obvious natural language capabilities, they are surprisingly good with code. This may be helped by code being a form of language with more rules. Or, perhaps because the vast wealth of code the models have been trained on allows them to identify common patterns at even more granular levels than human coders would normally think about. Regardless of the reason, LLMs can produce correct well-formatted code from prompts and answer questions about code snippets with clarity, though they do still get things wrong.

When it comes to a competition centered around code vulnerabilities, the question is, “How can we leverage LLM’s coding abilities to find and ultimately fix vulnerabilities?” One of the most obvious approaches for using an LLM to find vulnerabilities is to ask a model to identify a vulnerability in a code snippet. Since finding vulnerabilities is a fundamentally hard problem with a multitude of factors, models easily spot trivial bugs but are less accurate at identifying more difficult vulnerabilities.

User: Here's a piece of Python code. Can you identify any vulnerabilities?

import os

def delete_file(filename):
    os.system(f"rm {{filename}}")

user_input = input("Enter the file to delete: ")
delete_file(user_input)

      
LLM: Yes, this code is vulnerable to command injection. Since os.system(f"rm {{filename}}") directly interpolates user input into a shell command, an attacker could input something like "; rm -rf /" and execute arbitrary commands.

LLMs can find simple vulnerabilities easily.

On the flip side, we can also ask an LLM to patch a particular vulnerability in a given piece of code., This is a bit easier for LLMs because it is less of an open-ended problem. It also plays to the models’ strengths of modeling language and patterns since they can draw upon the collective history of people publicly discussing bugs in forums like Stack Overflow and internet blog posts explaining security vulnerabilities and how to fix them.

User: Can you fix the vulnerability?
LLM: Yes! Here's a secure version of your code that mitigates command injection by using subprocess.run() with an argument list:

import os
import subprocess
import shlex

def delete_file(filename):
    safe_filename = shlex.quote(filename)  # Ensures special characters are escaped
    subprocess.run(["rm", "--", safe_filename], check=True)

user_input = input("Enter the file to delete: ")
delete_file(user_input)

    

LLMs can also fix vulnerabilities.

While traditional approaches have been developed for both finding and fixing vulnerabilities, as was highlighted in DARPA’s 2016 Cyber Grand Challenge, there’s an asymmetry in what such approaches yield. Approaches leveraging fuzz testing (also known as fuzzing) and program analysis have become more mature and effective at finding bugs and have become more commonly adopted. Fixing vulnerabilities and generating patches has proved a more difficult task because it requires correctly modeling the code and vulnerability, and producing a fix that doesn’t break the existing functionality. Some promising research has made strides in this area. But, even if they aren’t perfect, LLMs become very attractive if they can naturally handle some of the most complex parts of fixing vulnerabilities.


Getting the most out of LLMs

If you’ve heard about LLMs, you’ve probably heard about “hallucinations” or the variety of answers they can produce. So, how does an automated system with high standards for correctness, like a CRS, get reliable answers?

We know that an LLM may not answer correctly, so we should discuss what incorrect responses look like and how to handle them. If asked to find a vulnerability, an LLM will respond with either a description of a real vulnerability, an assertion that no vulnerability is present, or a description of a non-existent vulnerability. If we don’t already know if there is a vulnerability and what it is, the answer could be wrong in the form of a false negative or false positive, as shown in the table below. This means that a CRS must have ways to validate the LLM responses, but that they should also do everything they can to improve the accuracy and reliability of their LLM interactions.

Reliability means LLMs would rarely produce false negatives or false positives

“Prompt engineering” is a concept that has arisen to describe how one can tailor a prompt to get LLMs to give responses more in line with the user’s desires. The simplest example observed in ASC was the common technique of telling the LLM how to format the output. For example, patches had to be emitted in the standard unified diff format which is simple for an LLM to produce.

User: Produce a patch in the unified diff format that fixes the vulnerability as shown above.
LLM: Here is the patch in unified diff format that fixes the command injection vulnerability in the original Python script:

--- vulnerable.py	2025-02-10 12:00:00
+++ secure.py	2025-02-10 12:01:00
@@ -1,7 +1,9 @@
 import os
+import subprocess
+import shlex

 def delete_file(filename):
-    os.system(f"rm {filename}")
+    safe_filename = shlex.quote(filename)  # Ensures special characters are escaped
+    subprocess.run(["rm", "--", safe_filename], check=True)

 user_input = input("Enter the file to delete: ")
 delete_file(user_input)

      

The simplest form of prompt engineering is specifying how the LLM should answer.

While prompt engineering is helpful for managing response formats, more importantly, it can be used to increase the accuracy and reliability of LLM responses. This is still an active area of research, but it’s clear from the ASC data that teams felt it made a difference because every team’s LLM requests featured prompt engineering to some degree. One common example included altering the “system prompt,” which are instructions given to a model before the user input is processed. This influences how the LLM constructs responses, such as by limiting speculation or encouraging step-by-step explanations. Other examples of prompt engineering seen in ASC included: attempts to increase reliability through praise and threats, making questions extremely specific, and using custom input and output formats (such as only allowing yes/no answers).

 

The importance of context

Given the importance of different kinds of prompt engineering and the common overall workflows of identifying and fixing vulnerabilities, another clear observation from ASC was that competitors used prompt templates to accomplish common tasks. Such a prompt template might include a prefix to attempt to increase reliability, a specification of the desired task, and end with the desired output format. Such templates end up sounding similar to how one might talk to a human engineer, except much more mechanical and verbose.

User: Template-Based Prompt for Reliable Code Generation
Prefix (to improve reliability):
"You are an expert [programming language] developer. Follow best practices and optimize for readability and performance."
Task Specification:
"Write a function in [language] that [performs a specific task]. The function should [handle edge cases, follow a specific paradigm, etc.]."
Desired Output Format:
"Return only the complete and properly formatted code without explanations."
Example Prompt:
"You are an expert Python developer. Follow best practices and optimize for readability and performance. Write a Python function that takes a list of integers and returns a new list with only the prime numbers. The function should be efficient and handle edge cases such as empty lists and negative numbers. Return only the complete and properly formatted code without explanations."

Using a template can help apply multiple prompt engineering techniques

The key factor for improving LLM answers is including relevant context in the prompt, especially when dealing with potentially unknown code. This is crucial when identifying vulnerabilities since providing enough context to LLMs can affect their ability to spot patterns across code. In our previous examples of asking an LLM to fix a piece of code, we included every line of code, but when dealing with a challenging project with millions of lines of code, being able to pick exactly what to include or exclude becomes critical.

Naturally, there are tradeoffs between including more context in each request and how much that request costs in terms of the LLM budget (which we’ll talk about in a later section), and different teams approached this problem differently. Some teams tended toward shorter and more precise questions, while others used the paradigm of longer conversations to build context and clarify or correct responses.

This is still an area of active research, and one of the most interesting meta-questions going into the finals will be how the evolution of the most powerful LLMs will affect techniques like prompt engineering and context sizes that teams build into their CRS. Because the teams aren’t building in a vacuum, and at the end of the day only the most effective CRSs will win.

 

LLMs and competing in ASC

How LLMs are used in a cybersecurity context is important, but the reality is that ASC was a competition with rules, and each CRS was playing to win. Naturally, the competitors had to find and fix vulnerabilities for points to earn their way into the top seven and a spot in the finals. We’ll start with some reflections on the contents of the questions and then delve into workflows and gameplay considerations.

 

What does a CRS ask about?

Given the ability to ask questions that include normal language and code, what did the competitor systems ask about? The short answer is the bulk of requests were naturally focused on identifying vulnerabilities, but what does that look like? Since disclosing specific details would spoil the competition at this point, we’ll look across the LLM queries in bulk to see some themes across CRSs and projects.

In order to crunch the large quantity of text involved in the over 70,000 LLM queries, we used an analysis called Term Frequency-Inverse Document Frequency (TF-IDF), which really just looks for words that stood out as being important relative to the whole dataset. This kind of analysis allows us to extract keyword summaries and perform ranking to identify any trends or outliers across the five challenge projects. We ended up creating some visuals that included word clouds, bar charts, and heatmaps to try to highlight patterns across projects, and this led to some interesting results.

A word cloud shows relative importance by the size of terms, which is helpful as a quick and fairly intuitive way to see what terms were important across projects and which ones may only appear in one or two projects. Based on our discussion on requests, we expect to see three things represented in the data: the questions themselves, prompt engineering techniques, and code snippets. Understanding which terms map to which of the three pieces takes a little bit of background knowledge, but the word cloud representations for each challenge project make some of these ideas stand out.

Word clouds for Linux, Nginx, and Jenkins

Looking at these three examples, we can see that there is a huge focus on input and similar words across projects and languages. In Linux and Nginx, we see “buffer” and “struct” feature prominently, while in Jenkins, we see “string” and “command,” which all point to different terms relating to input. This relative emphasis on input words makes sense because the competitors had to provide inputs that would trigger the vulnerabilities in the challenge harnesses regardless of language, so we can deduce that many of the questions focused on input.

In addition to some general terms, each project tended to use more language-specific terms. For the C projects Linux and Nginx we see terms like “struct,” “index,” and “null”; and “buffer,” “len,” and “size” respectively. This is in contrast to Jenkins, where we see terms that are more common or native to Java, “string,” “throws,” and “command.” This confirms our observation that code snippets dominated the LLM requests and fits with the idea that a CRS asks questions that focus on specific code.

While word clouds can give us an idea of an overall flavor, they aren’t as easy to compare and see trends in. We can use the same data to instead generate a heatmap to provide insights on which terms were frequently used when considering the distribution across the five challenge projects. We again see similarity due to languages, with the two Java projects, Jenkins and Tika, sharing terms like “XML,” “throws,” and “jazzer” (the name of a Java fuzzer), and the C projects showing similarity for terms like “buffer” and “size.” That being said, the top spots were similar across all projects, with words like “harness,” “vulnerability,” “sanitizer,” and “security.”

Heatmap of terms’ TF-IDF scores across projects

If we filter more aggressively and depict the top terms for certain projects, we can make graphs that show details that reinforce these trends. In Jenkins, we can see the words “injection” and “execution,” which might relate to common Java vulnerability concepts like “command injection” or “arbitrary command execution.” For Linux, we see “index,” “size,” and “buffer,” which likely relate to buffer overflows or memory corruption.

This gives us an overall flavor of the questions being focused on code snippets and vulnerability concepts but doesn’t illustrate the flow of how a CRS works toward scoring. In ASC, a CRS only had two ways to score: by discovering a vulnerability and by patching it. But they first had to prove they discovered a vulnerability before they could get credit for patching it. This discovery requirement ultimately led to some patterns that appeared across competitors.

 

Path to a patch

The basic workflow of building up to the point where a CRS could demonstrate it successfully discovered a vulnerability and ultimately developed a working patch is illustrated below. While the workflow was similar across competitors, teams varied greatly in their approaches, including differences in the kinds of tasks they asked LLMs to solve as well as the degree to which the team utilized these resources. But one fact was clear: all of the finalists leveraged LLMs in developing successful scoring inputs.

General steps for score:
1. Attempt to locate a vulnerability
2. Confirm vulnerability
3. Craft vulnerability discovery scoring input
4. Craft candidate patch for vulnerability
5. Test patch
6. Submit patch scoring input

In ASC, a successful vulnerability discovery scoring input included several pieces of information, such as an input that triggers the vulnerability, the commit that introduced the vulnerability, and the sanitizer that catches the vulnerability. Actually discovering a vulnerability ended up being the hardest part due to multiple factors: the extreme size of the challenge projects, the difficulty of some of the vulnerabilities, and what questions to ask in order to effectively leverage LLMs to efficiently discover bugs. Several approaches proved to be popular, but exact details will have to wait until the finals to be released, though we can say that fuzz testing was a very popular approach.

In order for a competitor to score points for a patch, they had to generate a patch that would fix the vulnerability without breaking the functionality of the project. Since teams already had the data from the scoring input to start with, this gave the teams a solid starting point for their LLM requests and the patching process in general. In fact, each time any CRS discovered a vulnerability, the CRS also generated a successful patch 50.9% of the time (28 of the 55 cases). Pretty good for fully automated systems dealing with large code bases across two languages and vulnerabilities they’d never seen before.

This performance particularly showcases the potential for LLMs in patching capabilities because almost every team that scored used LLMs to help generate patches. Of course, the patch rate was not 100%, but that is to be expected. The given vulnerabilities were designed by experts to specifically test CRS capabilities in a variety of issues across a spectrum of difficulty. And when LLMs got things wrong, what did it look like? Typically, a large number of prompts repeating the same questions to fix the issues that were given in a previous iteration, which is kind of like a dog chasing its own tail, except not as cute and more expensive since CRSs had to pay for every request.

 

Budgeting automated security

For each round of the Semifinals, a CRS had an LLM budget of $100, which primarily served as a ceiling to constrain competitors and force them to be judicious in their use of LLM requests. If you go to look up what $100 gets you in terms of API usage, you’ll see that all popular models charge by millions of tokens, like in the table below.

Price tiers typically reflect tradeoffs between cost, speed, and depth of reasoning.

Feature Claude 3.5 Haiku Claude 3 Opus Gemini 1.5 Flash Gemini Flash-8B OpenAI o1 OpenAI o3-mini
Input $0.80 / MTok $15 / MTok $0.075 / 1M $0.0375 / 1M $5.00 / 1M $1.00 / 1M
Output $4.00 / MTok $75 / MTok $0.15 / 1M $0.15 / 1M $6.00 / 1M $4.40 / 1M
Context 200K 200K 1M 1M 200K 200K
Caching Write $1.00 / MTok $18.75 / MTok $0.1875 / 1M $0.15 / 1M
Caching Read $0.08 / MTok $1.50 / MTok Free (≤1M) Free (≤1M) $7.50 / 1M $0.55 / 1M
Free Tier No No Yes Yes No No
Grounding / Tools Batches API Batches API Google Search (paid) Limited Tools/Vision Structured Outputs
LLM API model costs and tradeoffs in dollars per million tokens
Model Type Emphasis Price (Input) Price (Output)
Standard General usage $2 / 1M tokens $5 / 1M tokens
Deep Deeper reasoning $5 / 1M tokens $15 / 1M tokens
Fast Lower cost $0.75 / 1M tokens $3 / 1M tokens

A token is just a word or word fragment, where the process of dividing text into tokens is called tokenization, and it is unique to each model. The key things to note are that different models have different costs and that you get charged for both the tokens going in (the request) and the tokens going out (the response).

Our thoughts on what is a reasonable number of tokens are heavily influenced by what we think requests look like in terms of size and composition. In ASC, many prompts contained large chunks of code, which could be hundreds of lines, which really adds up. Responses also varied, where models could be asked to just provide a yes/no answer or to produce a patch which similarly could contain numerous lines of code.

Let’s try to use some real examples from ASC to put things in perspective. The average number of tokens per request (prompt and response) was 10,809, while the median was 4,940, which shows that there were a small number of really big requests that skewed the average higher. In fact, the largest request in ASC weighed in at 540,851 tokens, with the prompt being 540,200 tokens and the response consisting of only 651 tokens. We can see the “long tail” of the small number of really large requests in the first graph, while the second zooms in on the fact that 95% of the LLM requests contained less than 50,000 tokens, with the majority having less than 5,000 tokens.

Most requests were small, but larger requests can quickly consume budget
With the budget and prices at the time of ASC, a CRS had somewhere in the ballpark of 10-200 million tokens per round. This might seem like a lot, but one team burned through their entire budget for a round in just over 10 minutes, and the next fastest was 51 minutes! If we look at the distribution of when the requests were sent over time, we can see a clear trend of teams sending more requests earlier in the round.
LLM Requests over time shows the most activity near the start of rounds

In the context of the game, this makes sense. Teams start a round with no information about the challenge project, and they want to find vulnerabilities as quickly as possible. We observed teams taking a variety of approaches to the problem of how to prioritize and pace LLM requests, where some teams nearly always sent requests one at a time, while others tended to use large batches of multiple parallel requests. Given that there were 17 cases of CRSs running out of LLM budget across the five rounds, the budget limit was a real factor that showed it pays to be careful of one’s finances, even if you are a literal money-spending, bug-finding machine.

 

Looking ahead to finals

Stepping back, it’s clear that ASC has already demonstrated an extraordinary application of theory, and competitors have already demonstrated some of the promise that AI holds for improving cybersecurity. We’ve explored some of the data trends, and while strategies varied across teams, the unifying theme was that a variety of teams were able to build automated systems that successfully incorporated LLMs into their workflows. These nascent systems demonstrated that they were able to find and fix a variety of vulnerabilities without any human interaction on large open source code bases in a meaningful timeframe.

And this was only the Semifinals! We know the seven finalists are refining their techniques and will develop new capabilities arising from the novel engineering and research funded by their winnings from the Semifinals. The Final Competition promises more flexibility, more challenges, and ultimately, the open-sourcing of every CRS that competes. Stay tuned for more, and we invite you to join us in August for the AIxCC Finals event being held at DEF CON 33.