An Empirical Evaluation of LLMs for Solving Offensive Security Challenges
Minghao Shao*
New York UniversityBoyuan Chen*
New York UniversitySofija Jancheska*
New York UniversityBrendan Dolan-Gavitt*
New York University
Siddharth Garg
New York UniversityRamesh Karri
New York UniversityMuhammad Shafique
New York University Abu Dhabi
Abstract
Capture The Flag (CTF) challenges are puzzles related
to computer security scenarios. With the advent of large
language models (LLMs), more and more CTF participants
are using LLMs to understand and solve the challenges.
However, so far no work has evaluated the effectiveness of
LLMs in solving CTF challenges with a fully automated
workflow. We develop two CTF-solving workflows, human-
in-the-loop (HITL) and fully-automated, to examine the
LLMs’ ability to solve a selected set of CTF challenges,
prompted with information about the question. We collect
human contestants’ results on the same set of questions, and
find that LLMs achieve higher success rate than an average
human participant. This work provides a comprehensive
evaluation of the capability of LLMs in solving real world
CTF challenges, from real competition to fully automated
workflow. Our results provide references for applying LLMs
in cybersecurity education and pave the way for systematic
evaluation of offensive cybersecurity capabilities in LLMs.
1 Introduction
Large Language Models (LLMs) have enabled significant
strides in the capabilities of artificial intelligence tools. Mod-
els like OpenAI’s GPT (Generative Pre-trained Transformer)
series [15, 41, 44, 45] have shown strong performance across
natural language and programming tasks [17], and are pro-
ficient in generating human-like responses in conversations,
language translation, text summarization, and code generation.
They have shown some proficiency in solving complex
cybersecurity tasks, for instance, answering professional
cybersecurity certification questions and, pertinent to this
work, solving CTF challenges [49].
CTF challenges are puzzles related to computer security
scenarios spanning a wide range of topics, including cryptog-
raphy, reverse engineering, web exploitation, forensics, and
miscellaneous topics. Participants in CTF competitions aim
0Authors with*contributed equally to this work.to capture and print hidden ‘flags,’ which are short strings of
characters or specific files, proving successful completion of a
challenge. Solving CTF challenges requires an understanding
of cybersecurity concepts and creative problem solving skills.
Consequently, CTF has garnered attention as a prominent
approach in cybersecurity education [16].
This work explores and evaluates the ability of LLMs
to solve CTF challenges. As part of our study, we or-
ganized the LLM Attack challenge [24] as a part of
the Cybersecurity Awareness Week (CSAW) [23] at New
York University (NYU), in which participants competed in
designing “prompts” that enable LLMs to solve a collection
of CTF challenges. We analyze the results of the human
participants in this challenge. Furthermore, we explore two
workflows for LLM-guided CTF solving:
1.Human-in-the-loop (HITL) workflow: In this workflow,
the contestant interacts with the LLM by manually copying
the challenge description and its related code to form the
input prompt for the LLM. Once the LLM responds and
returns a code script, the user utilizes this code script file
with the generated contents and runs the file to observe
the results. If the code returns error(s) or does not provide
the flag in the desired format, the user provides these error
messages to the LLM, requesting another round of output.
If the LLM sends incorrect output three times in a row, we
consider the LLM unable to solve the problem.
2.Fully-automated workflow: In this workflow, the LLM
automatically solves a CTF challenge without any human
involvement. Similar to the HITL case, the LLM is
prompted with executable files, source code, and challenge
descriptions. We initialize
LLMs解决进攻性安全挑战的实证评估-2402.11814
文档预览
中文文档
16 页
50 下载
1000 浏览
0 评论
309 收藏
3.0分
温馨提示:本文档共16页,可预览 3 页,如浏览全部内容或当前文档出现乱码,可开通会员下载原始文档
本文档由 人生无常 于 2024-05-12 13:16:05上传分享