随着软件的规模和复杂性的不断增长,软件漏洞对于系统的安全造成重大威胁,也给社会和企业带来了巨大损失。自动化漏洞检测技术旨在高效识别软件代码中存在的潜在威胁,目前已经有很多相关技术在该领域有所进展。基于程序分析的方法,如静态分析和动态分析,通过追踪软件代码的数据流和控制流识别漏洞。基于深度学习的方法通过深层神经网络捕捉漏洞模式。此外,对于检测出来的存在漏洞的代码片段,有效的漏洞检测技术也应对其进行漏洞类型判断,即对应到CWE(通用缺陷枚举)中的特定类别。本赛题要求参赛队伍设计算法,建立模型,对给定的代码片段进行漏洞检测,并进行漏洞类型的判断。
数据集已划分为训练集、验证集、测试集,即train.jsonl
valid.jsonl
test.jsonl
,下载链接为:
https://drive.google.com/file/d/18pkURdURNzQItFy2DdA0b7lNhfGCnEdZ/view?usp=sharing
数据集的条目总数和漏洞条目数为:
数据集 | 总条目 | 漏洞条目 |
---|---|---|
训练集 | 147,863 | 4,528 |
验证集 | 18,483 | 562 |
测试集 | 18,483 | —— |
属性名 | 属性含义 | 其他说明 |
---|---|---|
function_id | 代码片段唯一编号 | 无 |
function | 代码片段 | 代码片段的粒度为单个函数,包括函数名和函数体 |
target | 漏洞标签(1为有漏洞,0为没有漏洞) | 无 |
cve_id | 漏洞编号 | 安全补丁对应修复的漏洞编号,形如"CVE-2022-1052" |
cve_description | 漏洞描述 | 此漏洞编号对应的漏洞描述 |
cvss | 漏洞严重等级评分 | 0-10分,分数越高漏洞越严重 |
cwe_id | 漏洞类型编号 | 此漏洞编号所对应的漏洞类型(可能有多个类型),如越界读写,缓冲区溢出,除零错误等,形如"CWE-787" |
cwe_description | 此类型漏洞(cwe_id)对应的描述 | 无 |
cwe_consequence | 此类型漏洞(cwe_id)可能造成的影响 | 无 |
cwe_solution | 此类漏洞(cwe_id)可能的解决方法 | 无 |
commit_message | 安全补丁提交信息 | 无 |
commit_date | 安全补丁提交日期 | 无 |
project | 补丁所在项目名称 | 如linux, gpac, vim等 |
数据集基于安全补丁进行爬取,每个安全补丁有对应的漏洞编号(此安全补丁修复的漏洞编号),每个漏洞编号有对应的漏洞类型编号(可能有多个),从安全补丁中进行划分得到函数级别粒度的代码片段
训练集的数据条目可能包含多个漏洞类型编号,而验证集和测试集的数据条目仅包含一个漏洞类型编号
{
"function_id": "0052500c1ed5bf8263b26b9fd7773dbdc6f170c4_31",
"function": "struct MACH0_(obj_t) *MACH0_(new_buf)(RBuffer *buf, struct MACH0_(opts_t) *options) {\n\tr_return_val_if_fail (buf, NULL);\n\tstruct MACH0_(obj_t) *bin = R_NEW0 (struct MACH0_(obj_t));\n\tif (bin) {\n\t\tbin->b = r_buf_ref (buf);\n\t\tbin->main_addr = UT64_MAX;\n\t\tbin->kv = sdb_new (NULL, \"bin.mach0\", 0);\n\t\tbin->size = r_buf_size (bin->b);\n\t\tif (options) {\n\t\t\tbin->verbose = options->verbose;\n\t\t\tbin->header_at = options->header_at;\n\t\t\tbin->symbols_off = options->symbols_off;\n\t\t}\n\t\tif (!init (bin)) {\n\t\t\treturn MACH0_(mach0_free)(bin);\n\t\t}\n\t}\n\treturn bin;\n}",
"target": 0,
"cve_id": "CVE-2022-1052",
"cve_description": "Heap Buffer Overflow in iterate_chained_fixups in GitHub repository radareorg/radare2 prior to 5.6.6.",
"cvss": "7.8",
"cwe_id": [
"CWE-787",
"CWE-125"
],
"cwe_description": "The product reads data past the end, or before the beginning, of the intended buffer.",
"cwe_consequence": "::SCOPE:Confidentiality:IMPACT:Read Memory::SCOPE:Confidentiality:IMPACT:Bypass Protection Mechanism:NOTE:By reading out-of-bounds memory, an attacker might be able to get secret values, such as memory addresses, which can be bypass protection mechanisms such as ASLR in order to improve the reliability and likelihood of exploiting a separate weakness to achieve code execution instead of just denial of service.::",
"cwe_solution": "::PHASE:Implementation:STRATEGY:Input Validation:DESCRIPTION:Assume all input is malicious. Use an accept known good input validation strategy, i.e., use a list of acceptable inputs that strictly conform to specifications. Reject any input that does not strictly conform to specifications, or transform it into something that does. When performing input validation, consider all potentially relevant properties, including length, type of input, the full range of acceptable values, missing or extra inputs, syntax, consistency across related fields, and conformance to business rules. As an example of business rule logic, boat may be syntactically valid because it only contains alphanumeric characters, but it is not valid if the input is only expected to contain colors such as red or blue. Do not rely exclusively on looking for malicious or malformed inputs. This is likely to miss at least one undesirable input, especially if the code's environment changes. This can give attackers enough room to bypass the intended validation. However, denylists can be useful for detecting potential attacks or determining which inputs are so malformed that they should be rejected outright. To reduce the likelihood of introducing an out-of-bounds read, ensure that you validate and ensure correct calculations for any length argument, buffer size calculation, or offset. Be especially careful of relying on a sentinel (i.e. special character such as NUL) in untrusted inputs.::PHASE:Architecture and Design:STRATEGY:Language Selection:DESCRIPTION:Use a language that provides appropriate memory abstractions.::",
"commit_message": "Fix heap OOB read in macho.iterate_chained_fixups ##crash\n\n* Reported by peacock-doris via huntr.dev\r\n* Reproducer 'tests_65305'\r\n\r\nmrmacete:\r\n* Return early if segs_count is 0\r\n* Initialize segs_count also for reconstructed fixups\r\n\r\nCo-authored-by: pancake <pancake@nopcode.org>\r\nCo-authored-by: Francesco Tamagni <mrmacete@protonmail.ch>",
"commit_date": "2022-03-22T15:56:27Z",
"project": "radareorg/radare2"
}