Skip to content

The Unified Multimodal NIDS Dataset Tool performs the standardization of network intrusion detection datasets by extracting comprehensive flow, payload, and contextual features from raw PCAP files, ensuring consistency across datasets and enhancing machine learning-driven threat detection and analysis.

Notifications You must be signed in to change notification settings

SyedWaliAbbas/UM-NIDS-Tool

Repository files navigation

  Unified, Multimodal NIDS Dataset Tool

The Unified, Multimodal NIDS Dataset tool is designed to address the issue of inconsistent feature sets across publicly available network intrusion detection system (NIDS) datasets. Many existing datasets vary significantly in their features, have inconsistent labeling, and often exclude crucial payload and contextual information, limiting the effectiveness of machine learning-based models.

Our tool resolves these challenges by providing a standardized method for converting any raw PCAP dataset into a uniform format with consistent features, ensuring compatibility and enhancing the potential for cross-dataset analysis. It extracts key flow-level features such as source and destination IPs, ports, protocols, packet counts, and flow duration, while also integrating detailed payload data, which is often excluded in traditional datasets. This is particularly important for identifying attacks that rely on specific payload characteristics, such as SQL injection or malware. Additionally, the tool generates contextual features based on a sliding time window, capturing historical and temporal patterns in network traffic.

These contextual features are crucial for detecting advanced persistent threats (APT), where the timing and sequence of packets play a significant role. A comprehensive description of all features developed by the dataset, including flow, payload, and contextual features, is available in the `features_meta_data.pdf` document, offering users detailed insights into the structure and content of the unified dataset.

Dataset Coverage

The Unified Multimodal NIDS Dataset includes processed data from four publicly available, well-established network intrusion detection datasets: CIC-IDS 2017, CIC-IoT 2023, UNSW-NB15, and a DDoS-specific dataset. These datasets have been standardized to ensure consistency in feature sets, including flow, payload, and contextual data. Additionally, the tool supports the extension of the dataset by allowing users to process and add new datasets, converting raw PCAP files into the same unified format. This flexibility ensures that the dataset can be expanded further to accommodate evolving research needs.

You can access the dataset here .

Tool Usage Guide

Step 1: Processing PCAP Files

The process of generating the Unified Multimodal (UM) dataset begins with processing raw PCAP files. The first step involves using the tool to convert PCAP files into CSV format containing payload content, statistical flow features, and contextual window-based features. This can be achieved with the following command:

from pcap_process.flow_payload import *
pcap_process(dataset_folder=folder_name, window_size, vulnerable_ports_list, http_ports_list, idle_timeout, active_timeout, flowlimit)

Key parameters such as the rolling window size, the list of ports to monitor, and flow termination criteria (based on active timeout, idle timeout, or packet limit) can be customized to suit your specific needs. This allows for flexible dataset generation based on the features most relevant to your analysis.

The tool extracts flow-level features, payload content, and contextual features based on a sliding time window, providing a detailed dataset ready for further processing and labeling in the next steps.

Step 2: Preparing Pre-labeled CSVs

In the second step, the tool requires pre-labeled CSVs. These CSVs must include:

  • Timestamp Column: This can be in Unix format or a timestamp in any timezone, but the timezone must be known to the user.
  • Flow Duration Column: A column indicating the duration of each flow.
  • Source/Destination IP and Ports: Columns containing source and destination IPs and ports for labeling purposes. To label the processed CSVs from Step 1, use the following commands:
from label.parallel_label import *
# Extract metadata from pre-labeled CSVs
meta_data = extract_time_ranges_from_csvs(folders, timestamp_column='timestamp', timezone='None', batch_size=5)

Step 3: Labeling Processed CSVs

Finally, use the metadata to label the processed CSVs:

label_csvs(input_folder, meta_data, output_folder="labeled_csv", timezone='Canada/Atlantic', num_workers=2, unit='ms', timestamp_col='timestamp', flowduration_col='flowduration', label_col='label')
  • input_folder: Directory containing the processed CSV files.
  • meta_data: Extracted metadata for matching and labeling.
  • output_folder: The folder where labeled CSVs will be saved (will be created in the input folder).
  • timezone: Specify the timezone.
  • num_workers: Number of parallel workers for processing. This workflow ensures smooth processing and labeling of PCAP files into a unified dataset format, ready for machine learning analysis.

Example Usage

This repository contains example Jupyter notebook .ipynb files demonstrating the processing of all four datasets included in the UM-NIDS dataset. Additionally, it includes the performance evaluation of a Random Forest-based machine learning classifier.

You will also find examples of payload-based NIDS processing in the file payload_based_Cross_validation.ipynb, where we cross-validate payload-specific attacks.

Moreover, we have trained and tested various classifiers on the undersampled version of the UM-NIDS dataset in undersampled.ipynb, showcasing the tool’s flexibility and ease of use in different machine learning scenarios.

Citation

If you are using our tool, kindly cite our paper which outlines the details of the graph modeling and processing.

@ARTICLE{10720901,
 author={Wali, Syed and Farrukh, Yasir Ali and Khan, Irfan and Bastian, Nathaniel D.},
 journal={IEEE Data Descriptions}, 
 title={Meta: Towards a Unified, Multimodal Dataset for Network Intrusion Detection Systems}, 
 year={2024},
 volume={},
 number={},
 pages={1-8},
 keywords={Payloads;Feature extraction;Metadata;Labeling;Pipelines;Machine learning;Data mining;Network intrusion detection;Telecommunication traffic;Analytical models;Network Intrusion Detection Systems;Multimodal Dataset;Machine Learning;Security;Payload},
 doi={10.1109/IEEEDATA.2024.3482286}
}

About

The Unified Multimodal NIDS Dataset Tool performs the standardization of network intrusion detection datasets by extracting comprehensive flow, payload, and contextual features from raw PCAP files, ensuring consistency across datasets and enhancing machine learning-driven threat detection and analysis.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published