key08 Security has surpassed 3,000 followers, meaning that a significant portion of cybersecurity professionals in China are keeping an eye on it. So, it's time for a big project.
While working in the domestic cybersecurity field, I realized that there is still a lot of untapped potential in the overall technical level. Many people working in cybersecurity might also be interested in how security software on their computers actually works. Additionally, some might even dream of developing their own antivirus software or see it as their long-term goal.
So, I felt there was a need to systematically document the working principles of an antivirus engine. While working on this, I noticed that the information available online is close to zero. The few available sources only describe outdated technologies like signature-based scanning and cloud antivirus from before 2006. Antivirus software seems to be treated like a black box.
To systematically educate, rather than spread misinformation or meme-based security practices like some other public security accounts, I spent two days developing an antivirus engine that aligns with modern security practices (as of 2025).
Now, I will explain how it works, what its weaknesses are, and at the end of the chapter, I will even open-source the code, which can be compiled directly using Visual Studio, making learning more convenient.
⚠️ WARNING: This code is provided for learning purposes only. The datasets for machine learning, signature analysis, and dynamic behavior detection are extremely small, so detection effectiveness is very limited.Do not use this code for your "bypass AV" tests and then complain that it fails to detect certain samples. This is not intended for antivirus evasion testing. If you want to improve it, study the issues yourself instead of copying and pasting the code and then asking why it doesn't work!
Currently, all major security vendors promote their so-called NGAV (Next-Gen Antivirus), but in reality, most detection engines fall into these four categories:
-
Cloud-Based Detection
- This includes:
- Fuzzy hashing engines (such as
ssdeep
,simhash
, etc.), which are used to compare the similarity of files (some vendors call this "virus DNA"). - Traditional hash-based engines, which rely on SHA1, SHA256, etc.
- Various cloud-based sandbox, manual or automated analysis systems.
- Fuzzy hashing engines (such as
- This includes:
-
Signature-Based Detection
-
AI & Machine Learning-Based Detection
-
Heuristic-Based Sandbox Detection
Cloud-based engines are extremely complex and are typically a core capability of each security company, so we won't discuss their implementation here (except for those who simply use VirusTotal (VT) as their cloud engine).
That leaves categories 2, 3, and 4, which are typically combined in AV solutions.
Each has its own strengths and weaknesses:
- Signature-Based Detection: Does not have heuristic capabilities and fully relies on manual rule creation, but it is the most effective. Each security vendor's detection capabilities heavily rely on their signature database.
- Heuristic-Based Sandbox Detection: Has weak detection capabilities, is easily bypassed, and lags behind evolving threats. It also tends to generate false positives.
- AI/Machine Learning-Based Detection: Provides high detection rates but also produces high false positive rates, often negatively impacting business operations (e.g., compiling a simple Hello World! application in Visual Studio might trigger an alert). Many AI-based engines are overly aggressive and flag almost anything without a digital signature.
Today, we will create a combined Machine Learning + Behavior-Based Sandbox Engine.
We are not implementing a signature-based engine because that would be too simple (if you're interested in signature matching, check out YARA).
The overall engine structure is as follows:
We need to implement two core modules:
- Sandbox Behavior Analysis Module
- Machine Learning-Based Detection Module
We will introduce each module step by step.
A sandbox module is typically used for unpacking and behavior analysis. Essentially, it is a PE file emulator.
In our system, we use Unicorn Engine to simulate CPU execution. Unicorn Engine is a lightweight, cross-platform CPU emulation framework that supports multiple architectures, including MIPS, ARM, PowerPC, x86, and x64. It is based on QEMU and was first introduced at Black Hat 2015 by the GrayShift security team.
-
Initialize the Emulation Environment
- Relocate PE file sections
- Setup stack memory
- Initialize
Unicorn Engine
and allocate virtual memory - Map the PE file into the virtual environment
- Load required DLLs into the virtual machine
- Hook critical DLL functions to monitor behavior
- Set up essential handles, stack, PEB, TEB, etc.
- Store important PE metadata for unpacking
-
Relocation Processing
- If a PE header contains a relocation table, Windows will relocate resources and functions before execution.
-
Memory and Stack Allocation
- The stack memory must be fully emulated for the execution environment.
-
Mapping PE Sections into Memory
- A PE file's size on disk differs from its actual size when loaded in memory.
- We must expand it and map each section accordingly.
-
Load Required DLLs
- Parse the Import Table and map necessary DLLs into our virtual machine.
-
Intercept API Calls
- Hook imported API functions.
-
Shellcode & Packed Malware Detection
- Monitor for self-modifying code execution, which indicates packed malware.
-
Behavior-Based Detection
- Detect suspicious behavior, such as:
- Downloading executable files via
WinHttp
- Excessive
sleep
delays - Accessing sensitive directories
- Direct access to
LDR
structures (used to detect stealth malware)
- Downloading executable files via
- Detect suspicious behavior, such as:
Here’s an example detection result:
The machine learning module is used to classify files based on extracted PE features.
We extract the following feature sets:
- PE Header Features (Presence of Import Tables, TLS sections, relocations, etc.)
- Imported DLLs (Checks for specific suspicious DLLs)
- File Entropy (Measures randomness)
- Entry Point Byte Sequence (Examines the first 64 bytes of code)
- Section Analysis (Checks PE section sizes and entropy)
- Code-to-Data Ratio (Compares code section size vs. total PE file size)
We collected 1,000 benign samples and 1,000 malicious samples, saved their features into a CSV file, and used them for training.
⚠️ NOTE: The dataset is too small for real-world performance. A proper dataset should have at least 100,000+ benign and 100,000+ malicious samples.
We use XGBoost for training and then export the trained model to pure C++ code using m2cgen.
This is a basic but modern antivirus engine using sandbox-based behavior analysis and machine learning-based detection.
The full source code is available on GitHub (link below). 🚀
🔗 GitHub Repository: [INSERT LINK HERE]