This repository contains all code used for the experimental work in my Bachelor thesis on Approximation Algorithms for Graph Edit Distance (GED). The experiments focus on benchmarking approximation methods against exact GED computations, using AStar-BMao, GEDLIB and SimGNN as the primary backend.
- Graph Edit Distance (GED) Evaluation: Compare multiple GED approximation algorithms.
- Dataset Processing & Conversion: Convert datasets to different formats (TXT → GXL/XML, JSON).
- GEDLIB Benchmarking: Execute and log results from AStar- and GEDLIB-based algorithms.
- SimGNN Model Training & Evaluation: Train and test a neural network-based GED predictor.
- Results Analysis & Visualization: Compare accuracy, runtime, and memory usage across methods.
To get started with this project, clone the repository to your local machine using git:
git clone https://github.com/michaelflppv/ged-approximation.git
cd ged-approximation
This repository includes precompiled datasets and large files (e.g., GXL/XML files, JSON graph pairs, and pre-trained models). To ensure these files are correctly downloaded, Git LFS (Large File Storage) must be installed.
- Download and install Git LFS.
- Run the setup command:
git lfs install
- Pull large files manually:
git lfs pull
📦 ged-approximation
├── 📜 README.md # Project documentation
├── 📂 data/ # Raw graph datasets (AIDS, IMDB, etc.)
│ ├── 📂 AIDS/
│ ├── 📂 IMDB-BINARY/
│ ├── 📂 PROTEINS/
│ └── ...
├── 📂 processed_data/ # Preprocessed data for different tools
│ ├── 📂 gxl/ # GXL graphs for GEDLIB
│ ├── 📂 json_pairs/ # JSON graph pairs for SimGNN
│ ├── 📂 synthetic_graphs/ # Synthetic graphs for experiments
│ ├── 📂 txt/ # TXT graph pairs for AStar-BMao
│ ├── 📂 xml/ # XML graph pair collections
├── 📂 results/ # Stores output of GED computations
│ ├── 📂 exact_ged/ # Ground truth edit distances
│ ├── 📂 extracted_paths/ # Edit paths from GEDLIB
│ ├── 📂 lower_bound/ # Lower bound estimations
│ ├── 📂 simgnn/ # SimGNN predictions
│ ├── 📂 gedlib/ # GEDLIB results
│ └── 📂 label_diversity/ # Label diversity stats
├── 📂 heuristics/ # Heuristic lower bound estimations
│ ├── 📂 plots/ # Visualizations of lower bounds
│ ├── 📜 estimate_lower_bound.py
│ └── 📜 validate_lower_bounds.py
├── 📂 SimGNN/ # Neural GED model (SimGNN)
│ ├── 📂 assets/
│ ├── 📂 dataset/ # Train/test data in JSON format
│ ├── 📂 models/ # Saved PyTorch models
│ └── 📂 src/ # Model code (SimGNN, training, eval)
│ ├── layers.py, simgnn.py, ...
│ └── simgnn_extract_edit_path.py, ...
📂 src/ # Main processing and analysis scripts
├── 📂 analysis/ # Scripts and notebooks for analyzing GED results
│ ├── 📂 notebooks/ # Jupyter Notebooks for visual exploration
│ │ ├── lower_bound_analysis.ipynb # Analyze lower bound estimations
│ │ ├── plot_analysis.ipynb # Plot comparison metrics
│ │ └── statistics_analysis.ipynb # General dataset statistics
│ ├── 📂 C++_parsers/ # Python wrappers for C++ GED results
│ │ ├── astar_exact_ged.py # Parse A* GED output
│ │ ├── gedlib_edit_path.py # Extract GEDLIB edit paths
│ │ └── gedlib_parser.py # General GEDLIB result parser
├── 📂 converters/ # Convert original TXT datasets into structured formats
│ ├── 📂 gxl_xml/ # Convert to GXL/XML for GEDLIB
│ │ ├── preprocess_aids.py
│ │ ├── preprocess_imdb.py
│ │ ├── preprocess_proteins.py
│ │ └── preprocess_mutag.py
│ ├── 📂 json/ # Convert to JSON for SimGNN
│ │ └── preprocess_all.py
│ ├── 📂 txt/ # TXT conversion handling
│ │ └── preprocess_all.py
├── 📂 edit_path_test/ # Tools for evaluating edit paths (ground-truth vs predicted)
│ ├── 📂 generate_synthetic_graphs/ # Scripts for generating synthetic test data
│ │ ├── generate_gxl_collection.py
│ │ └── generate_json_pairs.py
│ ├── 📂 test/ # Edit path validation utilities
│ │ └── gedlib_validate_edit_path.py # Validate GEDLIB paths
│ └── 📜 apply_edit_path.py # Apply and simulate edit path execution
├── 📂 helper_functions/ # Miscellaneous utility scripts
│ └── 📜 label_diversity_calculator.py # Computes label diversity in datasets
├── 📂 gedlib/ # GEDLIB C++ source and interface
│ ├── 📂 src/, include/, lib/ # C++ logic and libraries
│ ├── 📜 main.cpp, CMakeLists.txt # Entry and build files
│ └── 📜 install.py # Installation script
├── 📂 median/ # Placeholder (possibly for GED median)
├── 📂 tests/ # Unit and functional tests
├── 📂 venv/ # Python virtual environment (optional)
└── 📜 LICENSE, .gitignore, ... # Meta files
- Linux: Ubuntu 18.04+ / Debian / Fedora (recommended)
- Windows: Windows 10+ (WSL recommended)
- macOS: macOS 10.14+ (M1/M2 compatible)
- Python ≥ 3.8 (recommended 3.9)
- C++17‑compatible compiler (e.g. GCC ≥ 7, Clang ≥ 5, MSVC ≥ 2017)
The required Python packages are listed in the requirements.txt
file. You can install them using the following command:
pip install -r requirements.txt
- CMake: Required for building the C++ components. Install it from CMake.
- Doxygen: Required for generating documentation. Install it from Doxygen.
- OpenMP: Required for parallel processing. Ensure your compiler supports OpenMP.
Find more information on how to install these tools in GEDLIB.
This repository partially relies on GEDLIB for GED computation. The required repository and its external libraries should already be installed within this project. If not, refer to the GEDLIB for more information.
To compile the C++ code, navigate to the gedlib
directory and run:
cd gedlib
python install.py --clean --doc --lib gxl
mkdir build && cd build
cmake ..
make
My repostory called mixup contains a backup copy of GEDLIB with the source code, required to compile this project.
This project also relies on the Graph Edit Distance (GED) repository by Lijun Chang for exact GED computation.
To use this repository:
- Clone the repository:
git clone https://github.com/LijunChang/Graph_Edit_Distance.git cd Graph_Edit_Distance
- Follow the build instructions provided in the repository to compile and set up the exact GED computation framework.
To convert datasets into the required formats, follow these steps:
- Navigate to the src/converters directory.
- Run the appropriate conversion script for your dataset. For example, to convert the AIDS dataset to GXL:
- Choose gxl_xml directory.
- Select the appropriate script (e.g.,
preprocess_aids.py
) or specify the dataset name in the script (e.g.,preprocess_all.py
). - Run the script:
python preprocess_aids.py
To estimate lower bounds for the graph pairs:
- Navigate to heuristics and run:
python estimate_lower_bound.py
- The results will be saved in the
results/lower_bound
directory.
To compute the exact GED using the AStar-BMao algorithm:
- Set up the environment as described in the Installation & Setup section.
- Navigate to src/c++_parsers.
- Run the AStar-BMao script:
python astar_exact_ged.py
- If you want, you can adjust the amount of threads and graph pairs used for the computation in the script.
- The results will be saved in the
results/exact_ged
directory.
To compute an approximate GED using any algorithm available in the GEDLIB:
- Set up the environment as described in the Installation & Setup section.
- Navigate to src/c++_parsers.
- Select the appropriate script (e.g.,
gedlib_parser.py
) and the algorithm you want to use. - For changing the algorithm, you can modify the
command = [GED_EXECUTABLE, dataset_path, preprocessed_xml, "IPFP"]
line in the script. - Run the script:
python gedlib_parser.py
To train the model, navigate to SimGNN/src and run main.py
:
python main.py
For more information and hyperparameter settings, refer to original SimGNN repository.
To test the model, navigate to SimGNN/src and run:
python simgnn_evaluate.py
To test the model on a specific dataset, you can modify the paths in the script. The results will be saved in the results/simgnn
directory.
The repository includes tools for extracting and validating edit paths using GEDLIB algorithms and SimGNN. To extract edit paths:
- To extract edit path for a pair of graphs, navigate either to src/c++_parsers or SimGNN/src for GEDLIB and SimGNN respectively.
- Run the appropriate script (e.g.,
gedlib_edit_path.py
orsimgnn_extract_edit_path.py
) and specify the graph pair and dataset path.
- Run the appropriate script (e.g.,
- To validate or apply edit paths, navigate to src/edit_path_test and choose an appropriate script (e.g.,
gedlib_validate_edit_path.py
orapply_edit_path.py
). Modify the paths in the script to point to the edit path files and dataset. - To validate the edit paths for SimGNN, navigate to simgnn_validate_edit_path.py and follow the same steps.
This repository includes Jupyter Notebooks for analyzing and visualizing the results of the experiments. To explore the results:
- Navigate to notebooks.
- Open the desired notebook (e.g.,
lower_bound_analysis.ipynb
,plot_analysis.ipynb
, orstatistics_analysis.ipynb
) and run the cells to visualize the results.
The repository includes several datasets for benchmarking the GED algorithms. The datasets are stored in the data
directory and include:
- AIDS: A dataset of molecular graphs.
- IMDB-BINARY: A dataset of binary graphs representing movie co-appearances.
- PROTEINS: A dataset of protein structures represented as graphs.
These datasets were taken from TUDataset, which is a collection of benchmark datasets for graph machine learning. For more information on the datasets, refer to the TUDataset website.
If you use this code in your work, please cite:
@misc{Filippov2025,
author = {Mikhail Filippov},
title = {Approximation Algorithms for Graph Edit Distance (GED)},
year = {2025},
url = {https://github.com/michaelflppv/ged-approximation},
note = {Bachelor Thesis, University of Mannheim}
}
For GEDLIB, refer to the official repository. The source code of GEDLIB is distributed under the GNU Lesser General Public License.
- Blumenthal, D. B., Bougleux, S., Gamper, J., & Brun, L. (2019). GEDLIB: A C++ library for graph edit distance computation. In Graph-Based Representations in Pattern Recognition (GbRPR 2019). Paper.
- Blumenthal, D. B., Boria, N., Gamper, J., Bougleux, S., & Brun, L. (2020). Comparing heuristics for graph edit distance computation. VLDB Journal, 29(1), 419-458. Paper.
- Chang, L., Feng, X., Lin, X., Qin, L., Zhang, W., & Ouyang, D. (2020). Speeding Up GED Verification for Graph Similarity Search. In Proceedings of the 36th International Conference on Data Engineering (ICDE'20). Paper.
- Chang, L., Feng, X., Yao, K., Qin, L., & Zhang, W. (2022). Accelerating Graph Similarity Search via Efficient GED Computation. IEEE Transactions on Knowledge and Data Engineering (TKDE). Paper.
- Bai, Y., Ding, H., Bian, S., Chen, T., Sun, Y., & Wang, W. (2019). SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM 2019). Paper.
The SimGNN implementation is based on the original repository by Benedek Rozemberczki.
For questions, create an issue or reach out via email.