SemPCA

SemPCA is an artifact of our empirical study: Try with Simpler – An Evaluation of Improved Principle Component Analysis in Log-based Anomaly Detection.

├─approaches  		# LogBar main entrance
├─conf	      		# configurations for Drain
├─datasets    		# open source Log datasets, i.e. HDFS, BGL and Spirit
├─entities    		# instances for log data and DL models
├─logs    
├─models      		# LSTM, attention-based GRU, Cluster and PCA models
├─module      		# anomaly detection modules, including classifier, Attention, etc.
├─outputs   	
├─parsers     		# Drain parser
├─preprocessing 	# preprocessing code, data loaders and cutters
├─representations       # log template and sequence representation methods
├─scripts		# running scripts for reproduction
├─utils

Datasets

We used 3 open-source log datasets, HDFS, BGL, and Spirit. The table below illustrates some basic information about them.

Software System	Description	Time Span	# Messages	Data Size	Link
HDFS	Hadoop distributed file system log	38.7 hours	11,175,629	1.47 GB	LogHub
BGL	Blue Gene/L supercomputer log	214.7 days	4,747,963	708.76 MB	Usenix-CFDR Data
Spirit	Spirit supercomputer log	2.5 years	272,298,969	37.34 GB	Usenix-CFDR Data

You can find the datasets files by doi: 10.5281/zenodo.6375627 or just click the doi badge from the title. Please note that due to size limitation, the zenodo archive do not include the original log files (e.g., HDFS log file), but it should allow you to run the scripts directly. If anything goes wrong, please download the log files and put it into the corresponding folder(e.g., datasets/HDFS/HDFS.log for HDFS log data).

Environment

To reproduce results in our paper, please try to install the suggested version of key packages. Other packages needed to run our experiments are presented in requirement.txt.

Key Packages: The packages used by logbar are listed below. The version of the cricual ones for reproducibility are specified.

Python 3.8.3         
hdbscan 0.8.27
verrides 6.1.0
scikit-learn 0.24.2  #(Latest version of scikit-learn is not supported)
PyTorch 1.10.1       #(Please refer to official guidelines.)
Drain3               # https://github.com/IBM/Drain3 Please use `pip install drain3` if conda install fails.)
tqdm
numpy
regex
pandas
scipy

We have prepared a requirements.txt for quick installation. Please be noted that PyTorch is not included in the file, you may need to visit its official site for more information.

NB: Since there is some known issue about joblib, scikit-learn > 0.24 is not supported here. We'll keep watching.

Reproducibility

Follow the steps below in order to reproduce our experiment results.

Step 1: Create a directory under the datasets folder **using a unique and memorable name **(e.g. HDFS or BGL).
Step 2: Move a target log file with the extension of .log (in plain text, with each row containing one log message) into the folder created in step 1.
Step 3: Download the glove.6B.300d.txt from Stanford NLP word embeddings, and put it under the datasets folder.
Step 4: Enter the scripts/HDFS folder, and run PCA_PlusPlus.sh to perform an anomaly detection on HDFS by PCA++. Other techniques can be executed by the corresponding scripts in a similar way.

Contact

authors are anonymous at the current stage.

Name	Email Address

* corresponding author

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemPCA

Table of Contents

Project Structure

Datasets

Environment

Reproducibility

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
approaches		approaches
conf		conf
entities		entities
models		models
module		module
outputs/models		outputs/models
parsers		parsers
preprocessing		preprocessing
representations		representations
scripts		scripts
utils		utils
.gitignore		.gitignore
CONSTANTS.py		CONSTANTS.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

LeonYang95/SemPCA

Folders and files

Latest commit

History

Repository files navigation

SemPCA

Table of Contents

Project Structure

Datasets

Environment

Reproducibility

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages