file-type-detection-by-byte-blocks

In this project, we detect file types based on the bytes that constitute them. We use the first, body, and last blocks of bytes on the disk to account for all possible scenarios and train the FFNN, CNN, GRU, and LSTM models. Afterward, we make predictions and evaluate the performance of each model. The experimental computer uses an SSD, where each block size is 4KB, equivalent to 4096 bytes. The selected blocks vary in nature: the first and last blocks may contain headers and trailers for certain file types, whereas the body block presents a greater challenge, as it may lack the distinct patterns often found in the other blocks.

Dataset

The dataset used for this project consists of files that can be downloaded here. Alternatively, a web-scraping script has been implemented to download the dataset in "toolkit/scrape.py".

Working dir

Data visualisation: A dedicated notebook that manages dataset download and sampling, creates and analyses visualisations, performs feature extraction, and helps interpret data trends, validate assumptions, and communicate insights effectively.
Models random search: A specific notebook designed for hyperparameter optimisation using random search, enabling us to efficiently explore a range of parameter values and improve model performance.
Venv: A Python virtual environment used to isolate the dependencies installed for the project, useful for avoiding conflicting library versions.
Requirements: A text file listing the required dependencies to install in the project’s virtual environment.
HPS results: A folder to save the models’ hyperparameter search results for each model addressed.
Toolkit: A python package developed for the project.
Govdocs1: A folder containing the dataset to be used in the project, consisting of files of mixed types.
Systems 1-6: Separate notebooks that focus on model training and evaluation, providing a structured approach to experimenting with different algorithms and hyperparameters.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
govdocs1		govdocs1
hps_results		hps_results
toolkit		toolkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_visualisation.ipynb		data_visualisation.ipynb
models_random_search.ipynb		models_random_search.ipynb
requirements.txt		requirements.txt
system1.ipynb		system1.ipynb
system2.ipynb		system2.ipynb
system3.ipynb		system3.ipynb
system4.ipynb		system4.ipynb
system5.ipynb		system5.ipynb
system6.ipynb		system6.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

file-type-detection-by-byte-blocks

Dataset

Working dir

Best accuracy scores

About

Releases

Packages

Languages

License

R40835/file-type-detection-by-byte-blocks

Folders and files

Latest commit

History

Repository files navigation

file-type-detection-by-byte-blocks

Dataset

Working dir

Best accuracy scores

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages