Global Terrorism Database Clustering

Introduction

The repository contains a visualization of the the Global Terrorism Database (GTD) that is hosted on Kaggle and is provided by the START Consortium. It contains more than 170,000 terrorist attacks from all over the world from 1970 to 2016 (Figure 1). To visualize this dataset, I decided to implement k-means in PySpark to cluster the different geo-locations of the terrorist attacks.

Preprocessing

Preprocessing was relatively straightforward for this dataset. Using the Pandas library, I created a DataFrame with only the latitude and longitude. After this, the data was cleaned by dropping any row with either the latitude or longitude missing. From this, I created a CSV file without any indices or headers in order to run it from Amazon S3.

Method

As an illustrative example of clustering on this dataset, I chose to set k=5 and to use the great-circle distance metric because it would roughly correspond to a clustering for each continent and also because we are dealing with the approximate spherical geometry of the earth. The resulting centroids (Figure 2) appear to correspond to actual centers of terrorist attacks.

Results

Figure 1: The entire Global Terror Attacks (1970 - 2016) dataset. The solid red circles are terrorist attacks before clustering is performed.

Figure 2: A map containing the centroids resulting from the k-means algorithm using k=5 and the great-circle distance metric on the whole Global Terror Attacks (1970 - 2016) dataset. It shows clusters of terrorist attacks in the following locations: Oceania, South Asia, Central Africa, Middle East, and South America.

The five resulting centroid coordinates are:

Latitude	Longitude
2.361956292733226	29.36602923946585
9.207740262716625	115.80481474889932
28.406414022007812	74.12808393025578
3.2801139683765075	-79.74785539671865
38.414250505940146	28.417748608803993

Figure 1: These are where terrorist attacks have occured.

Figure 2: After clustering, we find the following centroids.

Figure 3: Plotting both together we can see the following map.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Kmeans.py		Kmeans.py
README.md		README.md
clean_data.csv		clean_data.csv
clusters.txt		clusters.txt
fig1.png		fig1.png
fig2.png		fig2.png
fig3.png		fig3.png
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Terrorism Database Clustering

Introduction

Preprocessing

Method

Results

About

Releases

Packages

Contributors 2

Languages

lschlessinger1/Global-Terrorism-Database-K-Means

Folders and files

Latest commit

History

Repository files navigation

Global Terrorism Database Clustering

Introduction

Preprocessing

Method

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages