TripHawk

This is the backbone of the big data management part of the project. For this we are creating data pipelines where we input data from API calls and other data sources and finally store the data in a distributed database.

For this project we are using Python as our main language to create the data pipelines. The data collection is done with python scripts and the requests module.

Crontab module is used to create the scheduler, the one that will call the API’s and store the data on a certain period of time.

HDFS is used as a distributed file system where all the data is being stored after being fetched.

MongoDB is used as a distributed NoSQL database which is being fed with the data from the API calls, stored in HDFS.

Tools Required

In order to run this project you need the following tools:

Python3
MongoDB server
HDFS
API's from:

Preferably a VM to run the code via SSH.

Installation

First, make sure you install all the required modules from requirements.txt.

In order to be able to run and call the API, you need an API key that you can get from Foursquare Places and Yelp Fusion. The dictionary with keys should be put in credentials/keys.py. This folder is ignored by git.

Both HDFS and MongoDB should be started and running on the machine. They need to run in the background. Make sure that you change the HOST:PORT in both files in scripts/temp_loaders directory as well as for MongoDB in scripts/perm_loaders/mongo.py

While testing we were using the UPC's virtual machines: virtech.fib.upc.edu/

There are multiple ways you can fetch and store data.

Run crontab and each data pipeline will run every certain period of time. In order to do this, simply run python3 scripts/scheduler.py. To check if the demons are running do $ crontab -l. In case you want to change the intervals do $ crontab -e
You can run the pipelines manually using: python3 scripts/pipelines/*. There you can choose between accommodation, attractions, businesses.

Running the pipelines will make the data to be fetched/scraped from the sources, uploaded to HDFS, read from HDFS and inserted in MongoDB.

Start HDFS: /home/bdm/BDM_Software/hadoop/sbin/start-dfs.sh Start Mongo: ~/BDM_Software/mongodb/bin/mongod --bind_ip_all --dbpath /home/bdm/BDM_Software/data/mongodb_data

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TripHawk

Tools Required

Installation

Authors

Filip Sotiroski

Zhicheng Luo

About

Releases

Packages

Contributors 2

Languages

vilipche/triphawk

Folders and files

Latest commit

History

Repository files navigation

TripHawk

Tools Required

Installation

Authors

Filip Sotiroski

Zhicheng Luo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages