Skip to content

Commit

Permalink
Merge pull request #17 from celebi-pkg/dev
Browse files Browse the repository at this point in the history
v1.2.0 Complete overhaul of scraping setup
  • Loading branch information
kcelebi authored Jun 11, 2023
2 parents c92ce2a + 3583bde commit e73aa71
Show file tree
Hide file tree
Showing 6 changed files with 451 additions and 127 deletions.
54 changes: 42 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
[![kcelebi](https://circleci.com/gh/celebi-pkg/flight-analysis.svg?style=svg)](https://circleci.com/gh/celebi-pkg/flight-analysis)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Live on PyPI](https://img.shields.io/badge/PyPI-1.1.0-brightgreen)](https://pypi.org/project/google-flight-analysis/)
[![Live on PyPI](https://img.shields.io/badge/PyPI-1.2.0-brightgreen)](https://pypi.org/project/google-flight-analysis/)
[![TestPyPI](https://img.shields.io/badge/PyPI-1.1.1--alpha.11-blue)](https://test.pypi.org/project/google-flight-analysis/1.1.1a11/)

# Flight Analysis

This project provides tools and models for users to analyze, forecast, and collect data regarding flights and prices. There are currently many features in initial stages and in development. The current features (as of 4/5/2023) are:
This project provides tools and models for users to analyze, forecast, and collect data regarding flights and prices. There are currently many features in initial stages and in development. The current features (as of 5/25/2023) are:

- Scraping tools for Google Flights
- Detailed scraping and querying tools for Google Flights
- Ability to store data locally or to SQL tables
- Base analytical tools/methods for price forecasting/summary

The features in development are:

- Models to demonstrate ML techniques on forecasting
- Querying of advanced features
- API for access to previously collected data

## Table of Contents
Expand Down Expand Up @@ -59,19 +62,46 @@ For GitHub repository cloners, import as follows from the root of the repository

Here is some quick starter code to accomplish the basic tasks. Find more in the [documentation](https://kcelebi.github.io/flight-analysis/).

# Try to keep the dates in format YYYY-mm-dd
result = Scrape('JFK', 'IST', '2023-07-20', '2023-08-10') # obtain our scrape object
dataframe = result.data # outputs a Pandas DF with flight prices/info
origin = result.origin # 'JFK'
dest = result.dest # 'IST'
date_leave = result.date_leave # '2023-07-20'
date_return = result.date_return # '2023-08-10'
# Keep the dates in format YYYY-mm-dd
result = Scrape('JFK', 'IST', '2023-07-20', '2023-08-20') # obtain our scrape object, represents out query
result.type # This is in a round-trip format
result.origin # ['JFK', 'IST']
result.dest # ['IST', 'JFK']
result.dates # ['2023-07-20', '2023-08-20']
print(result) # get unqueried str representation

You can also scrape for one-way trips now:
A `Scrape` object represents a Google Flights query to be run. It maintains flights as a sequence of one or more one-way flights which have a origin, destination, and flight date. The above object for a round-trip flight from JFK to IST is a sequence of JFK --> IST, then IST --> JFK. We can obtain the data as follows:

ScrapeObjects(result) # runs selenium through ChromeDriver, modifies results in-place
result.data # returns pandas DF
print(result) # get queried representation of result

You can also scrape for one-way trips:

results = Scrape('JFK', 'IST', '2023-08-20')
result.data.head() #see data
ScrapeObjects(result)
result.data #see data

You can also scrape chain-trips, which are defined as a sequence of one-way flights that have no direct relation to each other, other than being in chronological order.

# chain-trip format: origin, dest, date, origin, dest, date, ...
result = Scrape('JFK', 'IST', '2023-08-20', 'RDU', 'LGA', '2023-12-25', 'EWR', 'SFO', '2024-01-20')
result.type # chain-trip
ScrapeObjects(result)
result.data # see data

You can also scrape perfect-chains, which are defined as a sequence of one-way flights such that the destination of the previous flight is the origin of the next and the origin of the chain is the final destination of the chain (a cycle).

# perfect-chain format: origin, date, origin, date, ..., first_origin
result = Scrape("JFK", "2023-09-20", "IST", "2023-09-25", "CDG", "2023-10-10", "LHR", "2023-11-01", "JFK")
result.type # perfect-chain
ScrapeObjects(result)
result.data # see data

You can read more about the different type of trips in the documentation. Scrape objects can be added to one another to create larger queries. This is under the conditions:

1. The objects being added are the same type of trip (one-way, round-trip, etc)
2. The objects being added are either both unqueried or both queried

## Updates & New Features

Expand Down
7 changes: 4 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
tqdm
numpy
pandas==2.0.1
pandas
selenium
pytest==7.2.2
sqlalchemy
pytest
sqlalchemy
chromedriver-autoinstaller
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ install_requires =
numpy
pandas
selenium
sqlalchemy
chromedriver-autoinstaller

[options.packages.find]
where = src
Expand Down
8 changes: 4 additions & 4 deletions src/google_flight_analysis/flight.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@

class Flight:

def __init__(self, dl, *args):
def __init__(self, date, *args):
self._id = 1
self._origin = None
self._dest = None
self._date = dl
self._dow = datetime.strptime(dl, '%Y-%m-%d').isoweekday() # day of week
self._date = date
self._dow = datetime.strptime(date, '%Y-%m-%d').isoweekday() # day of week
self._airline = None
self._flight_time = None
self._num_stops = None
Expand Down Expand Up @@ -105,7 +105,7 @@ def time_arrive(self):
return self._time_arrive

def _classify_arg(self, arg):
if ('AM' in arg or 'PM' in arg) and len(self._times) < 2:
if ('AM' in arg or 'PM' in arg) and len(self._times) < 2 and ':' in arg:
# arrival or departure time
delta = timedelta(days = 0)
if arg[-2] == '+':
Expand Down
Loading

0 comments on commit e73aa71

Please sign in to comment.