Skip to content

Commit

Permalink
feature/implement-unit-testing - Implement Unit Testing
Browse files Browse the repository at this point in the history
  • Loading branch information
Abel Tavares committed Feb 10, 2024
1 parent 6227082 commit 86be005
Show file tree
Hide file tree
Showing 8 changed files with 1,401 additions and 147 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/run_black.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Run Black Formatter

on:
push:
branches:
- main

jobs:
format:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.10.12

- name: Install dependencies
run: |
pip install --upgrade pip
pip install black
- name: Run Black formatter
run: black .
40 changes: 40 additions & 0 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Run Tests

on:
pull_request:
types: [opened, synchronize, reopened]
push:
branches:
- main

jobs:
test:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.10.12

- name: Install dependencies
run: |
pip install --upgrade pip
pip install requests
pip install psycopg2-binary
pip install python-dotenv
pip install apache-airflow==2.8.1
pip install apache-airflow[cncf.kubernetes]
pip install pandas
pip install Flask-Session==0.5.0
- name: Initialize Airflow database
run: airflow db migrate

- name: Run tests
run: |
python -m unittest discover tests
python tests/dags_test.py
17 changes: 17 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
repos:
- repo: local
hooks:
- id: unit-tests
name: Run Unit Tests
entry: |
python3 -c "
import subprocess
import sys
TEST_RESULT = subprocess.call(['python3', '-m', 'unittest', 'discover', 'tests'])
sys.exit(TEST_RESULT)
"
language: system
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
234 changes: 181 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,94 +9,213 @@

# MarketTrackPipe

MarketTrackPipe is an automated Apache Airflow data pipeline for collecting, storing, and backing up stock and cryptocurrency market data. The pipeline retrieves daily data for the top 5 stocks and top 5 cryptocurrencies based on market performance from Alpha Vantage, Financial Modeling Prep, and CoinMarketCap APIs and stores it in a PostgreSQL database. Additionally, the pipeline includes a monthly backup function that stores the data from the database in an AWS S3 bucket. The pipeline is containerized using Docker and written in Python 3.
MarketTrackPipe is an automated Apache Airflow data pipeline for collecting and storing stock and cryptocurrency market data. The pipeline retrieves daily data for the top 5 stocks and top 5 cryptocurrencies based on market performance from Alpha Vantage, Financial Modeling Prep, and CoinMarketCap APIs and stores it in a PostgreSQL database. The pipeline is containerized using Docker and written in Python 3.

## Project Components

The pipeline consists of two Python scripts in `dags` folder:

- `data_collection_storage.py`: Contains functions for retrieving stock and crypto performance data from APIs and storing the data in a PostgreSQL database, as well as a function for backing up the data to Amazon S3.
- `market_data_dag.py`: Sets up the DAGs for collecting and storing stock data from the financialmodelingprep and Alpha Advantage APIs, as well as cryptocurrency data from the CoinMarketCap API. Additionally, it sets up a DAG for backing up the data in the PostgreSQL database to Amazon S3 on the last day of every month.

The `data_collection_storage_stocks` DAG consists of the following tasks:

1. `get_stocks`: Retrieves the symbol of the top 5 stocks according to market performance.

2. `get_stock_data`: Retrieves detailed information of the stocks retrieved in task 1.
The pipeline follows object-oriented programming principles to ensure modularity, maintainability, and extensibility. Each component of the pipeline is designed as a separate class with well-defined responsibilities.

3. `store_stock_data`: Stores the stock data in a PostgreSQL database.
Unit testing is implemented throughout the workflow to ensure the reliability and efficiency of the pipeline. These tests validate the functionality of each component and help identify any potential issues or bugs.

DAG runs every day at 11 PM from Monday to Friday.

The `data_collection_storage_crypto` DAG consists of the following tasks:

1. `get_crypto_data`: Retrieves data for the top 5 cryptocurrencies according to market performance.

2. `store_crypto_data`: Stores the cryptocurrency data in a PostgreSQL database.
## Project Components

DAG runs every day at 11 PM.

The `backup_data` DAG consists of the following task:
```
├── core
│   ├── __init__.py
│   └── market_data_processor.py
├── dags
│   └── market_data_dag.py
├── docker-compose.yaml
├── init.sql
└── tests
├── dags_test.py
└── tests_market_data_processor.py
```

- `core`: Contains core functionality for processing market data.
```mermaid
classDiagram
class BaseApiClient {
<<abstract>>
+logger: logging.Logger
<<abstractmethod>>
+@abstractmethod get_data(): Dict[str, List[str]]
}
class StockApiClient {
+ALPHA_API_KEY: str
+PREP_API_KEY: str
+ALPHA_BASE_URL: str
+PREP_BASE_URL: str
+logger: logging.Logger
+get_stocks(): Dict[str, List[str]]
+get_data(symbols: Dict[str, List[str]]): Dict[str, List[Dict]]
}
class CryptoApiClient {
+COIN_API_KEY: str
+logger: logging.Logger
+get_data(): Dict[str, List[Dict]]
}
class Storage {
+host: str
+port: int
+database: str
+user: str
+password: str
+conn
+cur
+logger: logging.Logger
+_connect()
+_close()
+store_data(data: Dict[str, List[Dict[str, any]]], data_type: str): None
}
class MarketDataEngine {
+api_client: BaseApiClient
+db_connector: Storage
+logger: logging.Logger
+process_stock_data()
+process_crypto_data()
}
BaseApiClient <|-- StockApiClient
BaseApiClient <|-- CryptoApiClient
MarketDataEngine "1" --> "1" BaseApiClient
MarketDataEngine "1" --> "1" Storage
```
<br>

- `dags`: Contains the Apache Airflow DAG definitions for orchestrating the data collection and storage process.
- `tests`: Contains the unit tests for testing individual components of the project.
- `init.sql`: SQL script for creating and initializing the database schema.
```mermaid
graph TD;
subgraph DB
schema[market_data]
stock[stock_data]
crypto[crypto_data]
end
subgraph Fields
date_collected
symbol
name
market_cap
volume
price
change_percent
end
schema --> |Schema| stock & crypto
stock & crypto -->|Table| gainers & losers & actives
gainers & losers & actives --> Fields
```
<br>

- `docker-compose.yml`: Defines the services and configures the project's containers, setting up the environment (postgres, pgadmin, airflow).

The `MarketDataEngine` class within `core/market_data_processor.py` encapsulates the logic for retrieving and storing market data. The `market_data_dag.py` file within the `dags` directory sets up the Apache Airflow DAGs for collecting and storing market data.
<br>
```mermaid
graph TD;
subgraph MarketTrackPipe
A((Airflow))
D(Docker)
P(PostgreSQL)
G(pgAdmin)
end
subgraph Core
MDE(MarketDataEngine)
SAPI(StockApiClient)
CAPI(CryptoApiClient)
STR(Storage)
end
subgraph Dags
MD_DAG_stocks(process_stock_data)
MD_DAG_crypto(process_crypto_data)
end
D --> A & P & G
P --> G
A --> Dags
Dags --> MDE
MDE --> SAPI & CAPI
SAPI & CAPI --> API
API --> SAPI & CAPI
SAPI & CAPI --> STR
STR --> P
style A fill:#f9f,stroke:#333,stroke-width:4px;
style D fill:#bbf,stroke:#333,stroke-width:2px;
style P fill:#f9f,stroke:#333,stroke-width:4px;
style MDE fill:#f9f,stroke:#333,stroke-width:4px;
style MD_DAG_stocks fill:#f9f,stroke:#333,stroke-width:4px;
style MD_DAG_crypto fill:#f9f,stroke:#333,stroke-width:4px;
```

1. `backup_data`: Extracts data from the PostgreSQL database and stores it in an Amazon S3 bucket in parquet file format.
## Requirements

The `docker-compose.yml` file is used to define the services and configure the project's containers, setting up the environment (postgres, pgadmin, airflow).
- [Docker](https://www.docker.com/get-started)
- [pre-commit](https://pre-commit.com/) (Developer)

The `init.sql` file is used to create and initialize the database schema when the docker compose command is executed.

It creates creates two schemas in `market_data` database, one for `stock_data` and another for `crypto_data`, and then creates tables within each schema to store `gainer`, `loser`, and `active` data for both stock and crypto.
## Setup

The columns for each table are as follows:
1. Clone the repository:

- `id` : a unique identifier for each row in the table
- `date_collected` : the date on which the data was collected, defaulting to the current date
- `symbol` : the stock or crypto symbol
- `name` : the name of the stock or crypto
- `market_cap` : the market capitalization of the stock or crypto
- `volume` : the trading volume of the stock or crypto
- `price` : the price of the stock or crypto
- `change_percent` : the percentage change in the price of the stock or crypto
```bash
git clone https://github.com/abeltavares/MarketTrackPipe.git
```

## Requirements
2. Create an '.env' file in the project's root directory with the required environment variables (refer to the example .env file in the project).

- [Docker](https://www.docker.com/get-started)
3. Start the Docker containers:

```bash
docker-compose up
```

## Setup
4. Access the Airflow web server:

1. Clone the repository: <br>
Go to the Airflow web UI at http://localhost:8080 and turn on the DAGs.

$ git clone https://github.com/abeltavares/MarketTrackPipe.git
Alternatively, you can trigger the DAG manually by running the following command in your terminal:

2. Create an '.env' file in the project's root directory with the required environment variables (refer to the example .env file in the project).
```bash
airflow trigger_dag data_collection_storage_stocks
airflow trigger_dag data_collection_storage_crypto
```

3. Start the Docker containers:<br>

$ docker-compose up
## Setting up Pre-commit Hooks (Developer Setup)

4. Access the Airflow web server:<br>
To ensure code quality and run unit tests before committing changes, MarketTrackPipe uses [pre-commit](https://pre-commit.com/) hooks. Follow these steps to set it up:

Go to the Airflow web UI at http://localhost:8080 and turn on the DAGs.
1. Install `pre-commit` by running the following command in your terminal:

Alternatively, you can trigger the DAG manually by running the following command in your terminal:
```bash
pip install pre-commit
```

$ airflow trigger_dag data_collection_storage_stocks
$ airflow trigger_dag data_collection_storage_crypto
$ airflow trigger_dag backup_data
2. Run the following command to set up pre-commit:

That's it! You should now be able to collect and store stock and cryptocurrency data using MarketTrackPipe.
```bash
pre-commit install
```

This will install the pre-commit hook into your git repository.
<br>
3. Now, every time you commit changes, pre-commit will automatically run unit tests to ensure code quality. Additionally, these tests are also executed in a GitHub Actions workflow on every pull request to the repository.

## Usage

After setting up the workflow, you can access the Apache Airflow web UI to monitor the status of the tasks and the overall workflow.

To access the data stored in the PostgreSQL database, you have two options:

1. Use the command-line tool `psql` to run SQL queries directly. The database credentials and connection information can be found in the '.env' file as well. Using psql, you can connect to the database, execute queries, and save the output to a file or use it as input for other scripts or applications.

$ docker exec -it my-postgres psql -U postgres -d market_data
1. **Command-line tool `psql`**: You can use `psql` to run SQL queries directly. Find the database credentials and connection information in the '.env' file. Use the following command in your terminal to connect to the database:

```bash
docker exec -it [host] psql -U [user] -d market_data
```
2. Use `pgAdmin`, a web-based visual interface. To access it, navigate to http://localhost:5050 in your web browser and log in using the credentials defined in the `.env` file in the project root directory. From there, you can interactively browse the tables created by the pipeline, run queries, and extract the desired data for analysis or visualization.

Choose the option that suits you best depending on your familiarity with SQL and preference for a graphical or command-line interface.
Expand All @@ -112,3 +231,12 @@ This project is open to contributions. If you have any suggestions or improvemen

## Copyright
© 2023 Abel Tavares


The codebase of this project follows the [black](https://github.com/psf/black) code style. To ensure consistent formatting, the [pre-commit](https://pre-commit.com/) hook is set up to run the black formatter before each commit.

Additionally, a GitHub Action is configured to automatically run the black formatter on every pull request, ensuring that the codebase remains formatted correctly.

Please make sure to run `pip install pre-commit` and `pre-commit install` as mentioned in the setup instructions to enable the pre-commit hook on your local development environment.

Contributors are encouraged to follow the black code style guidelines when making changes to the codebase.
Loading

0 comments on commit 86be005

Please sign in to comment.