Skip to content

Commit

Permalink
Updated Raedme and test files
Browse files Browse the repository at this point in the history
  • Loading branch information
manvith1604 committed Jun 2, 2024
1 parent a590768 commit 7b8b775
Show file tree
Hide file tree
Showing 7 changed files with 98 additions and 19 deletions.
111 changes: 95 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,43 +37,122 @@ Source : [Access the dataset here](https://www.imf.org/en/Publications/SPROLLS/w
Ensure you have the following installed on your system:

- Docker
- Docker Compose
- Git
- Python 3.x
- Jupyter Notebook
- DVC
- TensorFlow
- MLFlow
- FastAPI/Flask
- Python > 3.8
- Google Cloud Platform

### Step-by-Step Setup

1. **Clone the Repository**
```bash
git clone <repository-url>
git clone https://github.com/namansnghl/World-Econ-Growth-Forecast
cd <repository-directory>
```

#### 2. Initialize DVC

2. **Check pyhton version**
```bash
dvc init
python --version
```

#### 3. Install Python Dependencies
3. **Install Python Dependencies**
```bash
pip install -r requirements.txt
```

4. Create and Run Docker Containers
4. **Install airflow**
```bash
pip install "apache-airflow[celery]==2.9.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.9.1/constraints-3.8.txt"
```
Make sure to change the python version above - "constraints-3.x"

5. **Check if you have enough memory to run docker (recommended 4GB)**
```bash
docker run --rm "debian:bullseye-slim" bash -c 'numfmt --to iec $(echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE))))'
```

6. **Initialize docker-compose.yaml**
```bash
docker-compose up -d
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.9.1/docker-compose.yaml'
```

5. Run Airflow Scheduler and Webserver
7. **Initialize the database (only first time)**
```bash
docker compose up airflow-init
```

8. **Run Airflow with Docker**
```bash
docker compose up
```

9. **Visit localhost:8080 login with credentials**
```bash
user:airflow
password:airflow
```

10. **Run the DAG by clicking on the play button on the right side of the window**<br/>
Ignore the example DAGS by setting load examples as false in docker-compose.yaml
```bash
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
```

11. **DVC Setup**
```bash
pip install dvc
```

12. **Initialize DVC**
```bash
dvc init
```

13. **Add files to DVC**
```bash
dvc add <file-path>
```

## Tools Used for MLOps
- GitHub
- Jupyter Notebook
- Docker
- Airflow
- DVC
- MLflow
- TensorFlow
- Flask
- Streamlit
- Google Cloud Platform (GCP)

<img src="assets/Logo.jpg" alt="airflow" width="900" height="200">

1. **GitHub**: GitHub hosts the project's source code, documentation, and manages issues and pull requests. It has 3 branches `main`, `test-stage` and `dev`. `pytest` is configured with GitActions. It builds and tests on every push.

2. **Jupyter Notebook**: We have used it to experiment with different data cleaning and feature engineering techniques, as well as to visualize initial model results.

3. **DVC (Data Version Control)**: For managing datasets and machine learning models versioning, ensuring reproducibility and efficient data handling. DVC allows us to track data and model versions, making it easy to reproduce results and collaborate on data-intensive tasks without version conflicts. It is configured with `GCP`

4. **PyTest**: For writing and running unit tests to ensure code quality and functionality of individual components.

5. **GitHub Actions**: To automate workflows for continuous integration and continuous deployment (CI/CD), including running tests, building Docker images, and deploying models.

6. **Docker**: Containerizes applications and their dependencies, ensuring consistency across different environments and simplifying deployment.

7. **Airflow**: It manages the entire data pipeline, scheduling and monitoring tasks to ensure timely and reliable execution of the data processing and model training workflows.


<img src="assets/airflow_dag.png" alt="airflow" width="800" height="400">


8. **TensorFlow**: It provides a comprehensive ecosystem for developing, training, and deploying machine learning models. TensorFlow is used to build and train the predictive models in this project, leveraging its powerful APIs and tools to handle complex data and modeling tasks.

9. **MLFlow**: Managing the machine learning lifecycle, including experimentation, reproducibility, and model deployment, along with tracking metrics and parameters.

10. **FastAPI/Flask**: It serves as the web framework for building RESTful APIs to use the machine learning models as services for integration with other applications.

11. **Logging**: Implements logging mechanisms to track the performance, errors, and usage of the deployed models. Logging provides insights into the model's behavior and performance in production, helping to identify and troubleshoot issues quickly.

12. **Looker**: Utilized for business intelligence and data visualization to create interactive dashboards and reports that provide insights into model performance and business impact.

13. **Streamlit**: Creates interactive web applications for visualizing data, model predictions, and performance metrics, facilitating easier engagement.

14. **Google Cloud Platform**: Provides scalable cloud infrastructure for hosting, managing, and deploying machine learning models and applications. GCP offers the necessary infrastructure to deploy models at scale, ensuring high availability and performance for the deployed services.
Binary file added assets/Logo.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/airflow_dag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added tests/__init__.py
Empty file.
2 changes: 1 addition & 1 deletion tests/test_data_cleaner.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import pandas as pd
import numpy as np
import pytest
from src.data_cleaner import process_data
from data_cleaner import process_data

# Determine the absolute path of the project directory
PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Expand Down
2 changes: 1 addition & 1 deletion tests/test_data_loader.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import os
import pytest
from src.data_loader import import_data
from data_loader import import_data

# Determine the absolute path of the project directory
PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Expand Down
2 changes: 1 addition & 1 deletion tests/test_transform.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import pandas as pd
import pytest
from src.transform import transform_data, melt_dataframe, pivot_dataframe
from transform import transform_data, melt_dataframe, pivot_dataframe

# Determine the absolute path of the project directory
PROJECT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Expand Down

0 comments on commit 7b8b775

Please sign in to comment.