Building a Data Warehouse to Store Data on Ethiopian Medical Business Data Scraped from Telegram Channels

KAIM Week 7 Challenges

Project Overview

This project aims to create a data warehouse for Ethiopian medical businesses by scraping relevant data from public Telegram channels and analyzing images through object detection using the YOLO (You Only Look Once) algorithm. The system includes processes for data scraping, data cleaning, data transformation, and data storage, as well as providing API access to the processed data.

Key Objectives:

Scraping Images from Telegram Channels: Scrape images and metadata from specified channels using the Telegram API.
Data Warehousing: Store scraped images and their metadata in a relational database.
Object Detection Preparation: Set up data for object detection, ensuring proper storage and accessibility.
Data Transformation: Use DBT (Data Build Tool) to transform the stored data for object detection and further processing.
API Development: Develop an API to expose processed data for real-time insights and analysis.

Requirements

Python 3.x
Telethon for Telegram API access
SQLAlchemy for database management
PostgreSQL or SQLite for data warehousing
Pillow (PIL) for image processing
DBT (Data Build Tool) for data transformation
YOLOv5 (for object detection in future tasks)

Task Breakdown

Task 1: Telegram Scraping

Overview:

This task focuses on scraping images from Telegram channels using the Telethon library. Images are downloaded into a local folder, and metadata is collected for each image, including:

File path
Source channel
Timestamp

Key Files:

scrape_telegram.py: Handles the Telegram scraping and metadata extraction.

Task 2: Data Warehousing

Overview:

This task stores the image metadata (from Task 1) into a relational database. The database helps manage image metadata, ensuring future scalability and accessibility for object detection.

Database Schema:

Table: images
- id: Primary key (auto-increment).
- file_path: Path to the saved image.
- source_channel: The channel from where the image was scraped.
- timestamp: Time when the image was downloaded.

Key Files:

database.py: Manages database operations, including storing image metadata.

Task 3: Object Detection

In the next phase, object detection will be performed on the scraped images using models like YOLOv5. This will involve:

Loading images from the database.
Running detection models on the images.
Storing results in the database.

Task 4: Data Transformation with DBT

DBT will be used for transforming the data in the warehouse, ensuring it’s structured properly for object detection models. The transformations will include:

Cleaning and organizing metadata.
Generating datasets optimized for model input.

Task 5: API Development

An API will be developed to expose the processed data and object detection results for real-time insights. The API will be built using Flask or FastAPI and will provide endpoints for querying detection results and metadata.

Python Libraries

The main Python libraries required are listed below. Install them using pip:

pip install telethon dbt opencv-python torch torchvision fastapi uvicorn pydantic sqlalchemy

Project Structure

Here’s a high-level overview of the project’s structure:

├── app/
│   ├── templates/   
│   ├── crud.py               
│   └── database.py
│   ├── main.py
│   ├── models.py                
│   └── schemas.py
│   ├── telegram_scraper.py     
│   ├── yolo_object_detection.py                                  
├── data/
├── images/
├── logs/
├── dbt_medical_data/
│   ├── analaysis/     
│   ├── macros              
│   └── models/    
│   ├── seeds/    
│   ├── snapshots                
│   └── tests/                          
├── notebooks/
│   ├── telegram_scraper.py     
│   ├── utils.py                
│   └── raw_data/               
├── scripts/
│   ├── __init__.py     
│   ├── main.py                
│   └── dbt_setup.py        
├── src/
│   ├── telegram_scraper.py     
│   ├── utils.py                
│   └── raw_data/               
├── tests/
│   ├── __init__.py     
│   ├── test_data_loader.py                              
├── yolov5/
│   ├── models/
│   ├── runs/               
│   └── utils/
│   ├── detect.py     
│   ├── export.py                
│   └── yolov5.pt                             
├── .gitignore               
├── requirements.txt
└── README.md                   #

Setup Instructions

1. Data Scraping

Description

The first step involves scraping textual and image data from public Telegram channels that focus on Ethiopian medical businesses. The data is collected using Python scripts and the Telethon library, which interfaces with Telegram's API.

Telegram Channels Scraped

DoctorsET
Chemed Telegram Channel
Yetenaweg
EAHCI
Additional channels from https://et.tgstat.com/medicine

Setup and Execution

Install Dependencies:
```
pip install telethon
```
Run the Scraper: Before running, make sure to create a .env file with your Telegram API credentials (API ID, API hash, and phone number).

Example .env file:
```
API_ID=your_api_id
API_HASH=your_api_hash
PHONE=your_phone_number
```
Execute the script:
```
python src/message_scraper.py
```
Output:
- Text data and metadata will be saved in a local database.
- Image files will be stored in the images/ directory.

2. Data Cleaning and Transformation

Description

After scraping, the raw data is cleaned and transformed using DBT (Data Build Tool). This process involves removing duplicates, handling missing values, and standardizing formats for easy querying and analysis.

Setup and Execution

Install DBT: Install DBT and initialize a new DBT project:
```
pip install dbt
dbt init dtb_medical_data
```

Define DBT Models:

Define SQL models in the dbt_medical_data/models/ directory for cleaning and transforming data.

Sample DBT model file:

-- models/cleaned_telegram_data.sql
select
    distinct message_id,
    message_text,
    timestamp::timestamp as message_time,
    channel_name
from raw_data
where message_text is not null

Run DBT Models: Apply the transformations by running the DBT models:
```
dbt run
```
Testing: Test data quality using DBT's built-in test features:
```
dbt test
```

3. Object Detection using YOLO

Description

In this task, we perform object detection on the scraped images using YOLOv5 to detect medical equipment, promotional materials, and other objects related to Ethiopian medical businesses.

Setup and Execution

Install YOLO Dependencies: Install PyTorch and YOLOv5:

pip install torch torchvision
git clone https://github.com/ultralytics/yolov5.git
cd yolov5
pip install -r requirements.txt

Prepare Images: Place the scraped images from the images/ folder directory for object detection.
Run YOLO: Run the YOLOv5 object detection script:
```
cd yolov5
python detect.py
```
Store Detection Results: The detection results (bounding boxes, class labels, and confidence scores) will be saved in a structured format, which will later be loaded into the data warehouse.

4. Data Warehouse Design and Implementation

Description

The data warehouse stores all the cleaned, transformed, and enriched data, enabling efficient querying and analysis. The data includes textual Telegram posts, image metadata, and YOLO object detection results.

Setup and Execution

Install PostgreSQL: Install and configure PostgreSQL, or alternatively, use SQLite for local testing.

Database Models: Define your database schema in app/models.py using SQLAlchemy:

from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy.orm import relationship

class ImageMetadata(Base):
    __tablename__ = 'image_metadata'
    id = Column(Integer, primary_key=True)
    image_path = Column(String, nullable=False)
    channel_name = Column(String, nullable=False)
    timestamp = Column(String, nullable=False)

class ObjectDetection(Base):
    __tablename__ = 'object_detection'
    id = Column(Integer, primary_key=True)
    image_id = Column(Integer, ForeignKey('image_metadata.id'))
    bounding_box = Column(String, nullable=False)
    confidence = Column(Float, nullable=False)
    class_label = Column(String, nullable=False)

    image = relationship("ImageMetadata", back_populates="detections")

Migrate Database: Initialize and migrate the database to create the tables:
```
python app/database.py
```

FastAPI for Data Access

Description

To expose the processed data via an API, FastAPI is used to create RESTful endpoints. These endpoints allow users to query the data warehouse for images, detections, and associated metadata.

Setup and Execution

Install FastAPI:
```
pip install fastapi uvicorn
```
Create FastAPI Application:
- Define routes in app/main.py:

     from fastapi import FastAPI, Depends
     from sqlalchemy.orm import Session
     from .crud import get_detections
     from .database import SessionLocal

     app = FastAPI()

     @app.get("/detections/{image_id}")
     def read_detections(image_id: int, db: Session = Depends(get_db)):
         detections = get_detections(db, image_id=image_id

)
         return detections

Run FastAPI: Start the FastAPI server:
```
uvicorn app.main:app --reload
```
Access the API: Visit http://127.0.0.1:8000/ to explore the automatically generated API documentation.

Future Improvements

Data Enrichment: Add more sources of data, such as public medical directories or customer reviews, to provide a richer dataset.
Machine Learning Models: Build predictive models to analyze trends in medical products or promotional effectiveness.
Fine-tune YOLO: Train the YOLO model on specific Ethiopian medical products and packaging to improve detection accuracy.

By following these steps, you can set up a fully operational data pipeline for scraping, cleaning, transforming, analyzing, and querying data on Ethiopian medical businesses.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
app		app
data		data
dbt_medical_data		dbt_medical_data
images		images
notebooks		notebooks
scripts		scripts
src		src
tests		tests
yolov5		yolov5
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
session_name.session		session_name.session
session_name.session-journal		session_name.session-journal

tedoaba/KAIM-W7

Folders and files

Latest commit

History

Repository files navigation

Building a Data Warehouse to Store Data on Ethiopian Medical Business Data Scraped from Telegram Channels

KAIM Week 7 Challenges

Project Overview

Key Objectives:

Table of Contents

Requirements

Task Breakdown

Task 1: Telegram Scraping

Overview:

Key Files:

Task 2: Data Warehousing

Overview:

Database Schema:

Key Files:

Task 3: Object Detection

Task 4: Data Transformation with DBT

Task 5: API Development

Python Libraries

Project Structure

Setup Instructions

1. Data Scraping

Description

Telegram Channels Scraped

Setup and Execution

2. Data Cleaning and Transformation

Description

Setup and Execution

3. Object Detection using YOLO

Description

Setup and Execution

4. Data Warehouse Design and Implementation

Description

Setup and Execution

FastAPI for Data Access

Description

Setup and Execution

Future Improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages