Building a Data Warehouse to Store Data on Ethiopian Medical Business Data Scraped from Telegram Channels
This project aims to create a data warehouse for Ethiopian medical businesses by scraping relevant data from public Telegram channels and analyzing images through object detection using the YOLO (You Only Look Once) algorithm. The system includes processes for data scraping, data cleaning, data transformation, and data storage, as well as providing API access to the processed data.
- Scraping Images from Telegram Channels: Scrape images and metadata from specified channels using the Telegram API.
- Data Warehousing: Store scraped images and their metadata in a relational database.
- Object Detection Preparation: Set up data for object detection, ensuring proper storage and accessibility.
- Data Transformation: Use DBT (Data Build Tool) to transform the stored data for object detection and further processing.
- API Development: Develop an API to expose processed data for real-time insights and analysis.
- Project Overview
- Requirements
- Setup Instructions
- Task Breakdown
- Project Structure
- Challenges and Solutions
- Python 3.x
- Telethon for Telegram API access
- SQLAlchemy for database management
- PostgreSQL or SQLite for data warehousing
- Pillow (PIL) for image processing
- DBT (Data Build Tool) for data transformation
- YOLOv5 (for object detection in future tasks)
This task focuses on scraping images from Telegram channels using the Telethon library. Images are downloaded into a local folder, and metadata is collected for each image, including:
- File path
- Source channel
- Timestamp
scrape_telegram.py
: Handles the Telegram scraping and metadata extraction.
This task stores the image metadata (from Task 1) into a relational database. The database helps manage image metadata, ensuring future scalability and accessibility for object detection.
- Table: images
id
: Primary key (auto-increment).file_path
: Path to the saved image.source_channel
: The channel from where the image was scraped.timestamp
: Time when the image was downloaded.
database.py
: Manages database operations, including storing image metadata.
In the next phase, object detection will be performed on the scraped images using models like YOLOv5. This will involve:
- Loading images from the database.
- Running detection models on the images.
- Storing results in the database.
DBT will be used for transforming the data in the warehouse, ensuring it’s structured properly for object detection models. The transformations will include:
- Cleaning and organizing metadata.
- Generating datasets optimized for model input.
An API will be developed to expose the processed data and object detection results for real-time insights. The API will be built using Flask or FastAPI and will provide endpoints for querying detection results and metadata.
The main Python libraries required are listed below. Install them using pip
:
pip install telethon dbt opencv-python torch torchvision fastapi uvicorn pydantic sqlalchemy
Here’s a high-level overview of the project’s structure:
├── app/
│ ├── templates/
│ ├── crud.py
│ └── database.py
│ ├── main.py
│ ├── models.py
│ └── schemas.py
│ ├── telegram_scraper.py
│ ├── yolo_object_detection.py
├── data/
├── images/
├── logs/
├── dbt_medical_data/
│ ├── analaysis/
│ ├── macros
│ └── models/
│ ├── seeds/
│ ├── snapshots
│ └── tests/
├── notebooks/
│ ├── telegram_scraper.py
│ ├── utils.py
│ └── raw_data/
├── scripts/
│ ├── __init__.py
│ ├── main.py
│ └── dbt_setup.py
├── src/
│ ├── telegram_scraper.py
│ ├── utils.py
│ └── raw_data/
├── tests/
│ ├── __init__.py
│ ├── test_data_loader.py
├── yolov5/
│ ├── models/
│ ├── runs/
│ └── utils/
│ ├── detect.py
│ ├── export.py
│ └── yolov5.pt
├── .gitignore
├── requirements.txt
└── README.md #
The first step involves scraping textual and image data from public Telegram channels that focus on Ethiopian medical businesses. The data is collected using Python scripts and the Telethon library, which interfaces with Telegram's API.
- DoctorsET
- Chemed Telegram Channel
- Yetenaweg
- EAHCI
- Additional channels from https://et.tgstat.com/medicine
-
Install Dependencies:
pip install telethon
-
Run the Scraper: Before running, make sure to create a
.env
file with your Telegram API credentials (API ID, API hash, and phone number).Example
.env
file:API_ID=your_api_id API_HASH=your_api_hash PHONE=your_phone_number
Execute the script:
python src/message_scraper.py
-
Output:
- Text data and metadata will be saved in a local database.
- Image files will be stored in the
images/
directory.
After scraping, the raw data is cleaned and transformed using DBT (Data Build Tool). This process involves removing duplicates, handling missing values, and standardizing formats for easy querying and analysis.
-
Install DBT: Install DBT and initialize a new DBT project:
pip install dbt dbt init dtb_medical_data
-
Define DBT Models:
- Define SQL models in the
dbt_medical_data/models/
directory for cleaning and transforming data. - Sample DBT model file:
-- models/cleaned_telegram_data.sql select distinct message_id, message_text, timestamp::timestamp as message_time, channel_name from raw_data where message_text is not null
- Define SQL models in the
-
Run DBT Models: Apply the transformations by running the DBT models:
dbt run
-
Testing: Test data quality using DBT's built-in test features:
dbt test
In this task, we perform object detection on the scraped images using YOLOv5 to detect medical equipment, promotional materials, and other objects related to Ethiopian medical businesses.
-
Install YOLO Dependencies: Install PyTorch and YOLOv5:
pip install torch torchvision git clone https://github.com/ultralytics/yolov5.git cd yolov5 pip install -r requirements.txt
-
Prepare Images: Place the scraped images from the
images/
folder directory for object detection. -
Run YOLO: Run the YOLOv5 object detection script:
cd yolov5 python detect.py
-
Store Detection Results: The detection results (bounding boxes, class labels, and confidence scores) will be saved in a structured format, which will later be loaded into the data warehouse.
The data warehouse stores all the cleaned, transformed, and enriched data, enabling efficient querying and analysis. The data includes textual Telegram posts, image metadata, and YOLO object detection results.
-
Install PostgreSQL: Install and configure PostgreSQL, or alternatively, use SQLite for local testing.
-
Database Models: Define your database schema in
app/models.py
using SQLAlchemy:from sqlalchemy import Column, Integer, String, ForeignKey from sqlalchemy.orm import relationship class ImageMetadata(Base): __tablename__ = 'image_metadata' id = Column(Integer, primary_key=True) image_path = Column(String, nullable=False) channel_name = Column(String, nullable=False) timestamp = Column(String, nullable=False) class ObjectDetection(Base): __tablename__ = 'object_detection' id = Column(Integer, primary_key=True) image_id = Column(Integer, ForeignKey('image_metadata.id')) bounding_box = Column(String, nullable=False) confidence = Column(Float, nullable=False) class_label = Column(String, nullable=False) image = relationship("ImageMetadata", back_populates="detections")
-
Migrate Database: Initialize and migrate the database to create the tables:
python app/database.py
To expose the processed data via an API, FastAPI is used to create RESTful endpoints. These endpoints allow users to query the data warehouse for images, detections, and associated metadata.
-
Install FastAPI:
pip install fastapi uvicorn
-
Create FastAPI Application:
- Define routes in
app/main.py
:
- Define routes in
from fastapi import FastAPI, Depends
from sqlalchemy.orm import Session
from .crud import get_detections
from .database import SessionLocal
app = FastAPI()
@app.get("/detections/{image_id}")
def read_detections(image_id: int, db: Session = Depends(get_db)):
detections = get_detections(db, image_id=image_id
)
return detections
-
Run FastAPI: Start the FastAPI server:
uvicorn app.main:app --reload
-
Access the API: Visit
http://127.0.0.1:8000/
to explore the automatically generated API documentation.
- Data Enrichment: Add more sources of data, such as public medical directories or customer reviews, to provide a richer dataset.
- Machine Learning Models: Build predictive models to analyze trends in medical products or promotional effectiveness.
- Fine-tune YOLO: Train the YOLO model on specific Ethiopian medical products and packaging to improve detection accuracy.
By following these steps, you can set up a fully operational data pipeline for scraping, cleaning, transforming, analyzing, and querying data on Ethiopian medical businesses.