WARNING: This project only runs on ARM64 chips.
- Project Objective
- Datasets Selection
- System Architecture
- Technologies Used
- Deployment
- Result
- Authors
The project aims to build a Smart Traffic Management System
that leverages real-time data processing, advanced visualization, and monitoring to optimize traffic flow, enhance safety, and support sustainable urban mobility. Key objectives include:
- Real-Time Insights: Deliver actionable traffic data on congestion, accidents, and vehicle movement.
- Data Integration: Process and store high-volume traffic data from IoT devices and sensors.
- Visualization: Provide intuitive dashboards for traffic trends, weather, and parking analysis.
- Monitoring: Ensure system reliability with real-time infrastructure monitoring.
- User Applications: Offer real-time traffic updates through user-friendly apps.
This system transforms raw traffic data into valuable insights, enabling smarter, safer, and greener cities.
- Source: ParkingDB_HCMCity_PostgreSQL
This database manages operations for a parking lot system in Ho Chi Minh City, Vietnam, tracking everything from parking records to customer feedback. The database contains operational data for managing parking facilities, including vehicle tracking, payment processing, customer management, and staff scheduling.
- Source: GasStationDB_HCMCity_PostgreSQL
This database manages operations for a chain of gas stations in Ho Chi Minh city, Viet Nam, tracking everything from fuel sales to inventory management. The database contains operational data for managing gas stations, including sales transactions, inventory tracking, customer management, and employee records.
- Source: IOT_RoadTransport_HCMCity
The dataset provided describes information about a vehicle (in this case, a motorbike) moving along a specific road in Ho Chi Minh City. It includes various details about the vehicle, its owner, weather conditions, traffic status, and alerts related to the vehicle during its journey. This data can be used in traffic monitoring systems, vehicle operation analysis, or smart transportation services.
Here is the schema of the provided JSON
, described in a hierarchical structure:
{
"vehicle_id": "string",
"owner": {
"name": "string",
"license_number": "string",
"contact_info": {
"phone": "string",
"email": "string"
}
},
"speed_kmph": "float",
"road": {
"street": "string",
"district": "string",
"city": "string"
},
"timestamp": "string",
"vehicle_size": {
"length_meters": "float",
"width_meters": "float",
"height_meters": "float"
},
"vehicle_type": "string",
"vehicle_classification": "string",
"coordinates": {
"latitude": "float",
"longitude": "float"
},
"engine_status": {
"is_running": "boolean",
"rpm": "int",
"oil_pressure": "string"
},
"fuel_level_percentage": "int",
"passenger_count": "int",
"internal_temperature_celsius": "float",
"weather_condition": {
"temperature_celsius": "float",
"humidity_percentage": "float",
"condition": "string"
},
"estimated_time_of_arrival": {
"destination": {
"street": "string",
"district": "string",
"city": "string"
},
"eta": "string"
},
"traffic_status": {
"congestion_level": "string",
"estimated_delay_minutes": "int"
},
"alerts": [
{
"type": "string",
"description": "string",
"severity": "string",
"timestamp": "string"
}
]
}
- Source: Traffic_Accidents_HCMCity
The dataset provided contains information about a road accident that took place on various roads in Ho Chi Minh City, Viet Nam. It includes details about the accident, such as the vehicles involved, severity, accident time, recovery time, and traffic congestion caused by the accident. This data can be useful for traffic management systems, accident reporting, and analyzing traffic patterns.
Here is the schema of the provided JSON
, described in a hierarchical structure:
{
"road_name": "string",
"district": "string",
"city": "string",
"vehicles_involved": [
{
"vehicle_type": "string",
"vehicle_id": "string"
}
],
"accident_severity": "int",
"accident_time": "string",
"number_of_vehicles": "int",
"estimated_recovery_time": "string",
"congestion_km": "float",
"description": "string"
}
- Source: VisualCrossing
The dataset provided contains detailed weather information for Ho Chi Minh City, Viet Nam, including temperature, humidity, wind conditions, precipitation, and other meteorological measurements. This data is collected hourly and aggregated daily, useful for weather forecasting, climate analysis, and urban planning applications.
Here is the schema of the provided JSON
, described in a hierarchical structure:
{
"latitude": "number",
"longitude": "number",
"resolvedAddress": "string",
"address": "string",
"timezone": "string",
"tzoffset": "number",
"days": [
{
"datetime": "string",
"datetimeEpoch": "number",
"tempmax": "number",
"tempmin": "number",
"temp": "number",
"feelslike": "number",
"humidity": "number",
"precip": "number",
"windspeed": "number",
"winddir": "number",
"pressure": "number",
"cloudcover": "number",
"visibility": "number",
"uvindex": "number",
"sunrise": "string",
"sunset": "string",
"conditions": "string",
"hours": [
{
"datetime": "string",
"temp": "number",
"feelslike": "number",
"humidity": "number",
"precip": "number",
"windspeed": "number",
"winddir": "number",
"conditions": "string"
}
]
}
]
}
The Data Lakehouse architecture implemented in this project is meticulously designed to accommodate both batch and real-time streaming data, seamlessly integrating multiple data sources into a cohesive analytics platform. This architecture follows the Me dallion Architecture pattern, which organizes data into Bronze, Silver, and Gold layers, each serving specific roles in the data lifecycle.
The system is divided into several components, each responsible for specific tasks within the data process:
The architecture consists of several key components:
- Seatunnel: Handles CDC (Change Data Capture) using Debezium format
- Apache NiFi: Manages data routing and transformation
- Apache Kafka & Zookeeper: Handles real-time data streaming
- Redpanda UI: Provides Kafka management interface
- Apache Spark: Batch processing
- Apache Flink: Stream processing
- Apache Airflow: Workflow orchestration
- Redis: Data caching
- MinIO: Object storage
- lakeFS: Data versioning
- Apache Hudi: Storage format for Silver layer
- Apache Hive: Data warehouse for Gold layer
- PostgreSQL: Metastore backend
- ClickHouse: OLAP database
- Prometheus: Metrics storage
- Trino: Distributed SQL query engine
- Streamlit: Interactive dashboards
- Metabase: Business analytics
- Apache Superset: Data exploration
- Grafana: Metrics visualization
The Bronze stage serves as the initial landing zone for raw data, implementing a robust ingestion and validation pipeline:
- Weather API Data: Live weather metrics and forecasts
- Traffic Data: Real-time traffic conditions and events
- Vehicle IOT Data: Streaming vehicle telemetry and sensor data
- Initial Collection
Apache NiFi
orchestrates data collection from all sources- Implements initial data validation and formatting
- Ensures data completeness and basic quality checks
- Stream Processing
- Data streams are directed to
Kafka
(managed viaRedpanda UI
) - Implements message validation and schema verification
- Maintains data lineage and source tracking
- Data streams are directed to
The Bronze stage implements a sophisticated version-controlled storage strategy:
- Branch Management
- Creates temporary branches from main Bronze repository
- Implements atomic commits for data consistency
- Maintains data versioning through
lakeFS
- Commit Workflow
- Automated validation triggers on commit
Airflow
DAGs orchestrateSpark
jobs for data verification- Successful validation merges changes to main branch
- Caching Layer:
Redis
implementation for frequently accessed data - Monitoring Stack:
Prometheus
metrics collectionGrafana
dashboards for real-time monitoring- Performance metrics and SLA tracking
The Silver stage combines multiple data streams and implements advanced processing:
- PostgreSQL Integration:
- Real-time monitoring of
ParkingLot
andStorageTank
tables Seatunnel
implementation withDebezium
format- Maintains data consistency and transaction order
- Real-time monitoring of
-
Real-time Processing Pipeline
Flink
processes incomingKafka
streams- Implements business logic and data enrichment
- Maintains stateful processing for complex operations
-
Data Quality Management
Spark Streaming
validates processed data- Implements data quality rules and business constraints
- Maintains data consistency across branches
The Silver stage implements a sophisticated branching strategy:
-
Change Detection
- Continuous monitoring of data changes
Spark
jobs compare incoming data with staging- Implements delta detection algorithms
-
Branch Management
- Creates temporary branches for changes
- Implements validation before commit
- Maintains data lineage and audit trails
- Hourly Synchronization:
Airflow
DAGs orchestrate main branch updates- Implements merge conflict resolution
- Maintains data consistency and quality
ClickHouse
collects processedKafka
dataSuperset
provides visualization capabilities- Implements real-time analytics queries
Streamlit
applications consume processed data- Combines
Redis
cache withPostgreSQL
data - Provides real-time user interfaces
The Gold stage implements a robust dimensional model:
-
Daily Processing
- Scheduled
Airflow
DAGs process Silver layer data - Implements slowly changing dimension logic
- Maintains historical accuracy and tracking
- Scheduled
-
Aggregation Pipeline
Spark
jobs perform complex aggregations- Implements business rules and calculations
- Maintains data consistency and accuracy
-
Real-time Updates
- Triggered by Silver layer changes
- Implements upsert operations for fact tables
- Maintains transactional consistency
-
Batch Processing
- Daily scheduled updates from Silver layer
- Implements full refresh of fact tables
- Maintains historical accuracy
- Workflow Management:
Airflow
orchestrates allSpark
jobs- Implements dependency management
- Maintains process monitoring and logging
- Caching Strategy:
Redis
caches frequently accessed data- Implements cache invalidation rules
- Maintains performance SLAs
- Quality Assurance:
Prometheus
metrics collectionGrafana
dashboards for monitoring- Implements SLA monitoring and alerting
The Gold Layer in a data lakehouse architecture represents the final, refined stage where clean, processed, and analytics-ready data is stored. This layer is specifically designed for consumption by business intelligence tools, data scientists, and analysts. In the context of an Traffic Data Warehouse, the Gold Layer is where critical business metrics, aggregated datasets, and key insights are stored in an optimized format.
The schema DataWarehouse:
To deploy and run this Data Lakehouse project, the following system requirements are necessary:
- Processor:
ARM64
chip with at least 12 CPU cores. - Memory: 32 GB of RAM.
- Storage: 50 GB of free storage space.
- Operating System: A
Linux-based
OS supportingARM64
architecture. - Docker: Pre-installed to run all components in isolated containers.
- Docker Compose: Pre-installed for orchestrating multi-container Docker applications.
- Git: For version control and project deployment
-
Run the following command to clone the project repository:
git clone https://github.com/Ren294/SmartTraffic_Lakehouse_for_HCMC.git
-
Navigate to the project directory:
cd SmartTraffic_Lakehouse_for_HCMC
2.1. Grant Execution Permissions for Shell Scripts
-
Once connected, ensure all shell scripts have the correct execution permissions:
chmod -R +x ./*
- This command grants execute permissions to all .sh files in the project directory, allowing them to be run without any issues.
2.2. Start Docker Services
docker-compose up -d
2.3. Update LakeFS Initial Credentials
-
Watch the
LakeFS
container logs to get the initial credentials:docker logs lakefs | grep -A 2 "Login"
-
You should see output similar to this:
Login at http://127.0.0.1:8000/ Access Key ID : XXXXXXXXXXXXXXXXXXX Secret Access Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
-
Edit the init.sh file and update these lines with your
LakeFS
credentials:username='AKIAJC5AQUW4OXQYCRAQ' # Replace with your Access Key ID password='iYf4H8GSjY+HMfN70AMquj0+PRpYcgUl0uN01G7Z' # Replace with your Secret Access Key
-
2.4. Initialize the Project Environment
-
Run the initialization script:
./init.sh
-
This script will perform the following tasks:
- Download required data from Kaggle
- Update
LakeFS
credentials across all necessary files - Initialize storage (
Minio
buckets andLakeFS
repositories) - Set up
Airflow
connector - Create
Debezium
connector - Upload
Flink
jobs - Submit S
park Streaming
jobs
2.5. Configure and Run Apache NiFi
To kick off your data processing workflows, follow these steps to configure and run Apache NiFi
:
-
Open the
NiFi Web UI
by navigating tohttp://localhost:8443/nifi
in your browser. -
Add the template for the
NiFi
workflow: -
You will now see several key components in your NiFi workflow:
-
TrafficDataIngestion
: Collects and integrates raw traffic data. -
TrafficAccidentsIngestion
: Collects and integrates raw traffic accident data. -
WeatherIngestion
: Collects and integrates raw weather data. -
ProcessFromIngestion
: Sends data from ingestion pipelines toKafka
for downstream consumption. -
TrafficDataCommitToBronzeLayer
: Transfers traffic data into the Bronze layer for storage. -
TrafficAccidentCommitToBronzeLayer
: Loads traffic accident data into the Bronze layer for storage. -
WeatherCommitToBronzeLayer
: Stores ingested weather data into the Bronze layer for foundational analytics.
-
2.6. Run Apache Airflow DAGs
-
Open the
Airflow Web UI
by navigating tohttp://localhost:6060
in your browser -
Login with
- Username: ren294
- Password: ren294
-
After login, it's time to activate the DAGs (Directed Acyclic Graphs) that control the data workflows. The project consists of ten pre-configured
DAGs
, which you can activate from theAirflow
dashboard:
Here are the ten DAGs and their respective functions:
-
Bronze_Traffic_accident_Validation_DAG
: Validate traffic accident data and merge to main. -
Bronze_Traffic_data_Validation_DAG
: Validates traffic data and integrates it into the main branch. -
Bronze_Weather_Validation_DAG
: Validates weather data and integrates it into the main branch. -
Silver_Staging_Gasstation_Validate_DAG
: Handles the synchronization of gasstation data from PostgreSQL to Hudi tables. -
Silver_Staging_Parking_Validate_DAG
: Handles the synchronization of parking data from PostgreSQL to Hudi tables. -
Silver_to_Staging_Accidents_Merge_DAG
: Merges traffic accident data from a temporary branch into the staging branch. -
Silver_to_Staging_Parkinglot_Merge_DAG
: Merges parking lot data from a temporary branch into the staging branch. -
Silver_to_Staging_Storagetank_Merge_DAG
: Merges storage tank data from a temporary branch into the staging branch. -
Silver_to_Staging_Traffic_Merge_DAG
: Merges traffic data from a temporary branch into the staging branch. -
Silver_to_Staging_Weather_Merge_DAG
: Merges weather data from a temporary branch into the staging branch. -
Silver_Main_Gasstation_Sync_DAG
: Merges gasstation data from the staging branch into the main branch. -
Silver_Main_Parking_Sync_DAG
: Merges parking data from the staging branch into the main branch. -
Staging_to_Main_Accidents_Merge_DAG
: Merges traffic accident data from the staging branch into the main branch. -
Staging_to_Main_Traffic_Merge_DAG
: Merges traffic data from the staging branch into the main branch. -
Staging_to_Main_Weather_Merge_DAG
: Merge weather data from staging to main branch -
Gold_Dimension_Tables_Load_DAG
: Create dimension tables and commits them to the main branch. -
Gold_Fact_Tables_WithUpsert_Load_DAG
: To update and commit fact tables with upsert functionality into the main branch. -
Gold_Fact_Tables_WithOverwrite_Load_DAG
: To update and commit fact tables with overwrite functionality into the main branch.
2.7. Apache Flink
-
Open the
Apache Flink
Web UI by navigating tohttp://localhost:8085
in your browser. You can view active jobs, completed jobs, and failed jobs, along with their associated details such as execution plans and performance metrics.
2.8. Kafka via Redpanda UI
-
Access the
Redpanda
Web UI by navigating tohttp://localhost:1010
in your browser. This intuitive interface provides an efficient way to manage and monitor your Kafka infrastructure.
2.9. Redis
-
Access the
Redis Insight
dashboard by visitinghttp://localhost:6379
in your browser.Redis Insight
allows you to inspect keys, monitor performance metrics, and visualize data structures like hashes, sets, and lists.- Use this dashboard to perform administrative tasks, such as analyzing key usage and monitoring latency, to ensure
Redis
is performing optimally for your application. The intuitive UI helps streamline debugging and data analysis.
- Use this dashboard to perform administrative tasks, such as analyzing key usage and monitoring latency, to ensure
2.10. Clikhouse via Dbeaver
-
To connect to the
ClickHouse
database throughDBeaver
, configure the connection using the following settings:- Host: localhost
- Port: 18123
- Database/Schema: smart_traffic_db
- Username: ren294
- Password: trungnghia294
-
Once the connection is established, the
DBeaver
interface will display the connected database, as shown below: -
Explore the schema and table structures within the smart_traffic_db database to understand the relationships and data stored:
-
Use
DBeaver
’s SQL editor to execute queries, analyze data, and visualize query results directly within the tool.
2.11. Trino via Dbeaver
-
Connect to the
Trino
query engine viaDBeaver
by setting up a connection with the following details:- Host: localhost
- Port: 8080
- Database/Schema: default
- Username: admin
- Password: (leave blank)
-
After successfully connecting, you can view the Trino instance in DBeaver as shown below:
-
The database schema will appear in a graphical representation for easy navigation and exploration:
2.12. Promotheus
-
Launch the
Prometheus
Web UI by navigating tohttp://localhost:9090
in your browser. -
Use the
Prometheus
query interface to explore metrics and analyze performance trends. The metrics browser allows you to filter and visualize time-series data based on specific query expressions:
-
Access
Superset
by navigating tohttp://localhost:8081
in your web browser. -
Log in using the following credentials:
- Username: ren294
- Password: trungnghia294
-
Import the pre-configured templates for charts, dashboards, datasets, and queries from the folder superset/Template. These templates are designed to provide seamless insights and accelerate your visualization setup.
-
After successfully importing, you will have access to a comprehensive collection of visualizations, including:
- Dashboards:
- Charts:
- Datasets:
-
Superset’s interactive interface allows you to drill down into data, create new visualizations, and customize dashboards to meet your analytical needs.
-
Open the
Metabase
by navigating tohttp://localhost:3000
in your browser -
Connect to the
Trino
query engine by setting up a connection with the following details:- Host: trino-coordinator
- Port: 8080
- Database/Schema: hive
- Username: admin
- Password: (leave blank)
-
Utilize the SQL query files located in the metabase/dashboard folder to create visually appealing and interactive dashboards. These pre-written queries are optimized for extracting meaningful insights, making it easier to visualize complex datasets:
-
With
Metabase
, you can effortlessly build and customize dashboards, explore data trends, and share insights with your team.
-
Access
Grafana
for monitoring by navigating tohttp://localhost:3001
in your browser. -
Log in using the following credentials:
- Username: ren294
- Password: trungnghia294
-
Import the pre-built monitoring dashboards from the grafana/Dashboard folder to instantly visualize critical system metrics and performance indicators.
-
Grafana
enables real-time monitoring with customizable dashboards, allowing you to track system health, detect anomalies, and ensure smooth operation. The intuitive interface provides detailed insights into your infrastructure and application performance.
Gain instant insights into real-time traffic conditions with this comprehensive dashboard, offering detailed analytics on road usage and congestion patterns.
Monitor and analyze accident data in real-time to identify trends and enhance road safety measures effectively.
Stay updated on real-time weather conditions, helping to correlate traffic and safety insights with environmental factors.
Dive into detailed parking transaction data, enabling better resource planning and operational insights.
Track vehicle movements across key areas to understand traffic flow and optimize transport infrastructure.
Visualize and explore traffic incident patterns to improve emergency response and preventive measures.
Analyze fuel consumption patterns and gas station transactions to better understand travel behavior and fuel efficiency.
- Open the
Streamlit
Web UI by navigating tohttp://localhost:8501
in your browser.
Access live updates on traffic conditions for the current road, providing essential details for drivers.
Stay informed about real-time traffic on three nearby roads, empowering better route planning and decision-making.
Visualize and track the performance and health of your Airflow workflows with a dedicated monitoring dashboard.
Keep a close eye on ClickHouse database performance metrics to ensure efficient query processing.
Track the performance and stability of Apache Flink jobs, ensuring optimal real-time data processing.
Monitor Kafka brokers and topics in real-time to maintain smooth message streaming and processing pipelines.
Oversee MinIO storage performance, ensuring reliability and availability for data storage.
Gain insights into the performance of Apache NiFi workflows, tracking throughput and data processing efficiency.
Monitor Redis key-value store performance and optimize caching and session management tasks.
Track Apache Spark job performance, resource utilization, and system health to maintain efficient batch and streaming processing.
Ensure the stability of your containerized environment by monitoring Docker performance metrics in real-time.
Nguyen Trung Nghia
- Contact: trungnghia294@gmail.com
- GitHub: Ren294
- Linkedln: tnghia294