The US Candy Distributor SQL Project is a data analysis project centered around SQL queries. Using sales and geospatial data from a candy distributor, the goal of this project is to explore and analyze the data through a series of SQL queries, providing actionable insights into shipping efficiency, product margins, and sales performance.
This project focuses on optimizing shipping routes, analyzing product margins, and identifying areas for improvement in the candy distributor’s operations using SQL. It is designed for data engineers to practice and showcase their SQL skills in solving real-world business problems.
- Efficient Factory-to-Customer Shipping Routes: Analyze shipping routes between factories and customers using SQL to identify the most and least efficient routes.
- Product Margin Analysis: Use SQL to calculate profit margins for products and identify which product lines generate the highest margins.
- Factory Relocation Insights: Analyze the current factory-to-customer distribution and recommend product line reallocation based on efficiency and cost.
- Sales Target Achievement: Query sales data to determine how well the distributor is meeting targets and where improvements can be made.
- SQL Server: For querying and analysis,Join multiple tables to combine relevant data,Perform aggregations, calculations, and filtering to derive insights.
- Python: For data cleaning and transformations.
- Microsoft Excel: For initial data exploration and formatting.
Follow these steps to set up the project environment:
Navigate to your GitHub repository:us_candy_distribution_analysis_project Clone the repository locally using the green Code button and choose your preferred method (HTTPS, SSH, or GitHub CLI).
Install and configure SQL Server on your system. Create a new database for the project using the SQL scripts provided in the schema folder.
Download the datasets from Maven Analytics Data Playground Populate the database using the provided SQL scripts. Clean and preprocess the datasets using Python scripts available in the scripts folder.
This project uses multiple related tables, each containing valuable information for the analysis.
-
Candy_Factories
Contains factory locations with geographic data (latitude, longitude) to help optimize shipping routes. -
Candy_Products
Provides detailed information about products, their pricing, and manufacturing factories. -
Candy_Sales
Records sales transactions, including order details, shipment dates, product sales, and costs. -
Candy_Targets
Holds target sales data for each division, enabling performance tracking and comparison. -
US_zips
Geospatial data about US zip codes includes city, state, population, and geographic coordinates to support route analysis.
The key areas for analysis in this project are:
-
Factory-to-Customer Shipping Route Optimization
Analyzing shipping routes between candy factories and customer locations to identify the most and least efficient routes. Optimize these routes by considering shipping costs, travel distances, and factory capacity. -
Product Margin Analysis
Calculate the margins for each product by comparing unit costs with unit prices, and identify the most profitable product lines. -
Factory Relocation Strategy
Suggest which products should be moved to different factories based on their shipping efficiency and geographic location, potentially reducing overall transportation costs. -
Sales Target Achievement
Analyze how well the candy distributor is meeting sales targets for different product divisions and recommend improvement strategies.
- Maven Analytics for providing the dataset and allowing us to explore this real-world business problem.
- US Candy Distributor for their data-driven approach to enhancing logistics and product sales.
- Data Engineers and Analysts who contributed to curating and analyzing the dataset.
This dataset is provided under the Public Domain license, allowing you to use and share the data freely for educational and analytical purposes.
This project offers a comprehensive look at data engineering in a logistics and product sales context. By working with real-world data, you will gain valuable insights into optimizing factory locations, shipping routes, and product margins, all while refining your skills as a data engineer.
The following are a few of the questions/scenarios that are answered in this project. check out the full SQL queries in the detailed article here
- Retrieve All Products from a Specific Factory
- Calculate Product Margins
- Get Total Sales and Units Sold by Division
- Find Top 5 Cities by Total Sales
- Get Factory-Wise Product Count
- Find Products with Maximum Price Difference Between Units
- Identify Orders with Late Shipments
- Search for Factory Names Containing Special Characters
- Calculate Running Totals of Sales and Gross Profit by Region
- Find Top Products by Units Sold Using Percentile<
- Calculate Average Sales per Day for Each Division
- Find the Most Recent Orders for Each Division
- Calculate Year-over-Year Sales Growth by Division
- Find the most populated cities for each state
- Product Margin Calculation
- Factory Optimization Recommendation
- Identify the Most Efficient Shipping Routes (Based on Distance)
- Compare Sales vs Target by Division
- Get Sales Target Achievement by Division
- Determine Which Regions Exceeded Their Targets
- Generate a List of Consecutive Orders for Each Customer
- Calculate Monthly Sales Trends for the Top Regions
- Identify Orders That Miss Sales Targets by Division and Region
- Find the Most Profitable Product by Factory and Division