This project leverages PySpark for a comprehensive analysis of sales data, combining it with menu data to generate insightful reports and visualizations. The project demonstrates how to use PySpark for ETL (Extract, Transform, Load) processes and data analysis, providing valuable insights into customer spending, product popularity, and sales trends.
.
├── Data
│ ├── Dashboard.gif
│ ├── menu.csv.txt
│ └── sales.csv.txt
├── Pyspark_project.html
├── Pyspark_project_notebook.html
├── Pyspark_project_notebook.ipynb
└── README.md
This project reads sales data and menu data from CSV files, processes the data using PySpark, and performs various analyses to derive insights. The following operations are performed:
- Data Loading: Reading sales and menu data from CSV files.
- Data Transformation: Extracting year, month, and quarter from the order dates.
- Data Aggregation and Analysis:
- Total amount spent by each customer.
- Total amount spent on each product.
- Monthly, yearly, and quarterly sales amount.
- Product sales count and top-selling products.
- Customer visits analysis.
- Sales by location.
- Sales by order source.
- Sales Data: Contains details of each sale, including product ID, customer ID, order date, location, and source of the order.
- Menu Data: Contains details of each product, including product ID, product name, and price.
- Product_id: Integer
- Customer_id: String
- Order_date: Date
- Location: String
- Source_order: String
- Product_id: Integer
- Product_name: String
- Price: String
-
Data Loading:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType schema = StructType([ StructField("Product_id", IntegerType(), True), StructField("Customer_id", StringType(), True), StructField("Order_date", DateType(), True), StructField("Location", StringType(), True), StructField("Source_order", StringType(), True) ]) sales_df = spark.read.csv("/FileStore/tables/sales_csv.txt", schema=schema)
-
Data Transformation:
from pyspark.sql.functions import month, year, quarter sales_df = sales_df.withColumn("order_year", year(sales_df.Order_date)) sales_df = sales_df.withColumn("order_month", month(sales_df.Order_date)) sales_df = sales_df.withColumn("order_quarter", quarter(sales_df.Order_date))
-
Data Aggregation and Analysis:
-
Total amount spent by each customer:
total_amount_spent = (sales_df.join(menu_df, "Product_id") .groupBy('Customer_id') .agg({'Price':'sum'}) .orderBy('Customer_id'))
-
Total amount spent on each product:
total_amount_spent = (sales_df.join(menu_df, "Product_id") .groupBy('Product_name') .agg({'Price':'sum'}) .orderBy('Product_name'))
-
Monthly, yearly, and quarterly sales amount:
monthly_amount = (sales_df.join(menu_df, "Product_id") .groupBy('Order_month') .agg({'Price':'sum'}) .orderBy('Order_month'))
-
Product sales count and top-selling products:
product_sale = (sales_df.join(menu_df, "Product_id") .groupBy('Product_name') .agg(count('product_id').alias('product_count')) .orderBy('product_count', ascending=0)).limit(5)
-
Customer visits analysis:
customer_visit = (sales_df.filter(sales_df.Source_order == 'Restaurant') .groupBy('Customer_id') .agg(countDistinct('Order_date').alias('No of Visits')) .orderBy('Customer_id'))
-
Sales by location:
country_sales = (sales_df.join(menu_df, 'Product_id') .groupBy('Location') .agg({'Price':'sum'}) .orderBy('Location'))
-
Sales by order source:
mode_of_sales = (sales_df.join(menu_df, 'Product_id') .groupBy('Source_order') .agg({'Price':'sum'}) .orderBy('Source_order'))
-
- PySpark: For data processing and analysis.
- Jupyter Notebook: For interactive development and visualization.
- Databricks: For spark cluster.
This project showcases the power of PySpark for handling large datasets and performing complex data transformations and analyses. The insights derived from this analysis can help businesses understand customer behavior, product popularity, and sales trends, enabling data-driven decision-making.