Skip to content

Ingest, Enrich, Sessionize, aggregate, and store clickstream Data in real time

License

Notifications You must be signed in to change notification settings

AnthonyByansi/Clickstream-Data-Processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clickstream Data Processing

This project demonstrates a real-time data processing pipeline for analyzing user behavior using Scala, Apache Kafka, and Apache Spark. The pipeline ingests clickstream data, enriches it, performs sessionization, aggregates the results, and stores the processed data for further analysis.


Table of Contents


Introduction

Modern organizations collect clickstream data from websites or applications to gain actionable insights into user behavior. This project provides a robust and scalable real-time data processing pipeline using Scala, Apache Kafka, and Apache Spark.

Key Highlights:

  • Data Ingestion: Real-time ingestion of clickstream data into Kafka.
  • Data Enrichment: Enhancing raw data with useful context (e.g., IP geolocation, user-agent parsing).
  • Sessionization: Grouping clickstream events into sessions based on user activity patterns.
  • Aggregation: Calculating metrics such as session duration, page views, and user engagement.
  • Storage: Processed data is stored for further analysis.

Project Architecture

The following diagram outlines the project architecture:

+------------------+       +------------------+       +------------------+
| Kafka Producer   | ----> | Kafka Topic      | ----> | Spark Streaming  |
+------------------+       +------------------+       +------------------+
                                                           |
                                                           v
                                       +------------------------------------+
                                       | Data Enrichment, Sessionization,   |
                                       | Aggregation                       |
                                       +------------------------------------+
                                                           |
                                                           v
                                            +-----------------------------+
                                            | Processed Data Storage      |
                                            | (HDFS / HBase / Database)   |
                                            +-----------------------------+
                                                           |
                                                           v
                                     +--------------------------------+
                                     | Visualization & Data Analysis |
                                     | (Superset / Tableau / Spark SQL)|
                                     +--------------------------------+

Features

  1. Real-time Data Ingestion

    • Simulated clickstream data is sent to a Kafka topic using a Kafka producer.
  2. Data Enrichment

    • Perform IP address lookups to determine geolocation.
    • Parse user-agent strings to extract browser and device information.
  3. Sessionization

    • Define logic for identifying sessions based on inactivity thresholds.
    • Group events into user sessions.
  4. Aggregation

    • Calculate metrics such as session duration, total page views, bounce rates, and more.
  5. Storage

    • Store processed data in Hadoop Distributed File System (HDFS), HBase, or relational databases.
  6. Visualization

    • Analyze and visualize processed clickstream data using tools like Apache Superset, Tableau, or Spark SQL.

Prerequisites

Before running the project, ensure you have the following tools installed:


Installation

Follow these steps to set up and run the project:

1. Clone the Repository:

https://github.com/AnthonyByansi/Clickstream-Data-Processing.git
cd ClickstreamDataProcessing

2. Install Dependencies:

If using SBT:

sbt clean compile

If using Maven:

mvn clean install

3. Start Apache Kafka:

  • Start Zookeeper and Kafka broker:
bin/zookeeper-server-start.sh config/zookeeper.properties  
bin/kafka-server-start.sh config/server.properties  
  • Create a Kafka topic for clickstream data:
bin/kafka-topics.sh --create --topic clickstream-data --bootstrap-server localhost:9092

Usage

  1. Simulate Clickstream Data

    • Run the Kafka producer to generate or simulate clickstream data.
    scala src/main/scala/producer/KafkaClickstreamProducer.scala
  2. Run the Data Processing Pipeline

    • Start the Spark Streaming job to consume and process the clickstream data.
    spark-submit --class processor.ClickstreamSessionizer \
      --master local[*] target/scala-2.12/clickstream-data-processing.jar
  3. Store the Processed Data

    • Output processed data to HDFS, HBase, or a database. Update configurations in application.conf.
  4. Analyze the Results

    • Use Spark SQL or visualization tools like Tableau/Superset to analyze the processed data.

Data Pipeline Overview

  1. Kafka Producer: Simulates or ingests raw clickstream data.
  2. Spark Streaming: Processes the real-time clickstream data.
  3. Sessionization & Aggregation: Groups events into sessions and calculates metrics.
  4. Storage: Saves the processed data for further analysis.
  5. Visualization: Generates reports or dashboards for insights.

Contributing

We welcome contributions! Follow these steps to contribute:

  1. Fork the repository.
  2. Create a new branch: git checkout -b feature/your-feature-name.
  3. Commit your changes: git commit -m "Add feature description".
  4. Push your branch: git push origin feature/your-feature-name.
  5. Submit a pull request explaining your changes.

License

This project is licensed under the MIT License. See LICENSE for details.

About

Ingest, Enrich, Sessionize, aggregate, and store clickstream Data in real time

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages