Skip to content

The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Notifications You must be signed in to change notification settings

gakas14/Kafka_streaming_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kafka_streaming_project

This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

kafka_netflix_data

Launch an ec2 on AWS

Screen Shot 2024-03-26 at 1 09 49 PM

Connect to your instance

Screen Shot 2024-03-26 at 1 13 31 PM

Install Kafka

  wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
  tar -xvf kafka_2.13-3.7.0.tgz

Install Java

  sudo yum install java-1.8.0
  java -version

Edit inbound rules to allow the request from the local machine

Change the server to run on the public IP of the ec2 instance

sudo nano config/server.properties

Start the zookeeper

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka server

export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
cd kafka_2.13-3.7.0
bin/kafka-server-start.sh config/server.properties

Create a topic

bin/kafka-topics.sh --create --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1
Screen Shot 2024-03-26 at 1 24 34 PM

Start Producer

cd kafka_2.13-3.7.0
bin/kafka-console-producer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}

Start Consumer

cd kafka_2.13-3.7.0
bin/kafka-console-consumer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}

Create a s3 bucket

Screen Shot 2024-03-26 at 1 27 22 PM

Open Jupyter Notebook and create a producer and consumer

Producer

Screen Shot 2024-03-26 at 1 33 25 PM Screen Shot 2024-03-26 at 1 34 43 PM

Consumer

Screen Shot 2024-03-26 at 1 33 16 PM

Check the data in the s3 bucket

Screen Shot 2024-03-26 at 2 12 40 PM

Build a crawler in AWS Glue

Add the s3 bucket as a data source

Screen Shot 2024-03-26 at 1 44 05 PM

Create a database

Screen Shot 2024-03-26 at 1 41 47 PM

Run the crawler

Screen Shot 2024-03-26 at 1 44 48 PM Screen Shot 2024-03-26 at 1 46 33 PM

Run queries on the table in Athena

Screen Shot 2024-03-26 at 1 47 52 PM

Screen Shot 2024-03-26 at 1 48 03 PM

We can run different types of queries

query movie add in 2020
SELECT * FROM "netflix_movies_db"."gakas_kafka_netflix_data" WHERE release_year=2020;
Screen Shot 2024-03-26 at 1 56 07 PM Screen Shot 2024-03-26 at 1 56 16 PM

Query count movies by type

SELECT type,count(*)  FROM "netflix_movies_db"."gakas_kafka_netflix_data" Group BY type;
Screen Shot 2024-03-26 at 2 04 23 PM

About

The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published