Kafka_streaming_project

This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Launch an ec2 on AWS

Connect to your instance

Install Kafka

  wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz
  tar -xvf kafka_2.13-3.7.0.tgz

Install Java

  sudo yum install java-1.8.0
  java -version

Edit inbound rules to allow the request from the local machine

Change the server to run on the public IP of the ec2 instance

sudo nano config/server.properties

Start the zookeeper

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka server

export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
cd kafka_2.13-3.7.0
bin/kafka-server-start.sh config/server.properties

Create a topic

bin/kafka-topics.sh --create --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1

Start Producer

cd kafka_2.13-3.7.0
bin/kafka-console-producer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}

Start Consumer

cd kafka_2.13-3.7.0
bin/kafka-console-consumer.sh --topic netflix_data --bootstrap-server {Put the Public IP of your EC2 Instance:9092}

Create a s3 bucket

Open Jupyter Notebook and create a producer and consumer

Producer

Consumer

Check the data in the s3 bucket

Build a crawler in AWS Glue

Add the s3 bucket as a data source

Create a database

Run the crawler

Run queries on the table in Athena

We can run different types of queries

query movie add in 2020

SELECT * FROM "netflix_movies_db"."gakas_kafka_netflix_data" WHERE release_year=2020;

Query count movies by type

SELECT type,count(*)  FROM "netflix_movies_db"."gakas_kafka_netflix_data" Group BY type;

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
kafka_commands.txt		kafka_commands.txt
netflix_consumer.ipynb		netflix_consumer.ipynb
netflix_producer.ipynb		netflix_producer.ipynb
netflix_titles.csv		netflix_titles.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kafka_streaming_project

This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Launch an ec2 on AWS

Connect to your instance

Install Kafka

Install Java

Edit inbound rules to allow the request from the local machine

Change the server to run on the public IP of the ec2 instance

Start the zookeeper

Start Kafka server

Create a topic

Start Producer

Start Consumer

Create a s3 bucket

Open Jupyter Notebook and create a producer and consumer

Producer

Consumer

Check the data in the s3 bucket

Build a crawler in AWS Glue

Add the s3 bucket as a data source

Create a database

Run the crawler

Run queries on the table in Athena

We can run different types of queries

query movie add in 2020

Query count movies by type

About

Releases

Packages

Languages

gakas14/Kafka_streaming_project

Folders and files

Latest commit

History

Repository files navigation

Kafka_streaming_project

This project dataset is from Kaggle; it contains all the metadata on Netflix for TV shows and movies. The project is to simulate Real-time streaming for movie details using Kafka. We used different technologies such as Python, Amazon EC2, Apache Kafka, Glue, Athena, and SQL.

Launch an ec2 on AWS

Connect to your instance

Install Kafka

Install Java

Edit inbound rules to allow the request from the local machine

Change the server to run on the public IP of the ec2 instance

Start the zookeeper

Start Kafka server

Create a topic

Start Producer

Start Consumer

Create a s3 bucket

Open Jupyter Notebook and create a producer and consumer

Producer

Consumer

Check the data in the s3 bucket

Build a crawler in AWS Glue

Add the s3 bucket as a data source

Create a database

Run the crawler

Run queries on the table in Athena

We can run different types of queries

query movie add in 2020

Query count movies by type

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages