This project entails an in-depth analysis of Netflix's movies and TV shows dataset through SQL. The aim is to derive meaningful insights and address several business-related queries using the data. This README offers a thorough overview of the project’s objectives, business challenges, solutions, findings, and conclusions.
- Examine the distribution of content types (movies versus TV shows).
- Determine the most frequent ratings for each content type.
- Organize and evaluate content by release year, country, and duration.
- Investigate and classify content based on particular criteria and keywords.
The dataset used in this project is obtained from Kaggle.
-
Dataset Link: Movies Dataset
DROP TABLE IF EXISTS netflix;
CREATE TABLE netflix
(
show_id VARCHAR(5),
type VARCHAR(10),
title VARCHAR(250),
director VARCHAR(550),
casts VARCHAR(1050),
country VARCHAR(550),
date_added date,
release_year INT,
rating VARCHAR(15),
duration VARCHAR(15),
listed_in VARCHAR(250),
description VARCHAR(550)
);
SELECT * FROM netflix;
SELECT
type,
count(*)
FROM netflix
GROUP BY type;
SELECT
type,
rating
FROM (
SELECT
type,
rating,
COUNT(*) as rating,
RANK() OVER(PARTITION BY type ORDER BY COUNT(*) DESC ) Ranking
FROM netflix
GROUP BY type, rating
) AS t1
WHERE Ranking = 1;
SELECT *
FROM netflix
WHERE release_year = 2018;
WITH Most_Content AS (
SELECT
Trim(UNNEST(STRING_TO_ARRAY(country,','))) AS country,
count(*) AS contents
FROM netflix
GROUP BY 1
)
SELECT
country,
contents
FROM Most_Content
ORDER BY 2 DESC
LIMIT 7;
SELECT *
FROM netflix
WHERE type = 'Movie' AND duration = (
SELECT
MAX(duration)
FROM netflix
);
SELECT *
FROM netflix
WHERE type = 'TV Show'
AND SPLIT_PART(duration, ' ', 1)::INT >= 7;
SELECT *
FROM netflix
WHERE date_added >= CURRENT_DATE - INTERVAL '7 year' AND date_added IS NOT NULL;
SELECT
UNNEST(STRING_TO_ARRAY(listed_in, ',')) AS genre,
COUNT (*) AS content_number
FROM netflix
GROUP BY genre
ORDER BY 1;
SELECT
country,
release_year,
COUNT(show_id) AS total_release,
ROUND(
COUNT(show_id)::numeric /
(SELECT COUNT(show_id) FROM netflix WHERE country ILIKE '%India%')::numeric * 100, 2
) AS avg_release
FROM netflix
WHERE country ILIKE '%India%'
GROUP BY country, release_year
ORDER BY avg_release DESC
LIMIT 5;
SELECT *
FROM netflix
WHERE type = 'Movie' AND listed_in ILIKE '%Documentaries%';
SELECT *
FROM netflix
WHERE director IS NULL;
SELECT
UNNEST(STRING_TO_ARRAY(casts, ',')) AS actor,
COUNT(*)
FROM netflix
WHERE country = 'India'
GROUP BY actor
ORDER BY COUNT(*) DESC
LIMIT 10;
SELECT *
FROM netflix
WHERE casts ILIKE '%Shah Rukh Khan%'
AND date_added>= CURRENT_DATE - INTERVAL '5 year';
WITH No_Of_Category AS (
SELECT *,
CASE
When description ILIKE '%Kill%'
OR
description ILIKE '%Violence%'
THEN 'Bad Category'
ELSE 'Good Category'
END Category
FROM netflix
)
SELECT
Category,
COUNT(*)
FROM No_Of_Category
GROUP BY Category
ORDER BY 1 DESC;
1. Content Distribution: The dataset features a variety of movies and TV shows across different ratings and genres.
2. Popular Ratings: Analyzing common ratings offers insights into the intended audience for the content.
3. Geographical Trends: Leading countries and average content releases, particularly in India, shed light on regional content trends.
4. Content Classification: Sorting content by specific keywords aids in understanding the types of material available on Netflix.
This analysis delivers an in-depth look at Netflix's offerings and supports strategic content planning and decision-making.