This project involves analyzing an Airbnb dataset to explore various data patterns, perform data cleaning, and extract meaningful insights. The following steps were undertaken:
- Data Exploration
- Data Cleaning
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('datasets.csv')
- Initial Inspection:
df.head()
: Displays the first few rows of the dataset.df.shape
: Provides the dimensions of the dataset.df.info()
: Summarizes the dataset, including data types and non-null counts.
- The dataset contains missing values and duplicate entries.
- The
id
column is stored as a float but should be categorical.
df.isnull().sum()
df.dropna(inplace=True)
df.isnull().sum()
- Missing values were identified and dropped to ensure clean data.
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df.duplicated().sum()
- Duplicate entries were removed.
df['id'] = df['id'].astype('object')
- The
id
column was converted to an object data type.
- Shape:
df.shape
- Price Distribution
sns.histplot(data=df, x='price')
- Identified extreme outliers in the
price
column using a boxplot.
sns.boxplot(data=df, x='price')
- Excluded prices > 16,000 for focused analysis.
data = df[df['price'] < 16000]
sns.histplot(data=data, x='price', bins=120)
- Availability (365 Days)
sns.histplot(data=df, x='availability_365', bins=30)
- Price by Neighborhood Group
data.groupby('neighbourhood_group')['price'].mean()
- Average price varies significantly across neighborhood groups.
- Price Per Bed
data['price_per_bed'] = data['price'] / data['beds']
data.groupby('neighbourhood_group')['price_per_bed'].mean()
- Introduced a new feature to normalize pricing.
- Neighborhood Group vs. Price by Room Type
sns.barplot(data=data, x='neighbourhood_group', y='price', hue='room_type')
- Reviews vs. Price
sns.scatterplot(data=data, x='number_of_reviews', y='price', hue='neighbourhood_group')
- Explored the relationship between the number of reviews and price.
sns.pairplot(data=data, vars=['price', 'number_of_reviews', 'availability_365', 'beds'], hue='room_type')
- Examined relationships among numerical columns.
sns.scatterplot(data=data, x='longitude', y='latitude', hue='room_type')
- Visualized the spatial distribution of Airbnb listings.
- Histograms: Displayed distributions of
price
andavailability_365
. - Boxplots: Highlighted outliers in pricing.
- Scatterplots: Explored geographic and review-based trends.
- Pairplot: Showed interrelationships between numeric variables.
- Outliers: High-priced listings (>16,000) were removed for a more focused analysis.
- Neighborhood Insights:
- Pricing and price per bed vary significantly across neighborhoods.
- Room type is a strong determinant of price within neighborhoods.
- Geographical Trends:
- Listings are clustered in specific geographic locations.
This analysis highlights key trends and relationships within the Airbnb dataset and sets the foundation for further modeling or operational insights.