Airbnb has successfully revolutionized the traditional hospitality industry as more and more travelers choose Airbnb as their primary hosting provider. Since its beginning in 2008, Airbnb has seen an enormous growth, with the number of rentals listed on its website growing exponentially each year.
Paris has the largest number of Airbnb listings than any other city in the world. More than 52,000 listings, and 86% of them entire homes, while in almost the same period that Airbnb has been active in Paris, it has been estimated that about 20,000 apartments have disappeared from the traditional rental market.
In the past couple months, the Covid-19 pandemic have shut down the travel industry. And suddenly Airbnb is facing an existential crisis
So what could possibly go wrong and how the Covid-19 pandemic impacted On Airbnb?
Udacity-DEND-Capstone-Project
│ README.md # Project description
│ requirements.txt # Python dependencies
│ docker-compose.yml # Docker Containers Configuration
|
└───notebooks # Python notebooks to run ETL locally
| | ETL_airbnb.ipynb
| | ETL_covid.ipynb
| | ETL_weather.ipynb
|___data
| |
│ └───airbnb_paris_data # Airbnb Data Directory
│ | └─── listings.csv
│ | └─── reviews.csv
│ └───covid19_data # Covid-19 Data Directory
│ | └─── confirmed_data.csv
│ | └─── deaths_data.csv
│ | └─── recovered_data.csv
│ └───weather_data # Weather Data Directory
│ └─── weather_paris.json
|
|___assets
│ └───raw_datalake_dag_graph_view.JPG # Move data to s3 DAG Graph View
│ └───airbnb_etl_dag_graph_view.JPG # Airbnb data etl DAG Graph View
|
|___src
| |
│ └───airflow # Airflow home
│ │ │─── dags # DAG definitions
│ │ │ └─── dag_airbnb_data_etl.py # Airbnb data etl DAG
│ │ │ └─── dag_move_data_to_s3.py # Move data to s3 DAG
│ │ │─── plugins
│ │ │ └─── operators # Custom operators
│ │ │ │ └─── create_s3_bucket.py # CreateS3BucketOperator
│ │ │ │ └─── data_quality.py # CheckS3FileCountOperator
│ │ │ │ └─── upload_file_s3.py # UploadFilesToS3Operator
│ └─── download_data_script
│ │ └─── covid_data_download.py # Python script to download covid-19 data
│ │ └─── weather_data_download.py # Python script to download weather data
│ └─── script
│ │ └─── airbnb_etl.py # Airbnb data etl script
│ │ └─── covid_etl.py # covid data etl script
│ │ └─── weather_etl.py # weather data etl script
│ │ └─── check_data_quality.py # Check data quality script
│ └─── emr
│ │ └─── emr_config.json # EMR cluster configuration
- Install Python3 and Jupyter
- Install Docker and Docker Compose
- Create an AWS Account and make sure to create an IAM Role
The idea is to create a Data Lake on AWS, enabling users to analyze covid-19 effect on airbnb data and extract insights. The main goal of this project is to build an end-to-end data pipeline which is capable to work with big volumes of data. We are going to extract, transform and load the data to an optimized data lake on S3. The data lake will consist of logical tables partitioned by year and month to optimize query latency.
The datasets will be cleaned, enriched to add more information and will be loaded into a data lake. For this, we will use following tools and technologies:
1. AWS S3:
- Amazon S3, is is an object storage service that offers industry-leading scalability, availability and durability. We are going to use Amazon S3 to :
- Store the raw data
- Create a data lake
2. AWS EMR:
- We are going to use Apache Spark running on an EMR cluster on AWS to ingest the data from S3 and perform the ETL process, denormalizing the datasets to create the data lake.
3. Apache Airflow:
- We are going to build a data pipeline using Apache Airflow to orchestrate everything. Airflow provides an intuitive UI where we can track the progress and bottlenecks of our pipelines.
4. Python:
- Python is a easy to use programming language as it supports wide variety of libraries to perform tasks such as creation of pipelines as well as data analysis.
5. Docker:
- Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.
We are going to work with 3 data sources:
- Airbnb Data
- Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.
- We are going to use Paris Data by downloading 2 datasets:
- Detailed Listings data for Paris. (66 336 records, CSV)
- Detailed Review Data for listings in Paris. (1 261 965 records, CSV)
- Covid-19 Data
- This dataset contains information about COVID-19 daily time series summary tables, including confirmed, deaths and recovered
- Weather Data
- This dataset contains daily weather information, and we are going to use meteostat API in order to download it in JSON Format.
Airbnb Data
One of the biggest issues with Airbnb data is getting the occupancy rate for each host or listing. Inside Airbnb, our data source, uses an occupancy model which they call the "San Francisco Model" with the following methodology:
- A Review Rate of 50% is used to convert reviews to estimated bookings, it means that only 50% of all visitors write a review. With that said, the number of reviews divided by the review gives us an estimate of actual visitors.
- An average length of stay for each city is usually published by Airbnb. This number multiplied by the estimated bookings for each listing over a period of time gives the occupancy rate.
- Finally, the estimated income is occupancy rate multiplied by the price
So for the fact revenue table we have to add two calculated fields:
- Occupancy Rate = Average Length of Stay * (reviews count/ Review Rate)
- Estimated Income = Occupancy Rate* Price
Covid-19 Data
For covid-19 data, since the downloaded files includes all contries, we start by filtering to get only france data, then we transpose columns to rows and finally join the 3 files togther.
Weather Data
The weather dataset includes some unnecessary columns and will be removed.
We are going to use a star schema with 1 fact table and 6 dimension tables.
Fact table
- revenue table
Column Name | Data Type | Description |
---|---|---|
date | DateType | The date |
listing_id | int | Unique id of listing |
host_id | int | Unique id of host |
occupancy | float | Calculated occupancy rate |
income_est | float | Estimated income |
year | int | The year |
month | int | The month |
day | int | The day |
Dimension tables
- host table
Column Name | Data Type | Description |
---|---|---|
host_id | int | Unique id of host, Primary Key |
host_name | string | Host name |
host_location | string | Host location |
host_since | DateType | Date when host started using airbnb |
host_response_time | string | Host response time |
host_response_rate | string | Host response time |
host_is_superhost | string | Whether host is superhost or not |
host_acceptance_rate | string | Host acceptance rate |
- listing table
Column Name | Data Type | Description |
---|---|---|
id_listing | int | unique id of listing, Primary Key |
name | string | listing name |
description | string | listing description |
price | string | listing price |
neighbourhood_group_cleansed | string | listing neighbourhood group |
latitude | float | listing latitude |
longitude | float | listing longitude |
property_type | int | listing type |
accommodates | int | listing accommodates |
bathrooms | int | number of bathrooms |
bedrooms | int | number of bedrooms |
beds | int | number of beds |
- covid table
Column Name | Data Type | Description |
---|---|---|
date | DateType | The date |
confirmed | int | Number of confirmed cases |
deaths | int | Number of deaths |
recovered | int | Number of recovered |
year | int | The year |
month | int | The month |
day | int | The day |
- weather table
Column Name | Data Type | Description |
---|---|---|
date | DateType | The date |
tavg | float | The average air temperature in °C |
tmin | float | The minimum air temperature in °C |
tmax | float | The maximum air temperature in °C |
year | int | The year |
month | int | The month |
day | int | The day |
- date table
Column Name | Data Type | Description |
---|---|---|
date | DateType | The date |
year | int | The year |
month | int | The month |
day | int | The day |
weekday | int | The weekday |
git clone https://github.com/amine-akrout/Udacity-DEND-Capstone-Project.git
cd Udacity-DEND-Capstone-Project
python3 -m venv venv # create virtualenv
source venv/bin/activate # activate virtualenv
pip install -r requirements.txt # install requirements
- Airbnb Data:
Download and save the data in the folder
data/airbnb_paris_data/
- Covid-19 Data: Use the script covid_data_download.py to automatically download covid data and save it in csv format
cd src/download_data_script
python covid_data_download.py
- Weather Data:
Use the script
weather_data_download.py
to automatically download weather data using meteostat API and save it in JSON format
python covid_data_download.py
We are going to use Bitnami image for the airflow container
cd ../..
docker-compose up
Access localhost:8080 in your browser and login to Airflow.
- Username: user
- Password: password
-
Airflow connections
Navigate to Admin >> Connections >> Add a new record, then enter the following values : - Conn Id: Enter aws_credentials. - Conn Type: Enter Amazon Web Services. - Login: Enter your Access key ID from the IAM User credentials. - Password: Enter your Secret access key from the IAM User credentials. - Extra: Add the default region name. { "region_name": "eu-west-3" }
-
EMR config
Navigate to Admin >> Connections, Select the 'emr_default' connection.
Copy everything from
src/emr/emr_config.json
and paste it in the field 'Extra' then save.
Navigate the Airflow UI and start the move_data_to_s3_dag
by switching it state from OFF
to ON
Refresh the page and click on the trigger dag button.
Navigate to Graph View to view the current DAG state. This pipeline creates the 3 S3 buckets,one for our raw data, the second for our data lake and a 3rd bucket for our scripts, and uploads the data from local machine. Wait until the pipeline has successfully completed (it should take around 5-10 minutes).
Step | Type | Description |
---|---|---|
Begin_execution | DummyOperator | Dummy task with no action |
Create_S3_bucket, Create_script_bucket, Create_datalake_bucket |
CreateS3BucketOperator | Creates S3 bucket |
Upload_weather_data, Upload_covid_data, Upload_airbnb_data |
UploadFilesToS3Operator | Move data from local to s3 bucket |
Check_data_quality | CheckS3FileCount | Check wether the were uploaded or not |
Stop_execution | DummyOperator | Dummy task with no action |
Navigate back to the Airflow UI and start the airbnb_data_etl_dag
by switching it state from OFF
to ON
Refresh the page and click on the trigger dag button.
Navigate to "Graph View" to view the current DAG state.
This pipeline extracts the data from raw data bucket, transforms it with Spark cluster on EMR to create dimension and fact tables and writes it back to S3 in a data lake.
Step | Type | Description |
---|---|---|
Begin_execution | DummyOperator | Dummy task with no action |
Upload_etl_code | UploadFilesToS3Operator | Upload etl scripts to S3 |
Create_EMR_cluster | EmrCreateJobFlowOperator | Creates an EMR Cluster |
Add_jobflow_steps | EmrAddStepsOperator | Adds steps to an existing EMR JobFlow |
weather_processing_step, covid_processing_step, airbnb_processing_step, data_quality_check_step |
EmrStepSensor | Asks for the state of the step until it reaches a terminal state. If it fails the sensor errors, failing th |
Stop_execution | DummyOperator | Dummy task with no action |
1. The data was increased by 100x
- The raw data storage and data lake would not require significant changes since we have a flexible schema and S3 is meant for storing big data.
- For the ETL job, we should propbably use m5large or m5xlarge for EC2 instances insted of m4large and add more workers
2. The pipelines would be run on a daily basis by 7 am every day.
We could schedule Airflow pipelines so that they run on a daily basis at 7 am
3. The database needed to be accessed by 100+ people. - We would need more CPU resources to get a fast experience, we should use a distributed datawarehouse solution like Redshift, in order to improve replications and partitioning to get faster query results for each user.