02. Data Wrangling

Architecture

In this lab, we'll use Databricks to wrangle data in the blob storage Before you follow the steps, please make sure concept of ADF

Mount the blob storage to Databricks
Python
Dataframe

1. Create Azure Databricks

Create a service from here Azure Portal #Create Azure Databricks to create Azure Databricks

Type Azure Databricks workspace name, e.g.) azhol92

Select appropreate Azure subscription

Select 'Use Existing' and find your hands-on lab resource group name from the drop box e.g.) azhol-92-rg

Select 'West US' for location

Select 'Premium (+Role-based access controls)' for Pricing Tier

Pin Azure Databricks on your Azure Portal dashboard

2. Create Azure Databricks cluster (10 mins)

Open your Azure Databricks workspace or open Azure Databricks from your browser

Click 'Cluster' icon on the left panel of screen

Click '+ Create Cluster' and fill out the form like following

Name	Value
Cluster Name	azhol92
Cluster Mode	Standard Mode
Databricks Runtime Ver.	4.2.
Python Ver.	3
Driver Type	same as worker
Worker Type	Standard DS3 v2
Workers	2
Enable autoscaling	Uncheck
Auto Termination	Check, 10

3. Interact with data via Notebook

3.1 Import notebook

Click 'Workspace' on Azure Databricks Portal

Click 'Users', click on little icon next to user name and then click 'Import' to import existing notebook in to your Azure Databricks workspace

Select 'URL', Copy below url and paste it to import window

https://raw.githubusercontent.com/xlegend1024/az-cloudscale-adv-analytics/master/AzureDatabricks/02.datawrangling.ipynb

Click 'Import' button, then the notebook will automatically open on your browser

Click 'Detached' menu and then Select your cluster on the list.

3.2. Update widget parameters for the lab

Find blob strage account name and key and then update Notebook parameters

3.3 Run command from the notebook

Run differen languages (Python, Scala, R, SQL) to load data from blob to Databricks Data wrangling

3.4 Save final training dataset to blob

Run the commnad from Notebook on Databricks

3.5 (optional) Run Machine Learning in Databricks

Optionaly we can run Machine Learning in Databricks

Next > 03. Modeling

Main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02DataWrangling.md

02DataWrangling.md

02. Data Wrangling

Architecture

1. Create Azure Databricks

2. Create Azure Databricks cluster (10 mins)

3. Interact with data via Notebook

3.1 Import notebook

3.2. Update widget parameters for the lab

3.3 Run command from the notebook

3.4 Save final training dataset to blob

3.5 (optional) Run Machine Learning in Databricks

Files

02DataWrangling.md

Latest commit

History

02DataWrangling.md

File metadata and controls

02. Data Wrangling

Architecture

1. Create Azure Databricks

2. Create Azure Databricks cluster (10 mins)

3. Interact with data via Notebook

3.1 Import notebook

3.2. Update widget parameters for the lab

3.3 Run command from the notebook

3.4 Save final training dataset to blob

3.5 (optional) Run Machine Learning in Databricks