Skip to content

Latest commit

 

History

History
98 lines (58 loc) · 2.89 KB

02DataWrangling.md

File metadata and controls

98 lines (58 loc) · 2.89 KB

02. Data Wrangling

Architecture

In this lab, we'll use Databricks to wrangle data in the blob storage Before you follow the steps, please make sure concept of ADF

  • Mount the blob storage to Databricks
  • Python
  • Dataframe

02

1. Create Azure Databricks

Create a service from here Azure Portal #Create Azure Databricks to create Azure Databricks

Type Azure Databricks workspace name, e.g.) azhol92

Select appropreate Azure subscription

Select 'Use Existing' and find your hands-on lab resource group name from the drop box e.g.) azhol-92-rg

Select 'West US' for location

Select 'Premium (+Role-based access controls)' for Pricing Tier

Pin Azure Databricks on your Azure Portal dashboard

2. Create Azure Databricks cluster (10 mins)

Open your Azure Databricks workspace or open Azure Databricks from your browser

Click 'Cluster' icon on the left panel of screen

Click '+ Create Cluster' and fill out the form like following

Name Value
Cluster Name azhol92
Cluster Mode Standard Mode
Databricks Runtime Ver. 4.2.
Python Ver. 3
Driver Type same as worker
Worker Type Standard DS3 v2
Workers 2
Enable autoscaling Uncheck
Auto Termination Check, 10

createdatabrcisk

3. Interact with data via Notebook

3.1 Import notebook

Click 'Workspace' on Azure Databricks Portal

Click 'Users', click on little icon next to user name and then click 'Import' to import existing notebook in to your Azure Databricks workspace

importnotebook

Select 'URL', Copy below url and paste it to import window

https://raw.githubusercontent.com/xlegend1024/az-cloudscale-adv-analytics/master/AzureDatabricks/02.datawrangling.ipynb

importnotebook

Click 'Import' button, then the notebook will automatically open on your browser

Click 'Detached' menu and then Select your cluster on the list.

importnotebook

3.2. Update widget parameters for the lab

Find blob strage account name and key and then update Notebook parameters

importnotebook

3.3 Run command from the notebook

Run differen languages (Python, Scala, R, SQL) to load data from blob to Databricks Data wrangling

3.4 Save final training dataset to blob

Run the commnad from Notebook on Databricks

3.5 (optional) Run Machine Learning in Databricks

Optionaly we can run Machine Learning in Databricks


Next > 03. Modeling


Main