About

This repo illustrates creating a machine learning pipeline in the cloud with the python sdk and used to develop a model to predict a patients blood pressure given several features e.g. insulin,BMI and many more. This is based on the UC Irvine Diabetes dataset found below.

https://archive.ics.uci.edu/dataset/34/diabetes

This scenario solves a regression problem, in this case a continuous numeric value for blood pressure. The main purpose is primarily to guide in the creation of a pipeline using Azure Machine Learning and the Python SDK.

Creating Machine Learning Pipelines with Python SDKv2 in Azure Machine Learning

There are many ways to create pipelines in Azure Machine Learning. This method uses a programmatic approach with the Python SDK (software development kit). This approach is ideal for developing pipelines and experimenting. For production, I recommend the CLI and the AML extension which helps create a production grade pipeline.

flowchart TB
    subgraph pipeline with sdkv2
    c1(data prep component) --> c2(training component)
    end

The files are setup to follow a sequence of steps that mimic an ML workflow starting with p01.

1 development

p01_development.ipynb

Assuming role of mlops or ml engineer we have received a notebook that creates a ml model.

2 productionalize

p02_refactor.py

We convert our notebook into a python script (.py) as pipelines jobs require scripts. However note, that I will ultimately create two scripts, one for each component found under the components folder.
To make things easier to manage we create functions from our steps.
This method also introduces the argparse package which is used to read data into the python script or any other input such as a parameter.

Note we have removed reference to our workspace and data filepaths.

2a - Components with Python SDKv2

flowchart LR
    script:::bar --> notebook:::foobar --> component
    classDef bar stroke:#0f0
    classDef foobar stroke:#00f

We create components using two approaches, with Python SDK we define the structure of our component and another example we define a component with a yaml file.

At this level, we can specify compute and environment at the component level.

We start with python sdkv2 or the programmatic approach.

components/p03_data_prep.py

We take our functions from the prep stage in our p02_refactor.py script and make a few adjustments. This will become a separate independent component from our training portion of our pipeline and will be used to scale our features.

This will be the be the staging point for data ingestion into our pipeline, so we define two arguments. One for input which will be the data we ultimately want to use, and output which is the data that has been transformed and ready for training.

Within our main() function we utlimately want to return our prepped data so we adjust our function to return data as a .csv and save it wo our output argument named --prepped_data.

Arguments

--input_data - argument that will refence filepath/filename name as it is defined as uri_file

--prepped_data - argument that will reference filepath as it is defined as uri_folder

We will not provide the filepath just yet to our data and where to store our transformed data. At this point we are creating skeletons for our pipeline.

p03_create_data_prep_component.ipynb

We will create the actual component using a notebook with python sdks command function. We also take an additional step and register this component into our AML workspace.

2b - Components with yaml

flowchart LR
    yaml:::foo & script:::bar --> notebook:::foobar --> component
    classDef foo stroke:#f00
    classDef bar stroke:#0f0
    classDef foobar stroke:#00f

This approach defines our component using a yaml file.

p04_create_register_component.ipynb

This notebook loads the yaml file and creates a component for us (note, script is defined in the yaml file).

p04_training.yaml

yaml file holding component inputs, outputs, script reference, and environment.

p04_training.py

Our actual script that is executed when running the training component.

3 create our pipeline

p05_create_pipeline.ipynb

This last notebook creates our pipeline. Because we didn't specify compute type in our components we specify the compute resource to run this pipeline. We can also define different computes per component, would make sense to use GPU or Spark Cluster for just the training component and CPU compute for data prep.

Benefits

we have a pipeline stood up primarily python

Drawbacks

reproducibility is important, CLI is best for this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Creating Machine Learning Pipelines with Python SDKv2 in Azure Machine Learning

1 development

2 productionalize

2a - Components with Python SDKv2

2b - Components with yaml

3 create our pipeline

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
config.json		config.json
p01_development.ipynb		p01_development.ipynb
p02_refactor.py		p02_refactor.py
p03_create_data_prep_component.ipynb		p03_create_data_prep_component.ipynb
p04_create_register_component.ipynb		p04_create_register_component.ipynb
p04_training.yaml		p04_training.yaml
p05_create_pipeline.ipynb		p05_create_pipeline.ipynb
readme.md		readme.md

gatorduck/AzureML_SDK_Pipeline

Folders and files

Latest commit

History

Repository files navigation

About

Creating Machine Learning Pipelines with Python SDKv2 in Azure Machine Learning

1 development

2 productionalize

2a - Components with Python SDKv2

2b - Components with yaml

3 create our pipeline

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages