Copy editing Beam ML notebooks (apache#26226)

* Copy editing the Beam ML notebooks * typo fixes * updated note capitalization * Update examples/notebooks/beam-ml/automatic_model_refresh.ipynb Co-authored-by: Danny McCormick <dannymccormick@google.com> --------- Co-authored-by: Danny McCormick <dannymccormick@google.com>
m-trieu · Apr 12, 2023 · 78db671 · 78db671
1 parent 0e8c3c2
commit 78db671
Show file tree

Hide file tree

Showing 9 changed files with 171 additions and 151 deletions.
diff --git a/examples/notebooks/beam-ml/automatic_model_refresh.ipynb b/examples/notebooks/beam-ml/automatic_model_refresh.ipynb
@@ -15,16 +15,6 @@
     }
   },
   "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
-      },
-      "source": [
-        "<a href=\"https://colab.research.google.com/github/AnandInguva/beam/blob/notebook/beam/examples/notebooks/beam-ml/side_Input_model_updates.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-      ]
-    },
     {
       "cell_type": "code",
       "source": [
@@ -57,7 +47,16 @@
     {
       "cell_type": "markdown",
       "source": [
-        "# Update ML models in running pipelines"
+        "# Update ML models in running pipelines\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/automatic_model_refresh.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/colab_32px.png\" />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/automatic_model_refresh.ipynb\"><img src=\"https://raw.githubusercontent.com/google/or-tools/main/tools/github_32px.png\" />View source on GitHub</a>\n",
+        "  </td>\n",
+        "</table>\n"
       ],
       "metadata": {
         "id": "ZUSiAR62SgO8"
@@ -66,13 +65,13 @@
     {
       "cell_type": "markdown",
       "source": [
-        "The pipeline in this notebook uses a RunInference `PTransform` to run inference on images using TensorFlow models. To update the model, it uses a side input `PCollection` that emits `ModelMetadata`.\n",
-        "\n",
-        "You can use side inputs to update your model in real-time, even while the Apache Beam pipeline is running. The side input is passed in a `ModelHandler` configuration object. You can update the model either by leveraging one of Apache Beam's provided patterns, such as the `WatchFilePattern`, or by configuring a custom side input `PCollection` that defines the logic for the model update.\n",
+        "This notebook demonstrates how to perform automatic model updates without stopping your Apache Beam pipeline.\n",
+        "You can use side inputs to update your model in real time, even while the Apache Beam pipeline is running. The side input is passed in a `ModelHandler` configuration object. You can update the model either by leveraging one of Apache Beam's provided patterns, such as the `WatchFilePattern`, or by configuring a custom side input `PCollection` that defines the logic for the model update.\n",
         "\n",
+        "The pipeline in this notebook uses a RunInference `PTransform` with TensorFlow machine learning (ML) models to run inference on images. To update the model, it uses a side input `PCollection` that emits `ModelMetadata`.\n",
         "For more information about side inputs, see the [Side inputs](https://beam.apache.org/documentation/programming-guide/#side-inputs) section in the Apache Beam Programming Guide.\n",
         "\n",
-        "This example uses `WatchFilePattern` as a side input. `WatchFilePattern` is used to watch for the file updates matching the `file_pattern` based on timestamps. It emits the latest `ModelMetadata`, which is used in the RunInference `PTransform` to automatically update the ML model without stopping the Apache Beam pipeline.\n"
+        "This example uses `WatchFilePattern` as a side input. `WatchFilePattern` is used to watch for file updates that match the `file_pattern` based on timestamps. It emits the latest `ModelMetadata`, which is used in the RunInference `PTransform` to automatically update the ML model without stopping the Apache Beam pipeline.\n"
       ],
       "metadata": {
         "id": "tBtqF5UpKJNZ"
@@ -84,7 +83,7 @@
         "## Before you begin\n",
         "Install the dependencies required to run this notebook.\n",
         "\n",
-        "To use RunInference with side inputs for automatic model updates, install `Apache Beam` version `2.46.0` or later."
+        "To use RunInference with side inputs for automatic model updates, use Apache Beam version 2.46.0 or later."
       ],
       "metadata": {
         "id": "SPuXFowiTpWx"
@@ -147,11 +146,14 @@
     {
       "cell_type": "markdown",
       "source": [
-        "## Runner\n",
+        "## Configure the runner\n",
         "\n",
-        "This pipeline runs on the Dataflow Runner. Ensure that you have all the required permissions to run the pipeline on Dataflow.\n",
+        "This pipeline uses the Dataflow Runner. To run the pipeline, you need to complete the following tasks:\n",
         "\n",
-        "Configure the pipeline options for the pipeline to run on Dataflow. Make sure the pipeline is using streaming mode."
+        "* Ensure that you have all the required permissions to run the pipeline on Dataflow.\n",
+        "* Configure the pipeline options for the pipeline to run on Dataflow. Make sure the pipeline is using streaming mode.\n",
+        "\n",
+        "In the following code, replace `BUCKET_NAME` with the the name of your Cloud Storage bucket."
       ],
       "metadata": {
         "id": "ORYNKhH3WQyP"
@@ -172,7 +174,7 @@
         "# Set the Google Cloud region that you want to run Dataflow in.\n",
         "options.view_as(GoogleCloudOptions).region = 'us-central1'\n",
         "\n",
-        "# IMPORTANT: Update the following line to choose a Cloud Storage location.\n",
+        "# IMPORTANT: Replace BUCKET_NAME with the the name of your Cloud Storage bucket.\n",
         "dataflow_gcs_location = \"gs://BUCKET_NAME/tmp/\"\n",
         "\n",
         "# The Dataflow staging location. This location is used to stage the Dataflow pipeline and the SDK binary.\n",
@@ -220,10 +222,12 @@
     {
       "cell_type": "markdown",
       "source": [
-        "## TensorFlow ModelHandler\n",
-        " This example uses `TFModelHandlerTensor` as the model handler and the `resnet_101` model trained on imagenet as our initial model used for inference.\n",
+        "## Use the TensorFlow model handler\n",
+        " This example uses `TFModelHandlerTensor` as the model handler and the `resnet_101` model trained on [ImageNet](https://www.image-net.org/).\n",
+        "\n",
+        " Download the model from [Google Cloud Storage](https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet101_weights_tf_dim_ordering_tf_kernels.h5) (link downloads the model), and place it in the directory that you want to use to update your model.\n",
         "\n",
-        " Download the model from [Google Cloud Storage](https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet101_weights_tf_dim_ordering_tf_kernels.h5) (link downloads the model), and place it in the directory that you want to use to update your model."
+        "In the following code, replace `BUCKET_NAME` with the the name of your Cloud Storage bucket."
       ],
       "metadata": {
         "id": "_AUNH_GJk_NE"
@@ -319,8 +323,7 @@
       "source": [
         "1. Create a `PeriodicImpulse` transform, which emits output every `n` seconds. The `PeriodicImpulse` transform generates an infinite sequence of elements with a given runtime interval.\n",
         "\n",
-        "  In this example, `PeriodicImpulse` mimics the Pub/Sub source. Because the inputs in a streaming pipeline arrive in intervals, use `PeriodicImpulse` to output elements at `m` intervals.\n",
-        "\n",
+        "   In this example, `PeriodicImpulse` mimics the Pub/Sub source. Because the inputs in a streaming pipeline arrive in intervals, use `PeriodicImpulse` to output elements at `m` intervals.\n",
         "To learn more about `PeriodicImpulse`, see the [`PeriodicImpulse` code](https://github.com/apache/beam/blob/9c52e0594d6f0e59cd17ee005acfb41da508e0d5/sdks/python/apache_beam/transforms/periodicsequence.py#L150)."
       ],
       "metadata": {
@@ -353,7 +356,7 @@
       "source": [
         "2. To read and pre-process the images, use the `read_image` function. This example uses `Cat-with-beanie.jpg` for all inferences.\n",
         "\n",
-        "  **Note**: Image used for prediction is licensed in CC-BY, creator in listed in the [LICENSE.txt](https://storage.googleapis.com/apache-beam-samples/image_captioning/LICENSE.txt) file."
+        "  **Note**: Image used for prediction is licensed in CC-BY. The creator is listed in the [LICENSE.txt](https://storage.googleapis.com/apache-beam-samples/image_captioning/LICENSE.txt) file."
       ],
       "metadata": {
         "id": "8-sal2rFAxP2"
@@ -385,7 +388,8 @@
       "cell_type": "markdown",
       "source": [
         "3. Pass the images to the RunInference `PTransform`. RunInference takes `model_handler` and `model_metadata_pcoll` as input parameters.\n",
-        "  * `model_metadata_pcoll` is a [side input](https://beam.apache.org/documentation/programming-guide/#side-inputs) `PCollection` to the RunInference `PTransform`. This side input is used to update the `model_uri` in the `model_handler` without needing to stop the Apache Beam pipeline. Use `WatchFilePattern` as side input to watch a `file_pattern` matching `.h5` files. In this case, the `file_pattern` is `'gs://BUCKET_NAME/*.h5'`.\n",
+        "  * `model_metadata_pcoll` is a side input `PCollection` to the RunInference `PTransform`. This side input is used to update the `model_uri` in the `model_handler` without needing to stop the Apache Beam pipeline\n",
+        "  * Use `WatchFilePattern` as side input to watch a `file_pattern` matching `.h5` files. In this case, the `file_pattern` is `'gs://BUCKET_NAME/*.h5'`.\n",
         "\n"
       ],
       "metadata": {
@@ -418,8 +422,7 @@
       "cell_type": "markdown",
       "source": [
         "4. Post-process the `PredictionResult` object.\n",
-        "\n",
-        "  When the inference is complete, RunInference outputs a `PredictionResult` object that contains the fields `example`, `inference`, and `model_id`. The `model_id` field identifies the model used to run the inference. The `PostProcessor` returns the predicted label and the model ID used to run the inference on the predicted label."
+        "When the inference is complete, RunInference outputs a `PredictionResult` object that contains the fields `example`, `inference`, and `model_id`. The `model_id` field identifies the model used to run the inference. The `PostProcessor` returns the predicted label and the model ID used to run the inference on the predicted label."
       ],
       "metadata": {
         "id": "lTA4wRWNDVis"
@@ -442,9 +445,9 @@
     {
       "cell_type": "markdown",
       "source": [
-        "**How to watch for the automatic model update**\n",
+        "### Watch for the model update\n",
         "\n",
-        "  After the pipeline starts processing data and when you see output emitted from the RunInference `PTransform`, upload a `resnet152` model saved in `.h5` format to a Google Cloud Storage bucket location that matches the `file_pattern` you defined earlier. You can download a copy of the model by clicking [this link](https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet152_weights_tf_dim_ordering_tf_kernels.h5). RunInference uses `WatchFilePattern` as a side input to update the `model_uri` of `TFModelHandlerTensor`."
+        "After the pipeline starts processing data and when you see output emitted from the RunInference `PTransform`, upload a `resnet152` model saved in `.h5` format to a Google Cloud Storage bucket location that matches the `file_pattern` you defined earlier. You can [download a copy of the model](https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet152_weights_tf_dim_ordering_tf_kernels.h5) (link downloads the model). RunInference uses `WatchFilePattern` as a side input to update the `model_uri` of `TFModelHandlerTensor`."
       ],
       "metadata": {
         "id": "wYp-mBHHjOjA"
@@ -453,7 +456,9 @@
     {
       "cell_type": "markdown",
       "source": [
-        "## Run the pipeline"
+        "## Run the pipeline\n",
+        "\n",
+        "Use the following code to run the pipeline."
       ],
       "metadata": {
         "id": "_ty03jDnKdKR"

diff --git a/examples/notebooks/beam-ml/run_inference_multi_model.ipynb b/examples/notebooks/beam-ml/run_inference_multi_model.ipynb
@@ -72,7 +72,7 @@
         "\n",
         "For more information about the RunInference API, review the [RunInference notebook](https://colab.research.google.com/drive/111USL4VhUa0xt_mKJxl5nC1YLOC8_yF4?usp=sharing#scrollTo=746b67a7-3562-467f-bea3-d8cd18c14927).\n",
         "\n",
-        "**Note:** all images are licensed CC-BY, creators are listed in the [LICENSE.txt](https://storage.googleapis.com/apache-beam-samples/image_captioning/LICENSE.txt) file."
+        "**Note:** All images are licensed CC-BY, and creators are listed in the [LICENSE.txt](https://storage.googleapis.com/apache-beam-samples/image_captioning/LICENSE.txt) file."
       ],
       "metadata": {
         "id": "6vZWSLyuM_P4"

diff --git a/examples/notebooks/beam-ml/run_inference_pytorch.ipynb b/examples/notebooks/beam-ml/run_inference_pytorch.ipynb
@@ -68,7 +68,7 @@
         "id": "A8xNRyZMW1yK"
       },
       "source": [
-        "This notebook demonstrates the use of the RunInference transform for PyTorch. Apache Beam includes implementations of the [ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class for [users of PyTorch](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.pytorch_inference.html). For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation.\n",
+        "This notebook demonstrates the use of the RunInference transform for PyTorch. Apache Beam includes implementations of the [ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class for [users of PyTorch](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.pytorch_inference.html). For more information about using RunInference, see [Get started with AI/ML pipelines](https://beam.apache.org/documentation/ml/overview/) in the Apache Beam documentation.\n",
         "\n",
         "\n",
         "This notebook illustrates common RunInference patterns, such as:\n",

diff --git a/examples/notebooks/beam-ml/run_inference_sklearn.ipynb b/examples/notebooks/beam-ml/run_inference_sklearn.ipynb
@@ -69,7 +69,7 @@
       },
       "source": [
         "This notebook demonstrates the use of the RunInference transform for [scikit-learn](https://scikit-learn.org/), also called sklearn.\n",
-        "Apache Beam [RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference) has implementations of the [ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class prebuilt for scikit-learn. For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation.\n",
+        "Apache Beam [RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference) has implementations of the [ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class prebuilt for scikit-learn. For more information about using RunInference, see [Get started with AI/ML pipelines](https://beam.apache.org/documentation/ml/overview/) in the Apache Beam documentation.\n",
         "\n",
         "You can choose the appropriate model handler based on your input data type:\n",
         "* [NumPy model handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n",