Preparing the data processing notebooks for import to devsite (apache…

…#29937) * Preparing the data processing notebooks for import to devsite * Update the title casing for the notebook link in the readme
m-trieu · Jan 5, 2024 · f79eadd · f79eadd
1 parent 0b0d973
commit f79eadd
Show file tree

Hide file tree

Showing 5 changed files with 57 additions and 64 deletions.
diff --git a/examples/notebooks/beam-ml/README.md b/examples/notebooks/beam-ml/README.md
@@ -16,7 +16,7 @@
     specific language governing permissions and limitations
     under the License.
 -->
-# ML Sample Notebooks
+# ML sample notebooks
 
 Starting with the Apache Beam SDK version 2.40, users have access to a
 [RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
@@ -27,13 +27,13 @@ The model handler abstracts the user from the configuration needed for
 specific frameworks, such as Tensorflow, PyTorch, and others. For a full list of supported frameworks,
 see the [About Beam ML](https://beam.apache.org/documentation/ml/about-ml/) page.
 
-## Using The Notebooks
+## Use the notebooks
 
 These notebooks illustrate ways to use Apache Beam's RunInference transforms, as well as different
 use cases for [`ModelHandler`](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) implementations.
 Beam comes with [multiple `ModelHandler` implementations](https://beam.apache.org/documentation/ml/about-ml/#modify-a-python-pipeline-to-use-an-ml-model).
 
-### Loading the Notebooks
+### Load the notebooks
 
 1. To get started quickly with notebooks, use [Colab](https://colab.sandbox.google.com/).
 2. In Colab, open the notebook from GitHub using the notebook URL, for example:
@@ -48,6 +48,14 @@ to your project and bucket.
 
 This section contains the following example notebooks.
 
+### Data processing
+
+* [Generate text embeddings by using the Vertex AI API](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb)
+* [Generate text embeddings by using Hugging Face Hub models](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb)
+* [Compute and apply vocabulary on a dataset](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb)
+* [Use MLTransform to scale data](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/scale_data.ipynb)
+* [Preprocessing with the Apache Beam DataFrames API](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb)
+
 ### Prediction and inference with pretrained models
 
 * [Apache Beam RunInference for PyTorch](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch.ipynb)
@@ -85,8 +93,3 @@ This section contains the following example notebooks.
 ### Model Evaluation
 
 * [Use TFMA to evaluate and compare model performance](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/tfma_beam.ipynb)
-
-### Data processing
-
-* [Preprocess data with MLTransform](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/mltransform_basic.ipynb)
-* [Preprocessing with the Apache Beam DataFrames API](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb)
diff --git a/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb b/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb
@@ -45,7 +45,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "# Compute and Apply Vocabulary on a dataset using `MLTransform`\n",
+        "# Compute and apply vocabulary on a dataset\n",
         "\n",
         "<table align=\"left\">\n",
         "  <td>\n",
@@ -63,12 +63,12 @@
     {
       "cell_type": "markdown",
       "source": [
-        "[ComputeAndApplyVocabulary](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) is a data processing transform that computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning tasks.\n",
+        "[`ComputeAndApplyVocabulary`](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) is a data processing transform that computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning (ML) tasks.\n",
         "\n",
-        "Generating a vocabulary on the incoming dataset is a crucial preprocessing step while training machine learning models that deal with text data. By mapping words to numerical indices, the vocabulary reduces the complexity and dimensionality of dataset, allowing ML models to process the same words in a consistent way.\n",
+        "When you train ML models that use text data, generating a vocabulary on the incoming dataset is a crucial preprocessing step. By mapping words to numerical indices, the vocabulary reduces the complexity and dimensionality of dataset. This step allows ML models to process the same words in a consistent way.\n",
         "\n",
         "This notebook shows how to use `MLTransform` to complete the following tasks:\n",
-        "* Use `write` mode in `MLTransform` to generate a vocabulary on the input text and assign an index value to each token.\n",
+        "* Use `write` mode to generate a vocabulary on the input text and assign an index value to each token.\n",
         "* Use `read` mode to use the generated vocabulary and assign an index to a different dataset.\n",
         "\n",
         "`MLTransform` uses the `ComputeAndApplyVocabulary` transform, which is implemented by using `tensorflow_transform` to generate the vocabulary.\n",
@@ -120,15 +120,13 @@
     {
       "cell_type": "markdown",
       "source": [
-        "### Artifact location\n",
+        "## Artifact location\n",
         "\n",
-        "Artifact location are used to store the artifacts, such as vocabulary file generated by `ComputeAndApplyVocabulary`, in `MLTransform` write mode.\n",
+        "In `write` mode, the artifact location is used to store artifacts, such as the vocabulary file generated by `ComputeAndApplyVocabulary`.\n",
         "\n",
-        "**NOTE**: Artifact location must be empty otherwise a `RuntimeError` will be raised.\n",
+        "**NOTE**: The artifact location must be empty. If it isn't empty, a `RuntimeError` occurs.\n",
         "\n",
-        "During the `MLTransform` read mode, `MLTransform` will fetch artifacts from the specified artifact location.\n",
-        "\n",
-        "**NOTE**: In read mode, make sure to pass the same artifact location that was used in write mode. Otherwise, it could result in `RuntimeError` or `MLTransform` will produce unexpected results in read mode.\n"
+        "In `read` mode, `MLTransform` fetches artifacts from the specified artifact location. Pass the same artifact location that you used in `write` mode. Otherwise, a `RuntimeError` occurs or `MLTransform` produces unexpected results in `read` mode.\n"
       ],
       "metadata": {
         "id": "vfarBxAMFvRA"
@@ -165,9 +163,9 @@
     {
       "cell_type": "markdown",
       "source": [
-        "In this example, `MLTransform` in `write` mode uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The incoming text data is split into tokens and each token is assigned an unique index.\n",
+        "In this example, in `write` mode, `MLTransform` uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The incoming text data is split into tokens and each token is assigned an unique index.\n",
         "\n",
-        " The generated vocabulary is stored in an artifact location that you can use on a different dataset in `read` mode with `MLTransform`."
+        " The generated vocabulary is stored in an artifact location that you can use on a different dataset in `read` mode."
       ],
       "metadata": {
         "id": "oETBJNVfRws_"
@@ -210,15 +208,15 @@
     {
       "cell_type": "markdown",
       "source": [
-        "### Understanding and Visualizing Vocabulary in Data Processing\n",
+        "## Understand and visualize vocabulary\n",
         "\n",
-        "When working with text data in machine learning, one common step is the generation of a vocabulary index. This process is effectively demonstrated through the `MLTransform` using `ComputeAndApplyVocabulary` transformation. Here, each unique word in your text data is assigned a specific index. This index is then used to represent the text in a numerical format, which is essential for machine learning algorithms.\n",
+        "When working with text data in machine learning, one common step is the generation of a vocabulary index. `MLTransform` completes this step by using the `ComputeAndApplyVocabulary` transformation. Each unique word in your text data is assigned a specific index. This index is then used to represent the text in a numerical format, which is needed for machine learning algorithms.\n",
         "\n",
-        "In the provided example, the `ComputeAndApplyVocabulary` transformation is applied to the `feature` column, creating a vocabulary index for each unique word found in this column.\n",
+        "In this example, the `ComputeAndApplyVocabulary` transformation is applied to the `feature` column. A vocabulary index is created for each unique word found in this column.\n",
         "\n",
-        "To visualize and understand this generated vocabulary, you can use the `ArtifactsFetcher` class. This class allows you to retrieve the vocabulary list from your specified location. Once you have this list, you can easily see the index associated with each word in your vocabulary. This index corresponds to the numerical representation used in the transformation output of `ComputeAndApplyVocabulary`.\n",
+        "To visualize and understand this generated vocabulary, use the `ArtifactsFetcher` class. This class allows you to retrieve the vocabulary list from your specified location. When you have this list, you can see the index associated with each word in your vocabulary. This index corresponds to the numerical representation used in the transformation output of `ComputeAndApplyVocabulary`.\n",
         "\n",
-        "By examining this vocabulary index, you gain insight into how your text data is being processed and represented numerically. This understanding is crucial for debugging and improving your machine learning models that rely on text data."
+        "Examine this vocabulary index to understand how your text data is being processed and represented numerically. This understanding is useful for debugging and improving machine learning models that rely on text data."
       ],
       "metadata": {
         "id": "hvTvzOw8iBi9"
@@ -272,7 +270,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "### Frequency Threshold\n",
+        "## Frequency Threshold\n",
         "\n",
         "The `frequency_threshold` parameter identifies the elements that appear frequently in the dataset. This parameter limits the generated vocabulary to elements with an absolute frequency greater than or equal to the specified threshold. If you don't specify the parameter, the entire vocabulary is generated.\n",
         "\n",
@@ -319,7 +317,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "In the above output, if the frequency of the token is less than the specified frequency, it is assigned to a `default_value` of `-1`. For the other tokens, a vocabulary file is generated."
+        "In the output, if the frequency of the token is less than the specified frequency, it is assigned to a `default_value` of `-1`. For the other tokens, a vocabulary file is generated."
       ],
       "metadata": {
         "id": "h1s4a6hzxKrb"
@@ -361,7 +359,7 @@
       "source": [
         "## `MLTransform` for inference workloads\n",
         "\n",
-        "When `MLTransform` is in `write` mode, it produces artifacts, such as vocabulary files for `ComputeAndApplyVocabulary`. This allows you to ensure that you are applying the same vocabulary (and any other preprocessing transforms you apply) when you are training your model and serving it in production or testing its accuracy.\n",
+        "When `MLTransform` is in `write` mode, it produces artifacts, such as vocabulary files for `ComputeAndApplyVocabulary`. These artifacts allow you to apply the same vocabulary, and any other preprocessing transforms, when you train your model and serve it in production, or when you test its accuracy.\n",
         "\n",
         "When `MLTransform` is used `read` mode, it uses the previously generated vocabulary files to map the incoming text data. If the incoming vocabulary isn't found in the generated vocabulary, then the incoming vocabulary is mapped to a `default_value` provided during `write` mode. In this case, the `default_value` is `-1`.\n",
         "\n",

diff --git a/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb b/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb
@@ -63,12 +63,9 @@
     {
       "cell_type": "markdown",
       "source": [
-        "\n",
-        "## Text embeddings\n",
         "\n",
         "Use text embeddings to represent text as numerical vectors. This process lets computers understand and process text data, which is essential for many natural language processing (NLP) tasks.\n",
         "\n",
-        "### Uses of text embeddings\n",
         "The following NLP tasks use embeddings:\n",
         "\n",
         "* **Semantic search:** Find documents or passages that are relevant to a query when the query doesn't use the exact same words as the documents.\n",
@@ -148,7 +145,7 @@
         "The following text inputs come from the Hugging Face blog [Getting Started With Embeddings](https://huggingface.co/blog/getting-started-with-embeddings).\n",
         "\n",
         "\n",
-        "`MLTransform` operates on dictionaries of data. To generate embeddings for specific columns, provide the column names as input to the `columns` argument in the `SentenceTransformerEmbeddings` package.\""
+        "`MLTransform` operates on dictionaries of data. To generate embeddings for specific columns, provide the column names as input to the `columns` argument in the `SentenceTransformerEmbeddings` package."
       ],
       "metadata": {
         "id": "Dbkmu3HP6Kql"