From 0b6ee6194cd293516a95a05019136fa79ae2a407 Mon Sep 17 00:00:00 2001
From: Keqiu Hu <khu@apache.org>
Date: Mon, 26 Jul 2021 10:59:04 -0700
Subject: [PATCH] Add ORC reader tutorial (#1465)

* Add ORC reader tutorial

* clean up notebook

* address comments

* address comments

* address comments

* address comment: remove outputs and add desc for dataset

* fix lint

* fix lint: Prefer second person instead of first person.

* address comments

* fix typo
---
 docs/tutorials/_toc.yaml |   3 +-
 docs/tutorials/orc.ipynb | 333 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 335 insertions(+), 1 deletion(-)
 create mode 100644 docs/tutorials/orc.ipynb
diff --git a/docs/tutorials/_toc.yaml b/docs/tutorials/_toc.yaml
index 1c2ee891d..7edeee8bf 100644
--- a/docs/tutorials/_toc.yaml
+++ b/docs/tutorials/_toc.yaml
@@ -36,4 +36,5 @@ toc:
   path: /io/tutorials/elasticsearch
 - title: "Avro"
   path: /io/tutorials/avro
-
+- title: "ORC"
+  path: /io/tutorials/orc
diff --git a/docs/tutorials/orc.ipynb b/docs/tutorials/orc.ipynb
new file mode 100644
index 000000000..e94ac6cf3
--- /dev/null
+++ b/docs/tutorials/orc.ipynb
@@ -0,0 +1,333 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Tce3stUlHN0L"
+      },
+      "source": [
+        "##### Copyright 2021 The TensorFlow Authors."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "cellView": "form",
+        "id": "tuOe1ymfHZPu"
+      },
+      "outputs": [],
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qFdPvlXBOdUN"
+      },
+      "source": [
+        "# Apache ORC Reader"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MfBg1C5NB3X0"
+      },
+      "source": [
+        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/io/tutorials/orc\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View on GitHub</a>\n",
+        "  </td>\n",
+        "  <td>\n",
+        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
+        "  </td>\n",
+        "</table>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xHxb-dlhMIzW"
+      },
+      "source": [
+        "## Overview\n",
+        "\n",
+        "Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading [Apache ORC](https://orc.apache.org/) files."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MUXex9ctTuDB"
+      },
+      "source": [
+        "## Setup"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1Eh-iCRVBm0p"
+      },
+      "source": [
+        "Install required packages, and restart runtime\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "id": "g7cxbf1-skn6"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install tensorflow-io"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "id": "IqR2PQG4ZaZ0"
+      },
+      "outputs": [],
+      "source": [
+        "import tensorflow as tf\n",
+        "import tensorflow_io as tfio"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EyHfC3nEzseN"
+      },
+      "source": [
+        "### Download a sample dataset file in ORC"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ZjEeF6Fva8UO"
+      },
+      "source": [
+        "The dataset you will use here is the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "id": "zaiXjZiXzrHs"
+      },
+      "outputs": [],
+      "source": [
+        "!curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc\n",
+        "!ls -l iris.orc"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7DG9JTJ0-bzg"
+      },
+      "source": [
+        "## Create a dataset from the file"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 35,
+      "metadata": {
+        "id": "ppFAjXAYsj-z"
+      },
+      "outputs": [],
+      "source": [
+        "dataset = tfio.IODataset.from_orc(\"iris.orc\", capacity=15).batch(1)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4xPr3f4LVdeN"
+      },
+      "source": [
+        "Examine the dataset:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 42,
+      "metadata": {
+        "id": "9B1QUKG70Lzs"
+      },
+      "outputs": [],
+      "source": [
+        "for item in dataset.take(1):\n",
+        "    print(item)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "03qncHJPVNK3"
+      },
+      "source": [
+        "Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tDkpKRMVcPfb"
+      },
+      "source": [
+        "### Data preprocessing"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nDgkfWFRVjKz"
+      },
+      "source": [
+        "Configure which columns are features, and which column is label:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 47,
+      "metadata": {
+        "id": "R1OYAybz07dr"
+      },
+      "outputs": [],
+      "source": [
+        "feature_cols = [\"sepal_length\", \"sepal_width\", \"petal_length\", \"petal_width\"]\n",
+        "label_cols = [\"species\"]\n",
+        "\n",
+        "# select feature columns\n",
+        "feature_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=feature_cols)\n",
+        "# select label columns\n",
+        "label_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=label_cols)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "GSYMP48vVvV0"
+      },
+      "source": [
+        "A util function to map species to float numbers for model training:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 48,
+      "metadata": {
+        "id": "TQvuE7OgVs1q"
+      },
+      "outputs": [],
+      "source": [
+        "vocab_init = tf.lookup.KeyValueTensorInitializer(\n",
+        "    keys=tf.constant([\"virginica\", \"versicolor\", \"setosa\"]),\n",
+        "    values=tf.constant([0, 1, 2], dtype=tf.int64))\n",
+        "vocab_table = tf.lookup.StaticVocabularyTable(\n",
+        "    vocab_init,\n",
+        "    num_oov_buckets=4)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 49,
+      "metadata": {
+        "id": "lpf0w41iWAZ4"
+      },
+      "outputs": [],
+      "source": [
+        "label_dataset = label_dataset.map(vocab_table.lookup)\n",
+        "dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))\n",
+        "dataset = dataset.batch(1)\n",
+        "\n",
+        "def pack_features_vector(features, labels):\n",
+        "    \"\"\"Pack the features into a single array.\"\"\"\n",
+        "    features = tf.stack(list(features), axis=1)\n",
+        "    return features, labels\n",
+        "\n",
+        "dataset = dataset.map(pack_features_vector)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "R1Tyf3AodC2Y"
+      },
+      "source": [
+        "## Build, compile and train the model"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "oVB9Q0B-WDn4"
+      },
+      "source": [
+        "Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 50,
+      "metadata": {
+        "id": "tToy0FoOWG-9"
+      },
+      "outputs": [],
+      "source": [
+        "model = tf.keras.Sequential(\n",
+        "    [\n",
+        "        tf.keras.layers.Dense(\n",
+        "            10, activation=tf.nn.relu, input_shape=(4,)\n",
+        "        ),\n",
+        "        tf.keras.layers.Dense(10, activation=tf.nn.relu),\n",
+        "        tf.keras.layers.Dense(3),\n",
+        "    ]\n",
+        ")\n",
+        "\n",
+        "model.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[\"accuracy\"])\n",
+        "model.fit(dataset, epochs=5)"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [
+        "Tce3stUlHN0L"
+      ],
+      "name": "orc.ipynb",
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}