From 0b6ee6194cd293516a95a05019136fa79ae2a407 Mon Sep 17 00:00:00 2001 From: Keqiu Hu Date: Mon, 26 Jul 2021 10:59:04 -0700 Subject: [PATCH] Add ORC reader tutorial (#1465) * Add ORC reader tutorial * clean up notebook * address comments * address comments * address comments * address comment: remove outputs and add desc for dataset * fix lint * fix lint: Prefer second person instead of first person. * address comments * fix typo --- docs/tutorials/_toc.yaml | 3 +- docs/tutorials/orc.ipynb | 333 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 335 insertions(+), 1 deletion(-) create mode 100644 docs/tutorials/orc.ipynb diff --git a/docs/tutorials/_toc.yaml b/docs/tutorials/_toc.yaml index 1c2ee891d..7edeee8bf 100644 --- a/docs/tutorials/_toc.yaml +++ b/docs/tutorials/_toc.yaml @@ -36,4 +36,5 @@ toc: path: /io/tutorials/elasticsearch - title: "Avro" path: /io/tutorials/avro - +- title: "ORC" + path: /io/tutorials/orc diff --git a/docs/tutorials/orc.ipynb b/docs/tutorials/orc.ipynb new file mode 100644 index 000000000..e94ac6cf3 --- /dev/null +++ b/docs/tutorials/orc.ipynb @@ -0,0 +1,333 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Tce3stUlHN0L" + }, + "source": [ + "##### Copyright 2021 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "cellView": "form", + "id": "tuOe1ymfHZPu" + }, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qFdPvlXBOdUN" + }, + "source": [ + "# Apache ORC Reader" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MfBg1C5NB3X0" + }, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View on GitHub\n", + " \n", + " Download notebook\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xHxb-dlhMIzW" + }, + "source": [ + "## Overview\n", + "\n", + "Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading [Apache ORC](https://orc.apache.org/) files." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MUXex9ctTuDB" + }, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Eh-iCRVBm0p" + }, + "source": [ + "Install required packages, and restart runtime\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "g7cxbf1-skn6" + }, + "outputs": [], + "source": [ + "!pip install tensorflow-io" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "IqR2PQG4ZaZ0" + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "import tensorflow_io as tfio" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EyHfC3nEzseN" + }, + "source": [ + "### Download a sample dataset file in ORC" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZjEeF6Fva8UO" + }, + "source": [ + "The dataset you will use here is the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "zaiXjZiXzrHs" + }, + "outputs": [], + "source": [ + "!curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc\n", + "!ls -l iris.orc" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7DG9JTJ0-bzg" + }, + "source": [ + "## Create a dataset from the file" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "id": "ppFAjXAYsj-z" + }, + "outputs": [], + "source": [ + "dataset = tfio.IODataset.from_orc(\"iris.orc\", capacity=15).batch(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4xPr3f4LVdeN" + }, + "source": [ + "Examine the dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "id": "9B1QUKG70Lzs" + }, + "outputs": [], + "source": [ + "for item in dataset.take(1):\n", + " print(item)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "03qncHJPVNK3" + }, + "source": [ + "Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tDkpKRMVcPfb" + }, + "source": [ + "### Data preprocessing" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nDgkfWFRVjKz" + }, + "source": [ + "Configure which columns are features, and which column is label:" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "id": "R1OYAybz07dr" + }, + "outputs": [], + "source": [ + "feature_cols = [\"sepal_length\", \"sepal_width\", \"petal_length\", \"petal_width\"]\n", + "label_cols = [\"species\"]\n", + "\n", + "# select feature columns\n", + "feature_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=feature_cols)\n", + "# select label columns\n", + "label_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=label_cols)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GSYMP48vVvV0" + }, + "source": [ + "A util function to map species to float numbers for model training:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "id": "TQvuE7OgVs1q" + }, + "outputs": [], + "source": [ + "vocab_init = tf.lookup.KeyValueTensorInitializer(\n", + " keys=tf.constant([\"virginica\", \"versicolor\", \"setosa\"]),\n", + " values=tf.constant([0, 1, 2], dtype=tf.int64))\n", + "vocab_table = tf.lookup.StaticVocabularyTable(\n", + " vocab_init,\n", + " num_oov_buckets=4)" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "lpf0w41iWAZ4" + }, + "outputs": [], + "source": [ + "label_dataset = label_dataset.map(vocab_table.lookup)\n", + "dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))\n", + "dataset = dataset.batch(1)\n", + "\n", + "def pack_features_vector(features, labels):\n", + " \"\"\"Pack the features into a single array.\"\"\"\n", + " features = tf.stack(list(features), axis=1)\n", + " return features, labels\n", + "\n", + "dataset = dataset.map(pack_features_vector)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R1Tyf3AodC2Y" + }, + "source": [ + "## Build, compile and train the model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oVB9Q0B-WDn4" + }, + "source": [ + "Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "id": "tToy0FoOWG-9" + }, + "outputs": [], + "source": [ + "model = tf.keras.Sequential(\n", + " [\n", + " tf.keras.layers.Dense(\n", + " 10, activation=tf.nn.relu, input_shape=(4,)\n", + " ),\n", + " tf.keras.layers.Dense(10, activation=tf.nn.relu),\n", + " tf.keras.layers.Dense(3),\n", + " ]\n", + ")\n", + "\n", + "model.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[\"accuracy\"])\n", + "model.fit(dataset, epochs=5)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [ + "Tce3stUlHN0L" + ], + "name": "orc.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}