Skip to content

Commit

Permalink
Add ORC reader tutorial (#1465)
Browse files Browse the repository at this point in the history
* Add ORC reader tutorial

* clean up notebook

* address comments

* address comments

* address comments

* address comment: remove outputs and add desc for dataset

* fix lint

* fix lint: Prefer second person instead of first person.

* address comments

* fix typo
  • Loading branch information
oliverhu authored Jul 26, 2021
1 parent 77da6bc commit 0b6ee61
Show file tree
Hide file tree
Showing 2 changed files with 335 additions and 1 deletion.
3 changes: 2 additions & 1 deletion docs/tutorials/_toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,5 @@ toc:
path: /io/tutorials/elasticsearch
- title: "Avro"
path: /io/tutorials/avro

- title: "ORC"
path: /io/tutorials/orc
333 changes: 333 additions & 0 deletions docs/tutorials/orc.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,333 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Tce3stUlHN0L"
},
"source": [
"##### Copyright 2021 The TensorFlow Authors."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"cellView": "form",
"id": "tuOe1ymfHZPu"
},
"outputs": [],
"source": [
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qFdPvlXBOdUN"
},
"source": [
"# Apache ORC Reader"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MfBg1C5NB3X0"
},
"source": [
"<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://www.tensorflow.org/io/tutorials/orc\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View on GitHub</a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xHxb-dlhMIzW"
},
"source": [
"## Overview\n",
"\n",
"Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading [Apache ORC](https://orc.apache.org/) files."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MUXex9ctTuDB"
},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1Eh-iCRVBm0p"
},
"source": [
"Install required packages, and restart runtime\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "g7cxbf1-skn6"
},
"outputs": [],
"source": [
"!pip install tensorflow-io"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "IqR2PQG4ZaZ0"
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import tensorflow_io as tfio"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EyHfC3nEzseN"
},
"source": [
"### Download a sample dataset file in ORC"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZjEeF6Fva8UO"
},
"source": [
"The dataset you will use here is the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "zaiXjZiXzrHs"
},
"outputs": [],
"source": [
"!curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc\n",
"!ls -l iris.orc"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7DG9JTJ0-bzg"
},
"source": [
"## Create a dataset from the file"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"id": "ppFAjXAYsj-z"
},
"outputs": [],
"source": [
"dataset = tfio.IODataset.from_orc(\"iris.orc\", capacity=15).batch(1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4xPr3f4LVdeN"
},
"source": [
"Examine the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"id": "9B1QUKG70Lzs"
},
"outputs": [],
"source": [
"for item in dataset.take(1):\n",
" print(item)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "03qncHJPVNK3"
},
"source": [
"Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tDkpKRMVcPfb"
},
"source": [
"### Data preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nDgkfWFRVjKz"
},
"source": [
"Configure which columns are features, and which column is label:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"id": "R1OYAybz07dr"
},
"outputs": [],
"source": [
"feature_cols = [\"sepal_length\", \"sepal_width\", \"petal_length\", \"petal_width\"]\n",
"label_cols = [\"species\"]\n",
"\n",
"# select feature columns\n",
"feature_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=feature_cols)\n",
"# select label columns\n",
"label_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=label_cols)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GSYMP48vVvV0"
},
"source": [
"A util function to map species to float numbers for model training:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"id": "TQvuE7OgVs1q"
},
"outputs": [],
"source": [
"vocab_init = tf.lookup.KeyValueTensorInitializer(\n",
" keys=tf.constant([\"virginica\", \"versicolor\", \"setosa\"]),\n",
" values=tf.constant([0, 1, 2], dtype=tf.int64))\n",
"vocab_table = tf.lookup.StaticVocabularyTable(\n",
" vocab_init,\n",
" num_oov_buckets=4)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"id": "lpf0w41iWAZ4"
},
"outputs": [],
"source": [
"label_dataset = label_dataset.map(vocab_table.lookup)\n",
"dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))\n",
"dataset = dataset.batch(1)\n",
"\n",
"def pack_features_vector(features, labels):\n",
" \"\"\"Pack the features into a single array.\"\"\"\n",
" features = tf.stack(list(features), axis=1)\n",
" return features, labels\n",
"\n",
"dataset = dataset.map(pack_features_vector)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R1Tyf3AodC2Y"
},
"source": [
"## Build, compile and train the model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oVB9Q0B-WDn4"
},
"source": [
"Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"id": "tToy0FoOWG-9"
},
"outputs": [],
"source": [
"model = tf.keras.Sequential(\n",
" [\n",
" tf.keras.layers.Dense(\n",
" 10, activation=tf.nn.relu, input_shape=(4,)\n",
" ),\n",
" tf.keras.layers.Dense(10, activation=tf.nn.relu),\n",
" tf.keras.layers.Dense(3),\n",
" ]\n",
")\n",
"\n",
"model.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[\"accuracy\"])\n",
"model.fit(dataset, epochs=5)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"Tce3stUlHN0L"
],
"name": "orc.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit 0b6ee61

Please sign in to comment.