This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-interaction.html
260 lines (236 loc) · 16.6 KB
/
01-interaction.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Introduction to Hadoop</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://citi.clemson.edu" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/paw.gif" width="100px" height="auto" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Introduction to Hadoop</h1></a>
<h2 class="subtitle">Interacting with Hadoop</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Learn how to access the web-based Jupyter notebook.</li>
<li>Learn how to use the Hadoop command in Jupyter shells.</li>
<li>Learn how to access the web UI of the Hadoop Distributed File System.</li>
</ul>
</div>
</section>
<p>In this workshop, we will leverage the Jupyter infrastructure at Clemson University to directly interact with Hadoop.</p>
<h2 id="jupyter">Jupyter</h2>
<p>To start using the Jupyter notebook, go to <strong>https://webapp01-ext.palmetto.clemson.edu:8443</strong> and sign in with your Clemson credentials. Next, click <strong>Start My Server</strong> to spawn a new Jupyter notebook. You should see the content of your home directory on Palmetto under <strong>Files</strong>. <br> <img src="fig/jupyter/login.png" \
alt="Login" \
style="height:300px"> <br> Under <strong>New</strong>, create a new folder. This folder will appear immediately in your home directly with the name <strong>Untitled Folder</strong>. Check the selection box next to this folder, a button called <strong>Rename</strong> will appear below the <strong>Files</strong> tab. Click this button to change this folder to a name of your choice. Click on this folder to go to the next level. <br> <img src="fig/jupyter/folder.png" \
alt="Create New Folder" \
style="height:300px"> <br> Use the menu under <strong>New</strong> once again to create a new Jupyter notebook using Python 3.0 distributed through Anaconda 2.5.0 by Continuum. <br> <img src="fig/jupyter/notebook.png" \
alt="Create New Folder" \
style="height:300px"> <br> Change the name of this notebook to “Introduction to Hadoop”. <br> <img src="fig/jupyter/notebook-2.png" \
alt="Create New Folder" \
style="height:300px"> <br> For this workshop, the default codes inside a cell will be interpreted as Python language. However, any line that begins with <strong>!</strong> will be interpreted as a Linux system command. <br></p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">print</span> <span class="st">"Hello World"</span></code></pre></div>
<pre class="output"><code>Hello World</code></pre>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ls</span> -l /</code></pre></div>
<pre class="output"><code>total 424
dr-xr-xr-x. 2 root root 4096 Feb 26 04:18 bin
drwxrws--- 30 root bioengr 4096 Apr 18 18:31 bioengr
dr-xr-xr-x. 4 root root 4096 Jan 26 17:22 boot
drwxr-xr-x. 10 root root 4096 Oct 17 2014 cgroup
drwxr-xr-x 3 root root 4096 Aug 4 2015 cgroups_test
drwxr-xr-x 24 root root 4096 Apr 27 16:54 common
lrwxrwxrwx. 1 root root 7 Oct 17 2014 common1 -> /common
drwxrwxr-x 17 root cugi 4096 Mar 4 16:27 cugi
drwxr-xr-x 18 root root 4240 Feb 26 14:09 dev
drwxr-xr-x. 148 root root 12288 May 10 14:49 etc
drwxr-xr-x. 2 root root 4096 Oct 17 2014 fast
drwxrws--- 24 root feltus 4096 Apr 8 11:17 feltus
drwxr-xr-x 3 root root 4096 Apr 26 2015 hadoop
drwxr-xr-x 17 root root 4096 Dec 2 11:31 hdp_service_accounts
drwxr-xr-x 1845 root root 77824 May 10 14:47 home
dr-xr-xr-x. 13 root root 4096 Feb 26 04:18 lib
dr-xr-xr-x. 10 root root 12288 Feb 26 04:18 lib64
drwxr-xr-x. 4 root root 4096 Apr 26 2015 local
drwx------. 2 root root 16384 Oct 17 2014 lost+found
drwxr-xr-x. 2 root root 4096 Jan 28 17:10 media
drwxr-xr-x 8 root root 4096 May 16 2014 misc
drwxr-xr-x. 2 root root 4096 Jul 20 2011 mnt
drwxr-xr-x 3 nagios nagios 4096 Oct 10 2014 nagios
drwxr-xr-x. 2 root root 4096 Sep 25 2013 net
drwxr-xr-x 9 root root 4096 Nov 3 2014 nsr
drwxrwxr-x. 56 root root 4096 Mar 9 19:36 opt
drwxrws--- 23 root panicle 4096 Feb 11 13:15 panicle
dr-xr-xr-x 481 root root 0 Feb 26 14:08 proc
dr-xr-x---. 15 root root 4096 May 10 15:00 root
dr-xr-xr-x. 2 root root 12288 Feb 26 04:18 sbin
drwxr-xr-x 2 root root 4096 Sep 2 2015 scratch1
drwxr-xr-x 1740 root root 1740 May 10 14:47 scratch2
drwxr-xr-x. 2 root root 4096 Oct 17 2014 selinux
drwxrws--- 14 root smlc 4096 Apr 18 13:21 smlc
drwxr-xr-x 117 root root 4096 Apr 13 14:42 software
drwxr-xr-x. 2 root root 4096 Jul 20 2011 srv
drwxr-xr-x 13 root root 0 Feb 26 14:08 sys
drwxrwxrwt. 574 hbase hadoop 45056 May 10 06:06 tmp
drwxr-xr-x. 18 root root 4096 Feb 5 2015 usr
drwxr-xr-x. 23 root root 4096 Oct 17 2014 var
drwx------. 9 root root 4096 Oct 17 2014 xcatpost</code></pre>
<h2 id="hdfs-commands">HDFS commands</h2>
<p>HDFS provides a set of commands for users to interact with the system from a Linux-based terminal. To view all available HDFS systems commands, run the following in a cell:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs</code></pre></div>
<pre class="output"><code>Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
where COMMAND is one of:
dfs run a filesystem command on the file systems supported in Hadoop.
classpath prints the classpath
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
journalnode run the DFS journalnode
zkfc run the ZK Failover Controller daemon
datanode run a DFS datanode
dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFS filesystem checking utility
balancer run a cluster balancing utility
jmxget get JMX exported values from NameNode or DataNode.
mover run a utility to move block replicas across
storage types
oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to an legacy fsimage
oev apply the offline edits viewer to an edits file
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
groups get the groups which users belong to
snapshotDiff diff two snapshots of a directory or diff the
current directory contents with a snapshot
lsSnapshottableDir list all snapshottable dirs owned by the current user
Use -help to see options
portmap run a portmap service
nfs3 run an NFS version 3 gateway
cacheadmin configure the HDFS cache
crypto configure HDFS encryption zones
storagepolicies list/get/set block storage policies
version print the version
Most commands print help when invoked w/o parameters.</code></pre>
<p>For this workshop, we are interested in file system commands. Create a new cell and run the following:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">ssh</span> dsciu001 hdfs dfs</code></pre></div>
<pre class="output"><code>Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]</code></pre>
<p>We can see that HDFS provides a number of file system commands that are quite similar to their Linux counterpart. For example, <strong><em>-chown</em></strong> and <strong><em>-chmod</em></strong> change ownership and permission of HDFS files and directories, <strong><em>-ls</em></strong> lists content of a directory, <strong><em>-mkdir</em></strong> creates new directory, <strong><em>-rm</em></strong> removes files and directories, and so on.</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="hadoop-fs-and-hdfs-dfs"><span class="glyphicon glyphicon-pushpin"></span><strong><em>hadoop fs</em></strong> and <strong><em>hdfs dfs</em></strong></h2>
</div>
<div class="panel-body">
<p><strong><em>hadoop fs</em></strong> is an older syntax for <strong><em>hdfs dfs</em></strong>. While both commands produce the same results, you are encouraged to use <strong><em>hdfs dfs</em></strong> instead.</p>
</div>
</aside>
<p>When a Hadoop cluster is first started, there is no data. Users usually import data into the cluster from the traditional Linux-based file system. This is done by using the commandOption <strong><em>-put</em></strong>. Vice versa, to move data from HDFS back to a Linux-based file system, commandOption <strong><em>-get</em></strong> is used.</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="hadoop-distributed-file-system-is-not-the-linux-file-system"><span class="glyphicon glyphicon-pushpin"></span>Hadoop Distributed File System is not the Linux File System</h2>
</div>
<div class="panel-body">
<p>It is important to distinguish between the files and directories that are stored on HDFS and those that are stored on the Linux File Systems. In the Hadoop usage guide, the prefix <strong><em>local</em></strong> implies a path to a file/directory that is on a Linux File System. Anything else implies a path to a file/directory on HDFS</p>
</div>
</aside>
<h2 id="hdfs-web-interface">HDFS Web Interface</h2>
<p>At Clemson University, the Hadoop Big Data infrastructure is called the Cypress cluster. It uses an open source flavor of Hadoop distributed by Hortonworks. HDFS provides a web-based user interface for users to view stored data. The interface is hosted on HDFS’ NameNode, which is replicated to ensure uninterrupted operation. The URLs of the NameNode replicates are: <br> <sub>~</sub> {.output} http://namenode1.palmetto.clemson.edu:50070 http://namenode2.palmetto.clemson.edu:50070 <sub>~</sub> <br> <img src="fig/Namenodes.png" \
alt="Namnodes" \
style="height:500px"> <br> This figure shows the interfaces of the two HDFS NameNode replications. Only the active instance (left) can be used to view files and directories.</p>
<h2 id="check-your-understanding-using-jupyter-shell-to-download-data" class="callout">Check your understanding: Using Jupyter shell to download data</h2>
<p>Create a directory named <strong>intro-to-hadoop</strong> in your home directory on Palmetto</p>
<p>From inside this directory, run the following command to get data from github</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">!<span class="kw">git</span> clone https://github.com/clemsoncoe/Introduction-to-Hadoop-data.git</code></pre></div>
<p>View this newly cloned directory to confirm that you have the file <strong><em>gutenberg-shakespeare.txt</em></strong>.</p>
<h2 id="check-your-understanding-view-files-and-directories-on-hdfs" class="challenge">Check your understanding: View files and directories on HDFS</h2>
<p>View the content of your HDFS user directory (/user/<strong>your-username</strong>) on Cypress</p>
<h2 id="check-your-understanding-create-directory-on-hdfs" class="challenge">Check your understanding: Create directory on HDFS</h2>
<p>Create a directory in your HDFS user directory named <strong>intro-to-hadoop</strong></p>
<h2 id="check-your-understanding-import-file-to-hdfs" class="challenge">Check your understanding: Import file to HDFS</h2>
<p>Copy the file <strong><em>gutenberg-shakespeare.txt</em></strong> from Palmetto to this newly created <strong>intro-to-hadoop</strong> directory on HDFS using <strong>put</strong>. View the content of the <strong>intro-to-hadoop</strong> directory to confirm that the file has been successfully uploaded.</p>
</div>
</div>
</article>
<div class="footer">
<a class="label clemson-orange" href="http://citi.clemson.edu">CITI</a>
<a class="label clemson-orange" href="https://github.com/clemsoncoe/hpc-workshop">Source</a>
<a class="label clemson-orange" href="mailto:atrikut@clemson.edu">Contact</a>
<a class="label clemson-orange" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>