-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME
157 lines (105 loc) · 6.26 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
This project contains
- A stand-alone tool to convert the Wikipedia article dump (as XML) into multiple
text files, each consisting of one <page>xxxx</page> record per line. This is
then suitable for input to Hadoop. The use of this tool is described in the
separate README-Splitting file
- A Hadoop-based workflow that processes the dump, extracts ngrams, and
generates counts. The rest of this document describes that tool.
======================================================
Prerequisites
======================================================
This section assumes that you are set up to use Elastic MapReduce.
If not, then you should complete the steps as described by the
first module in this course.
In particular, you'll need your AWS Access Key, Secret Key,
and a keypair generated by AWS.
Separately, you may want to pre-configure Foxy Proxy in
Firefox if you wish to view the job details via the Hadoop
JobTracker GUI. For instructions, see the "How to Install Foxy
Proxy" section of the Amazon Elastic MapReduce Developer Guide,
which you can view or download from
http://aws.amazon.com/documentation/elasticmapreduce/
======================================================
Running the processor job using EMR
======================================================
1. Create a bucket in S3, using the AWS Console.
https://console.aws.amazon.com/s3/home
For example, call this bucket "aws-test-99"
Inside of this bucket create three directories called "job", "logs", and "results"
2. Build the job jar
[On your dev machine]
% ant clean job
This will create the wikipedia-ngrams-job.jar Hadoop job jar
file in your build sub-directory. If you have Hadoop installed on
your development machine, you can try running it locally via:
% hadoop jar build/wikipedia-ngrams-job.jar -inputfile src/test/resources/enwiki-split.xml -outputdir build/test
This will generate text output files in build/test/raw-counts and build/test/sorted-counts.
To view the results, you can dump the output (these are text files), e.g.
% cat build/test/sorted-counts/part-r-00000
3. Upload the job jar to <bucket name>/job/, using the AWS Console.
For example, put it into aws-test-99/job/wikipedia-ngrams-job.jar
4. Start the Job Flow, using the AWS Console
https://console.aws.amazon.com/elasticmapreduce/home
- Click the "Create New Job Flow" button. This will start you down the six dialog path to enlightenment....
Define Job Flow
===============
- Give it a reasonable name, and set the "Choose a Job Type" menu to "Custom JAR"
- Click the "Continue" button.
Specify Parameters
==================
- Set the JAR Location to the job jar you uploaded (e.g. aws-test-99/job/wikipedia-ngrams-job.jar)
- Set the JAR Arguments to be "-inputfile s3n://datasets.elasticmapreduce/wikipediaxml/part-100.xml -outputdir s3n://<my bucket>/results -percent 10 -numreducers 1"
[NOTE - you must change <my bucket> to be the bucket you created above, e.g. aws-test-99]
- Click the "Continue" button.
Configure EC2 Instances
=======================
- Set the Master Instance Group's Instance Type menu to "Small (m1.small)"
- Set the Core Instance Group's Instance Count to 2, and the Instance Type menu to "Large (m1.large)"
- Leave the Task Instance Group's Instance Count set to 0.
- Click the "Continue" button.
Advanced Options
================
- Set the Amazon EC2 Key Pair menu to the name of the key pair you created previously.
- Set the Amazon S3 Log Path to be s3n://<my bucket>/logs
[NOTE - you must change <my bucket> to be the bucket you created above, e.g. aws-test-99]
- Leave everything else unchanged.
- Click the "Continue" button.
Bootstrap Actions
=================
- Leave the "Proceed with no Bootstrap Actions" radio button selected.
- Click the "Continue" button.
Review
======
- Behold the myriad settings you have specified.
- Click the "Create Job Flow" button.
- Click the "Close" button on the final dialog.
5. Monitor the Job Flow
The AWS Console will list your job in the Elastic MapReduce tab:
https://console.aws.amazon.com/elasticmapreduce/home
The state will initially be "STARTING", which will eventually change to "RUNNING".
If the job fails for any reason, wait about 5 minutes, then download and inspect
the log files that have been uploaded to S3. These will be found inside of the
<bucket name>/logs/ path you specified when defining the job, in the job-specific
subdirectory. For example, aws-test-99/logs/j-T6AYPJJ31MRH/
6. Using the Hadoop GUI
If you want to view high level details of the job as its running, using your browser,
then it's easy - just get the "Master Public DNS Name" by selecting the runnng job
in the list of Elastic MapReduce Job Flows, paste that into your browser, set the
port to 9100, and you'll be able to monitor things like percentage completion for
the various map and reduce tasks, total data read and written, etc.
7. [Advanced] - Proxying the Hadoop GUI
If you want to view all the details of the job as its running, using the Hadoop
GUI via your browser, then you will need to have previously installed and configured
FoxyProxy in Firefox as described above.
Once you have successfully configured Foxy Proxy (e.g. to proxy port 8157), you need
set up an SSH SOCKS server. Open new terminal window, and enter:
% ssh -i <path to keypair file> -ND 8157 hadoop@<public DNS name for master server>
The public DNS name is available via the AWS Management Console, as per above.
Once the SSH SOCKS server is running, you can open a browser window to the URL:
<public DNS name>:9100
This will show you the Hadoop JobTracker GUI. Note that once the job terminates, this
GUI will no longer be available, so you'll only have a few minutes to try this out.
8. When the job has completed (about 6-10 minutes, with the above configuration)
you can download and view the results. You should use the AWS Management Console
to download the <my bucket>/results/sorted-counts/part-r-00000 file to your
local disk, and then open it with any text editor.