-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfun-with-macros-and-parquet-avro.html
338 lines (287 loc) · 21.2 KB
/
fun-with-macros-and-parquet-avro.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns# fb: https://www.facebook.com/2008/fbml">
<head>
<title>Fun with macros and parquet-avro - Stackdiver as a Service</title>
<!-- Using the latest rendering mode for IE -->
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="canonical" href="https://www.lyh.me/fun-with-macros-and-parquet-avro.html">
<meta name="author" content="Neville Li" />
<meta name="keywords" content="avro,macro,parquet,scala" />
<meta name="description" content="I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline. Parquet and Avro Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark. Projection Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad. Projection[Signal]("field1", "field2.field2a") Note that fields specifications are strings even though the API has access to Avro type Signal which has strongly typed getter methods. This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField). It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of …" />
<meta property="og:site_name" content="Stackdiver as a Service" />
<meta property="og:type" content="article"/>
<meta property="og:title" content="Fun with macros and parquet-avro"/>
<meta property="og:url" content="https://www.lyh.me/fun-with-macros-and-parquet-avro.html"/>
<meta property="og:description" content="I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline. Parquet and Avro Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark. Projection Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad. Projection[Signal]("field1", "field2.field2a") Note that fields specifications are strings even though the API has access to Avro type Signal which has strongly typed getter methods. This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField). It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of …"/>
<meta property="article:published_time" content="2015-01-08" />
<meta property="article:section" content="code" />
<meta property="article:tag" content="avro" />
<meta property="article:tag" content="macro" />
<meta property="article:tag" content="parquet" />
<meta property="article:tag" content="scala" />
<meta property="article:author" content="Neville Li" />
<!-- Bootstrap -->
<link rel="stylesheet" href="https://www.lyh.me/theme/css/bootstrap.min.css" type="text/css"/>
<link href="https://www.lyh.me/theme/css/font-awesome.min.css" rel="stylesheet">
<link href="https://www.lyh.me/theme/css/pygments/monokai.css" rel="stylesheet">
<link href="https://www.lyh.me/theme/css/typogrify.css" rel="stylesheet">
<link rel="stylesheet" href="https://www.lyh.me/theme/css/style.css" type="text/css"/>
<link href="https://www.lyh.me/feeds/all.atom.xml" type="application/atom+xml" rel="alternate"
title="Stackdiver as a Service ATOM Feed"/>
<link href="https://www.lyh.me/feeds/code.atom.xml" type="application/atom+xml" rel="alternate"
title="Stackdiver as a Service code ATOM Feed"/>
</head>
<body>
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-ex1-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="https://www.lyh.me/" class="navbar-brand">
Stackdiver as a Service </a>
</div>
<div class="collapse navbar-collapse navbar-ex1-collapse">
<ul class="nav navbar-nav">
<li><a href="https://www.lyh.me/pages/about-me.html">
About Me
</a></li>
</ul>
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
</div> <!-- /.navbar -->
<!-- Banner -->
<!-- End Banner -->
<!-- Content Container -->
<div class="container">
<div class="row">
<div class="col-sm-9">
<section id="content">
<article>
<header class="page-header">
<h1>
<a href="https://www.lyh.me/fun-with-macros-and-parquet-avro.html"
rel="bookmark"
title="Permalink to Fun with macros and parquet-avro">
Fun with macros and parquet-avro
</a>
</h1>
</header>
<div class="entry-content">
<div class="panel">
<div class="panel-body">
<footer class="post-info">
<span class="label label-default">Date</span>
<span class="published">
<i class="fa fa-calendar"></i><time datetime="2015-01-08T14:01:00-05:00"> Thu 08 January 2015</time>
</span>
<span class="label label-default">Category</span>
<a href="https://www.lyh.me/category/code.html">code</a>
<span class="label label-default">Tags</span>
<a href="https://www.lyh.me/tag/avro.html">avro</a>
/
<a href="https://www.lyh.me/tag/macro.html">macro</a>
/
<a href="https://www.lyh.me/tag/parquet.html">parquet</a>
/
<a href="https://www.lyh.me/tag/scala.html">scala</a>
</footer><!-- /.post-info --> </div>
</div>
<p>I recently had some fun building <a href="https://github.com/nevillelyh/parquet-avro-extra">parquet-avro-extra</a>, an add-on module for <a href="https://github.com/Parquet/parquet-mr/tree/master/parquet-avro">parquet-avro</a> using <a href="http://scalamacros.org/">Scala macros</a>. I did it mainly to learn Scala macros but also to make it easier to use <a href="http://parquet.incubator.apache.org/">Parquet</a> with <a href="http://avro.apache.org/">Avro</a> in a data pipeline.</p>
<h2>Parquet and Avro</h2>
<p>Parquet is a columnar storage system designed for <span class="caps">HDFS</span>. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The <code>parquet-avro</code> module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a <span class="caps">JVM</span> data pipeline like <a href="https://github.com/twitter/scalding">Scalding</a> or <a href="http://spark.apache.org/">Spark</a>.</p>
<h2>Projection</h2>
<p>Parquet allows reading only a subset of columns via projection. Here’s an Scalding <a href="https://github.com/epishkin/scalding/tree/parquet_avro/scalding-parquet">example</a> from <a href="http://www.tapad.com/">Tapad</a>.</p>
<div class="highlight"><pre><span></span><code><span class="nc">Projection</span><span class="o">[</span><span class="kt">Signal</span><span class="o">](</span><span class="s">"field1"</span><span class="o">,</span> <span class="s">"field2.field2a"</span><span class="o">)</span>
</code></pre></div>
<p>Note that fields specifications are strings even though the <span class="caps">API</span> has access to Avro type <code>Signal</code> which has strongly typed getter methods.</p>
<p>This is slightly counter-intuitive since most Scala developers are used to transformations like <code>pipe.map(_.getField)</code>. It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of <code>def apply[T](getters: (T => Any)*): Schema</code> and can be used like this:</p>
<div class="highlight"><pre><span></span><code><span class="nc">Projection</span><span class="o">[</span><span class="kt">Signal</span><span class="o">](</span><span class="k">_</span><span class="o">.</span><span class="n">getField1</span><span class="o">,</span> <span class="k">_</span><span class="o">.</span><span class="n">getField2</span><span class="o">.</span><span class="n">getField2a</span><span class="o">)</span>
</code></pre></div>
<p>The macro version looks more natural plus you get auto-complete support from the <span class="caps">IDE</span> and avoid typos.</p>
<h2>Predicate</h2>
<p>Predicate is an even better use case for macros. The Parquet predicate <span class="caps">API</span> supports a fixed set of column types and operators, and the user must use the correct factory methods to construct an expression tree. A simple lambda like <code>i: Item => i.price < 100 && i.getReviews >= 10</code> becomes this:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">parquet.filter2.predicate.FilterApi</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">parquet.filter2.predicate.FilterPredicate</span><span class="p">;</span>
<span class="n">FilterPredicate</span> <span class="n">p</span> <span class="o">=</span> <span class="n">FilterApi</span><span class="p">(</span>
<span class="n">FilterApi</span><span class="p">.</span><span class="na">lt</span><span class="p">(</span><span class="n">FilterApi</span><span class="p">.</span><span class="na">floatColumn</span><span class="p">(</span><span class="s">"price"</span><span class="p">),</span> <span class="mi">100</span><span class="n">f</span><span class="p">),</span>
<span class="n">FilterApi</span><span class="p">.</span><span class="na">gteq</span><span class="p">(</span><span class="n">FilterApi</span><span class="p">.</span><span class="na">intColumn</span><span class="p">(</span><span class="s">"reviews"</span><span class="p">),</span> <span class="mi">10</span><span class="p">));</span>
</code></pre></div>
<p>Obviously this very cumbersome and even worse than the projection case. But with macros it feels almost like writing a regular Scala predicate lambda:</p>
<div class="highlight"><pre><span></span><code><span class="nc">Predicate</span><span class="o">[</span><span class="kt">Item</span><span class="o">](</span><span class="n">i</span> <span class="k">=></span> <span class="n">i</span><span class="o">.</span><span class="n">getPrice</span> <span class="o"><</span> <span class="mi">100</span> <span class="o">&&</span> <span class="n">i</span><span class="o">.</span><span class="n">getReviews</span> <span class="o">>=</span> <span class="mi">10</span><span class="o">)</span>
</code></pre></div>
<h2>Lessons learned</h2>
<p>I have some experience with C++ templates and Clojure macros, but Scala macros is still pretty challenging to get started. A couple of notes:</p>
<ul>
<li>Since one can write pretty complex code inside a macro, I had to remind myself that they don’t compute data at runtime, but merely transform the syntax tree.</li>
<li>Pattern matching and deconstruction are handy for extracting tree elements.</li>
<li>Quasiquotes can be chained and returned in recursion for complex transformations, just be aware of the different types first.</li>
<li>It’s great exercise for your recursion skills.</li>
</ul>
</div>
<!-- /.entry-content -->
<hr />
<!-- AddThis Button BEGIN -->
<div class="addthis_toolbox addthis_default_style">
<a class="addthis_button_facebook_like" fb:like:layout="button_count"></a>
<a class="addthis_button_tweet"></a>
<a class="addthis_button_google_plusone" g:plusone:size="medium"></a>
</div>
<!-- AddThis Button END -->
<hr/>
<section class="comments" id="comments">
<h2>Comments</h2>
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
var disqus_shortname = 'lyh'; // required: replace example with your forum shortname
var disqus_config = function () {
this.language = "en";
this.page.identifier = '2015-01-08-fun-with-macros-and-parquet-avro';
this.page.url = 'https://www.lyh.me/fun-with-macros-and-parquet-avro.html';
};
/* * * DON'T EDIT BELOW THIS LINE * * */
(function () {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by
Disqus.</a></noscript>
<a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</section>
</article>
</section>
</div>
<div class="col-sm-3" id="sidebar">
<aside>
<div id="aboutme">
<p>
<img width="100%" class="img-thumbnail" src="https://www.lyh.me//avatar.jpg"/>
</p>
<p>
<strong>About Neville Li</strong><br/>
Data infrastructure @<a href="https://twitter.com/Spotify">Spotify</a>, ex-@<a href="https://twitter.com/Yahoo">Yahoo</a> search, creator of <a href="https://github.com/spotify/scio">Scio</a>, technical cave & wreck diver, lefty guitar player
</p>
</div><!-- Sidebar -->
<section class="well well-sm">
<ul class="list-group list-group-flush">
<!-- Sidebar/Social -->
<li class="list-group-item">
<h4><i class="fa fa-home fa-lg"></i><span class="icon-label">Social</span></h4>
<ul class="list-group" id="social">
<li class="list-group-item"><a href="https://open.spotify.com/user/sinisa_lyh"><i class="fa fa-spotify fa-lg"></i> Spotify</a></li>
<li class="list-group-item"><a href="https://github.com/nevillelyh"><i class="fa fa-github-square fa-lg"></i> GitHub</a></li>
<li class="list-group-item"><a href="https://twitter.com/sinisa_lyh"><i class="fa fa-twitter-square fa-lg"></i> Twitter</a></li>
<li class="list-group-item"><a href="https://www.slideshare.net/sinisalyh"><i class="fa fa-slideshare fa-lg"></i> SlideShare</a></li>
<li class="list-group-item"><a href="https://www.youtube.com/user/sinisalyh/videos"><i class="fa fa-youtube-square fa-lg"></i> YouTube</a></li>
<li class="list-group-item"><a href="https://www.instagram.com/sinisa/"><i class="fa fa-instagram fa-lg"></i> Instagram</a></li>
<li class="list-group-item"><a href="https://www.flickr.com/photos/sinisa_lyh"><i class="fa fa-flickr fa-lg"></i> Flickr</a></li>
</ul>
</li>
<!-- End Sidebar/Social -->
<!-- Sidebar/Recent Posts -->
<li class="list-group-item">
<h4><i class="fa fa-home fa-lg"></i><span class="icon-label">Recent Posts</span></h4>
<ul class="list-group" id="recentposts">
<li class="list-group-item"><a href="https://www.lyh.me/magnolify.html">Magnolify</a></li>
<li class="list-group-item"><a href="https://www.lyh.me/featran.html">Featran</a></li>
<li class="list-group-item"><a href="https://www.lyh.me/automatic-type-class-derivation-with-shapeless.html">Automatic type-class derivation with Shapeless</a></li>
<li class="list-group-item"><a href="https://www.lyh.me/lambda-serialization.html">Lambda serialization</a></li>
<li class="list-group-item"><a href="https://www.lyh.me/lawfulness-of-aggregatebykey.html">Lawfulness of aggregateByKey</a></li>
</ul>
</li>
<!-- End Sidebar/Recent Posts -->
<!-- Sidebar/Categories -->
<li class="list-group-item">
<h4><i class="fa fa-home fa-lg"></i><span class="icon-label">Categories</span></h4>
<ul class="list-group" id="categories">
<li class="list-group-item">
<a href="https://www.lyh.me/category/code.html"><i class="fa fa-folder-open fa-lg"></i>code</a>
</li>
<li class="list-group-item">
<a href="https://www.lyh.me/category/misc.html"><i class="fa fa-folder-open fa-lg"></i>misc</a>
</li>
</ul>
</li>
<!-- End Sidebar/Categories -->
<!-- Sidebar/Twitter Timeline -->
<li class="list-group-item">
<h4><i class="fa fa-twitter fa-lg"></i><span class="icon-label">Latest Tweets</span></h4>
<div id="twitter_timeline">
<a class="twitter-timeline" data-width="250" data-height="300" data-dnt="true" data-theme="light" href="https://twitter.com/sinisa_lyh">Tweets by sinisa_lyh</a> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
</div>
</li>
<!-- End Sidebar/Twitter Timeline -->
</ul>
</section>
<!-- End Sidebar --> </aside>
</div>
</div>
</div>
<!-- End Content Container -->
<footer>
<div class="container">
<hr>
<div class="row">
<div class="col-xs-10">© 2020 Neville Li
· Powered by <a href="https://github.com/getpelican/pelican-themes/tree/master/pelican-bootstrap3" target="_blank">pelican-bootstrap3</a>,
<a href="http://docs.getpelican.com/" target="_blank">Pelican</a>,
<a href="http://getbootstrap.com" target="_blank">Bootstrap</a> <p><small> <a rel="license" href="https://creativecommons.org/licenses/by-nc/4.0/deed.en"><img alt="Creative Commons License" style="border-width:0" src="//i.creativecommons.org/l/by-nc/4.0/80x15.png" /></a>
Content
licensed under a <a rel="license" href="https://creativecommons.org/licenses/by-nc/4.0/deed.en">Creative Commons Attribution-NonCommercial 4.0 International License</a>, except where indicated otherwise.
</small></p>
</div>
<div class="col-xs-2"><p class="pull-right"><i class="fa fa-arrow-up"></i> <a href="#">Back to top</a></p></div>
</div>
</div>
</footer>
<script src="https://www.lyh.me/theme/js/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="https://www.lyh.me/theme/js/bootstrap.min.js"></script>
<!-- Enable responsive features in IE8 with Respond.js (https://github.com/scottjehl/Respond) -->
<script src="https://www.lyh.me/theme/js/respond.min.js"></script>
<!-- Disqus -->
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
var disqus_shortname = 'lyh'; // required: replace example with your forum shortname
/* * * DON'T EDIT BELOW THIS LINE * * */
(function () {
var s = document.createElement('script');
s.async = true;
s.type = 'text/javascript';
s.src = '//' + disqus_shortname + '.disqus.com/count.js';
(document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s);
}());
</script>
<!-- End Disqus Code -->
<!-- Google Analytics -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-6988688-5']);
_gaq.push(['_trackPageview']);
(function () {
var ga = document.createElement('script');
ga.type = 'text/javascript';
ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(ga, s);
})();
</script>
<!-- End Google Analytics Code -->
<script type="text/javascript">var addthis_config = {"data_track_addressbar": true};</script>
<script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=sinisalyh"></script>
</body>
</html>