Skip to content

Commit

Permalink
Generate Pelican site
Browse files Browse the repository at this point in the history
  • Loading branch information
nevillelyh committed Mar 11, 2020
1 parent 10a41fa commit 5864413
Show file tree
Hide file tree
Showing 109 changed files with 3,575 additions and 2,368 deletions.
2 changes: 0 additions & 2 deletions archives.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@
title="Das Keyboard Shredder ATOM Feed"/>




</head>
<body>

Expand Down
102 changes: 50 additions & 52 deletions author/neville-li.html

Large diffs are not rendered by default.

6 changes: 2 additions & 4 deletions author/neville-li2.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@
title="Das Keyboard Shredder ATOM Feed"/>




</head>
<body>

Expand Down Expand Up @@ -326,8 +324,8 @@ <h2>Parquet and&nbsp;Avro</h2>
<p>Parquet is a columnar storage system designed for <span class="caps">HDFS</span>. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The <code>parquet-avro</code> module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a <span class="caps">JVM</span> data pipeline like <a href="https://github.com/twitter/scalding">Scalding</a> or <a href="http://spark.apache.org/">Spark</a>.</p>
<h2>Projection</h2>
<p>Parquet allows reading only a subset of columns via projection. Here&#8217;s an Scalding <a href="https://github.com/epishkin/scalding/tree/parquet_avro/scalding-parquet">example</a> from <a href="http://www.tapad.com/">Tapad</a>.</p>
<div class="highlight"><pre><span></span><span class="nc">Projection</span><span class="o">[</span><span class="kt">Signal</span><span class="o">](</span><span class="s">&quot;field1&quot;</span><span class="o">,</span> <span class="s">&quot;field2.field2a&quot;</span><span class="o">)</span>
</pre></div>
<div class="highlight"><pre><span></span><code><span class="nc">Projection</span><span class="o">[</span><span class="kt">Signal</span><span class="o">](</span><span class="s">&quot;field1&quot;</span><span class="o">,</span> <span class="s">&quot;field2.field2a&quot;</span><span class="o">)</span>
</code></pre></div>


<p>Note that fields specifications are strings even though the <span class="caps">API</span> has access to Avro type <code>Signal</code> which has strongly typed getter&nbsp;methods.</p>
Expand Down
30 changes: 14 additions & 16 deletions author/neville-li3.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@
title="Das Keyboard Shredder ATOM Feed"/>




</head>
<body>

Expand Down Expand Up @@ -93,28 +91,28 @@ <h2><a href="https://www.lyh.me/using-cql-with-legacy-column-families.html">Usin
</footer><!-- /.post-info --> </div>
<div class="summary"><p>We use <a href="http://cassandra.apache.org/">Cassandra</a> extensively <a href="http://www.slideshare.net/JimmyMrdell/playlists-at-spotify-cassandra-summit-london-2013?related=1">at work</a>, and up till recently we&#8217;ve been using mostly Cassandra 1.2 with <a href="https://github.com/Netflix/astyanax">Astyanax</a> and <a href="https://thrift.apache.org/">Thrift</a> protocol in Java applications. Very recently we started adopting Cassandra 2.0 with <span class="caps">CQL</span>, <a href="https://github.com/datastax/java-driver">DataStax Java Driver</a> and binary&nbsp;protocol.</p>
<p>While one should move to <span class="caps">CQL</span> schema to take full advantage of the new protocol and storage engine, it&#8217;s still possible to use <span class="caps">CQL</span> and the new driver on existing clusters. Say we have a legacy column family with <code>UTF8Type</code> for row/column keys and <code>BytesType</code> for values, it would look like this in <code>cassandra-cli</code>:</p>
<div class="highlight"><pre><span></span><span class="k">create</span> <span class="k">column</span> <span class="n">family</span> <span class="k">data</span>
<div class="highlight"><pre><span></span><code><span class="k">create</span> <span class="k">column</span> <span class="n">family</span> <span class="k">data</span>
<span class="k">with</span> <span class="n">column_type</span> <span class="o">=</span> <span class="s1">&#39;Standard&#39;</span>
<span class="k">and</span> <span class="n">comparator</span> <span class="o">=</span> <span class="s1">&#39;UTF8Type&#39;</span>
<span class="k">and</span> <span class="n">default_validation_class</span> <span class="o">=</span> <span class="s1">&#39;BytesType&#39;</span>
<span class="k">and</span> <span class="n">key_validation_class</span> <span class="o">=</span> <span class="s1">&#39;UTF8Type&#39;</span><span class="p">;</span>
</pre></div>
</code></pre></div>


<p>And this in <code>cqlsh</code> after setting <code>start_native_transport: true</code> in <code>cassandra.yaml</code>:</p>
<div class="highlight"><pre><span></span><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="k">data</span> <span class="p">(</span>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="k">data</span> <span class="p">(</span>
<span class="k">key</span> <span class="nb">text</span><span class="p">,</span>
<span class="n">column1</span> <span class="nb">text</span><span class="p">,</span>
<span class="n">value</span> <span class="nb">blob</span><span class="p">,</span>
<span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="k">key</span><span class="p">,</span> <span class="n">column1</span><span class="p">)</span>
<span class="p">)</span> <span class="k">WITH</span> <span class="n">COMPACT</span> <span class="k">STORAGE</span><span class="p">;</span>
</pre></div>
</code></pre></div>


<p>In this table, <code>key</code> and <code>column1</code> corresponds to row and column keys in the legacy column family and <code>value</code> corresponds to column&nbsp;value.</p>
<p>Queries to look up a column value, an entire row, and selected columns in a row would look like&nbsp;this:</p>
<div class="highlight"><pre><span></span><span class="k">SELECT</span> <span class="n">value</span> <span class="k">FROM</span> <span class="n">mykeyspace</span><span class="p">.</span><span class="k">data</span> <span class="k">WHERE</span> <span class="k">key</span> <span class="o">=</span> <span class="s1">&#39;rowkey&#39;</span> <span class="k">AND</span> <span class="n">column1</span> <span class="o">=</span> <span class="s1">&#39;colkey&#39;</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="n">column1</span><span class="p">,</span> <span class="n">value</span> <span class="k">FROM</span> <span class="n">mykeyspace …</span></pre></div>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span> <span class="n">value</span> <span class="k">FROM</span> <span class="n">mykeyspace</span><span class="p">.</span><span class="k">data</span> <span class="k">WHERE</span> <span class="k">key</span> <span class="o">=</span> <span class="s1">&#39;rowkey&#39;</span> <span class="k">AND</span> <span class="n">column1</span> <span class="o">=</span> <span class="s1">&#39;colkey&#39;</span><span class="p">;</span>
<span class="k">SELECT</span> <span class="n">column1</span><span class="p">,</span> <span class="n">value</span> <span class="k">FROM</span> <span class="n">mykeyspace …</span></code></pre></div>
<a class="btn btn-default btn-xs" href="https://www.lyh.me/using-cql-with-legacy-column-families.html">more ...</a>
</div>
</article>
Expand Down Expand Up @@ -216,21 +214,21 @@ <h2><a href="https://www.lyh.me/how-many-copies.html">How many&nbsp;copies</a></
<div class="summary"><p>One topic that came up a lot when optimizing Scala data applications is the performance of standard collections, or the hidden cost of temporary copies. The collections <span class="caps">API</span> is easy to learn and maps well to many Python concepts where a lot of data engineers are familiar with. But the performance penalty can be pretty big when it&#8217;s repeated over millions of records in a <span class="caps">JVM</span> with limited&nbsp;heap.</p>
<h2>Mapping&nbsp;values</h2>
<p>Let&#8217;s take a look at one most naive example first, mapping the values of a <code>Map</code>.</p>
<div class="highlight"><pre><span></span><span class="k">val</span> <span class="n">m</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;A&quot;</span> <span class="o">-&gt;</span> <span class="mi">1</span><span class="o">,</span> <span class="s">&quot;B&quot;</span> <span class="o">-&gt;</span> <span class="mi">2</span><span class="o">,</span> <span class="s">&quot;C&quot;</span> <span class="o">-&gt;</span> <span class="mi">3</span><span class="o">)</span>
<div class="highlight"><pre><span></span><code><span class="k">val</span> <span class="n">m</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;A&quot;</span> <span class="o">-&gt;</span> <span class="mi">1</span><span class="o">,</span> <span class="s">&quot;B&quot;</span> <span class="o">-&gt;</span> <span class="mi">2</span><span class="o">,</span> <span class="s">&quot;C&quot;</span> <span class="o">-&gt;</span> <span class="mi">3</span><span class="o">)</span>
<span class="n">m</span><span class="o">.</span><span class="n">toList</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">t</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="n">t</span><span class="o">.</span><span class="n">_1</span><span class="o">,</span> <span class="n">t</span><span class="o">.</span><span class="n">_2</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)).</span><span class="n">toMap</span>
</pre></div>
</code></pre></div>


<p>Looks simple enough but obviously not optimal. Two temporary <code>List[(String, Int)]</code> were created, one from <code>toList</code> and one from <code>map</code>. <code>map</code> also creates 3 copies of <code>(String, Int)</code>.</p>
<p>There are a few commonly seen variations. These don&#8217;t create temporary collections but still key-value&nbsp;tuples.</p>
<div class="highlight"><pre><span></span><span class="k">for</span> <span class="o">((</span><span class="n">k</span><span class="o">,</span> <span class="n">v</span><span class="o">)</span> <span class="k">&lt;-</span> <span class="n">m</span><span class="o">)</span> <span class="k">yield</span> <span class="n">k</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">v</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="o">((</span><span class="n">k</span><span class="o">,</span> <span class="n">v</span><span class="o">)</span> <span class="k">&lt;-</span> <span class="n">m</span><span class="o">)</span> <span class="k">yield</span> <span class="n">k</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">v</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span>
<span class="n">m</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="k">case</span> <span class="o">(</span><span class="n">k</span><span class="o">,</span> <span class="n">v</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">k</span> <span class="o">-&gt;</span> <span class="o">(</span><span class="n">v</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span> <span class="o">}</span>
</pre></div>
</code></pre></div>


<p>If one reads the <a href="http://www.scala-lang.org/api/2.10.4/index.html#scala.collection.immutable.Map">ScalaDoc</a> closely, there&#8217;s a <code>mapValues</code> method already and it probably is the shortest and most&nbsp;performant.</p>
<div class="highlight"><pre><span></span><span class="n">m</span><span class="o">.</span><span class="n">mapValues</span><span class="o">(</span><span class="k">_</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span>
</pre></div>
<div class="highlight"><pre><span></span><code><span class="n">m</span><span class="o">.</span><span class="n">mapValues</span><span class="o">(</span><span class="k">_</span> <span class="o">+</span> <span class="mi">1</span><span class="o">)</span>
</code></pre></div>


<h2>Java&nbsp;conversion</h2>
Expand Down Expand Up @@ -293,7 +291,7 @@ <h2><a href="https://www.lyh.me/light-table.html">Light&nbsp;Table</a></h2>
</footer><!-- /.post-info --> </div>
<div class="summary"><p>I recently picked up <a href="http://www.lighttable.com/">Light Table</a> for <a href="http://clojure.org/">Clojure</a> development and liked it. Form evaluation works out of the box and indentation is better than that in <a href="http://plugins.jetbrains.com/plugin/?id=4050">La Clojure</a> plugin for <a href="http://www.jetbrains.com/idea/">IntelliJ <span class="caps">IDEA</span></a>.</p>
<p>I particularly like the idea of command bar, which allows you to search for Light Table commands by name and execute them quickly. I was already used to <span class="caps">IDEA</span>&#8217;s key map though (<code>Mac OS X 10.5+</code> which is more natural to Mac users than the default <code>Mac OS X</code>), and wanted something similar. The setting files are in Clojure so it&#8217;s easy to customize. This is what I got so far for <code>user.keymap</code>:</p>
<div class="highlight"><pre><span></span><span class="p">{</span><span class="ss">:+</span> <span class="p">{</span><span class="ss">:app</span> <span class="p">{</span><span class="s">&quot;alt-space&quot;</span> <span class="p">[</span><span class="ss">:show-commandbar-transient</span><span class="p">]}</span>
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="ss">:+</span> <span class="p">{</span><span class="ss">:app</span> <span class="p">{</span><span class="s">&quot;alt-space&quot;</span> <span class="p">[</span><span class="ss">:show-commandbar-transient</span><span class="p">]}</span>

<span class="ss">:editor</span> <span class="p">{</span><span class="s">&quot;alt-w&quot;</span> <span class="p">[</span><span class="ss">:editor.watch.watch-selection</span><span class="p">]</span>
<span class="s">&quot;alt-shift-w&quot;</span> <span class="p">[</span><span class="ss">:editor.watch.unwatch</span><span class="p">]</span>
Expand All @@ -304,7 +302,7 @@ <h2><a href="https://www.lyh.me/light-table.html">Light&nbsp;Table</a></h2>
<span class="s">&quot;pmeta-shift-up&quot;</span> <span class="p">[</span><span class="ss">:editor.sublime.swapLineUp</span><span class="p">]</span>
<span class="s">&quot;pmeta-shift-down&quot;</span> <span class="p">[</span><span class="ss">:editor.sublime.swapLineDown</span><span class="p">]</span>
<span class="s">&quot;pmeta-/&quot;</span> <span class="p">[</span><span class="ss">:toggle-comment-selection</span> <span class="ss">:editor.line-down</span><span class="p">]}}}</span>
</pre></div>
</code></pre></div>


<p>Apart from these, I found myself using <code>"pmeta-enter" [:eval-editor-form]</code> and <code>"ctrl-d" [:editor.doc.toggle]</code> most when writing Clojure code. After all they are probably the most essential ones no matter what editor you use&nbsp;:)</p>
Expand Down
2 changes: 0 additions & 2 deletions authors.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@
title="Das Keyboard Shredder ATOM Feed"/>




</head>
<body>

Expand Down
Loading

0 comments on commit 5864413

Please sign in to comment.