Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Add info about the version to use, current status, etc
  • Loading branch information
pgodwin authored Aug 12, 2019
1 parent b077705 commit c1fc260
Showing 1 changed file with 22 additions and 15 deletions.
37 changes: 22 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,32 @@ Implements "OCF Compresion" as used by Cerner PowerChart.

The compression is basically just the same LZW compression as used in TIFF images.

Background
===========

I wanted to produce a simple decoder that was somewhat "C# like" rather than just another
port of the C implementations. As such, it trades speed for "style".
Please use the `OCfLzw2` version of this library. The previous version in `OcfLzw` folder is
no longer under development as it was too slow.

An alternate implementation by Bruce Jackson may also be of interest:
* https://gist.github.com/pgodwin/7d66729444173146ad698d154f2b9b6c
* https://blog.brucejackson.info/2013/03/deconstructing-lzw-decompression.html/

## Current Status
The code has been extensively tested against more than 10 million "wild" LZW blobs without issue.
It handles compressed streams with a chunk-size of up to 13 bits.

## History
I wanted to produce a simple LZW decoder which could specifically handle "BLOB" data from Cerner PowerChart.
I also wanted something which was somewhat "C# like" rather than just another port of the C implementations.
As such, it trades speed for "style".

This port is based on two Java Implementation of TIFF LZW:

1. PDFBox was the original source (https://pdfbox.apache.org/.).
2. TwelveMonkeys TIFF LZWDecoder (https://github.com/haraldk/TwelveMonkeys)

I've made use of PDFBox's NBitStream for reading the variable bit-widths from the stream.
However it's LZW decoder seemed to be very heap-intensive (and really didn't need the hash of a dictionary),
However it's LZW decoder seemed to be very heap-intensive (and I really didn't need the hash overhead of a dictionary),
with decompressing 1 million values taking over 2 min on an i5-4590s.

As I want to run this across millions of rows, the performance just wasn't there.
Instead, the main decoding loop is from the TwelveMonkeys Image IO extensions,
Instead, the main decoding loop is based on the TwelveMonkeys Image IO extensions,
specfically - https://github.com/haraldk/TwelveMonkeys/blob/master/imageio/imageio-tiff/src/main/java/com/twelvemonkeys/imageio/plugins/tiff/LZWDecoder.java

Rather than a dictionary and node-tree, it just uses an array. This reduces the number of objects created, and has a much
Expand All @@ -30,14 +39,12 @@ As a result, running 1 million values on the same i5-4590s now completes within

I think there's room for further optimisation, but I want to try and ensure the code remains readable.

Current Status
===============
The code is passing it's basic tests, but I need many more samples to ensure it works correctly across my use-cases.

Todo:
- Package and publish as nuget packages
- Add SQL CLR functions (likely as a separate library)
- Further performance improvements
### Todo:

* [ ] Package and publish as nuget packages
* [ ] Add SQL CLR functions (likely as a separate library)
* [ ] Further performance improvements

## Licenses
License: http://www.apache.org/licenses/LICENSE-2.0
Expand Down

1 comment on commit c1fc260

@SolitaryBuscuit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work, and works well for test cases as a SQL CLR function. I'm hoping to connect and pick your brain about some issues I'm having, specifically around the initial requirements for Cerner BLOBS.

I appreciate this and all your work, and let me know if we can connect so I can pick your brain.

Please sign in to comment.