Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated string output #6

Open
thegoatherder opened this issue Mar 27, 2023 · 14 comments
Open

Truncated string output #6

thegoatherder opened this issue Mar 27, 2023 · 14 comments

Comments

@thegoatherder
Copy link

thegoatherder commented Mar 27, 2023

We have found some Cerner Blobs containing compressed RTF, where the decoded output is truncated by your library. We would expect all RTFs to end with } char or }<whitespace>. Around 20% of the CE_BLOBS in a collection of 10M PowerChart RTF documents are being truncated and do not end with } - documents are being cut-off mid-sentence. There seems to be no discernable pattern as to what is truncating them and I have no idea how to debug this.

Is it possible there is a limit to the total size of the blob?

We experimented with another library implemented in Python and confirmed that no truncation occurred on the same files - so we are certain this isn't a problem with the source data. I also tried the Bruce Jackson alternative, but I can't find any suitable column in the Cerner Millennium schema to provide the stringSize parameter - we tried CE_BLOB.BLOB_LENGTH but this resulted in an $Index was outside the bounds of the array error.

Any suggestions you could provide would be greatly appreciated!

@pgodwin
Copy link
Owner

pgodwin commented Mar 27, 2023

I'm not aware of those issues. Can you send some example code?
Remember that the notes in PowerChart can be spread across multiple rows.

Edit: I just tried it on a 75MB RTF spread across 1,149 CE_BLOB rows without any issues I could see.

  • Can you provide the code you're using to decode the files?
  • Make sure you're using OcfLzw.OcfLzw2.Decode, I really need to delete the largely untested v1. V2 is battle tested and reasonably performant.
  • What size documents are you having issue with?
  • Are you trimming EOF marked from the end of each blob? ocf_blob\0

@thegoatherder
Copy link
Author

@pgodwin thanks very much for your quick response!

We are currently excluding all multiple row blobs from our conversion, so it's not that. I can confirm CE_BLOB.MAX_SEQUENCE_NBR = 1 for all documents we converted.

It doesn't seem to be related to large blobs. You can see in the following query that many of the affected documents have a low BLOB_LENGTH

select top 100
    'broken' as [DESCRIPTION],
    ceb.BLOB_LENGTH,
    cebr.MAX_SEQUENCE_NBR,
    RIGHT(RTRIM(CONTENT_TEXT), 10) AS TrimmedLast10Chars,
    RIGHT(RTRIM(CONTENT_TEXT), 1) AS TrimmedLastChar,
    ASCII(RIGHT(RTRIM(CONTENT_TEXT), 1)) AS TrimmedLastCharASCIICode
from Extract.DOCUMENT_TEXT rtf
    JOIN Extract.CE_BLOB ceb ON ceb.EVENT_ID = rtf.EVENT_ID
    JOIN Extract.CE_BLOB_RESULT cebr ON cebr.EVENT_ID = rtf.EVENT_ID
WHERE RIGHT(RTRIM(CONTENT_TEXT),1) <> '}'
    AND RIGHT(RTRIM(CONTENT_TEXT),2) <> '}' + CHAR(0)

image

Note: the above example gets its data from extract.DOCUMENT_TEXT - this is not a Cerner table but our own user-defined table used for storing the decoded blob data

Currently we are decoding the blobs using a simple CLR function wrapper for your class:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using OcfLzw;

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
    [return: SqlFacet(MaxSize = -1)]
    public static SqlBinary CernerBlobDecode(SqlInt64 id, SqlBinary blob)
    {
        try
        {
            return OcfLzw2.Decode((byte[])blob);
        }
        catch (Exception e)
        {
            throw new Exception($"ID: {id}; ERROR: ${e.Message}");
        }
    }
}

As you can see, we are using OcfLzw2.Decode() not the v1 version.
Just to reiterate, this issue only affects around 20% of our documents extracted - the rest are perfect.

@pgodwin
Copy link
Owner

pgodwin commented Mar 27, 2023

Thanks for the example. I too am using a clr in a similar way.
I'll double check the documents I've got to see if I can see the same issue.

@pgodwin
Copy link
Owner

pgodwin commented Mar 27, 2023

I checked about 250,000 decoded RTF values, and couldn't find the same issue of truncated RTF files.

The code I use is an internal fork to this public one. There's only one difference I could find with my private version of the decoder vs this public one and that's Line 115 of OcfLzw](

if (nextCommand < 0)
):

My internal version has:

if (nextCommand == -1)
   break; // EOF?

instead of

if (nextCommand < 0)
    break; // EOF?

EDIT:
Things to try:

  1. You could try modifying the code with the change above and try decompressing the problem blobs again. I'd be very curious to see if that fixes it.
  2. Check the BLOB_CONTENTS values in your database. They should each end with ocf_blob\0 (0x6F63665F626C6F6200). You can check this with SQL:
declare @EOFMarker varbinary(max) = 0x6F63665F626C6F6200;

SELECT EVENT_ID, VALID_UNTIL_DT_TM, BLOB_CONTENTS
  FROM [CE_BLOB]
  where right(BLOB_CONTENTS, datalength(@EOFMarker)) = @EOFMarker
  1. Check your joins between CE_BLOB and CE_BLOB_RESULT. VALID_UNTIL_DT_TM field should be included on your joins. Perhaps your MAX_SEQUENCE_NUMBER is for an old version of the CE_BLOB record?

If you've checked all of that and still having issues - see if you can find an example that does not include any patient data (ie a manually created test note), and post it here and I can try and see what is going on.

@thegoatherder
Copy link
Author

@pgodwin thanks again for all your suggestions - so helpful!

Things to try:

You could try modifying the code with the change above and try decompressing the problem blobs again. I'd be very curious to see if that fixes it.

Tried this - same issue

Check the BLOB_CONTENTS values in your database. They should each end with ocf_blob\0 (0x6F63665F626C6F6200). You can check this with SQL:

Thanks I didn't know this - but have checked and all affected files do end with the correct byte sequence.

Check your joins between CE_BLOB and CE_BLOB_RESULT. VALID_UNTIL_DT_TM field should be included on your joins. Perhaps your MAX_SEQUENCE_NUMBER is for an old version of the CE_BLOB record?

Going to look into this shortly with a colleague who wrote the ETL scripts.

But it does strike me that for exactly the same BLOB_CONTENTS, we get a full extract using the Python port of your code. It seems very strange that that would work based on the fact that it's a port, so I would assume designed to be functionally identical.

I'm going to try to compare the C# and py scripts to see if I can spot the difference.

@thegoatherder
Copy link
Author

@pgodwin - actually on closer inspection, the Python port is not of your code, but of the Bruce Jackson version you posted on the README.

I asked ChatGPT to analyse the differences between your OcfLzw2 and the Python port. Here's what it had to say:

Certainly! Here are some of the differences between the two scripts:

The C# script uses a LzwTable class to store the LZW dictionary, whereas the Python script uses a list of lzwItem objects to store the lookup table. The LzwTable class has a fixed size and is implemented using an array, whereas the Python list can grow dynamically.

The C# script uses a NBitStream class to read the compressed data, whereas the Python script reads the data byte by byte from the input buffer.

The C# script uses a StringBuilder to accumulate the decompressed string, whereas the Python script appends the decompressed data to a list.

The C# script uses the Encoding.Default property to decode the decompressed byte array into a string, whereas the Python script does not decode the data into a string.

The C# script uses a fixed dictionary size of 8192 codes, whereas the Python script does not have a fixed dictionary size.

The C# script uses a dynamic bit-width scheme to adjust the number of bits needed to represent each LZW code, whereas the Python script uses a fixed bit-width scheme.

The C# script has a Decode method that takes a Stream object as input, whereas the Python script has a decompress method that takes a byte array as input.

The C# script uses bitwise operations to read and manipulate the compressed data, whereas the Python script uses arithmetic operations and bit shifting.

Overall, both scripts implement the LZW algorithm to decompress the OCF compressed data, but they use different data structures and algorithms to achieve this. The C# script is optimized for performance and uses a fixed dictionary size and a dynamic bit-width scheme to minimize the memory overhead. The Python script is simpler and more flexible, using a list to store the lookup table and a byte-by-byte input scheme to handle the compressed data.

So it looks like they are completely different implementations - I think the key part is that yours is more performant.

I have also tried to use the Bruce Jackson C# code on my side. I couldn't get it to work out-of-the-box as it needs a stringSize input and I could not find this in the Cerner tables (only BLOB_LENGTH, which doesn't seem to work).

I adapted his code to use a MemoryStream instead, to avoid having to predeclare the length of the bytes array. When I ran it as a CLR function, I got the same result as your C# script (i.e. truncated RTF on some records) - but much slower.

@thegoatherder
Copy link
Author

thegoatherder commented Mar 28, 2023

and some more...

Author:
Comparing the 2 scripts, can you suggest why the C# version sometimes truncates the decoded OCF? Tested on 10M files, we found 2M of them were truncated by the C# version - but they are fine in the Python version

ChatGPT:
Based on the code, it's difficult to determine exactly why the C# version is truncating the decoded OCF for some files, but one possibility is that it has a bug in how it handles the LZW dictionary.

In particular, it's possible that the LzwTable class used by the C# script has a bug that causes it to miss certain codes, resulting in an incomplete dictionary. This could cause the C# script to terminate prematurely when it encounters an undefined code. The Python script, on the other hand, uses a simple list to store the lookup table, which may be more resilient to bugs or errors.

Another possibility is that there may be differences in how the two scripts handle byte order or endianness. If the compressed data is stored in a different byte order than what the C# script is expecting, it could cause the C# script to misinterpret the data and truncate the output.

Without more information about the specific files that are being processed and the details of the bug or error that is causing the C# script to truncate the output, it's difficult to provide a more precise explanation or solution.

@pgodwin
Copy link
Owner

pgodwin commented Mar 28, 2023

I'm no expert at compression, so I'm not ruling out a bug in the implementation of LZW in this code. Without an example of the blob I can't say for certain. I've checked the CRCs of my decompressed versions and they are matching the CRC in CE_BLOB_RESULT, suggesting it is decompressing items correctly.

How are you calling the Python version? Can you call the C# implementation through the same method?

How are you bringing the blobs into SQL Server? Could there be an issue there?

@thegoatherder
Copy link
Author

@pgodwin I had missed the CRC code on CE_BLOB_RESULT as we didn't extract that column into our staging. I'll take a look at this.

I've tried to find you on LinkedIn... perhaps we can contract some time from you to assist with this issue?

@pgodwin
Copy link
Owner

pgodwin commented Mar 29, 2023

I'm on LinkedIn but based in Australia so might be a bit difficult.

I think the fact you're getting the same result in two separate, independently developed C# implementations suggests it maybe something in your dataset.

I think it'd be useful to compare how you pass the compressed blobs to the Python code vs C#.

@thegoatherder
Copy link
Author

@pgodwin
Apologies for the delay, I had to have some internal discussions about this.
I have managed to isolate the truncation issue happening on documents attached to TEST PATIENTs.
I have sent you an email to the gmail address specified in git log on this repo - if you don't receive it please let me know how best to contact you.

@plessbd
Copy link

plessbd commented May 9, 2023

I am the author of the Python port you mentioned, and just want to offer any assistance, I have run the python code on about 3.2m blobs with no issues (other than rtf being another thorn in my side...)

@thegoatherder
Copy link
Author

@plessbd thanks for the offer to help. The issue in the end was not with the c# script but in the way I converted it to SQLCLR. So, my fault! I will post more info for others soon. By the way the c# version running inside SQL server is considerably faster than the Python version (around 600%), although both return the same results. Your Python script was a great help, we used it as a fallback while we sorted the SQL implementation.

@plessbd
Copy link

plessbd commented May 11, 2023

@thegoatherder yeah, I was not concerned about performance, for me this was a one time dump. Though if I have time I might try to see where I could improve performance. Our bottleneck when doing this was pulling from oracle. Or if you have any suggestions on performance, that would be wonderful as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants