-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncated string output #6
Comments
I'm not aware of those issues. Can you send some example code? Edit: I just tried it on a 75MB RTF spread across 1,149 CE_BLOB rows without any issues I could see.
|
@pgodwin thanks very much for your quick response! We are currently excluding all multiple row blobs from our conversion, so it's not that. I can confirm It doesn't seem to be related to large blobs. You can see in the following query that many of the affected documents have a low select top 100
'broken' as [DESCRIPTION],
ceb.BLOB_LENGTH,
cebr.MAX_SEQUENCE_NBR,
RIGHT(RTRIM(CONTENT_TEXT), 10) AS TrimmedLast10Chars,
RIGHT(RTRIM(CONTENT_TEXT), 1) AS TrimmedLastChar,
ASCII(RIGHT(RTRIM(CONTENT_TEXT), 1)) AS TrimmedLastCharASCIICode
from Extract.DOCUMENT_TEXT rtf
JOIN Extract.CE_BLOB ceb ON ceb.EVENT_ID = rtf.EVENT_ID
JOIN Extract.CE_BLOB_RESULT cebr ON cebr.EVENT_ID = rtf.EVENT_ID
WHERE RIGHT(RTRIM(CONTENT_TEXT),1) <> '}'
AND RIGHT(RTRIM(CONTENT_TEXT),2) <> '}' + CHAR(0) Note: the above example gets its data from Currently we are decoding the blobs using a simple CLR function wrapper for your class: using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using OcfLzw;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
[return: SqlFacet(MaxSize = -1)]
public static SqlBinary CernerBlobDecode(SqlInt64 id, SqlBinary blob)
{
try
{
return OcfLzw2.Decode((byte[])blob);
}
catch (Exception e)
{
throw new Exception($"ID: {id}; ERROR: ${e.Message}");
}
}
} As you can see, we are using |
Thanks for the example. I too am using a clr in a similar way. |
I checked about 250,000 decoded RTF values, and couldn't find the same issue of truncated RTF files. The code I use is an internal fork to this public one. There's only one difference I could find with my private version of the decoder vs this public one and that's Line 115 of OcfLzw]( Line 115 in c1fc260
My internal version has: if (nextCommand == -1)
break; // EOF? instead of if (nextCommand < 0)
break; // EOF? EDIT:
declare @EOFMarker varbinary(max) = 0x6F63665F626C6F6200;
SELECT EVENT_ID, VALID_UNTIL_DT_TM, BLOB_CONTENTS
FROM [CE_BLOB]
where right(BLOB_CONTENTS, datalength(@EOFMarker)) = @EOFMarker
If you've checked all of that and still having issues - see if you can find an example that does not include any patient data (ie a manually created test note), and post it here and I can try and see what is going on. |
@pgodwin thanks again for all your suggestions - so helpful!
Tried this - same issue
Thanks I didn't know this - but have checked and all affected files do end with the correct byte sequence.
Going to look into this shortly with a colleague who wrote the ETL scripts. But it does strike me that for exactly the same BLOB_CONTENTS, we get a full extract using the Python port of your code. It seems very strange that that would work based on the fact that it's a port, so I would assume designed to be functionally identical. I'm going to try to compare the |
@pgodwin - actually on closer inspection, the Python port is not of your code, but of the Bruce Jackson version you posted on the README. I asked ChatGPT to analyse the differences between your OcfLzw2 and the Python port. Here's what it had to say:
So it looks like they are completely different implementations - I think the key part is that yours is more performant. I have also tried to use the Bruce Jackson C# code on my side. I couldn't get it to work out-of-the-box as it needs a I adapted his code to use a |
and some more... Author: ChatGPT: In particular, it's possible that the LzwTable class used by the C# script has a bug that causes it to miss certain codes, resulting in an incomplete dictionary. This could cause the C# script to terminate prematurely when it encounters an undefined code. The Python script, on the other hand, uses a simple list to store the lookup table, which may be more resilient to bugs or errors. Another possibility is that there may be differences in how the two scripts handle byte order or endianness. If the compressed data is stored in a different byte order than what the C# script is expecting, it could cause the C# script to misinterpret the data and truncate the output. Without more information about the specific files that are being processed and the details of the bug or error that is causing the C# script to truncate the output, it's difficult to provide a more precise explanation or solution. |
I'm no expert at compression, so I'm not ruling out a bug in the implementation of LZW in this code. Without an example of the blob I can't say for certain. I've checked the CRCs of my decompressed versions and they are matching the CRC in How are you calling the Python version? Can you call the C# implementation through the same method? How are you bringing the blobs into SQL Server? Could there be an issue there? |
@pgodwin I had missed the CRC code on I've tried to find you on LinkedIn... perhaps we can contract some time from you to assist with this issue? |
I'm on LinkedIn but based in Australia so might be a bit difficult. I think the fact you're getting the same result in two separate, independently developed C# implementations suggests it maybe something in your dataset. I think it'd be useful to compare how you pass the compressed blobs to the Python code vs C#. |
@pgodwin |
I am the author of the Python port you mentioned, and just want to offer any assistance, I have run the python code on about 3.2m blobs with no issues (other than rtf being another thorn in my side...) |
@plessbd thanks for the offer to help. The issue in the end was not with the c# script but in the way I converted it to SQLCLR. So, my fault! I will post more info for others soon. By the way the c# version running inside SQL server is considerably faster than the Python version (around 600%), although both return the same results. Your Python script was a great help, we used it as a fallback while we sorted the SQL implementation. |
@thegoatherder yeah, I was not concerned about performance, for me this was a one time dump. Though if I have time I might try to see where I could improve performance. Our bottleneck when doing this was pulling from oracle. Or if you have any suggestions on performance, that would be wonderful as well. |
We have found some Cerner Blobs containing compressed RTF, where the decoded output is truncated by your library. We would expect all RTFs to end with
}
char or}<whitespace>
. Around 20% of the CE_BLOBS in a collection of 10M PowerChart RTF documents are being truncated and do not end with}
- documents are being cut-off mid-sentence. There seems to be no discernable pattern as to what is truncating them and I have no idea how to debug this.Is it possible there is a limit to the total size of the blob?
We experimented with another library implemented in Python and confirmed that no truncation occurred on the same files - so we are certain this isn't a problem with the source data. I also tried the Bruce Jackson alternative, but I can't find any suitable column in the Cerner Millennium schema to provide the
stringSize
parameter - we triedCE_BLOB.BLOB_LENGTH
but this resulted in an$Index was outside the bounds of the array
error.Any suggestions you could provide would be greatly appreciated!
The text was updated successfully, but these errors were encountered: