Duplicate marking produces malformed QS tags in SAM output #145

eboyden · 2022-01-06T03:01:07Z

With -so and writing to bam (file or stdout), every read gets a QS tag, including duplicates. But when writing to sam (file or stdout), it looks like reads marked as duplicates are missing the QS tag; instead, an empty field is present, which can cause downstream tools to crash when additional tags are appended (thus the empty field appears to be a "tag" with an improper format).

Command:

/snap/v2.0.0/snap-aligner paired /snap/index/ -pairedInterleavedFastq - -t 8 -mrl 30 -s 0 500 -fs -so -R "@RG\tID:id\tLB:lb\tPL:ILLUMINA\tPU:pu\tSM:sm" -o -sam NOINPUT.sam

Output:

cat -et NOINPUT.sam | egrep -v "^@" | head -11
SN0796:846:HK7WNBCX3:1:1201:20613:31700^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IHIIIIHG@HHHHEC@HDHEGHHFCHCE=HHEGH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:1207:5416:83002^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IIGIIIIIIIIIIIIIIHIIIIIHIHHIIHIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:1213:4603:47822^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IHHG@G<1<<@HEFCEHCC<1CD1HGE=HD111H^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2104:2125:28883^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGCA^IHIIIIIIIIIHIIHIIIHIHHEH?GHHIHIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2104:5543:45261^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IHIHIIHHHIHIIIHIIIIIIIIIIIIIIIIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2215:16958:24783^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IHIHD1FIIIHF1IHGG<<1CC<<<0C/<G?HEH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1110:10832:12681^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IIIIIIIIIIHIIIIIIIIIHHHHIIIIIIHIHH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1115:14120:23567^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IGDHGIHHIHHG1HHGF@HHIIHHEHC<<IIIHH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1115:5463:88475^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IFIHGCFEEHCHHIIIHHIHHHEGC0C=FHHIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1201:13401:6342^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IFIIIHH@?HHIIHIHHHEGHGHGHIHCCHC@EH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:2203:7573:55404^I147^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^IQS:i:1240$

The text was updated successfully, but these errors were encountered:

bolosky · 2022-01-07T01:00:59Z

I put a partial fix for this in the dev branch. It at least trims the trailing tab from the line so that it's syntactically correct. However, it still removes the QS tag.

Unfortunately, fixing it properly will be more work. The SAM code writes SAM lines into string form into a buffer, and then can't extend the string size when it comes to mark duplicates later. The code clips out the QS tag to make extra space for the (possibly) expanded flags field. We'll need to figure out how not to do this ugliness, or else make it more ugly but functional.

Check it out and see if it helps before I put it in master. And let's keep this issue open until it's not clipping out tags.

eboyden · 2022-01-07T05:43:10Z

This seems to have done the trick, thanks. The missing QS tags aren't great but not the end of the world either; the empty fields are gone and it no longer breaks downstream tools when they append additional tags, and that's what I really care about.

bolosky · 2022-01-15T01:10:12Z

The fix for not being malformed is in 2.0.1, but it will still strip the QS tags, so I'm leaving this open.

bolosky self-assigned this Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate marking produces malformed QS tags in SAM output #145

Duplicate marking produces malformed QS tags in SAM output #145

eboyden commented Jan 6, 2022

bolosky commented Jan 7, 2022

eboyden commented Jan 7, 2022

bolosky commented Jan 15, 2022

Duplicate marking produces malformed QS tags in SAM output #145

Duplicate marking produces malformed QS tags in SAM output #145

Comments

eboyden commented Jan 6, 2022

bolosky commented Jan 7, 2022

eboyden commented Jan 7, 2022

bolosky commented Jan 15, 2022