Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate marking produces malformed QS tags in SAM output #145

Open
eboyden opened this issue Jan 6, 2022 · 3 comments
Open

Duplicate marking produces malformed QS tags in SAM output #145

eboyden opened this issue Jan 6, 2022 · 3 comments
Assignees

Comments

@eboyden
Copy link

eboyden commented Jan 6, 2022

With -so and writing to bam (file or stdout), every read gets a QS tag, including duplicates. But when writing to sam (file or stdout), it looks like reads marked as duplicates are missing the QS tag; instead, an empty field is present, which can cause downstream tools to crash when additional tags are appended (thus the empty field appears to be a "tag" with an improper format).

Command:

/snap/v2.0.0/snap-aligner paired /snap/index/ -pairedInterleavedFastq - -t 8 -mrl 30 -s 0 500 -fs -so -R "@RG\tID:id\tLB:lb\tPL:ILLUMINA\tPU:pu\tSM:sm" -o -sam NOINPUT.sam

Output:

cat -et NOINPUT.sam | egrep -v "^@" | head -11
SN0796:846:HK7WNBCX3:1:1201:20613:31700^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IHIIIIHG@HHHHEC@HDHEGHHFCHCE=HHEGH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:1207:5416:83002^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IIGIIIIIIIIIIIIIIHIIIIIHIHHIIHIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:1213:4603:47822^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IHHG@G<1<<@HEFCEHCC<1CD1HGE=HD111H^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2104:2125:28883^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGCA^IHIIIIIIIIIHIIHIIIHIHHEH?GHHIHIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2104:5543:45261^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IHIHIIHHHIHIIIHIIIIIIIIIIIIIIIIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:1:2215:16958:24783^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IHIHD1FIIIHF1IHGG<<1CC<<<0C/<G?HEH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1110:10832:12681^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IIIIIIIIIIHIIIIIIIIIHHHHIIIIIIHIHH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1115:14120:23567^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IGDHGIHHIHHG1HHGF@HHIIHHEHC<<IIIHH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1115:5463:88475^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGAA^IFIHGCFEEHCHHIIIHHIHHHEGC0C=FHHIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:1201:13401:6342^I1171^Ichr6^I80129075^I70^I27M6S^I=^I80129245^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGTA^IFIIIHH@?HHIIHIHHHEGHGHGHIHCCHC@EH^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^I$
SN0796:846:HK7WNBCX3:2:2203:7573:55404^I147^Ichr6^I80129075^I70^I27M6S^I=^I80129246^I132^IGTAAACATTTAAATTACTTATTTCTCAGCGGGA^IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII^IPG:Z:SNAP^INM:i:0^IRG:Z:id^ILB:Z:lb^IPL:Z:ILLUMINA^IPU:Z:pu^ISM:Z:sm^IQS:i:1240$
@bolosky
Copy link
Contributor

bolosky commented Jan 7, 2022

I put a partial fix for this in the dev branch. It at least trims the trailing tab from the line so that it's syntactically correct. However, it still removes the QS tag.

Unfortunately, fixing it properly will be more work. The SAM code writes SAM lines into string form into a buffer, and then can't extend the string size when it comes to mark duplicates later. The code clips out the QS tag to make extra space for the (possibly) expanded flags field. We'll need to figure out how not to do this ugliness, or else make it more ugly but functional.

Check it out and see if it helps before I put it in master. And let's keep this issue open until it's not clipping out tags.

@bolosky bolosky self-assigned this Jan 7, 2022
@eboyden
Copy link
Author

eboyden commented Jan 7, 2022

This seems to have done the trick, thanks. The missing QS tags aren't great but not the end of the world either; the empty fields are gone and it no longer breaks downstream tools when they append additional tags, and that's what I really care about.

@bolosky
Copy link
Contributor

bolosky commented Jan 15, 2022

The fix for not being malformed is in 2.0.1, but it will still strip the QS tags, so I'm leaving this open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants