Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200

MartinSpiessl · 2020-10-30T10:22:47Z

This fixes #1198

Overflows were discovered in nla-digbench and fixed via #1155. This also need to be progagated to the bounded versions in nla-digbench-scaling. For now it is enough to just disregard the tasks that have overflows (i.e. the verdict for no-overflow is false) and regenerate the tasks, which is what this PR does.

This PR can be reproduced if the as usual by running git rm *bound*{c,i,yml} && python3 generate.py in the nla-digbench-scaling folder, so the changes to that script reflect what this PR will change.

hernanponcedeleon · 2020-10-30T10:35:28Z

The script has some file names hardcoded to apply some "per-file" solution. Don't you need to also update those with the correct new names?

The hardcoded fixes for which tasks are fixed by which bound were broken because the file names changed. Now we have a more intelligent regex matching to decide which fix to apply for which task name.

MartinSpiessl · 2020-10-30T12:47:40Z

The script has some file names hardcoded to apply some "per-file" solution. Don't you need to also update those with the correct new names?

Good point, I completely forgot about those verdict fixes! I improved the file name matching with regexes now via 7eab343. The regexes make sure that hard.c and hard2.c are recognized as different tasks by matching hard.c via ^hard[^0-9] and hard2.c via ^hard2[^0-9], i.e., we demand that a non-digit character follows the specified prefix.

This would break if someone added a task with a name like hardening.c, so I just updated the regexes to include alphabetic characters, e.g. ^hard[^0-9] -> ^hard[^a-zA-Z0-9]. It seems unlikely that someone would add a variant of a task and name it harda.c and hardb.c without adding a dash or underscore inbeteween.

MartinSpiessl · 2020-10-30T12:51:21Z

I will merge #1204 into this PR and run generate.py again.

hernanponcedeleon · 2020-10-30T12:51:51Z

Don't we need to add the following assert not re.search(p("hard2"), "hard-ll.c")?

Also, I guess there is something missing to handle *-u.c files, right?

For some tasks the overflows were fixed while the reachability property is broken. These need to be treated differently in nla-digbench-scaling.

MartinSpiessl · 2020-10-30T13:16:11Z

Don't we need to add the following assert not re.search(p("hard2"), "hard-ll.c")?

Sure, can't hurt!

Also, I guess there is something missing to handle *-u.c files, right?

You are right, I didn't expect that also versions of the tasks were added where the overflows are fixed but the reachability property is violated.

Makes sense of course to add these task versions. For their the verdict the current guesses are off anyway. I would have to run some experiments to check how the verdicts are affected by the bounds (but not doing these experiments would mean another PR later when the prerun results arrive).

Actually this would just be geo1-u and hard-u, right? For geo1 we do not have special handling yet, this just affects hard-u.
Commit 7e8be55 should fix this, but stiil I think the reachability verdicts for the bounded versions of geo1-u might actually be true in some cases where the bound is so low that the unsigned overflow cannot occur. The bound on the values might have a similar effect. I still need to checks this.

MartinSpiessl · 2020-10-30T13:23:15Z

@PhilippWendler I saw you mention in #1202 (review) that we could already add license headers to files. The files in nla-digbench-scaling are generated from files in nla-digbench, so once the license headers are there I could add it to the files in nla-digbench-scaling by just running the generation script again.

Leaves the README.md and the generation script generate.py here. Those have been written by me => I can just add the usual Apache 2.0 license header (with <-- --> for the README.md):

# This file is part of the SV-Benchmarks collection of verification tasks:
# https://github.com/sosy-lab/sv-benchmarks
#
# SPDX-FileCopyrightText: 2011-2020 The SV-Benchmarks Community
#
# SPDX-License-Identifier: Apache-2.0

or do you want me to add something different?

PhilippWendler · 2020-10-30T13:37:02Z

Sure, that would be great! The more reuse headers we get, the better. If you finish a directory (except for the task-definition files were we don't need headers), we can even remove that directory from .reuse/dep5.

What you describe seems good.

hernanponcedeleon · 2020-10-30T13:40:20Z

Actually this would just be geo1-u and hard-u, right?

There is also dijkstra-u.c. But it is different to the other in the sense that the expected result is true.

For geo1 we do not have special handling yet, this just affects hard-u.
Commit 7e8be55 should fix this, but stiil I think the reachability verdicts for the bounded versions of geo1-u might actually be true in some cases where the bound is so low that the unsigned overflow cannot occur. The bound on the values might have a similar effect. I still need to checks this.

Without taking a deep look to any of the benchmarks, I think constraining both the loops or the nondet values can prevent the bug. This would be benchmark dependent (short loops with big increases in variables might still trigger the false result). Thus, it might be that the expected result in hard-u scaled versions is also true.

As you said, this would require a more in depth checking. For this year, I would just go for the solution that applies only for the benchmarks where unreachable-call = true since we are 100% sure the script generates sound benchmarks.

MartinSpiessl · 2020-10-30T15:41:42Z

I ran our k-induction with 200 seconds time limit on the task and found some verdicts that are currently wrong:
table.zip(contains the HTML results table since I github doesn't allow uploading other formats 😑 )

Labeled with unreach-call:false while it is actually true:

geo1-u_valuebound1.yml
hard-u_valuebound1.yml
hard-u_valuebound10.yml
hard-u_valuebound100.yml
hard-u_valuebound2.yml
hard-u_valuebound20.yml
hard-u_valuebound5.yml
hard-u_valuebound50.yml

labeled with true where it is actually false:

dijkstra-u_unwindbound5.yml
egcd-ll_unwindbound1.yml
hard-ll_unwindbound1.yml
hard-ll_unwindbound2.yml

I will see to correct the verdicts.

hernanponcedeleon · 2020-10-30T18:42:52Z

geo1-u and hard-u: you set the expected result to false. However, the overflow occurs when the non-deterministic values are large. Thus, in all instances where you restrict the values, the expected result should be true.

hard-ll_unwindbound1.yml and hard-ll_unwindbound2.yml: in the original benchmark, there are assertion at the end specifying that the final computation is correct. By finishing before, those assertions can be violated. I guess that for 100 iteration, there are still values where the computation would not finish and thus the violation can be violated. The expected result of all hard-ll_unwindboundX.yml has to be set to false.

I think the above shows that we need to do a per-benchmark evaluation for the cases where we bound the loops.

MartinSpiessl · 2020-11-02T08:15:24Z

geo1-u and hard-u: you set the expected result to false

au contraire, I wrote that they are currently set to false, and I will set them to true.

The expected result of all hard-ll_unwindboundX.yml has to be set to false

I agree, and my analyses also showed that hard-ll_unwindbound{1,2}.yml are unsafe (higher values probably timeouted). I presented the intermediate findings only to show places where we need to look more closely, I was never stating that I will just change the verdicts of those tasks that were identified by the k-induction analysis.

I think the above shows that we need to do a per-benchmark evaluation for the cases where we bound the loops.

Yes, there are not that many benchmarks so this is doable. We could also aditionally fuzztest them to check our intuition. In practice, we could also just encode the verdicts to the best of our knowledge and count on the verifiers to find any mistakes in the labeling. This is what we usually do with new tasks where we generally don't know whether the verdicts are correct.

hernanponcedeleon · 2020-11-02T08:22:59Z

au contraire, I wrote that they are currently set to false, and I will set them to true.

Yes, I meant in the current state of the files, not your previous comment.
Just a remark, this is the case for all geo1-u_valueboundX.yml and not just geo1-u_valuebound1.yml.

I agree, and my analyses also showed that hard-ll_unwindbound{1,2}.yml are unsafe (higher values probably timeouted). I presented the intermediate findings only to show places where we need to look more closely, I was never stating that I will just change the verdicts of those tasks that were identified by the k-induction analysis.

In the way these benchmarks are generated, any time we find a benchmark *_unwindboundX.yml is false, we can also set to false any benchmark *_unwindboundY.yml with X < Y.

In practice, we could also just encode the verdicts to the best of our knowledge and count on the verifiers to find any mistakes in the labeling. This is what we usually do with new tasks where we generally don't know whether the verdicts are correct.

Ok

MartinSpiessl · 2020-11-02T12:37:13Z

geo1-u and hard-u: you set the expected result to false. However, the overflow occurs when the non-deterministic values are large.

For geo1-u actually has an underflow that triggers the violation! There were no violations with the value bound anymore because I accidentally removed 0 from the range of possible values. It fails if z is zero because later 1 is subtracted from z, which leads to an underflow and to the assertions being violated. I updated the bounds to include 0 for all tasks, as this was the original intention. I updated all verdicts to the best of my knowledge

I fixed every verdict where CPAchecker found a wrong verdict and also manually checked all programs where CPAchecker timeouted even with unwindbound1, checked boxes mark tasks that will fail with an unwindbound:

hernanponcedeleon · 2020-11-02T16:31:40Z

Can you let me know what is the command to run CPAChecker with k-induction?

MartinSpiessl · 2020-11-02T16:35:12Z

scripts/cpa.sh -kInduction ~/sv-benchmarks/c/nla-digbench-scaling/geo1-u_valuebound100.c -spec config/properties/unreach-call.prp

The SMT formulas we build sometimes struggle with multiplications it seems, so changing the theory from bitvectors to linear integer arithmetic sometimes makes sense (though it is potentially unsound). For that you need to add the following three configuration options to the command above:

-setprop solver.solver=SMTInterpol -setprop cpa.predicate.encodeBitvectorAs=INTEGER -setprop cpa.predicate.encodeFloatAs=RATIONAL

hernanponcedeleon

The changes related to this PR are ok.
However I still suspect there is an overflow in some of the base benchmarks (see #1166), but this is independent of this PR so we can already merge.

Thanks @MartinSpiessl for the work!

hernanponcedeleon · 2020-11-03T19:31:24Z

@MartinSpiessl it seems the CI is not happy about the >=0 changes. Can you take a look to we can merge soon?

MartinSpiessl · 2020-11-04T15:13:19Z

Yes, I just modified the Makefile to ignored those checks, as it is the intention here to make clear in the code what the boundaries are. Detecting the variable type would be overkill. I will push the branch to upstream as well such that we can get faster CI results.

EDIT: CI results will be here: https://gitlab.com/sosy-lab/software/sv-benchmarks/-/pipelines/211746436

MartinSpiessl · 2020-11-04T15:39:25Z

@dbeyer CI is green at gitlab for a copy of my fork at branch https://github.com/sosy-lab/sv-benchmarks/tree/digbench-scaling-overflows:
https://gitlab.com/sosy-lab/software/sv-benchmarks/-/pipelines/211746436

PR is approved by @hernanponcedeleon, thanks for the nice feedback!

@dbeyer I think that means it can be merged now.

MartinSpiessl · 2020-11-04T20:42:50Z

@dbeyer Travis CI now reached the same conclusion 😉

holznerst · 2020-11-05T11:28:44Z

@MartinSpiessl there are now some merge conflicts, that need to be resolved first.

kfriedberger · 2020-11-09T08:57:03Z

@MartinSpiessl any updates?

MartinSpiessl · 2020-11-10T19:13:07Z

This PR is continued at PR #1220, as restarting is easier than fixing this PR (plus the commit history becomes cleaner)

Update nla-digbench-scaling (continuation of PR #1200)

Update nla-digbench-scaling, account for overflows

1f0bfba

MartinSpiessl mentioned this pull request Oct 30, 2020

Broken generation script #1198

Closed

MartinSpiessl requested a review from hernanponcedeleon October 30, 2020 10:24

MartinSpiessl added 2 commits October 30, 2020 13:34

Fix verdicts for nla-digbench-scaling again

7eab343

The hardcoded fixes for which tasks are fixed by which bound were broken because the file names changed. Now we have a more intelligent regex matching to decide which fix to apply for which task name.

Make task name pattern matching more resilient

0a65b18

hernanponcedeleon and others added 3 commits October 30, 2020 13:54

added missing expected result for no-overflow property

525e3b0

Add updated no-overflow verdicts from PR sosy-lab#1204

0e11306

Differentiate different fixes in nla-digbench

7e8be55

For some tasks the overflows were fixed while the reachability property is broken. These need to be treated differently in nla-digbench-scaling.

MartinSpiessl added 5 commits November 2, 2020 10:21

Fix broken lower value bound: ">0" => ">=0"

4e8c789

Fix easily fixable nla-digbench-scaling verdicts

998220f

Change dijkstra tasks with unwindbound>=5 to false

292c07f

Fix verdicts for bounded versions of fermat1-ll

6e48c91

Manually inspect digbench-scaling,correct verdicts

b9ce1f6

Add reuse headers to non-auto-generated files

0b29949

hernanponcedeleon previously approved these changes Nov 2, 2020

View reviewed changes

Fix some compile warnings in CI

e83ac01

MartinSpiessl dismissed hernanponcedeleon’s stale review via e83ac01 November 4, 2020 15:12

hernanponcedeleon approved these changes Nov 4, 2020

View reviewed changes

MartinSpiessl added a commit to MartinSpiessl/sv-benchmarks that referenced this pull request Nov 10, 2020

Recreate PR sosy-lab#1200 from master

a0f6753

This was referenced Nov 10, 2020

Reachability benchmarks with overflow #1166

Open

Update nla-digbench-scaling (continuation of PR #1200) #1220

Merged

MartinSpiessl closed this Nov 10, 2020

dbeyer added a commit that referenced this pull request Nov 11, 2020

Merge pull request #1220 from MartinSpiessl/nla-temp

2e5e802

Update nla-digbench-scaling (continuation of PR #1200)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200

Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200

MartinSpiessl commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

PhilippWendler commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Nov 2, 2020

hernanponcedeleon commented Nov 2, 2020

MartinSpiessl commented Nov 2, 2020

hernanponcedeleon commented Nov 2, 2020

MartinSpiessl commented Nov 2, 2020

hernanponcedeleon left a comment

hernanponcedeleon commented Nov 3, 2020

MartinSpiessl commented Nov 4, 2020 •

edited

Loading

MartinSpiessl commented Nov 4, 2020

MartinSpiessl commented Nov 4, 2020

holznerst commented Nov 5, 2020 •

edited

Loading

kfriedberger commented Nov 9, 2020

MartinSpiessl commented Nov 10, 2020

Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200

Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200

Conversation

MartinSpiessl commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

PhilippWendler commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Oct 30, 2020

hernanponcedeleon commented Oct 30, 2020

MartinSpiessl commented Nov 2, 2020

hernanponcedeleon commented Nov 2, 2020

MartinSpiessl commented Nov 2, 2020

hernanponcedeleon commented Nov 2, 2020

MartinSpiessl commented Nov 2, 2020

hernanponcedeleon left a comment

Choose a reason for hiding this comment

hernanponcedeleon commented Nov 3, 2020

MartinSpiessl commented Nov 4, 2020 • edited Loading

MartinSpiessl commented Nov 4, 2020

MartinSpiessl commented Nov 4, 2020

holznerst commented Nov 5, 2020 • edited Loading

kfriedberger commented Nov 9, 2020

MartinSpiessl commented Nov 10, 2020

MartinSpiessl commented Nov 4, 2020 •

edited

Loading

holznerst commented Nov 5, 2020 •

edited

Loading