Skip to content
This repository has been archived by the owner on Oct 3, 2021. It is now read-only.

Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200

Closed

Conversation

MartinSpiessl
Copy link
Contributor

This fixes #1198

Overflows were discovered in nla-digbench and fixed via #1155. This also need to be progagated to the bounded versions in nla-digbench-scaling. For now it is enough to just disregard the tasks that have overflows (i.e. the verdict for no-overflow is false) and regenerate the tasks, which is what this PR does.

This PR can be reproduced if the as usual by running git rm *bound*{c,i,yml} && python3 generate.py in the nla-digbench-scaling folder, so the changes to that script reflect what this PR will change.

@hernanponcedeleon
Copy link
Contributor

The script has some file names hardcoded to apply some "per-file" solution. Don't you need to also update those with the correct new names?

The hardcoded fixes for which tasks are fixed by which bound were
broken because the file names changed. Now we have a more intelligent
regex matching to decide which fix to apply for which task name.
@MartinSpiessl
Copy link
Contributor Author

The script has some file names hardcoded to apply some "per-file" solution. Don't you need to also update those with the correct new names?

Good point, I completely forgot about those verdict fixes! I improved the file name matching with regexes now via 7eab343. The regexes make sure that hard.c and hard2.c are recognized as different tasks by matching hard.c via ^hard[^0-9] and hard2.c via ^hard2[^0-9], i.e., we demand that a non-digit character follows the specified prefix.

This would break if someone added a task with a name like hardening.c, so I just updated the regexes to include alphabetic characters, e.g. ^hard[^0-9] -> ^hard[^a-zA-Z0-9]. It seems unlikely that someone would add a variant of a task and name it harda.c and hardb.c without adding a dash or underscore inbeteween.

@MartinSpiessl
Copy link
Contributor Author

I will merge #1204 into this PR and run generate.py again.

@hernanponcedeleon
Copy link
Contributor

Don't we need to add the following assert not re.search(p("hard2"), "hard-ll.c")?

Also, I guess there is something missing to handle *-u.c files, right?

hernanponcedeleon and others added 3 commits October 30, 2020 13:54
For some tasks the overflows were fixed while the reachability
property is broken. These need to be treated differently
in nla-digbench-scaling.
@MartinSpiessl
Copy link
Contributor Author

Don't we need to add the following assert not re.search(p("hard2"), "hard-ll.c")?

Sure, can't hurt!

Also, I guess there is something missing to handle *-u.c files, right?

You are right, I didn't expect that also versions of the tasks were added where the overflows are fixed but the reachability property is violated.

Makes sense of course to add these task versions. For their the verdict the current guesses are off anyway. I would have to run some experiments to check how the verdicts are affected by the bounds (but not doing these experiments would mean another PR later when the prerun results arrive).

Actually this would just be geo1-u and hard-u, right? For geo1 we do not have special handling yet, this just affects hard-u.
Commit 7e8be55 should fix this, but stiil I think the reachability verdicts for the bounded versions of geo1-u might actually be true in some cases where the bound is so low that the unsigned overflow cannot occur. The bound on the values might have a similar effect. I still need to checks this.

@MartinSpiessl
Copy link
Contributor Author

@PhilippWendler I saw you mention in #1202 (review) that we could already add license headers to files. The files in nla-digbench-scaling are generated from files in nla-digbench, so once the license headers are there I could add it to the files in nla-digbench-scaling by just running the generation script again.

Leaves the README.md and the generation script generate.py here. Those have been written by me => I can just add the usual Apache 2.0 license header (with <-- --> for the README.md):

# This file is part of the SV-Benchmarks collection of verification tasks:
# https://github.com/sosy-lab/sv-benchmarks
#
# SPDX-FileCopyrightText: 2011-2020 The SV-Benchmarks Community
#
# SPDX-License-Identifier: Apache-2.0

or do you want me to add something different?

@PhilippWendler
Copy link
Member

Sure, that would be great! The more reuse headers we get, the better. If you finish a directory (except for the task-definition files were we don't need headers), we can even remove that directory from .reuse/dep5.

What you describe seems good.

@hernanponcedeleon
Copy link
Contributor

Actually this would just be geo1-u and hard-u, right?

There is also dijkstra-u.c. But it is different to the other in the sense that the expected result is true.

For geo1 we do not have special handling yet, this just affects hard-u.
Commit 7e8be55 should fix this, but stiil I think the reachability verdicts for the bounded versions of geo1-u might actually be true in some cases where the bound is so low that the unsigned overflow cannot occur. The bound on the values might have a similar effect. I still need to checks this.

Without taking a deep look to any of the benchmarks, I think constraining both the loops or the nondet values can prevent the bug. This would be benchmark dependent (short loops with big increases in variables might still trigger the false result). Thus, it might be that the expected result in hard-u scaled versions is also true.

As you said, this would require a more in depth checking. For this year, I would just go for the solution that applies only for the benchmarks where unreachable-call = true since we are 100% sure the script generates sound benchmarks.

@MartinSpiessl
Copy link
Contributor Author

I ran our k-induction with 200 seconds time limit on the task and found some verdicts that are currently wrong:
table.zip(contains the HTML results table since I github doesn't allow uploading other formats 😑 )

Labeled with unreach-call:false while it is actually true:

geo1-u_valuebound1.yml
hard-u_valuebound1.yml
hard-u_valuebound10.yml
hard-u_valuebound100.yml
hard-u_valuebound2.yml
hard-u_valuebound20.yml
hard-u_valuebound5.yml
hard-u_valuebound50.yml

labeled with true where it is actually false:

dijkstra-u_unwindbound5.yml
egcd-ll_unwindbound1.yml
hard-ll_unwindbound1.yml
hard-ll_unwindbound2.yml

I will see to correct the verdicts.

@hernanponcedeleon
Copy link
Contributor

geo1-u and hard-u: you set the expected result to false. However, the overflow occurs when the non-deterministic values are large. Thus, in all instances where you restrict the values, the expected result should be true.

hard-ll_unwindbound1.yml and hard-ll_unwindbound2.yml: in the original benchmark, there are assertion at the end specifying that the final computation is correct. By finishing before, those assertions can be violated. I guess that for 100 iteration, there are still values where the computation would not finish and thus the violation can be violated. The expected result of all hard-ll_unwindboundX.yml has to be set to false.

I think the above shows that we need to do a per-benchmark evaluation for the cases where we bound the loops.

@MartinSpiessl
Copy link
Contributor Author

geo1-u and hard-u: you set the expected result to false

au contraire, I wrote that they are currently set to false, and I will set them to true.

The expected result of all hard-ll_unwindboundX.yml has to be set to false

I agree, and my analyses also showed that hard-ll_unwindbound{1,2}.yml are unsafe (higher values probably timeouted). I presented the intermediate findings only to show places where we need to look more closely, I was never stating that I will just change the verdicts of those tasks that were identified by the k-induction analysis.

I think the above shows that we need to do a per-benchmark evaluation for the cases where we bound the loops.

Yes, there are not that many benchmarks so this is doable. We could also aditionally fuzztest them to check our intuition. In practice, we could also just encode the verdicts to the best of our knowledge and count on the verifiers to find any mistakes in the labeling. This is what we usually do with new tasks where we generally don't know whether the verdicts are correct.

@hernanponcedeleon
Copy link
Contributor

au contraire, I wrote that they are currently set to false, and I will set them to true.

Yes, I meant in the current state of the files, not your previous comment.
Just a remark, this is the case for all geo1-u_valueboundX.yml and not just geo1-u_valuebound1.yml.

I agree, and my analyses also showed that hard-ll_unwindbound{1,2}.yml are unsafe (higher values probably timeouted). I presented the intermediate findings only to show places where we need to look more closely, I was never stating that I will just change the verdicts of those tasks that were identified by the k-induction analysis.

In the way these benchmarks are generated, any time we find a benchmark *_unwindboundX.yml is false, we can also set to false any benchmark *_unwindboundY.yml with X < Y.

In practice, we could also just encode the verdicts to the best of our knowledge and count on the verifiers to find any mistakes in the labeling. This is what we usually do with new tasks where we generally don't know whether the verdicts are correct.

Ok

@MartinSpiessl
Copy link
Contributor Author

geo1-u and hard-u: you set the expected result to false. However, the overflow occurs when the non-deterministic values are large.

For geo1-u actually has an underflow that triggers the violation! There were no violations with the value bound anymore because I accidentally removed 0 from the range of possible values. It fails if z is zero because later 1 is subtracted from z, which leads to an underflow and to the assertions being violated. I updated the bounds to include 0 for all tasks, as this was the original intention. I updated all verdicts to the best of my knowledge

I fixed every verdict where CPAchecker found a wrong verdict and also manually checked all programs where CPAchecker timeouted even with unwindbound1, checked boxes mark tasks that will fail with an unwindbound:

  • fermat1-ll: has a CEX A=11;R=4;
  • fermat2-ll: has a CEX A=11;R=4;
  • geo1-ll: condition after the loop can be rewritten to be the same as the one in each loop iteration => bounding iterations cannot make the task fail
  • geo2-ll: condition after the loop is equal to the one in the loop => bounding iterations cannot make the task fail
  • geo3-ll: condition after the loop is equal to the one in the loop => bounding iterations cannot make the task fail
  • knuth: no postconditions after the loop that need to be satisfied = > bounding iterations cannot make the task fail
  • lcm1: post condition clearly different than invariant in the loop => should fail when bounded; CEX: a=3;b=1;
  • lcm2: post condition same as the condition in the loop => bounding iterations cannot make the task fail
  • prod4br-ll: post condition clearly different than invariant in the loop => should fail when bounded; CEX: x=42;y=7;
  • prodbin-ll: post condition clearly different than invariant in the loop => should fail when bounded; CEX: a=42;b=7;

@hernanponcedeleon
Copy link
Contributor

Can you let me know what is the command to run CPAChecker with k-induction?

@MartinSpiessl
Copy link
Contributor Author

scripts/cpa.sh -kInduction ~/sv-benchmarks/c/nla-digbench-scaling/geo1-u_valuebound100.c -spec config/properties/unreach-call.prp

The SMT formulas we build sometimes struggle with multiplications it seems, so changing the theory from bitvectors to linear integer arithmetic sometimes makes sense (though it is potentially unsound). For that you need to add the following three configuration options to the command above:

-setprop solver.solver=SMTInterpol -setprop cpa.predicate.encodeBitvectorAs=INTEGER -setprop cpa.predicate.encodeFloatAs=RATIONAL

Copy link
Contributor

@hernanponcedeleon hernanponcedeleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes related to this PR are ok.
However I still suspect there is an overflow in some of the base benchmarks (see #1166), but this is independent of this PR so we can already merge.

Thanks @MartinSpiessl for the work!

@hernanponcedeleon
Copy link
Contributor

@MartinSpiessl it seems the CI is not happy about the >=0 changes. Can you take a look to we can merge soon?

@MartinSpiessl
Copy link
Contributor Author

MartinSpiessl commented Nov 4, 2020

Yes, I just modified the Makefile to ignored those checks, as it is the intention here to make clear in the code what the boundaries are. Detecting the variable type would be overkill. I will push the branch to upstream as well such that we can get faster CI results.

EDIT: CI results will be here: https://gitlab.com/sosy-lab/software/sv-benchmarks/-/pipelines/211746436

@MartinSpiessl
Copy link
Contributor Author

@dbeyer CI is green at gitlab for a copy of my fork at branch https://github.com/sosy-lab/sv-benchmarks/tree/digbench-scaling-overflows:
https://gitlab.com/sosy-lab/software/sv-benchmarks/-/pipelines/211746436

PR is approved by @hernanponcedeleon, thanks for the nice feedback!

@dbeyer I think that means it can be merged now.

@MartinSpiessl
Copy link
Contributor Author

@dbeyer Travis CI now reached the same conclusion 😉

@holznerst
Copy link
Contributor

holznerst commented Nov 5, 2020

@MartinSpiessl there are now some merge conflicts, that need to be resolved first.

@kfriedberger
Copy link
Member

@MartinSpiessl any updates?

MartinSpiessl added a commit to MartinSpiessl/sv-benchmarks that referenced this pull request Nov 10, 2020
@MartinSpiessl
Copy link
Contributor Author

This PR is continued at PR #1220, as restarting is easier than fixing this PR (plus the commit history becomes cleaner)

dbeyer added a commit that referenced this pull request Nov 11, 2020
Update nla-digbench-scaling (continuation of PR #1200)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Broken generation script
5 participants