-
Notifications
You must be signed in to change notification settings - Fork 169
Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200
Update nla-digbench-scaling, account for recently found overflows in nla-digbench #1200
Conversation
The script has some file names hardcoded to apply some "per-file" solution. Don't you need to also update those with the correct new names? |
The hardcoded fixes for which tasks are fixed by which bound were broken because the file names changed. Now we have a more intelligent regex matching to decide which fix to apply for which task name.
Good point, I completely forgot about those verdict fixes! I improved the file name matching with regexes now via 7eab343. The regexes make sure that This would break if someone added a task with a name like |
I will merge #1204 into this PR and run |
Don't we need to add the following Also, I guess there is something missing to handle |
For some tasks the overflows were fixed while the reachability property is broken. These need to be treated differently in nla-digbench-scaling.
Sure, can't hurt!
You are right, I didn't expect that also versions of the tasks were added where the overflows are fixed but the reachability property is violated. Makes sense of course to add these task versions. For their the verdict the current guesses are off anyway. I would have to run some experiments to check how the verdicts are affected by the bounds (but not doing these experiments would mean another PR later when the prerun results arrive). Actually this would just be |
@PhilippWendler I saw you mention in #1202 (review) that we could already add license headers to files. The files in Leaves the # This file is part of the SV-Benchmarks collection of verification tasks:
# https://github.com/sosy-lab/sv-benchmarks
#
# SPDX-FileCopyrightText: 2011-2020 The SV-Benchmarks Community
#
# SPDX-License-Identifier: Apache-2.0 or do you want me to add something different? |
Sure, that would be great! The more reuse headers we get, the better. If you finish a directory (except for the task-definition files were we don't need headers), we can even remove that directory from What you describe seems good. |
There is also
Without taking a deep look to any of the benchmarks, I think constraining both the loops or the nondet values can prevent the bug. This would be benchmark dependent (short loops with big increases in variables might still trigger the As you said, this would require a more in depth checking. For this year, I would just go for the solution that applies only for the benchmarks where |
I ran our k-induction with 200 seconds time limit on the task and found some verdicts that are currently wrong: Labeled with
labeled with
I will see to correct the verdicts. |
I think the above shows that we need to do a per-benchmark evaluation for the cases where we bound the loops. |
au contraire, I wrote that they are currently set to false, and I will set them to true.
I agree, and my analyses also showed that
Yes, there are not that many benchmarks so this is doable. We could also aditionally fuzztest them to check our intuition. In practice, we could also just encode the verdicts to the best of our knowledge and count on the verifiers to find any mistakes in the labeling. This is what we usually do with new tasks where we generally don't know whether the verdicts are correct. |
Yes, I meant in the current state of the files, not your previous comment.
In the way these benchmarks are generated, any time we find a benchmark
Ok |
For I fixed every verdict where CPAchecker found a wrong verdict and also manually checked all programs where CPAchecker timeouted even with
|
Can you let me know what is the command to run CPAChecker with k-induction? |
The SMT formulas we build sometimes struggle with multiplications it seems, so changing the theory from bitvectors to linear integer arithmetic sometimes makes sense (though it is potentially unsound). For that you need to add the following three configuration options to the command above:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes related to this PR are ok.
However I still suspect there is an overflow in some of the base benchmarks (see #1166), but this is independent of this PR so we can already merge.
Thanks @MartinSpiessl for the work!
@MartinSpiessl it seems the CI is not happy about the |
Yes, I just modified the EDIT: CI results will be here: https://gitlab.com/sosy-lab/software/sv-benchmarks/-/pipelines/211746436 |
@dbeyer CI is green at gitlab for a copy of my fork at branch https://github.com/sosy-lab/sv-benchmarks/tree/digbench-scaling-overflows: PR is approved by @hernanponcedeleon, thanks for the nice feedback! @dbeyer I think that means it can be merged now. |
@dbeyer Travis CI now reached the same conclusion 😉 |
@MartinSpiessl there are now some merge conflicts, that need to be resolved first. |
@MartinSpiessl any updates? |
This PR is continued at PR #1220, as restarting is easier than fixing this PR (plus the commit history becomes cleaner) |
Update nla-digbench-scaling (continuation of PR #1200)
This fixes #1198
Overflows were discovered in
nla-digbench
and fixed via #1155. This also need to be progagated to the bounded versions innla-digbench-scaling
. For now it is enough to just disregard the tasks that have overflows (i.e. the verdict forno-overflow
isfalse
) and regenerate the tasks, which is what this PR does.This PR can be reproduced if the as usual by running
git rm *bound*{c,i,yml} && python3 generate.py
in thenla-digbench-scaling
folder, so the changes to that script reflect what this PR will change.