-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set policy for LLM-generated tests #202
Comments
Given their tendency to create output that looks plausible but isn't actually correct, I'd lean towards "don't allow" or at least increase review requirements on them. We've been relatively permissive in accepting tests based on the fact that a human has already reasoned their way through the spec while writing; if that's no longer a valid assumption, we should take that into account. |
Yes, I'm overall on the "disallow officially" stance. If they are LLM-generated, but are good/correct enough we can't tell, then fine, but in my experience people who use LLMs like this are not particularly concerned about the quality in the first place. |
We also need to consider cases where the test is paritally generated but human-modified (e.g. with copilot-like tools). The practical arguments I've heard in favour of allowing LLM-[partially-]generated tests are basically:
Speaking personally, I'm definitely worried about the potential for lower-quality tests being written using these technologies, but as a practical matter I don't know how much we can enforce, so I'd only want to ban them if there's some strong non-technical reason (e.g. vendors who would be unable to use wpt). However I'd be interesting in ideas for marking such tests (assuming good faith contributors) so that we could verify whether they cause more problems in the longer term compared to purely human-generated tests. |
Maybe we need clarify the license issue associate with code generated with LLM? One of the reason LLM is banned from some corporation is because the license of LLM-generated code can be unclear, thus not safe to use. What is the policy regarding the license now for WPTs contributed by a human? |
I think broadly it's the submitter's responsibility to ensure that they're able to contribute it under the terms of the 3-Clause BSD License. |
I have thought about using an LLM to generate a testing plan (giving it https://web-platform-tests.org/writing-tests/making-a-testing-plan.html + the relevant part of the spec to test), and just tested it now with ChatGPT 4o for https://quirks.spec.whatwg.org/#the-html-element-fills-the-viewport-quirk and https://quirks.spec.whatwg.org/#the-body-element-fills-the-html-element-quirk After clarifying that I want a list of 100 tests with short descriptions, it came up with various reasonable cases to test. I then asked it to write the first test, and while it made a few mistakes (included a standards mode doctype, unnecessary margin calculations, and unnecessary As a reviewer I would probably like to know when LLMs were used and whether the contributor put in some effort to clean up and verify the correctness. Prompts and output
<!DOCTYPE html>
<html>
<head>
<title>Quirks Mode Test: HTML Width Auto in Vertical Writing Mode</title>
<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
<style>
html {
width: auto;
writing-mode: vertical-rl;
margin: 0;
padding: 0;
border: 0;
}
body {
margin: 0;
padding: 0;
border: 0;
}
</style>
</head>
<body>
<script>
// Define the test
test(function() {
// Retrieve the HTML element
var htmlElement = document.documentElement;
// Get the computed style of the HTML element
var computedStyle = getComputedStyle(htmlElement);
// Calculate the expected border box size
var viewportWidth = window.innerWidth;
var marginLeft = parseInt(computedStyle.marginLeft);
var marginRight = parseInt(computedStyle.marginRight);
var expectedBorderBoxWidth = Math.max(viewportWidth - marginLeft - marginRight, htmlElement.clientWidth);
// Assert that the border box width matches the expected value
assert_equals(htmlElement.offsetWidth, expectedBorderBoxWidth, "HTML element's border box width should be correctly calculated.");
}, "Test HTML element with width: auto in vertical writing mode");
// Report the results
setup({explicit_done: true});
add_completion_callback(function() {
done();
});
</script>
</body>
</html>
|
CC @web-platform-tests/wpt-core-team
I was recently asked about the policy for using LLMs to generate tests that are submitted to wpt. Currently we don't have any explicit policy on this, but I think we need one, so it's clear to test authors what's permissible, and to downstream consumers whether wpt is in line with any other policies they might have with regard to LLMs.
To be clear, I expect that any policy here would not affect review requirements i.e. we'd still require human review for all tests to ensure they're correct and easy to follow.
The text was updated successfully, but these errors were encountered: