Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/fuzzer: occasional TestFuzz failures #5674

Closed
a-nogikh opened this issue Jan 14, 2025 · 9 comments · Fixed by #5742
Closed

pkg/fuzzer: occasional TestFuzz failures #5674

a-nogikh opened this issue Jan 14, 2025 · 9 comments · Fixed by #5742
Labels

Comments

@a-nogikh
Copy link
Collaborator

Once in a while, we observe failures like this (both on syzbot and on GitHub CI):

2025-01-14T12:36:25.6163042Z 2025/01/14 12:36:21 [FATAL] check failed: execution of simple program fails: call 0 failed with errno 998
@a-nogikh a-nogikh added the bug label Jan 14, 2025
@a-nogikh
Copy link
Collaborator Author

One more case from the GitHub CI: https://github.com/google/syzkaller/actions/runs/13011351196/job/36289590277

I think we should run the test under -debug and display the output on such failures. Otherwise I don't see a reliable way to debug this.

a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Jan 28, 2025
@a-nogikh
Copy link
Collaborator Author

From #5711 (comment)

--- FAIL: TestFuzz (11.22s)
    testutil.go:35: seed=0
Error: seed=0
    fuzzer_test.go:184: executor output:
Error: executor output:
        connected to manager: procs=4 cover_edges=1 kernel_64_bit=1 slowdown=1 syscall_timeout=50 program_timeout=5000 features=0x10
        received info reply: covfilter=0
        recv exec request 1: type=0 flags=0x3 env=0x1c09 exec=0x10 size=14
        proc 0: handshaking to execute request 1
        proc 0: got handshake reply
        proc 0: start executing request 1
        proc 0: got output: [9ms] exec opts: reqid=1 type=0 procid=0 threaded=1 cover=0 comps=0 dedup=0 signal=0  sandbox=1/0/0/0 timeouts=50/5000/1 kernel_64_bit=1
        proc 0: got output: spawned worker pid 90923
        proc 0: got output: Fatal glibc error: malloc assertion failure in sysmalloc: (old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)
        proc 0: got output: Fatal glibc error: malloc assertion failure in sysmalloc: (old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)
        proc 0: got execute reply
        handle completion: completed=0 output_size=262144
        proc 0: output pipe EOF
        proc 0: restarting subprocess, current state 2 attempts 0
        proc 0: subprocess exit status 6
        SYZFAIL: failed to recv rpc
        fd=3 want=4 recv=0 n=0 (errno 9: Bad file descriptor)
    fuzzer_test.go:185: executor process exited: exit status 67

Now the mystery is

  1. Why do we corrupt the memory -- we don't execute any dangerous syscalls during the test.
  2. Why and where we call malloc() from the executor process.

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Jan 29, 2025

Okay, we're calling malloc() via pthread.

if (pthread_create(&th, &attr, fn, arg) == 0) {

https://codebrowser.dev/glibc/glibc/nptl/allocatestack.c.html#142
https://codebrowser.dev/glibc/glibc/elf/dl-tls.c.html#_dl_allocate_tls_storage

It doesn't seem that we can

  1. Avoid calling pthread_create() from the innermost process.
  2. Make pthread not do alloc.

Is making the TestFuzz/parent executor process more tolerant to these kinds of issues the way to go here?


We also shouldn't have corrupted the memory. The test executes some very very simple code that does no memory manipulation:

static void fake_crash(const char* name)
{
failmsg("crash", "{{CRASH: %s}}", name);
doexit(1);
}
static long syz_test_fuzzer1(volatile long a, volatile long b, volatile long c)
{
// We probably want something more interesting here.
if (a == 1 && b == 1 && c == 1)
fake_crash("first bug");
if (a == 1 && b == 2 && c == 3)
fake_crash("second bug");
return 0;
}

github-merge-queue bot pushed a commit that referenced this issue Jan 30, 2025
It should hopefully let us debug #5674.
@ramosian-glider
Copy link
Member

I tried building the test with ASan:

diff --git a/pkg/fuzzer/fuzzer_test.go b/pkg/fuzzer/fuzzer_test.go
index 6049ac615..40d3abaab 100644
--- a/pkg/fuzzer/fuzzer_test.go
+++ b/pkg/fuzzer/fuzzer_test.go
@@ -40,7 +40,7 @@ func TestFuzz(t *testing.T) {
        if sysTarget.BrokenCompiler != "" {
                t.Skipf("skipping, broken cross-compiler: %v", sysTarget.BrokenCompiler)
        }
-       executor := csource.BuildExecutor(t, target, "../..", "-fsanitize-coverage=trace-pc", "-g")
+       executor := csource.BuildExecutor(t, target, "../..", "-fsanitize-coverage=trace-pc", "-g", "-fsanitize=address")
 
        ctx, cancel := context.WithCancel(context.Background())
        defer cancel()

, but to no success.

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Feb 3, 2025

A simplified (and much faster) test to trigger the problem:

func TestMachineCheck(t *testing.T) {
	target, err := prog.GetTarget(targets.TestOS, targets.TestArch64Fuzz)
	if err != nil {
		t.Fatal(err)
	}
	sysTarget := targets.Get(target.OS, target.Arch)
	if sysTarget.BrokenCompiler != "" {
		t.Skipf("skipping, broken cross-compiler: %v", sysTarget.BrokenCompiler)
	}
	executor := csource.BuildExecutor(t, target, "../..")
	for i := 0; i < 200; i++ {
		t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
			t.Parallel()
			ctx, cancel := context.WithCancel(context.Background())
			defer cancel()

			var output bytes.Buffer
			cfg := &rpcserver.LocalConfig{
				Config: rpcserver.Config{
					Config: vminfo.Config{
						Debug:    true,
						Target:   target,
						Features: flatrpc.FeatureSandboxNone,
						Sandbox:  flatrpc.ExecEnvSandboxNone,
					},
					Procs:    4,
					Slowdown: 1,
				},
				Executor:     executor,
				Dir:          t.TempDir(),
				OutputWriter: &output,
			}
			cfg.MachineChecked = func(features flatrpc.Feature, syscalls map[*prog.Syscall]bool) queue.Source {
				cancel()
				return queue.Callback(func() *queue.Request {
					return nil
				})
			}
			err = rpcserver.RunLocal(ctx, cfg)
			if err != nil {
				t.Logf("executor output:\n%s", output.String())
				t.Fatal(err)
			}
		})
	}
}

On my side, it also doesn't reproduce if I add "-fsanitize=address".

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Feb 3, 2025

diff --git a/executor/executor_test.h b/executor/executor_test.h
index c94de09a2..4f60fbcbf 100644
--- a/executor/executor_test.h
+++ b/executor/executor_test.h
@@ -22,7 +22,7 @@ static void os_init(int argc, char** argv, void* data, size_t data_size)
        if (getppid() == 1)
                exitf("the parent process was killed");
 #endif
-       void* got = mmap(data, data_size, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+       void* got = mmap(data, data_size, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_FIXED_NOREPLACE, -1, 0);
        if (data != got)
                failmsg("mmap of data segment failed", "want %p, got %p", data, got);
        is_kernel_64_bit = sizeof(unsigned long) == 8;

The code is invoked here:

os_init(argc, argv, (char*)SYZ_DATA_OFFSET, SYZ_NUM_PAGES * SYZ_PAGE_SIZE);

The change above sometimes leads to a different error:

=== NAME  TestMachineCheck/93
    fuzzer_test.go:299: executor output:
        connected to manager: procs=4 cover_edges=1 kernel_64_bit=1 slowdown=1 syscall_timeout=50 program_timeout=5000 features=0x10
        received info reply: covfilter=0
        recv exec request 1: type=0 flags=0x3 env=0x1c09 exec=0x10 size=14
        proc 0: handshaking to execute request 1
        request pipe write failed (errno=32)
        proc 0: restarting subprocess, current state 1 attempts 0
        proc 0: subprocess exit status 67
        proc 0: got output: NOTFAIL: mmap of data segment failed
        want 0x20000000, got 0xffffffffffffffff (errno 17: File exists)
        proc 0: output pipe EOF
        handle completion: completed=0 output_size=14680064
        SYZFAIL: failed to recv rpc
        fd=3 want=4 recv=0 n=0 (errno 9: Bad file descriptor)
    fuzzer_test.go:300: execution of simple program fails: NOTFAIL: mmap of data segment failed. want 0x20000000, got 0xffffffffffffffff (errno 17: File exists). . process exited with status 67.

So SYZ_DATA_OFFSET sometimes clashes with the address of the pages mmapped by the malloc? But os_init() is called basically at the beginning of main() (it's after all "match argv and return" conditions), at what moment those pages may have already become touched?

Cc @dvyukov

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Feb 4, 2025

After a debugging session with @ramosian-glider:

For binaries compiled with -static, malloc indeed often brks the memory range around the 0x20000000 address specified in SYZ_DATA_OFFSET.

a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Feb 4, 2025
MAP_FIXED_NOREPLACE allows to fail early if we happened to overlap with
an existing memory mapping. It should help detects bugs google#5674 at an
earlier stage, before it led to memory corruptions.

MAP_FIXED_NOREPLACE is supported from Linux 4.17, which is okay for all
syzkaller use cases on syzbot.

There's no such option for some of the supported OSes, so set it
depending on the configuration we're building for.
@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Feb 4, 2025

Alternatives:

  • Get rid of SYZ_DATA_OFFSET and allocate with mmap(0, ..).
  • Set SYZ_DATA_OFFSET to a much higher value. 0x20000000 (512 << 20) -> 0x800000000 (512 << 26)?

@dvyukov
Copy link
Collaborator

dvyukov commented Feb 4, 2025

Keeping SYZ_DATA_OFFSET looks better for reproducibility. We could also compile with -static-pie on linux.

Surprised pthread calls malloc for -static binaries.

a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Feb 4, 2025
The current default value sometimes intersects with the addresses used
by malloc, which causes executor memory corruptions.

Closes google#5674.
github-merge-queue bot pushed a commit that referenced this issue Feb 4, 2025
MAP_FIXED_NOREPLACE allows to fail early if we happened to overlap with
an existing memory mapping. It should help detects bugs #5674 at an
earlier stage, before it led to memory corruptions.

MAP_FIXED_NOREPLACE is supported from Linux 4.17, which is okay for all
syzkaller use cases on syzbot.

There's no such option for some of the supported OSes, so set it
depending on the configuration we're building for.
a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Feb 4, 2025
The current default value sometimes intersects with the addresses used
by malloc, which causes executor memory corruptions.

Closes google#5674.
github-merge-queue bot pushed a commit that referenced this issue Feb 4, 2025
The current default value sometimes intersects with the addresses used
by malloc, which causes executor memory corruptions.

Closes #5674.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants