pkg/fuzzer: occasional TestFuzz failures #5674

a-nogikh · 2025-01-14T12:43:59Z

Once in a while, we observe failures like this (both on syzbot and on GitHub CI):

2025-01-14T12:36:25.6163042Z 2025/01/14 12:36:21 [FATAL] check failed: execution of simple program fails: call 0 failed with errno 998

The text was updated successfully, but these errors were encountered:

a-nogikh · 2025-01-28T16:20:46Z

One more case from the GitHub CI: https://github.com/google/syzkaller/actions/runs/13011351196/job/36289590277

I think we should run the test under -debug and display the output on such failures. Otherwise I don't see a reliable way to debug this.

It should hopefully let us debug google#5674.

a-nogikh · 2025-01-29T17:02:28Z

From #5711 (comment)

--- FAIL: TestFuzz (11.22s)
    testutil.go:35: seed=0
Error: seed=0
    fuzzer_test.go:184: executor output:
Error: executor output:
        connected to manager: procs=4 cover_edges=1 kernel_64_bit=1 slowdown=1 syscall_timeout=50 program_timeout=5000 features=0x10
        received info reply: covfilter=0
        recv exec request 1: type=0 flags=0x3 env=0x1c09 exec=0x10 size=14
        proc 0: handshaking to execute request 1
        proc 0: got handshake reply
        proc 0: start executing request 1
        proc 0: got output: [9ms] exec opts: reqid=1 type=0 procid=0 threaded=1 cover=0 comps=0 dedup=0 signal=0  sandbox=1/0/0/0 timeouts=50/5000/1 kernel_64_bit=1
        proc 0: got output: spawned worker pid 90923
        proc 0: got output: Fatal glibc error: malloc assertion failure in sysmalloc: (old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)
        proc 0: got output: Fatal glibc error: malloc assertion failure in sysmalloc: (old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)
        proc 0: got execute reply
        handle completion: completed=0 output_size=262144
        proc 0: output pipe EOF
        proc 0: restarting subprocess, current state 2 attempts 0
        proc 0: subprocess exit status 6
        SYZFAIL: failed to recv rpc
        fd=3 want=4 recv=0 n=0 (errno 9: Bad file descriptor)
    fuzzer_test.go:185: executor process exited: exit status 67

Now the mystery is

Why do we corrupt the memory -- we don't execute any dangerous syscalls during the test.
Why and where we call malloc() from the executor process.

a-nogikh · 2025-01-29T17:35:00Z

Okay, we're calling malloc() via pthread.

syzkaller/executor/common.h

Line 352 in afe4eff

if (pthread_create(&th, &attr, fn, arg) == 0) {

https://codebrowser.dev/glibc/glibc/nptl/allocatestack.c.html#142
https://codebrowser.dev/glibc/glibc/elf/dl-tls.c.html#_dl_allocate_tls_storage

It doesn't seem that we can

Avoid calling pthread_create() from the innermost process.
Make pthread not do alloc.

Is making the TestFuzz/parent executor process more tolerant to these kinds of issues the way to go here?

We also shouldn't have corrupted the memory. The test executes some very very simple code that does no memory manipulation:

syzkaller/executor/common_test.h

Lines 154 to 168 in afe4eff

    
           static void fake_crash(const char* name) 
        
           { 
        
           	failmsg("crash", "{{CRASH: %s}}", name); 
        
           	doexit(1); 
        
           } 
        
           static long syz_test_fuzzer1(volatile long a, volatile long b, volatile long c) 
        
           { 
        
           	// We probably want something more interesting here. 
        
           	if (a == 1 && b == 1 && c == 1) 
        
           		fake_crash("first bug"); 
        
           	if (a == 1 && b == 2 && c == 3) 
        
           		fake_crash("second bug"); 
        
           	return 0; 
        
           }

It should hopefully let us debug #5674.

ramosian-glider · 2025-02-03T17:20:24Z

I tried building the test with ASan:

diff --git a/pkg/fuzzer/fuzzer_test.go b/pkg/fuzzer/fuzzer_test.go
index 6049ac615..40d3abaab 100644
--- a/pkg/fuzzer/fuzzer_test.go
+++ b/pkg/fuzzer/fuzzer_test.go
@@ -40,7 +40,7 @@ func TestFuzz(t *testing.T) {
        if sysTarget.BrokenCompiler != "" {
                t.Skipf("skipping, broken cross-compiler: %v", sysTarget.BrokenCompiler)
        }
-       executor := csource.BuildExecutor(t, target, "../..", "-fsanitize-coverage=trace-pc", "-g")
+       executor := csource.BuildExecutor(t, target, "../..", "-fsanitize-coverage=trace-pc", "-g", "-fsanitize=address")
 
        ctx, cancel := context.WithCancel(context.Background())
        defer cancel()

, but to no success.

a-nogikh · 2025-02-03T19:02:36Z

A simplified (and much faster) test to trigger the problem:

func TestMachineCheck(t *testing.T) {
	target, err := prog.GetTarget(targets.TestOS, targets.TestArch64Fuzz)
	if err != nil {
		t.Fatal(err)
	}
	sysTarget := targets.Get(target.OS, target.Arch)
	if sysTarget.BrokenCompiler != "" {
		t.Skipf("skipping, broken cross-compiler: %v", sysTarget.BrokenCompiler)
	}
	executor := csource.BuildExecutor(t, target, "../..")
	for i := 0; i < 200; i++ {
		t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
			t.Parallel()
			ctx, cancel := context.WithCancel(context.Background())
			defer cancel()

			var output bytes.Buffer
			cfg := &rpcserver.LocalConfig{
				Config: rpcserver.Config{
					Config: vminfo.Config{
						Debug:    true,
						Target:   target,
						Features: flatrpc.FeatureSandboxNone,
						Sandbox:  flatrpc.ExecEnvSandboxNone,
					},
					Procs:    4,
					Slowdown: 1,
				},
				Executor:     executor,
				Dir:          t.TempDir(),
				OutputWriter: &output,
			}
			cfg.MachineChecked = func(features flatrpc.Feature, syscalls map[*prog.Syscall]bool) queue.Source {
				cancel()
				return queue.Callback(func() *queue.Request {
					return nil
				})
			}
			err = rpcserver.RunLocal(ctx, cfg)
			if err != nil {
				t.Logf("executor output:\n%s", output.String())
				t.Fatal(err)
			}
		})
	}
}

On my side, it also doesn't reproduce if I add "-fsanitize=address".

a-nogikh · 2025-02-03T19:17:30Z

diff --git a/executor/executor_test.h b/executor/executor_test.h
index c94de09a2..4f60fbcbf 100644
--- a/executor/executor_test.h
+++ b/executor/executor_test.h
@@ -22,7 +22,7 @@ static void os_init(int argc, char** argv, void* data, size_t data_size)
        if (getppid() == 1)
                exitf("the parent process was killed");
 #endif
-       void* got = mmap(data, data_size, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
+       void* got = mmap(data, data_size, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_FIXED_NOREPLACE, -1, 0);
        if (data != got)
                failmsg("mmap of data segment failed", "want %p, got %p", data, got);
        is_kernel_64_bit = sizeof(unsigned long) == 8;

The code is invoked here:

syzkaller/executor/executor.cc

Line 583 in 8f267ce

os_init(argc, argv, (char*)SYZ_DATA_OFFSET, SYZ_NUM_PAGES * SYZ_PAGE_SIZE);

The change above sometimes leads to a different error:

=== NAME  TestMachineCheck/93
    fuzzer_test.go:299: executor output:
        connected to manager: procs=4 cover_edges=1 kernel_64_bit=1 slowdown=1 syscall_timeout=50 program_timeout=5000 features=0x10
        received info reply: covfilter=0
        recv exec request 1: type=0 flags=0x3 env=0x1c09 exec=0x10 size=14
        proc 0: handshaking to execute request 1
        request pipe write failed (errno=32)
        proc 0: restarting subprocess, current state 1 attempts 0
        proc 0: subprocess exit status 67
        proc 0: got output: NOTFAIL: mmap of data segment failed
        want 0x20000000, got 0xffffffffffffffff (errno 17: File exists)
        proc 0: output pipe EOF
        handle completion: completed=0 output_size=14680064
        SYZFAIL: failed to recv rpc
        fd=3 want=4 recv=0 n=0 (errno 9: Bad file descriptor)
    fuzzer_test.go:300: execution of simple program fails: NOTFAIL: mmap of data segment failed. want 0x20000000, got 0xffffffffffffffff (errno 17: File exists). . process exited with status 67.

So SYZ_DATA_OFFSET sometimes clashes with the address of the pages mmapped by the malloc? But os_init() is called basically at the beginning of main() (it's after all "match argv and return" conditions), at what moment those pages may have already become touched?

Cc @dvyukov

a-nogikh · 2025-02-04T10:03:23Z

After a debugging session with @ramosian-glider:

For binaries compiled with -static, malloc indeed often brks the memory range around the 0x20000000 address specified in SYZ_DATA_OFFSET.

MAP_FIXED_NOREPLACE allows to fail early if we happened to overlap with an existing memory mapping. It should help detects bugs google#5674 at an earlier stage, before it led to memory corruptions. MAP_FIXED_NOREPLACE is supported from Linux 4.17, which is okay for all syzkaller use cases on syzbot. There's no such option for some of the supported OSes, so set it depending on the configuration we're building for.

a-nogikh · 2025-02-04T11:28:40Z

Alternatives:

Get rid of SYZ_DATA_OFFSET and allocate with mmap(0, ..).
Set SYZ_DATA_OFFSET to a much higher value. 0x20000000 (512 << 20) -> 0x800000000 (512 << 26)?

dvyukov · 2025-02-04T11:36:19Z

Keeping SYZ_DATA_OFFSET looks better for reproducibility. We could also compile with -static-pie on linux.

Surprised pthread calls malloc for -static binaries.

The current default value sometimes intersects with the addresses used by malloc, which causes executor memory corruptions. Closes google#5674.

MAP_FIXED_NOREPLACE allows to fail early if we happened to overlap with an existing memory mapping. It should help detects bugs #5674 at an earlier stage, before it led to memory corruptions. MAP_FIXED_NOREPLACE is supported from Linux 4.17, which is okay for all syzkaller use cases on syzbot. There's no such option for some of the supported OSes, so set it depending on the configuration we're building for.

The current default value sometimes intersects with the addresses used by malloc, which causes executor memory corruptions. Closes google#5674.

The current default value sometimes intersects with the addresses used by malloc, which causes executor memory corruptions. Closes #5674.

a-nogikh added the bug label Jan 14, 2025

a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Jan 28, 2025

pkg/fuzzer: collect executor debug logs in tests

a7393de

It should hopefully let us debug google#5674.

a-nogikh mentioned this issue Jan 28, 2025

pkg/fuzzer: collect executor debug logs in tests #5711

Merged

github-merge-queue bot pushed a commit that referenced this issue Jan 30, 2025

pkg/fuzzer: collect executor debug logs in tests

ec43420

It should hopefully let us debug #5674.

a-nogikh mentioned this issue Feb 4, 2025

executor: favor MAP_FIXED_NOREPLACE over MAP_FIXED #5741

Merged

a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Feb 4, 2025

sys/targets: change the default DataOffset value

1ccc718

The current default value sometimes intersects with the addresses used by malloc, which causes executor memory corruptions. Closes google#5674.

a-nogikh mentioned this issue Feb 4, 2025

sys/targets: change the default DataOffset value #5742

Merged

a-nogikh added a commit to a-nogikh/syzkaller that referenced this issue Feb 4, 2025

sys/targets: change the default DataOffset value

fa78f9e

The current default value sometimes intersects with the addresses used by malloc, which causes executor memory corruptions. Closes google#5674.

github-merge-queue bot pushed a commit that referenced this issue Feb 4, 2025

sys/targets: change the default DataOffset value

f1930ee

The current default value sometimes intersects with the addresses used by malloc, which causes executor memory corruptions. Closes #5674.

a-nogikh closed this as completed in #5742 Feb 4, 2025

ramosian-glider mentioned this issue Feb 11, 2025

sys/targets: picking DataOffset is hard #5770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/fuzzer: occasional TestFuzz failures #5674

pkg/fuzzer: occasional TestFuzz failures #5674

a-nogikh commented Jan 14, 2025

a-nogikh commented Jan 28, 2025

a-nogikh commented Jan 29, 2025

a-nogikh commented Jan 29, 2025 •

edited

Loading

ramosian-glider commented Feb 3, 2025

a-nogikh commented Feb 3, 2025 •

edited

Loading

a-nogikh commented Feb 3, 2025

a-nogikh commented Feb 4, 2025

a-nogikh commented Feb 4, 2025

dvyukov commented Feb 4, 2025

pkg/fuzzer: occasional TestFuzz failures #5674

pkg/fuzzer: occasional TestFuzz failures #5674

Comments

a-nogikh commented Jan 14, 2025

a-nogikh commented Jan 28, 2025

a-nogikh commented Jan 29, 2025

a-nogikh commented Jan 29, 2025 • edited Loading

ramosian-glider commented Feb 3, 2025

a-nogikh commented Feb 3, 2025 • edited Loading

a-nogikh commented Feb 3, 2025

a-nogikh commented Feb 4, 2025

a-nogikh commented Feb 4, 2025

dvyukov commented Feb 4, 2025

a-nogikh commented Jan 29, 2025 •

edited

Loading

a-nogikh commented Feb 3, 2025 •

edited

Loading