Fix flaky test with DataWriter (part 2) #376

grodowski · 2024-12-09T14:32:27Z

Improves the DataWriter setup by creating threads and connecting earlier in initialize. Includes synchronization on DataWriter threads, so that calling #start ensures a first write to db before returning and letting the test proceed.

This attempts to improve flakiness of test_interrupt_resume_inline_verifier_with_datawriter, which sometimes fails as expected BinlogVerifyStore contents are empty on interrup, when they should contain some data from DataWriter.

I found out with the below log entry that nothing gets to the inline verifier before interrupt in cases where test_interrupt_resume_inline_verifier_with_datawriter fails (feel free to try out locally).

Hoping this will be the definitive (and correct) fix, as opposed to the speed-up from #371.

diff --git a/inline_verifier.go b/inline_verifier.go
index 552c88e..b1ae817 100644
--- a/inline_verifier.go
+++ b/inline_verifier.go
@@ -25,6 +25,7 @@ import (
 // TODO: remove IterativeVerifier and remove this comment.
 type BinlogVerifyStore struct {
 	EmitLogPerRowsAdded uint64
+	logger              *logrus.Entry
 
 	mutex *sync.Mutex
 	// db => table => paginationKey => number of times it changed.
@@ -110,6 +111,7 @@ func NewBinlogVerifyStore() *BinlogVerifyStore {
 		store:               make(map[string]map[string]map[uint64]int),
 		totalRowCount:       uint64(0),
 		currentRowCount:     uint64(0),
+		logger:              logrus.WithField("tag", "BinLogVerifier"),
 	}
 }
 
@@ -127,6 +129,8 @@ func (s *BinlogVerifyStore) Add(table *TableSchema, paginationKey uint64) {
 	s.mutex.Lock()
 	defer s.mutex.Unlock()
 
+	s.logger.Error("Add event to store DEBUG")
+
 	_, exists := s.store[table.Schema]
 	if !exists {
 		s.store[table.Schema] = make(map[string]map[uint64]int)

grodowski · 2024-12-09T14:32:56Z

test/helpers/data_writer_helper.rb

+              @started_callback_cmd << n unless @started
+              n += 1
+              # Kind of makes the following race condition a bit better...
+              # https://github.com/Shopify/ghostferry/issues/280


This is still a problem 👀

grodowski · 2024-12-09T14:33:46Z

test/helpers/data_writer_helper.rb

+          begin
+            until @stop_requested do
+              write_data(connection, &on_write)
+              @started_callback_cmd << n unless @started


Lets #start return after first write

mtaner · 2024-12-09T14:51:33Z

test/helpers/data_writer_helper.rb

+      @number_of_writers.times do |i|
+        @threads << Thread.new do
+          @logger.info("data writer thread in wait mode #{i}")
+          connection = Mysql2::Client.new(@db_config)


we should still close the connection after the stop is requested?

Yup, I forgot to add it back here after an experiment with the ConnectionPool gem (but thought it's a an overkill for this test setup)

mtaner · 2024-12-09T15:05:27Z

test/helpers/data_writer_helper.rb

+              sleep(0.03)
+            end
+          ensure
+            connection.close


nit: this ensure will only get called if something goes wrong in begin block right? So if anything goes wrong before that we won't close the connection.

I don't think that's right, ensures are always called after any rescues or at the end of the block. Just moved it from here

when I do this:

def test raise "here" begin puts "Hi" ensure puts "always ensure" end end

ensure is not called.

irb(main):010> test (irb):2:in `test': here (RuntimeError)

but when I do this:

def test puts "here" begin raise "Hi" ensure puts "always ensure" end end

I get this:

irb(main):020> test here always ensure (irb):15:in `test': Hi (RuntimeError)

Ohhh right, just saw what happened there as I moved the connection outside of the block. Simplified and fixed in 50c78b2

grodowski · 2024-12-10T12:09:52Z

Currently re-running the CI suite several times to evaluate its stability ⌛ I don't think it's going to resolve all concurrency issues, but hopefully InterruptResumeTest will be more consistent

Edit: 96f92f5 is a side-quest for another go-test flake (where synchronization would be ideal too...)

Speed up the setup by creating threads and connecting earlier in initialize. Includes synchronization on DataWriter threads, so that calling #start ensures a first write to db before returning. This attempts to improve flakiness of test_interrupt_resume_inline_verifier_with_datawriter.

…ped before first write

…nts, potential source of flakiness Hypothesis: MixedActionDataWriter doesn't has a chance to start (goroutine), give it more time. Ideal solution would be to introduce some form of synchronization, but let's try this first. Partially reverts a speedup change from a288a28

grodowski commented Dec 9, 2024

View reviewed changes

mtaner reviewed Dec 9, 2024

View reviewed changes

grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch from 95f5d10 to 8c817bf Compare December 9, 2024 14:57

mtaner reviewed Dec 9, 2024

View reviewed changes

grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch 3 times, most recently from 50c78b2 to 837f027 Compare December 10, 2024 12:00

grodowski requested a review from mtaner December 10, 2024 12:06

grodowski added 2 commits December 10, 2024 13:59

Ensure start_synchronized_datawriter_threads doesn't deadlock if stop…

c89cbc7

…ped before first write

grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch from 7148582 to 96f92f5 Compare December 10, 2024 12:59

grodowski added 4 commits December 10, 2024 14:24

Ensure captures the entire Thread.new block

1839293

Drop the @started_callback_cmd queue, suspected deadlock

0401981

Interrupt later to prevent flakes

7750793

grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch from 96f92f5 to 7750793 Compare December 10, 2024 13:29

mtaner approved these changes Dec 10, 2024

View reviewed changes

grodowski merged commit 51d3960 into main Dec 11, 2024
9 checks passed

grodowski deleted the grodowski/flaky-test-fix-round-2 branch December 11, 2024 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test with DataWriter (part 2) #376

Fix flaky test with DataWriter (part 2) #376

grodowski commented Dec 9, 2024

grodowski Dec 9, 2024

grodowski Dec 9, 2024

mtaner Dec 9, 2024

grodowski Dec 9, 2024

mtaner Dec 9, 2024

grodowski Dec 9, 2024

mtaner Dec 10, 2024

grodowski Dec 10, 2024

grodowski commented Dec 10, 2024 •

edited

Loading

Fix flaky test with DataWriter (part 2) #376

Fix flaky test with DataWriter (part 2) #376

Conversation

grodowski commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grodowski commented Dec 10, 2024 • edited Loading

grodowski commented Dec 10, 2024 •

edited

Loading