Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test with DataWriter (part 2) #376

Merged
merged 6 commits into from
Dec 11, 2024

Conversation

grodowski
Copy link
Member

Improves the DataWriter setup by creating threads and connecting earlier in initialize. Includes synchronization on DataWriter threads, so that calling #start ensures a first write to db before returning and letting the test proceed.

This attempts to improve flakiness of test_interrupt_resume_inline_verifier_with_datawriter, which sometimes fails as expected BinlogVerifyStore contents are empty on interrup, when they should contain some data from DataWriter.

I found out with the below log entry that nothing gets to the inline verifier before interrupt in cases where test_interrupt_resume_inline_verifier_with_datawriter fails (feel free to try out locally).

Hoping this will be the definitive (and correct) fix, as opposed to the speed-up from #371.


diff --git a/inline_verifier.go b/inline_verifier.go
index 552c88e..b1ae817 100644
--- a/inline_verifier.go
+++ b/inline_verifier.go
@@ -25,6 +25,7 @@ import (
 // TODO: remove IterativeVerifier and remove this comment.
 type BinlogVerifyStore struct {
 	EmitLogPerRowsAdded uint64
+	logger              *logrus.Entry
 
 	mutex *sync.Mutex
 	// db => table => paginationKey => number of times it changed.
@@ -110,6 +111,7 @@ func NewBinlogVerifyStore() *BinlogVerifyStore {
 		store:               make(map[string]map[string]map[uint64]int),
 		totalRowCount:       uint64(0),
 		currentRowCount:     uint64(0),
+		logger:              logrus.WithField("tag", "BinLogVerifier"),
 	}
 }
 
@@ -127,6 +129,8 @@ func (s *BinlogVerifyStore) Add(table *TableSchema, paginationKey uint64) {
 	s.mutex.Lock()
 	defer s.mutex.Unlock()
 
+	s.logger.Error("Add event to store DEBUG")
+
 	_, exists := s.store[table.Schema]
 	if !exists {
 		s.store[table.Schema] = make(map[string]map[uint64]int)

@started_callback_cmd << n unless @started
n += 1
# Kind of makes the following race condition a bit better...
# https://github.com/Shopify/ghostferry/issues/280
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem 👀

begin
until @stop_requested do
write_data(connection, &on_write)
@started_callback_cmd << n unless @started
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets #start return after first write

@number_of_writers.times do |i|
@threads << Thread.new do
@logger.info("data writer thread in wait mode #{i}")
connection = Mysql2::Client.new(@db_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should still close the connection after the stop is requested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I forgot to add it back here after an experiment with the ConnectionPool gem (but thought it's a an overkill for this test setup)

@grodowski grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch from 95f5d10 to 8c817bf Compare December 9, 2024 14:57
sleep(0.03)
end
ensure
connection.close
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this ensure will only get called if something goes wrong in begin block right? So if anything goes wrong before that we won't close the connection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's right, ensures are always called after any rescues or at the end of the block. Just moved it from here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I do this:

def test
  raise "here"
   
   begin
     puts "Hi"
   ensure
     puts "always ensure"
   end
 end

ensure is not called.

irb(main):010> test
(irb):2:in `test': here (RuntimeError)

but when I do this:

def test
  puts "here"
   
   begin
     raise "Hi"
   ensure
     puts "always ensure"
   end
 end

I get this:

irb(main):020> test
here
always ensure
(irb):15:in `test': Hi (RuntimeError)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh right, just saw what happened there as I moved the connection outside of the block. Simplified and fixed in 50c78b2

@grodowski grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch 3 times, most recently from 50c78b2 to 837f027 Compare December 10, 2024 12:00
@grodowski grodowski requested a review from mtaner December 10, 2024 12:06
@grodowski
Copy link
Member Author

grodowski commented Dec 10, 2024

Currently re-running the CI suite several times to evaluate its stability ⌛ I don't think it's going to resolve all concurrency issues, but hopefully InterruptResumeTest will be more consistent

Edit: 96f92f5 is a side-quest for another go-test flake (where synchronization would be ideal too...)

Speed up the setup by creating threads and connecting earlier in initialize. Includes synchronization on DataWriter threads, so that calling #start ensures a first write to db before returning.

This attempts to improve flakiness of test_interrupt_resume_inline_verifier_with_datawriter.
@grodowski grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch from 7148582 to 96f92f5 Compare December 10, 2024 12:59
…nts, potential source of flakiness

Hypothesis: MixedActionDataWriter doesn't has a chance to start (goroutine), give it more time. Ideal solution would be to introduce some form of synchronization, but let's try this first. Partially reverts a speedup change from a288a28
@grodowski grodowski force-pushed the grodowski/flaky-test-fix-round-2 branch from 96f92f5 to 7750793 Compare December 10, 2024 13:29
@grodowski grodowski merged commit 51d3960 into main Dec 11, 2024
9 checks passed
@grodowski grodowski deleted the grodowski/flaky-test-fix-round-2 branch December 11, 2024 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants