Update microbatch end_time
to the batch_size
ceiling
#10883
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #10868
Problem
(From issue)
Sometimes the event_time column is a date instead of a datetime. Now consider a model where the event_time is date_added which is a date type. The microbatch filter would then be something like date_added >= ‘2024-10-16 00:00:00’ and date_added < ‘2024-10-16 11:29:34’. Now, how does the data warehouse handle that? In some cases the datetime value gets auto truncated to a date, thus making the filter date_added >= ‘2024-10-16’ and date_added < ‘2024-10-16’. That is problematic because that filter will always return zero rows. To get around this, one solution could be to take the batch ceiling of the current time. That is, if our batch_size is day
Solution
Use an upper bound timestamp that is the current (system) or specified (via CLI) timestamp, ceilinged! For example, if the current timestamp is 2020-01-01 12:30:00, the ceilinged timestamp is 2020-01-02 00:00:00. This guarantees that each batch filter size is the same (always the size of batch size, one of hour, day, month, year) and avoids empty filters as specified in the issue.
Checklist