Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update microbatch end_time to the batch_size ceiling #10883

Conversation

QMalcolm
Copy link
Contributor

@QMalcolm QMalcolm commented Oct 18, 2024

Resolves #10868

Problem

(From issue)

Currently, when running a microbatch model, the current time is right now. This, for the most part, just works. However, there is a problem.

Sometimes the event_time column is a date instead of a datetime. Now consider a model where the event_time is date_added which is a date type. The microbatch filter would then be something like date_added >= ‘2024-10-16 00:00:00’ and date_added < ‘2024-10-16 11:29:34’. Now, how does the data warehouse handle that? In some cases the datetime value gets auto truncated to a date, thus making the filter date_added >= ‘2024-10-16’ and date_added < ‘2024-10-16’. That is problematic because that filter will always return zero rows. To get around this, one solution could be to take the batch ceiling of the current time. That is, if our batch_size is day

Solution

Use an upper bound timestamp that is the current (system) or specified (via CLI) timestamp, ceilinged! For example, if the current timestamp is 2020-01-01 12:30:00, the ceilinged timestamp is 2020-01-02 00:00:00. This guarantees that each batch filter size is the same (always the size of batch size, one of hour, day, month, year) and avoids empty filters as specified in the issue.

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

@QMalcolm QMalcolm added the Skip Changelog Skips GHA to check for changelog file label Oct 18, 2024
@cla-bot cla-bot bot added the cla:yes label Oct 18, 2024
Copy link

codecov bot commented Oct 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.18%. Comparing base (8df5c96) to head (7cb7198).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10883      +/-   ##
==========================================
+ Coverage   89.13%   89.18%   +0.05%     
==========================================
  Files         183      183              
  Lines       23489    23496       +7     
==========================================
+ Hits        20938    20956      +18     
+ Misses       2551     2540      -11     
Flag Coverage Δ
integration 86.57% <100.00%> (+0.12%) ⬆️
unit 62.08% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.08% <100.00%> (+0.01%) ⬆️
Integration Tests 86.57% <100.00%> (+0.12%) ⬆️

@QMalcolm QMalcolm marked this pull request as ready for review October 29, 2024 21:14
@QMalcolm QMalcolm requested a review from a team as a code owner October 29, 2024 21:14
@QMalcolm QMalcolm removed the Skip Changelog Skips GHA to check for changelog file label Oct 29, 2024
@QMalcolm QMalcolm merged commit dd77210 into main Oct 29, 2024
63 checks passed
@QMalcolm QMalcolm deleted the qmalcolm--10868-ceiling-microbatch-end-time-to-batch-size-ceiling branch October 29, 2024 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Invalid where filter for latest batch when event_column is of type date
2 participants