Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved industrialization support: persisting of model (configuration), reapeated model scoring, model monitoring #127

Open
sandervh14 opened this issue Apr 1, 2022 · 2 comments · May be fixed by #134
Labels
data engineering enhancement New feature or request

Comments

@sandervh14
Copy link
Contributor

Add functionality to write away model metadata

It would be nice to have a function for writing away the date and time of each new modeling attempt, which variables were selected, which preprocessing was done and which was the resulting score.

Task Description

This could comprise:

  • store model metadata (scores, datetime, version etc) in a table
  • storage of files involved (model and preprocessor pickle, and potentially the data) on some filestorage or database blobs.

Provide the code for extracting this metadata, but allow a data scientist/engineer to write a plugin function to do the actual writing of the metadata to the database/filestore of choice.

@sandervh14 sandervh14 changed the title Add functionality to write away model metadata Add functionality to store model score and configuration (to model metadata database) Apr 27, 2022
@sandervh14 sandervh14 changed the title Add functionality to store model score and configuration (to model metadata database) Improved industrialization support: persisting of model (configuration), reapeated model scoring, model monitoring Jun 2, 2022
@sandervh14
Copy link
Contributor Author

sandervh14 commented Jun 2, 2022

Hi @sborms, @ZlaTanskY, @nicolasmorandi and @c-morey!

Before we can finish implementing this issue entirely and have a better support for industrializing Cobra models, I think we need to discuss a few things.

Proposed use cases (and processes) throughout industiralization & model usage phase that we could support better in Cobra:

  1. persisting the trained model:
  • has been trained with the existing Cobra API
  • persisting the preprocessor pipeline configuration as JSON file (1)
  • persisting the model scores achieved (2)
  • persisting the data (the basetable, containing all splits for reproducibility) (3)
  1. model monitoring
  • scoring the model every now and then, on new data (basetables) and reusing persisted Cobra models.
  • detecting data drift - let's not re-invent the wheel and integrate with a solution online - Nicolas and I thought of trying out NannyML.
  • persist the findings (model scores, data metrics) of the above model monitoring steps (2)
  • persist the data (basetables) used for the above model monitoring steps (3)
  • visualizing the model monitoring, most preferably in a dashboard (4)
  1. retraining the model (proactive/reactive maintenance of the model): same steps as use case 1.

  2. facilitate easy deployment & running: provide example python scripts for production-grade runnable Cobra models post-notebooking-phase, provide example Docker images/docker-compose files, ...

Integrations necessary for the above:

  • persisting files - see (1) above: we support writing to the local file system at the moment, but should consider supporting uploading to a server location instead, or as a blob in a database, etc.
  • persisting model scores - see (2) above:
  • persisting data - see (3) above: support databases both locally (MySQL, PostgreSQL) or in the cloud (BigQuery etc.)
  • visualizing model monitoring findings - see (4) above: integrate with PowerBI or other dashboarding software.

Additional task: documenting the thoughts above
And any of your additions to it of course, the use cases and integrations of above that we support at the moment, and how to use them. While doing so, also fix the points listed in #133 (fix them here and close #133, or fix #133 separately and mention this issue as linked).

Am I missing interesting use cases or integrations above? Feel free to suggest.

Also: we cannot implement everything right now, and not even in the coming years, but must pick the most interesting things at each time, just adding the use cases and integrations just on-the-go as we are industrializing Cobra for clients with different demands and infrastructure.

I've also gathered the files from the Brico pull request and started structuring it a bit, so we can integrate their efforts into Cobra, see the draft pull request mentioned below on this page (only FYI). But I'd like to first discuss the above thoughts before proceeding on the gathered code, so we agree on what we want to do.

@sandervh14
Copy link
Contributor Author

Nicolás is interested in the investigation of MLFlow, that investigation could fit in this issue. See details on MLFlow's github, all 4 of the MLFlow components (Tracking for storing model parameters, Projects for reproducible runs, Models for easy deployment and Model registry to track model evolution throughout the model's lifecycle) are very interesting to build integrators for within Cobra. Up to you to decide @nicolasmorandi @pietrodantuono.

@sandervh14 sandervh14 modified the milestones: 2023-03, 2023-05 Mar 9, 2023
@sandervh14 sandervh14 removed their assignment Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data engineering enhancement New feature or request
Projects
None yet
1 participant