The following architecture shows our deployment movie recommendation system
-
Data storage
- Apache Kafka is a distributed event store and stream-processing platform
- Collect Kafka log data
- Data (movies watched by user) --> for (re)training model and for online evaluation
- Rate (rating by user) --> for (re)training model and for online evaluation
- Request --> for online evaluation
- This pipeline, once run, continues to run until it is intentionally stopped.
- After online evaluation, expired data is automatically deleted.
-
Data preprocessing
- pre-processing the stored raw data
- Generate a compresssed sparse row (CSR) matrix
- Split it into train/validation sets
-
Model (re)training
- SVD
- SVD++
-
Offline evaluation
- 'RMSE' as metric for offline evaluation
- The process is integrated on Jenkins pipeline, which runs automatically.
- The result can be identified in a coverage report format on Jenkins
- Continuous integration
- Jenkins
- Unit test 1 to 5 --> model management & offline evaluation (model) --> online evaluation
- Using Blue Ocean plugin
- A more visualized dashboard than ever before
- Commit occurs in master branch of github --> Autorun the entire pipeline
- Save after pipeline build --> Jenkinsfile for pipeline is committed to master branch on github
- Using freestyle project
- Automatically run once in a specific period of time
- Setting the "build periodically" option
- Jenkins
- Rancher
- A complete container management platform that includes everything necessary for container management during the production process
- Deploymeny components
- Our system manages two recommendation models as different deployments in one cluster
- Each deployment consists of two pods, one replica of the ohter, which distributes and processes tasks
- Automatic Continuous Deployment with Jenkins
- Extending our integration pipeline to model deployment
- We leverage jenkins to transmit the deployment signal to the Rancher
- Whenever committed to Github, the pipeline is executed:
- Continuous Integration : Data fetching, Data preprocessing, Model retraining
- Continuous Deployment : Build docker images, Push images to docker repo
- Model deployment : Pull docker images for retrained models and redeploy it through Rancher
- Zero downtime for model redeployment
- The new redeployment also has 2 pods with replica
- After one new pod is deployment, one existing pod is terminated
- After a new pod is deployed again, the remaining existing pod is also terminated --> ZERO DWONTIME in the process of deploying the retrained models
- All these process are stable controlled under the Rancher platform
- Monitoring infrastructure
- Prometheus, Grafana and Node Exporter to monitor our infrastructure
- Memory usage
- CPU usage
- Latency time in flask
- Model quality
- Prometheus, Grafana and Node Exporter to monitor our infrastructure
- Provenance
- DVC
- An open-source version control system
- DVC stores the information of dataset and the model in .dvc format
- Process
- Track modification --> Add changes to git --> push git tag
- DVC
- Collect data from Kafka Streaming and data preprocessing for movie recommendation model training
- Deploy and measure a model inference service
- Build and operate infrastructures
- A continuous integration infrastructure for evaluate a model in production
- A monitoring infrastructure for the system health and model quality
- A continuous deployment infrasturcture for automatic periodic retraining and versioning
- Design and implement a monitoring strategy to detect possible issues in ML systems