Event-Detection-Engine - Serrano

Event Detection Engine for the H2020 Serrano research project. Based partially on the work done during the DICE H2020 project specifically the DICE Anomaly Detection Platform. and on the work done during the ASPIDE H2020 project and DIPET Chist Era project.

Context

In the following section we will use the term events and anomalies seemingly interchangeably. However, we should note that the methods used for detecting anomalies are applicable in the case of events. The main difference lies in the fact that anomalies pose an additional level of complexity by their spars nature, some anomalies might have an occurrence rate well under 0.01%. Event and anomaly detection can be split up into several categories based on the methods and the characteristics of the available data. The most simple form of anomalies are point anomalies which can be characterized by only one metric (feature). These types of anomalies are fairly easy to detect by applying simple rules (i.e. CPU is above 70%). Other types of anomalies are more complex but ultimately yield a much deeper understanding about the inner workings of a monitored exascale system or application. These types of anomalies are fairly common in complex systems.

Contextual anomalies are extremely interesting in the case of complex systems. These types of anomalies happen when a certain constellation of feature values are encountered. In isolation these values are not anomalous but when viewed in context they represent an anomaly. These type of anomalies represent application bottlenecks, imminent hardware failure or software miss-configuration. The last major type of anomaly which are relevant are temporal or sometimes sequential anomalies where a certain event takes place out of order or at the incorrect time. These types of anomalies are very important in systems which have a strong spatio-temporal relationship between features, which is very much the case for exascale metrics.

Architecture

The Event detection engine (EDE) has several sub-components which are based on lambda type architecture where we have a speed, batch and serving layer. Because of the heterogeneous nature of most modern computing systems (including exascale and mesh networks) and the substantial variety of solutions which could constitute a monitoring services the data ingestion component has to be able to contend with fetching data from a plethora of systems. Connectors is implemented such that it serves as adapters for each solution. Furthermore, this component also is be able to load data directly from static file (HDF5, CSV , JSON, or even raw format).

This aids in fine tuning of event and anomaly detection methods. We can also see that data ingestion can be done directly via query from the monitoring solution or streamed directly from the queuing service (after ETL if necessary). This ensures that we have the best chance of reducing the time between the event or anomaly happening and it being detected.

The pre-processing component is in charge of taking the raw data from the data ingestion component and apply several transformations. It handles data formatting (i.e. one-hot encoding), analysis (i.e. statistical information), splitter (i.e. splitting the data into training and validation sets) and finally augmentation (i.e. oversampling and undersampling).

As an example the analysis and splitter are responsible for creating stratified shuffle split for K-fold cross validation for training while the augmentation step might involve under or oversampling techniques such as ADASYN or SMOTE. This component is also responsible for any feature engineering of the incoming monitoring data.

The training component (batch layer) is used to instantiate and train methods that can be used for event and anomaly detection. The end user is able to configure the hyper-parameters of the selected models as well as run automatic optimization on these (i.e. Random Search, Bayesian search etc.). Users are not only able to set the parameters to be optimized but to define the objectives of the optimization. More specifically users can define what should be optimized including but not limited to predictive performance, transprecise objectives (inference time, computational limitations, model size etc.).

Evaluation of the created predictive model on a holdout set is also handled in this component. Current research and rankings of machine learning competitions show that creating an ensemble of different methods may yield statistically better results than single model predictions. Because of this ensembling capabilities have to be included.

Finally, the trained and validated models have to be saved in such a way that enables them to be easily instantiated and used in a production environment. Several predictive model formats have to be supported, such as; PMML, ONNX, HDF5, JSON.

It is important to note at this time that the task of event and anomaly detection can be broadly split into two main types of machine learning tasks; classification and clustering. Classification methods such as Random Forest, Gradient Boosting, Decision Trees, Naive Bayes, Neural networks, Deep Neural Networks are widely use in the field of anomaly and event detection. While in the case of clustering we have methods such as IsolationForest, DBSCAN and Spectral Clustering. Once a predictive model is trained and validated it is saved inside a model repository. Each saved model has to have metadata attached to it denoting its performance on the holdout set as well as other relevant information such as size, throughput etc.

The prediction component (speed layer) is in charge of retrieving the predictive model form the model repository and feed metrics from the monitored system. If and when an event or anomaly is detected EDE is responsible with signaling this to both the Monitoring service reporting component and to other tools such as the Resource manager and/or scheduler any decision support system. Figure 1 also shows the fact that the prediction component gets it’s data from both the monitoring service via direct query or directly from the queuing service via the data ingestion component.

For some situations a rule based approach is better suited. For these circumstances the prediction component has to include a rule based engine and a rule repository. Naturally, detection of anomalies or any other events is of little practical significance if there is no way of handling them. There needs to be a component which once the event has been identified tries to resolve the underlying issues.

Utilization

EDE is designed around the utilization of a yaml based configuration scheme. This allows the complete configuration of the tool by the end user with limited to no intervention in the source code. It should be mentioned that some of these features are considered unsave as they allow the execution of arbitrary code.
The configuration file is split up into several categories:

Connector - Deals with connection to the data sources
Mode - Selects the mode of operation for EDE
Filter - Used for applying filtering on the data
Augmentation - User defined augmentations on the data
Training - Settings for training of the selected predictive models
Detect - Settings for the detection using a pre-trained predictive model
Point - Settings for point anomaly detection
Misc - Miscellaneous settings

Connector

The current version of EDE support 3 types of data sources: ElasticSearch, Prometheus and CSV/Excel. Conversely it supports also reporting mechanisms for ElasticSearch and Kafka. In the former case, a new index is created in ElasticSearch which contains the detected anomalies while in the latter a new Kafka topic is created where the detected anomalies are pushed.

This sections parameters are:

PREndpoint - Endpoint for fetching Prometheus data
ESEndpoint - Endpoint for fetching ElasticSearch data
MPort - Sets the monitoring port for the selected Endpoint (defaults to 9200)
KafkaEndpoint - Endpoint for a pre-existing Kafka deployment
KafkaPort - Sets the Kafka port for the selected Kafka Endpoint (defaults to 9092)
KafkaTopic - Name of the kafka topic to be used
Query - The query string to be used for fetching data:
- In the case of ElasticSearch please consult the official documentation.
- In the case of Prometheus please consult the official documentation
  - For fetching all queryable data: {"query": '{__name__=~"node.+"}[1m]'}
  - For fetching specific metric data: { "query": 'node_disk_written_bytes_total[1m]'}
MetricsInterval - Metrics datapoint interval definition
QSize - size in MB of the data to be feteched (only if ESEndpoint is used)
- For no limit use QSize: 0
Index - The name of the column to be set as index
- The column has to have unique values, by default it is set to the column denoting the time when the metric was read
QDelay - Polling period for metrics fetching
Dask
- ScheduelerEndpoint - Denotes the Dask scheduler endpoint
  - If no pre-deployed Dask instance is available EDE can deploy a local Scheduler by setting this parameter to local
- SchedulerPort - Endpoint for Dask scheduler endpoint
- Scale - Sets the number of workers if local scheduler is used
- EnforceCheck - if set to true it will check if the libraries from the python environment used on each Dask worker are the same versions as the origination source
  - If this check fails the job will exit with an error message
  - This parameter can be omitted in the case of local deployment
Local - path to csv or Excel file to be used

Notes:

Only one of type of connector endpoint (PREndpoint or ESEndpoint) is supported at any given time.
If Local is defined than it will ignore any other data sources.

Mode

The following settings set the mode in which EDE operates. There are 3 modes available in this version; Training, Validate, Detect

Training - If set to true a Dask worker or Python process for training is started
Validate - If set to true a Dask worker or Python process for validation is started
Detect - If set to true a Dask worker for Python process fof Detection is started

Notes:

In case of a local Dask deployment it is advised to have at least 3 workers started (see the Scale parameter in the previouse section).

Filter

Once the data has been loaded by the EDE connector it is tranformed into DataFrames. The data in these Dataframes can be filtered by using the parameters listed bellow:

Columns - listing of columns which are to remain
Rows - Filters rows in a given bound
- gd - Lower bound
- ld - Upper bound
DColumns - list of columns to be deleted (dropped)
- Dlist - expects an external yaml file containing a list of columns to be dropped (usefull removing large number of features)
Fillna - fills None values with 0
Dropna - deletes columns wiith None values
LowVariance - used to detect and remove low variance features automatically
DWild - Removes columns based on regex
- Regex - Regex to be used for fitlering
- Keep - If True all selected columns are kept the rest are dropped, otherwise selected columns are dropped.

Notes:

Some machine learning models cannot deal with None values to this end the Fillna or Dropna parameters where introduced. It is important to note the Dropna will drop any column which has at least one None value.

Augmentation

The following parameters are used to define augmentations to be executed on the loaded dataframes. The augmentations are chained togheter as defined by the user. The available parameters are:

Scaler - Scaling/Normalizing the data. If not defined no scaler is used
- ScalerType - We currently support all scaler types from scikitlearn. Please consult the official documentation for further details
  - When utilizing a scikit-learn scaler we have to use the exact name of the scaler followed by its parameters. Bellow you can find an exampele utilizing the StandardScaler:
```
Scaler:
   StandardScaler:
       copy: True
       with_mean: True
       with_std: True
```

Operations - set of predefined operations that can be executed

STD - calculates the standard deviation, excepts a name and a list of metrics to use
Mean - Calculates the mean, excepts a name and a list of metrics to use
Median - Calculates the median, excepts a name and a list of metrics to use
Example usage:

Operations:
    STD:
      - cpu_load1:
          - node_load1_10.211.55.101:9100
          - node_load1_10.211.55.102:9100
          - node_load1_10.211.55.103:9100
      - memory:
          - node_memory_Active_anon_bytes_10.211.55.101:9100
          - node_memory_Active_anon_bytes_10.211.55.101:9100
          - node_memory_Active_anon_bytes_10.211.55.101:9100
    Mean:
      - network_flags:
          - node_network_flags_10.211.55.101:9100
          - node_network_flags_10.211.55.102:9100
          - node_network_flags_10.211.55.103:9100
      - network_out:
          - node_network_mtu_bytes_10.211.55.101:9100
          - node_network_mtu_bytes_10.211.55.102:9100
          - node_network_mtu_bytes_10.211.55.103:9100
    Median:
      - memory_file:
          - node_memory_Active_file_bytes_10.211.55.101:9100
          - node_memory_Active_file_bytes_10.211.55.102:9100
          - node_memory_Active_file_bytes_10.211.55.103:9100
      - memory_buffered:
          - node_memory_Buffers_bytes_10.211.55.101:9100
          - node_memory_Buffers_bytes_10.211.55.102:9100
          - node_memory_Buffers_bytes_10.211.55.103:9100

RemoveFiltered - If set to True the metrics used during these operations will be deleted, the resulting augmented columns remaining
Method - Excepts User defined augmentations (i.e. python functions) for feature engineering
- Methods should be wrapped as can bee seen in the wrapper_add_column example
- All keyword arguments should be passable to the wrapped function
- Here is an example of a user defined method invocation:
```
Method: !!python/object/apply:edeuser.user_methods.wrapper_add_columns # user defined operation
  kwds:
    columns: !!python/tuple [node_load15_10.211.55.101:9100, node_load15_10.211.55.102:9100]
    column_name: sum_load15
```
Categorical - Excepts a list of categorical columns, if not defined EDE can try to automatically detect categorical columns
- OH - If set to True oneHot encoding is used for categorical features

Training

The following parameters are used to set up training mode and machine learning model selection and initialization.

Type - Sets the type of machine learning problem. Currently supported are: clustering, classification, hpo and tpot.
Method - Sets the machine learning method to be used. We support all acikit-learn based models as well as other machine learning libraries which support scikit-learn API conventions such as: Tensorflow, Keras, LightGBM, XGBoost, CatBoost etc.
Export - Name of the preictive model to be exported (serialized)
MethodSettings - Setting dependant on machine learning method selected.
Target - Denotes the ground truth column name to be used. This is mandatory in the case of classification. If no target is defined the last column is used instead.

In the case of classification we have several additional options we can select:

Verbose - Will save a full classification report, confusion matrix, feature importance (if applicable) for all folds
PrecisionRecallCurve - Will plot the Precision Recall curve of the selected model
ROCAUC - Will plot the ROCAUC for the selected model
RFE - Will execute and plot recursive feature elimination. It will save a yaml file containing a list of features to be eliminated, usable by DList from DColumn.
- scorer - Defines the scorer to be used
- step - Defines the step for feature elimination. If there are a lot of features in the data this can take a long time to execute. In this case a larger step function is advised.
DecisionBoundary - Will plot the decision boundary after executing PCA with 2 components. For large number of classes the process of dimensionality reduction can result in noisy plots.
LearningCurve - Shows the relationship between model preformance and the amount of features/ training samples *sizes - Used to define the training samples for ploting, can be list or use generator function as seen in the example bellow. *scorer - Scorer to be used *n_jobs - Number of jobs to be executed, if Dask backend is used it will handle schedueling of these jobs.
ValidationCurve - Used to finetune a specific parameter, checking out of sample performance
- param_name - Name of Hyper-parameter to be optimized
- param_range - Range of values to check (list can be generated using generator functions same as for LearningCurve).
- scorer - Scorer to be used
- n_jobs - Number of jobs to be executed, if Dask backend is used it will handle schedueling of these jobs.

Note: When training an unsupervised method, by default it will generate a decision boundary and feature separation plots for the selected Model.

Example for clustering:

# Clustering example
Training:
  Type: clustering
  Method: isoforest
  Export: clustering_1
  MethodSettings:
    n_estimators: 10
    max_samples: 10
    contamination: 0.1
    verbose: True
    bootstrap: True

Example for classification:

Training:
  Type: classification
  Method: randomforest
  Export: classifier_1
  MethodSettings:
    n_estimators: 10
    max_samples: 10
    verbose: True
    bootstrap: True
  Target: target
  LearningCurve:
    sizes: !!python/object/apply:numpy.core.function_base.linspace
      kwds:
        start: 0.3
        stop: 1.0
        num: 10
    scorer: f1_weighted
    n_jobs: 5
  ValidationCurve:
    param_name: n_estimators
    param_range:
    - 10
    - 20
    - 60
    - 100
    - 200
    - 600
    scoring: f1_weighted
    n_jobs: 8
  PrecisionRecallCurve: 1
  ROCAUC: 1
  RFE:
    scorer: f1_weighted
    step: 10
  DecisionBoundary: 1
  Verbose: 1

Similar to how users can add their own implementations for augmentations it is also possible to add custom machine learning methods. An example implementation can be found in the edeuser section, namely user_iso. The wrapper function should contain all parameters which are necessary for the defined method and the return value should be an object which abides by scikit-learn API conventions.
Example for user defined method:

# User defined clustering custom
Training:
  Type: clustering
  Method: !!python/object/apply:edeuser.user_methods.user_iso
    kwds:
      n_estimators: 100
      contamination: auto
      max_features: 1
      n_jobs: 2
      warm_start: False
      random_state: 45
      bootstrap: True
      verbose: True
      max_samples: 1
  Export: clustering_2

Cross Validation

EDE supports a variety of cross validation methods (all from scikit-learn). The parameters are as follows:

CV - If an integer is used then standard (scikit-learn) CV is used with the integer representing the number of folds.
- Type - Type is required if CV denotes a scikit-learn or user defined CV method is used.
- Params - Parameters for CV method

For defining simple CV with 5 folds:

CV: 5

For defining CV using a specific method such as StratifiedKFold:

CV:
    Type: StratifiedKFold  # user defined all from sklearn
    Params:
      n_splits: 5
      shuffle: True
      random_state: 5

Scoring

EDE supports the inclusion of different scoring methods. These are all scikit-learn scoring methods as well as user defined scoring methods. The scoring functions are defined using the following parameters:

Scorers
- Scorers_list - List of scoreres to be used
  - Scorer
    - Scorer_name - User defined name of the scorer
    - skScorer - Scikit-learn scorer name In the case of user defined scorers the user has to define a key value pair; scorer_name and scorer instance.

An example Scorer chain definition can be found here:

Scorers:
    Scorer_list:
      - Scorer:
          Scorer_name: F1_weighted
          skScorer: f1_weighted
      - Scorer:
          Scorer_name: Jaccard_Index
          skScorer: jaccard_weighted # changes in scoring sklearn, for multiclass add suffix micro, weighted or sample
      - Scorer:
          Scorer_name: AUC
          skScorer: roc_auc_ovr_weighted
    User_scorer1: balanced_accuracy_score # key is user defined, can be changed same as Scorer_name

Hyper-parameter optimization

EDE also supports hyper parameter optimization methods such as: grid and random search, bayesian and evolutionary search and the tpot framework. The following parameters are used for HPO:

HPOMethod - Name of the hyper parameter optimization to be used
HPOParam - HPO parameters:
- n_iter - Number of iterations. In case of Grid search this is ignored.
- n_jobs - number of threads (-1 for all available)
- refit - Name of scoring metric to be used to determine the best performing hyperparameters. If multi metric is used, refit should be a metric name (mandatory)
- verbose - If set to true, it outputs metrics about each iteration
ParamDistribution - should contain a dictionary which has as a key the parameter name and a list (or python code which generates a list according to a particular distribution) which should be used:
```
ParamDistribution:
    n_estimators:
      - 10
      - 100
    max_depth:
      - 2
      - 3
```

Example of HPO including CV and Scorers examples:

# For HPO methods
Training:
  Type: hpo
  HPOMethod: Random  # random, grid, bayesian, tpot
  HPOParam:
    n_iter: 2
    n_jobs: -1
    refit: Balanced_Acc  # if multi metric used, refit should be metric name, mandatory
    verbose: True
  Method: randomforest
  ParamDistribution:
    n_estimators:
      - 10
      - 100
    max_depth:
      - 2
      - 3
  Target: target
  Export: hpo_1
  CV:
    Type: StratifiedKFold  # user defined all from sklearn
    Params:
      n_splits: 5
      shuffle: True
      random_state: 5
  Scorers:
    Scorer_list:
      - Scorer:
          Scorer_name: AUC
          skScorer: roc_auc
      - Scorer:
          Scorer_name: Jaccard_Index
          skScorer: jaccard
      - Scorer:
          Scorer_name: Balanced_Acc
          skScorer: balanced_accuracy
    User_scorer1: f1_score # key is user defined, can be changed same as Scorer_name

Example of HPO using the evolutionary search method:

Training:
  Type: hpo
  HPOMethod: Evol  # Random, Grid, Bayesian, tpot, Evol
  HPOParam:
    n_jobs: 1 # must be number, not -1 for all in case of Evol
    scoring: f1_weighted
    gene_mutation_prob: 0.20
    gene_crossover_prob: 0.5
    tournament_size: 4
    generations_number: 30
    population_size: 40  # if multi metric used, refit should be metric name, mandatory
    verbose: 4
  Method: randomforest
  ParamDistribution:
    n_estimators:
      - 10
      - 100
    max_depth:
      - 2
      - 3
  Target: target
  Export: hpo_1_y2
  CV:
    Type: StratifiedKFold  # user defined all from sklearn
    Params:
      n_splits: 5
      shuffle: True
      random_state: 5
  Scorers:
    Scorer_list:
      - Scorer:
          Scorer_name: F1_weighted
          skScorer: f1_weighted
      - Scorer:
          Scorer_name: Jaccard_Index
          skScorer: jaccard_weighted # changes in scoring sklearn, for multiclass add suffix micro, weighted or sample
      - Scorer:
          Scorer_name: AUC
          skScorer: roc_auc_ovr_weighted
    User_scorer1: balanced_accuracy_score

TPOT

TPOT is an automated machine learning framework designed around scikit-learn (and nay other framework which conforms to the scikit-learn API conventions). In contrast to other such tools it does not focus solely on the hyper-parameters of machine learning models but it tries to optimize the pre and postprocessing methods as well. It does this by generating scikit-learn pipelines. The optimizations are based around a Genetic Programming stochastic global search procedure. TPOT parameters are as follows:

TPOTParam
- generations - Number of generations to run
- population_size - Size of the population (candidate configurations)
- offspring_size - Number of new members added to a population between each generation
- mutation_rate - Rate of mutation to be used by the genetic algorithm
- crossover_rate - Value used in defining the crossover used when generating new offsprings
- scoring - Scoring functon to be used, it is different than scikit-learn, see TPOT documentation for details about TPOT scoring.
- max_time_mins - Limits the time for computing a generation
- max_eval_time_mins - Limits the amount of time for evaluating a pipeline.
- random_state - Random seed so that it enables consistency between experiments.
- n_Jobs - Number of concurent jobs. If set to -1 it will enable unlimited number of potential jobs
- verbosity - Logging detail
- config_dict - This sets the amount of detail and methods to add to the pipelines from a population. Possible values are:
  - Default - Includes all scikit-learn methods
  - TPOT light - restricted range of methods
  - TPTO MDR - Extended feature selectors and Multi dimensional reduction models
- use_dask - Use Dask backend to run phenotypes from the population (i.e each Dask worker runs a phenotype)

Example of TPOT

# TPOT Optimizer
Training:
  Type: tpot
  TPOTParam:
    generations: 2
    population_size: 2
    offspring_size: 2
    mutation_rate: 0.9
    crossover_rate: 0.1
    scoring: balanced_accuracy # Scoring different from HPO check TPOT documentation
    max_time_mins: 1
    max_eval_time_mins: 5
    random_state: 42
    n_jobs: -1
    verbosity: 2
    config_dict: TPOT light # "TPOT light", "TPOT MDR", "TPOT sparse" or None
    use_dask: True
  Target: target
  Export: tpotopt
  CV:
    Type: StratifiedKFold  # user defined all from sklearn
    Params:
      n_splits: 5
      shuffle: True
      random_state: 5

Notes:

Both HPO and TPOT are heavily based around Dask and utilize Dask workers for running different hyper parameter configurations. Because of this it is recommended to utilise a pre-existent distributed Dask worker cluster.
In contrast to other methods TPOT return the entire pipeline not just the predictive mode.

Prediction

Prediction is largely unchanged between the various EDE modes. Its parameters are:

Method - Name of the predictive model type used
Type - Specifies what type the model is (i.e. clustering, classsification, tpot etc.)
Load - Name of the serialized predictive model to be instantiated. See the export from training.
Scaler - Name of the scaler (if used). Once the scaler has been invoced during training the result will be serialized by EDE and can be reused for prediction.
Analysis - Will attach root cause analysis in the form of computed Shapely values and feature importance for all detected anomalous instances.
- Plot- If set to True it will generate plots for each detected anomalous instance;
  - Clustering: feature importance, summary and heatmap
  - Classification: force, summary

Example of a prediction:

Detect:
  Method: isoforest
  Type: clustering
  Load: clustering_1
  Scaler: StandardScaler  # Same as for training
  #Analysis: True
  Analysis: # if plotting of heatmap, summary and feature importance is require, if not set False or use previous example
    Plot: True

Analysis

EDE is capable of running any user defined analysis methods on the data. Users can add data exploration methods. Its parameters are:

Analysis
- Methods - List of methods to be used
  - Method - Information required for instantiation of user methods (including keyword arguments)
- Solo - If it is set to true it will run only the analysis and ignore any other Training or Prediction tasks.

Example analysis implementations included in EDE are Pearson correlation and a Line plot:

# Analysis example
Analysis:
 Methods:
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_analysis_corr
       kwds:
         name: Pearson1
         annot: False
         cmap: RdBu_r
         columns:
           - node_load1_10.211.55.101:9100
           - node_load1_10.211.55.102:9100
           - node_load1_10.211.55.103:9100
           - node_memory_Cached_bytes_10.211.55.101:9100
           - node_memory_Cached_bytes_10.211.55.102:9100
           - node_memory_Cached_bytes_10.211.55.103:9100
           - time
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_analysis_plot
       kwds:
         name: line1
         columns:
           - node_load1_10.211.55.101:9100
           - node_load1_10.211.55.102:9100
           - node_load1_10.211.55.103:9100
           - time
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_improved_pearson
       kwds:
         name: Test_Training
         dcol:
           - target
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
         show: False
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_rank2
       kwds:
         name: Test_rank
         dcol:
           - target
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
         algorithm: spearman
         show: False
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_rank1
       kwds:
         name: Test_rank1
         dcol:
           - target
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
         algorithm: shapiro
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_pca_plot
       kwds:
         name: Test_PCA
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
         projection: 3
         target: target
#         show: False
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_manifold
       kwds:
         name: Test_manifold
         target: target
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
         manifold: tsne
         n_neighbors: 10
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_manifold
       kwds:
         name: Test_manifold
         target: target
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
         manifold: hessian
   - Method: !!python/object/apply:edeuser.user_methods.wrapper_plot_on_features
       kwds:
         name: complete_columns
         target: target
         location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis

 Solo: True

Point

Point:
  Memory:
    cached:
      gd: 231313
      ld: 312334
    buffered:
      gd: 231313
      ld: 312334
    used:
      gd: 231313
      ld: 312334
  Load:
    shortterm:
      gd: 231313
      ld: 312334
    midterm:
      gd: 231313
      ld: 312334
  Network:
    tx:
      gd: 231313
      ld: 312334
    rx:
      gd: 231313
      ld: 312334

Misc

mMiscellaneous settings:

heap: Size of JVM heap used for weka based methods (now deprecated, will be removed in next version)
checkpoint: All filtering, augmentation steps will can be set to save to disk their results, thus in case of faliure processing can be resumed from the last step succesfully executed from the EDE processing pipeline.
delay : Used to set how often new data is to be fetched from the datasource. Same as QDelay'
interval: Query interval to be used when generating query strings. It will be ignored if user defined query string is used.
resetindex: Deletes anomaly index in case ElasticSearch is used for reporting.
point: Toggles point anomaly execution (now deprecated, will be removed in next version)

Misc:
  heap: 512m
  checkpoint: True
  delay: 15s
  interval: 30m
  resetindex: False
  point: False

Complete example configurations

EDE Service

The EDE Service is designed to offer a REST API for EDE. The current version only supports executing inference and not training.

Config

GET /v1/config

Returns the current version of the configuration file. See EDE Configuration for more details.

PUT /v1/config

Uploads a new configuration file in yaml format. See EDE Configuration for more details.

GET /v1/config/augmentation

Returns the current augmentation configuration.

{
  "Scaler": {
    "StandardScaler": {
      "copy": true,
      "with_mean": true,
      "with_std": true
    }
  }
}

PUT /v1/config/augmentation

Modifies augmentation part of the configuration.

GET /v1/config/connector

Returns the current connector configuration.

{
  "Dask": {
    "EnforceCheck": false,
    "Scale": 3,
    "SchedulerEndpoint": "local",
    "SchedulerPort": 8787
  },
  "Index": "time",
  "KafkaEndpoint": "10.9.8.136",
  "KafkaPort": 9092,
  "KafkaTopic": "edetopic",
  "MPort": 9200,
  "MetricsInterval": "1m",
  "PREndpoint": "194.102.62.155",
  "QDelay": "10s",
  "QSize": 0,
  "Query": {
    "query": "{__name__=~\"node.+\"}[1m]"
  }
}

PUT /v1/config/connector

Modifies connector part of the configuration.

GET /v1/config/filter

Returns the current filter configuration.

{
  "DColumns": {
    "Dlist": "..data/low_variance.yaml"
  },
  "Dropna": true,
  "Fillna": true
}

PUT /v1/config/filter

Modifies filter part of the configuration.

GET /v1/config/inference

Return current inference configuration.

{
  "Analysis": {
    "Plot": true
  },
  "Load": "cluster_y2_v3",
  "Method": "IForest",
  "Scaler": "StandardScaler",
  "Type": "clustering"
}

PUT /v1/config/inference

Modifies inference part of the configuration.

Data

These resources deal's with local data handling.

GET /v1/data

Returns a list of local datafiles. Currently only, txt, csv, xlsx and json files are supported.

{
  "files": [
    "serrano_test_cluster.csv"
  ]
}

GET /v1/data/{data_file}

This resource fetches the datafile denoted by the data_file parameter.

PUT /v1/data/{data_file}

This resource allows external files to be uploaded to the EDE service. Currently only, txt, csv, xlsx and json files are supported. We should note that the data_file parameter must be the same as the name of the file being uploaded.

Inference

POST /v1/inference

Starts the inference job using EDE based on the current configuration file.

Logs

GET /v1/logs

Returns EDE Service logs.

Engine

These resources are used to control RQ workers which wrap individual EDE instances. Each EDE instance can use a DASK cluster (local or remote).

GET /v1/service/jobs

Returns information about jobs from the services. It contains the unique ids for 4 types of jobs; failed, finished, queued, started. An example response can be seen bellow:

{
    "failed": [],
    "finished": [
        "a9784914-165c-488a-b5a6-7c58c6b421e6"
    ],
    "queued": [],
    "started": []
}

GET /v1/service/jobs/{job_id}

Returns information about a specific job denoted by its unique id or <job_uuid>. Some meta information is also contained in the response as reported by the background process. This resource can be used to check periodically if a particular job is finished. An example response can be seen bellow:

{
    "finished": true,
    "meta": {
        "progress": "Finished inference"
    },
    "status": "finished"
}

GET /v1/service/jobs/worker

Returns a list of workers from the current service instance. The list also includes workers which are no longer active, see status from the response. Other information about the workers are their unique id and pid from the operating system. An example response can be seen bellow:

{
    "workers": [
        {
            "id": "e3b0c442-98fc-11e7-8f38-2b66f5e7a637",
            "pid": 1,
            "status": "idle"
        },
        {
            "id": "e3b0c442-98fc-11e7-8f38-2b66f5e7a638",
            "pid": 2,
            "status": "idle"
        }
    ]
}

POST /v1/service/jobs/worker

Every time a post request is issued to this resource it will start a background worker. The maximum number of workers is dependent on the number of physical CPU cores available. An example response can be seen bellow:

{
  "status": "workers started"
}

If the maximum number of workers has been reached the following response will be given:

{
  "warning": "maximum number of workers active!",
  "workers": 4
}

DELETE /v1/service/jobs/worker

This resource enables the halting of workers. This resource need to be accessed each time a worker needs to be stopped.

Environmental Variables

We use several environment variables to configure the service. The following variables are required:

EDE_HOST - the host of the ede-service
EDE_PORT - the port of the ede-service
EDE_DEBUG - the debug level of the ede-service
REDIS_END - the endpoint for redis queue
REDIS_PORT - the port for redis queue
WORKER_THRESHOLD - threshold modifier for number of supported workers, be default it is twice the number of CPU cores
RQ_TIMEOUT - timeout for rq workers, default 3600 seconds All environment variables have default values in accordance with the libraries used.

NOTE

This is a work in progress and is not yet ready for production use. Additional features are being added and the API is subject to change. These features include:

Integration with Serrano Telemetry System
Support for training
Support for more data sources
Support input data validation and examples in swagger
Support for additional anomaly reporting (currently only via kafka topic)

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
architecture		architecture
conf		conf
dask-worker-space		dask-worker-space
data		data
ededeap		ededeap
edefeature		edefeature
edeformater		edeformater
edengine		edengine
edepoint		edepoint
edereporting		edereporting
edescikit		edescikit
edetensorflow		edetensorflow
edeuser		edeuser
experiments		experiments
helm		helm
models		models
queries		queries
service		service
.gitignore		.gitignore
1_ede_analysis_y2.yaml		1_ede_analysis_y2.yaml
1_ede_telemetry_integration.yaml		1_ede_telemetry_integration.yaml
2_ede_clustering_y2.yaml		2_ede_clustering_y2.yaml
3_ede_clustering_user_y2.yaml		3_ede_clustering_user_y2.yaml
4_ede_clustering_predict_y2.yaml		4_ede_clustering_predict_y2.yaml
5_ede_classification_y2.yaml		5_ede_classification_y2.yaml
6_ede_classification_predict_y2.yaml		6_ede_classification_predict_y2.yaml
7_ede_hpo_y2.yaml		7_ede_hpo_y2.yaml
8_ede_tpot.yaml		8_ede_tpot.yaml
9_ede_tpot_predict.yaml		9_ede_tpot_predict.yaml
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
README.md		README.md
build.yaml		build.yaml
danger_load.yaml		danger_load.yaml
dataformatter.py		dataformatter.py
docker-compose.yaml		docker-compose.yaml
ede.py		ede.py
ede_config.yaml		ede_config.yaml
ede_hpo.yaml		ede_hpo.yaml
edeconfig.py		edeconfig.py
edeconnector.py		edeconnector.py
edelogger.py		edelogger.py
logging.ini		logging.ini
low_variance.yaml		low_variance.yaml
pyQueryConstructor.py		pyQueryConstructor.py
requirements.txt		requirements.txt
requirements_service.txt		requirements_service.txt
setup.py		setup.py
util.py		util.py
validator.py		validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Event-Detection-Engine - Serrano

Context

Architecture

Utilization

Connector

Mode

Filter

Augmentation

Training

Cross Validation

Scoring

Hyper-parameter optimization

TPOT

Prediction

Analysis

Point

Misc

Complete example configurations

EDE Service

Config

Data

Inference

Logs

Engine

Environmental Variables

NOTE

About

Releases

Packages

Contributors 2

Languages

License

ict-serrano/service-assurance-ede

Folders and files

Latest commit

History

Repository files navigation

Event-Detection-Engine - Serrano

Context

Architecture

Utilization

Connector

Mode

Filter

Augmentation

Training

Cross Validation

Scoring

Hyper-parameter optimization

TPOT

Prediction

Analysis

Point

Misc

Complete example configurations

EDE Service

Config

Data

Inference

Logs

Engine

Environmental Variables

NOTE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages