Event Detection Engine for the H2020 Serrano research project. Based partially on the work done during the DICE H2020 project specifically the DICE Anomaly Detection Platform. and on the work done during the ASPIDE H2020 project and DIPET Chist Era project.
In the following section we will use the term events and anomalies seemingly interchangeably. However, we should note that the methods used for detecting anomalies are applicable in the case of events. The main difference lies in the fact that anomalies pose an additional level of complexity by their spars nature, some anomalies might have an occurrence rate well under 0.01%. Event and anomaly detection can be split up into several categories based on the methods and the characteristics of the available data. The most simple form of anomalies are point anomalies which can be characterized by only one metric (feature). These types of anomalies are fairly easy to detect by applying simple rules (i.e. CPU is above 70%). Other types of anomalies are more complex but ultimately yield a much deeper understanding about the inner workings of a monitored exascale system or application. These types of anomalies are fairly common in complex systems.
Contextual anomalies are extremely interesting in the case of complex systems. These types of anomalies happen when a certain constellation of feature values are encountered. In isolation these values are not anomalous but when viewed in context they represent an anomaly. These type of anomalies represent application bottlenecks, imminent hardware failure or software miss-configuration. The last major type of anomaly which are relevant are temporal or sometimes sequential anomalies where a certain event takes place out of order or at the incorrect time. These types of anomalies are very important in systems which have a strong spatio-temporal relationship between features, which is very much the case for exascale metrics.
The Event detection engine (EDE) has several sub-components which are based on lambda type architecture where we have a speed, batch and serving layer. Because of the heterogeneous nature of most modern computing systems (including exascale and mesh networks) and the substantial variety of solutions which could constitute a monitoring services the data ingestion component has to be able to contend with fetching data from a plethora of systems. Connectors is implemented such that it serves as adapters for each solution. Furthermore, this component also is be able to load data directly from static file (HDF5, CSV , JSON, or even raw format).
This aids in fine tuning of event and anomaly detection methods. We can also see that data ingestion can be done directly via query from the monitoring solution or streamed directly from the queuing service (after ETL if necessary). This ensures that we have the best chance of reducing the time between the event or anomaly happening and it being detected.
The pre-processing component is in charge of taking the raw data from the data ingestion component and apply several transformations. It handles data formatting (i.e. one-hot encoding), analysis (i.e. statistical information), splitter (i.e. splitting the data into training and validation sets) and finally augmentation (i.e. oversampling and undersampling).
As an example the analysis and splitter are responsible for creating stratified shuffle split for K-fold cross validation for training while the augmentation step might involve under or oversampling techniques such as ADASYN or SMOTE. This component is also responsible for any feature engineering of the incoming monitoring data.
The training component (batch layer) is used to instantiate and train methods that can be used for event and anomaly detection. The end user is able to configure the hyper-parameters of the selected models as well as run automatic optimization on these (i.e. Random Search, Bayesian search etc.). Users are not only able to set the parameters to be optimized but to define the objectives of the optimization. More specifically users can define what should be optimized including but not limited to predictive performance, transprecise objectives (inference time, computational limitations, model size etc.).
Evaluation of the created predictive model on a holdout set is also handled in this component. Current research and rankings of machine learning competitions show that creating an ensemble of different methods may yield statistically better results than single model predictions. Because of this ensembling capabilities have to be included.
Finally, the trained and validated models have to be saved in such a way that enables them to be easily instantiated and used in a production environment. Several predictive model formats have to be supported, such as; PMML, ONNX, HDF5, JSON.
It is important to note at this time that the task of event and anomaly detection can be broadly split into two main types of machine learning tasks; classification and clustering. Classification methods such as Random Forest, Gradient Boosting, Decision Trees, Naive Bayes, Neural networks, Deep Neural Networks are widely use in the field of anomaly and event detection. While in the case of clustering we have methods such as IsolationForest, DBSCAN and Spectral Clustering. Once a predictive model is trained and validated it is saved inside a model repository. Each saved model has to have metadata attached to it denoting its performance on the holdout set as well as other relevant information such as size, throughput etc.
The prediction component (speed layer) is in charge of retrieving the predictive model form the model repository and feed metrics from the monitored system. If and when an event or anomaly is detected EDE is responsible with signaling this to both the Monitoring service reporting component and to other tools such as the Resource manager and/or scheduler any decision support system. Figure 1 also shows the fact that the prediction component gets it’s data from both the monitoring service via direct query or directly from the queuing service via the data ingestion component.
For some situations a rule based approach is better suited. For these circumstances the prediction component has to include a rule based engine and a rule repository. Naturally, detection of anomalies or any other events is of little practical significance if there is no way of handling them. There needs to be a component which once the event has been identified tries to resolve the underlying issues.
EDE is designed around the utilization of a yaml based configuration scheme. This allows the complete configuration of the tool by the end user with limited to no intervention in the source code.
It should be mentioned that some of these features are considered unsave as they allow the execution of arbitrary code.
The configuration file is split up into several categories:
- Connector - Deals with connection to the data sources
- Mode - Selects the mode of operation for EDE
- Filter - Used for applying filtering on the data
- Augmentation - User defined augmentations on the data
- Training - Settings for training of the selected predictive models
- Detect - Settings for the detection using a pre-trained predictive model
- Point - Settings for point anomaly detection
- Misc - Miscellaneous settings
The current version of EDE support 3 types of data sources: ElasticSearch, Prometheus and CSV/Excel. Conversely it supports also reporting mechanisms for ElasticSearch and Kafka. In the former case, a new index is created in ElasticSearch which contains the detected anomalies while in the latter a new Kafka topic is created where the detected anomalies are pushed.
This sections parameters are:
- PREndpoint - Endpoint for fetching Prometheus data
- ESEndpoint - Endpoint for fetching ElasticSearch data
- MPort - Sets the monitoring port for the selected Endpoint (defaults to 9200)
- KafkaEndpoint - Endpoint for a pre-existing Kafka deployment
- KafkaPort - Sets the Kafka port for the selected Kafka Endpoint (defaults to 9092)
- KafkaTopic - Name of the kafka topic to be used
- Query - The query string to be used for fetching data:
- In the case of ElasticSearch please consult the official documentation.
- In the case of Prometheus please consult the official documentation
- For fetching all queryable data:
{"query": '{__name__=~"node.+"}[1m]'}
- For fetching specific metric data:
{ "query": 'node_disk_written_bytes_total[1m]'}
- For fetching all queryable data:
- MetricsInterval - Metrics datapoint interval definition
- QSize - size in MB of the data to be feteched (only if ESEndpoint is used)
- For no limit use
QSize: 0
- For no limit use
- Index - The name of the column to be set as index
- The column has to have unique values, by default it is set to the column denoting the time when the metric was read
- QDelay - Polling period for metrics fetching
- Dask
- ScheduelerEndpoint - Denotes the Dask scheduler endpoint
- If no pre-deployed Dask instance is available EDE can deploy a local Scheduler by setting this parameter to
local
- If no pre-deployed Dask instance is available EDE can deploy a local Scheduler by setting this parameter to
- SchedulerPort - Endpoint for Dask scheduler endpoint
- Scale - Sets the number of workers if
local
scheduler is used - EnforceCheck - if set to true it will check if the libraries from the python environment used on each Dask worker are the same versions as the origination source
- If this check fails the job will exit with an error message
- This parameter can be omitted in the case of local deployment
- ScheduelerEndpoint - Denotes the Dask scheduler endpoint
- Local - path to csv or Excel file to be used
Notes:
- Only one of type of connector endpoint (PREndpoint or ESEndpoint) is supported at any given time.
- If Local is defined than it will ignore any other data sources.
The following settings set the mode in which EDE operates. There are 3 modes available in this version; Training, Validate, Detect
- Training - If set to true a Dask worker or Python process for training is started
- Validate - If set to true a Dask worker or Python process for validation is started
- Detect - If set to true a Dask worker for Python process fof Detection is started
Notes:
- In case of a local Dask deployment it is advised to have at least 3 workers started (see the Scale parameter in the previouse section).
Once the data has been loaded by the EDE connector it is tranformed into DataFrames. The data in these Dataframes can be filtered by using the parameters listed bellow:
- Columns - listing of columns which are to remain
- Rows - Filters rows in a given bound
- gd - Lower bound
- ld - Upper bound
- DColumns - list of columns to be deleted (dropped)
- Dlist - expects an external yaml file containing a list of columns to be dropped (usefull removing large number of features)
- Fillna - fills
None
values with0
- Dropna - deletes columns wiith
None
values - LowVariance - used to detect and remove low variance features automatically
- DWild - Removes columns based on regex
- Regex - Regex to be used for fitlering
- Keep - If
True
all selected columns are kept the rest are dropped, otherwise selected columns are dropped.
Notes:
- Some machine learning models cannot deal with
None
values to this end the Fillna or Dropna parameters where introduced. It is important to note the Dropna will drop any column which has at least oneNone
value.
The following parameters are used to define augmentations to be executed on the loaded dataframes. The augmentations are chained togheter as defined by the user. The available parameters are:
- Scaler - Scaling/Normalizing the data. If not defined no scaler is used
- ScalerType - We currently support all scaler types from scikitlearn. Please consult the official documentation for further details
- When utilizing a scikit-learn scaler we have to use the exact name of the scaler followed by its parameters. Bellow you can find an exampele utilizing the StandardScaler:
Scaler: StandardScaler: copy: True with_mean: True with_std: True
- When utilizing a scikit-learn scaler we have to use the exact name of the scaler followed by its parameters. Bellow you can find an exampele utilizing the StandardScaler:
- ScalerType - We currently support all scaler types from scikitlearn. Please consult the official documentation for further details
- Operations - set of predefined operations that can be executed
- STD - calculates the standard deviation, excepts a name and a list of metrics to use
- Mean - Calculates the mean, excepts a name and a list of metrics to use
- Median - Calculates the median, excepts a name and a list of metrics to use
- Example usage:
Operations: STD: - cpu_load1: - node_load1_10.211.55.101:9100 - node_load1_10.211.55.102:9100 - node_load1_10.211.55.103:9100 - memory: - node_memory_Active_anon_bytes_10.211.55.101:9100 - node_memory_Active_anon_bytes_10.211.55.101:9100 - node_memory_Active_anon_bytes_10.211.55.101:9100 Mean: - network_flags: - node_network_flags_10.211.55.101:9100 - node_network_flags_10.211.55.102:9100 - node_network_flags_10.211.55.103:9100 - network_out: - node_network_mtu_bytes_10.211.55.101:9100 - node_network_mtu_bytes_10.211.55.102:9100 - node_network_mtu_bytes_10.211.55.103:9100 Median: - memory_file: - node_memory_Active_file_bytes_10.211.55.101:9100 - node_memory_Active_file_bytes_10.211.55.102:9100 - node_memory_Active_file_bytes_10.211.55.103:9100 - memory_buffered: - node_memory_Buffers_bytes_10.211.55.101:9100 - node_memory_Buffers_bytes_10.211.55.102:9100 - node_memory_Buffers_bytes_10.211.55.103:9100
- RemoveFiltered - If set to
True
the metrics used during these operations will be deleted, the resulting augmented columns remaining - Method - Excepts User defined augmentations (i.e. python functions) for feature engineering
- Methods should be wrapped as can bee seen in the wrapper_add_column example
- All keyword arguments should be passable to the wrapped function
- Here is an example of a user defined method invocation:
Method: !!python/object/apply:edeuser.user_methods.wrapper_add_columns # user defined operation kwds: columns: !!python/tuple [node_load15_10.211.55.101:9100, node_load15_10.211.55.102:9100] column_name: sum_load15
- Categorical - Excepts a list of categorical columns, if not defined EDE can try to automatically detect categorical columns
- OH - If set to True oneHot encoding is used for categorical features
The following parameters are used to set up training mode and machine learning model selection and initialization.
- Type - Sets the type of machine learning problem. Currently supported are: clustering, classification, hpo and tpot.
- Method - Sets the machine learning method to be used. We support all acikit-learn based models as well as other machine learning libraries which support scikit-learn API conventions such as: Tensorflow, Keras, LightGBM, XGBoost, CatBoost etc.
- Export - Name of the preictive model to be exported (serialized)
- MethodSettings - Setting dependant on machine learning method selected.
- Target - Denotes the ground truth column name to be used. This is mandatory in the case of classification. If no
target
is defined the last column is used instead.
In the case of classification we have several additional options we can select:
- Verbose - Will save a full classification report, confusion matrix, feature importance (if applicable) for all folds
- PrecisionRecallCurve - Will plot the Precision Recall curve of the selected model
- ROCAUC - Will plot the ROCAUC for the selected model
- RFE - Will execute and plot recursive feature elimination. It will save a yaml file containing a list of features to be eliminated, usable by DList from DColumn.
- scorer - Defines the scorer to be used
- step - Defines the step for feature elimination. If there are a lot of features in the data this can take a long time to execute. In this case a larger step function is advised.
- DecisionBoundary - Will plot the decision boundary after executing PCA with 2 components. For large number of classes the process of dimensionality reduction can result in noisy plots.
- LearningCurve - Shows the relationship between model preformance and the amount of features/ training samples *sizes - Used to define the training samples for ploting, can be list or use generator function as seen in the example bellow. *scorer - Scorer to be used *n_jobs - Number of jobs to be executed, if Dask backend is used it will handle schedueling of these jobs.
- ValidationCurve - Used to finetune a specific parameter, checking out of sample performance
- param_name - Name of Hyper-parameter to be optimized
- param_range - Range of values to check (list can be generated using generator functions same as for LearningCurve).
- scorer - Scorer to be used
- n_jobs - Number of jobs to be executed, if Dask backend is used it will handle schedueling of these jobs.
Note: When training an unsupervised method, by default it will generate a decision boundary and feature separation plots for the selected Model.
Example for clustering:
# Clustering example
Training:
Type: clustering
Method: isoforest
Export: clustering_1
MethodSettings:
n_estimators: 10
max_samples: 10
contamination: 0.1
verbose: True
bootstrap: True
Example for classification:
Training:
Type: classification
Method: randomforest
Export: classifier_1
MethodSettings:
n_estimators: 10
max_samples: 10
verbose: True
bootstrap: True
Target: target
LearningCurve:
sizes: !!python/object/apply:numpy.core.function_base.linspace
kwds:
start: 0.3
stop: 1.0
num: 10
scorer: f1_weighted
n_jobs: 5
ValidationCurve:
param_name: n_estimators
param_range:
- 10
- 20
- 60
- 100
- 200
- 600
scoring: f1_weighted
n_jobs: 8
PrecisionRecallCurve: 1
ROCAUC: 1
RFE:
scorer: f1_weighted
step: 10
DecisionBoundary: 1
Verbose: 1
Similar to how users can add their own implementations for augmentations it is also possible to add custom machine learning
methods. An example implementation can be found in the edeuser section, namely user_iso.
The wrapper function should contain all parameters which are necessary for the defined method and the return value should be an object which abides by scikit-learn API conventions.
Example for user defined method:
# User defined clustering custom
Training:
Type: clustering
Method: !!python/object/apply:edeuser.user_methods.user_iso
kwds:
n_estimators: 100
contamination: auto
max_features: 1
n_jobs: 2
warm_start: False
random_state: 45
bootstrap: True
verbose: True
max_samples: 1
Export: clustering_2
EDE supports a variety of cross validation methods (all from scikit-learn). The parameters are as follows:
- CV - If an integer is used then standard (scikit-learn) CV is used with the integer representing the number of folds.
- Type - Type is required if CV denotes a scikit-learn or user defined CV method is used.
- Params - Parameters for CV method
For defining simple CV with 5 folds:
CV: 5
For defining CV using a specific method such as StratifiedKFold:
CV:
Type: StratifiedKFold # user defined all from sklearn
Params:
n_splits: 5
shuffle: True
random_state: 5
EDE supports the inclusion of different scoring methods. These are all scikit-learn scoring methods as well as user defined scoring methods. The scoring functions are defined using the following parameters:
- Scorers
- Scorers_list - List of scoreres to be used
- Scorer
- Scorer_name - User defined name of the scorer
- skScorer - Scikit-learn scorer name
In the case of user defined scorers the user has to define a key value pair;
scorer_name
andscorer instance
.
- Scorer
- Scorers_list - List of scoreres to be used
An example Scorer chain definition can be found here:
Scorers:
Scorer_list:
- Scorer:
Scorer_name: F1_weighted
skScorer: f1_weighted
- Scorer:
Scorer_name: Jaccard_Index
skScorer: jaccard_weighted # changes in scoring sklearn, for multiclass add suffix micro, weighted or sample
- Scorer:
Scorer_name: AUC
skScorer: roc_auc_ovr_weighted
User_scorer1: balanced_accuracy_score # key is user defined, can be changed same as Scorer_name
EDE also supports hyper parameter optimization methods such as: grid and random search, bayesian and evolutionary search and the tpot framework. The following parameters are used for HPO:
- HPOMethod - Name of the hyper parameter optimization to be used
- HPOParam - HPO parameters:
- n_iter - Number of iterations. In case of Grid search this is ignored.
- n_jobs - number of threads (
-1
for all available) - refit - Name of scoring metric to be used to determine the best performing hyperparameters. If multi metric is used, refit should be a metric name (mandatory)
- verbose - If set to true, it outputs metrics about each iteration
- ParamDistribution - should contain a dictionary which has as a key the parameter name and a list (or python code which generates a list according to a particular distribution) which should be used:
ParamDistribution: n_estimators: - 10 - 100 max_depth: - 2 - 3
Example of HPO including CV and Scorers examples:
# For HPO methods
Training:
Type: hpo
HPOMethod: Random # random, grid, bayesian, tpot
HPOParam:
n_iter: 2
n_jobs: -1
refit: Balanced_Acc # if multi metric used, refit should be metric name, mandatory
verbose: True
Method: randomforest
ParamDistribution:
n_estimators:
- 10
- 100
max_depth:
- 2
- 3
Target: target
Export: hpo_1
CV:
Type: StratifiedKFold # user defined all from sklearn
Params:
n_splits: 5
shuffle: True
random_state: 5
Scorers:
Scorer_list:
- Scorer:
Scorer_name: AUC
skScorer: roc_auc
- Scorer:
Scorer_name: Jaccard_Index
skScorer: jaccard
- Scorer:
Scorer_name: Balanced_Acc
skScorer: balanced_accuracy
User_scorer1: f1_score # key is user defined, can be changed same as Scorer_name
Example of HPO using the evolutionary search method:
Training:
Type: hpo
HPOMethod: Evol # Random, Grid, Bayesian, tpot, Evol
HPOParam:
n_jobs: 1 # must be number, not -1 for all in case of Evol
scoring: f1_weighted
gene_mutation_prob: 0.20
gene_crossover_prob: 0.5
tournament_size: 4
generations_number: 30
population_size: 40 # if multi metric used, refit should be metric name, mandatory
verbose: 4
Method: randomforest
ParamDistribution:
n_estimators:
- 10
- 100
max_depth:
- 2
- 3
Target: target
Export: hpo_1_y2
CV:
Type: StratifiedKFold # user defined all from sklearn
Params:
n_splits: 5
shuffle: True
random_state: 5
Scorers:
Scorer_list:
- Scorer:
Scorer_name: F1_weighted
skScorer: f1_weighted
- Scorer:
Scorer_name: Jaccard_Index
skScorer: jaccard_weighted # changes in scoring sklearn, for multiclass add suffix micro, weighted or sample
- Scorer:
Scorer_name: AUC
skScorer: roc_auc_ovr_weighted
User_scorer1: balanced_accuracy_score
TPOT is an automated machine learning framework designed around scikit-learn (and nay other framework which conforms to the scikit-learn API conventions). In contrast to other such tools it does not focus solely on the hyper-parameters of machine learning models but it tries to optimize the pre and postprocessing methods as well. It does this by generating scikit-learn pipelines. The optimizations are based around a Genetic Programming stochastic global search procedure. TPOT parameters are as follows:
- TPOTParam
- generations - Number of generations to run
- population_size - Size of the population (candidate configurations)
- offspring_size - Number of new members added to a population between each generation
- mutation_rate - Rate of mutation to be used by the genetic algorithm
- crossover_rate - Value used in defining the crossover used when generating new offsprings
- scoring - Scoring functon to be used, it is different than scikit-learn, see TPOT documentation for details about TPOT scoring.
- max_time_mins - Limits the time for computing a generation
- max_eval_time_mins - Limits the amount of time for evaluating a pipeline.
- random_state - Random seed so that it enables consistency between experiments.
- n_Jobs - Number of concurent jobs. If set to
-1
it will enable unlimited number of potential jobs - verbosity - Logging detail
- config_dict - This sets the amount of detail and methods to add to the pipelines from a population. Possible values are:
- Default - Includes all scikit-learn methods
- TPOT light - restricted range of methods
- TPTO MDR - Extended feature selectors and Multi dimensional reduction models
- use_dask - Use Dask backend to run phenotypes from the population (i.e each Dask worker runs a phenotype)
Example of TPOT
# TPOT Optimizer
Training:
Type: tpot
TPOTParam:
generations: 2
population_size: 2
offspring_size: 2
mutation_rate: 0.9
crossover_rate: 0.1
scoring: balanced_accuracy # Scoring different from HPO check TPOT documentation
max_time_mins: 1
max_eval_time_mins: 5
random_state: 42
n_jobs: -1
verbosity: 2
config_dict: TPOT light # "TPOT light", "TPOT MDR", "TPOT sparse" or None
use_dask: True
Target: target
Export: tpotopt
CV:
Type: StratifiedKFold # user defined all from sklearn
Params:
n_splits: 5
shuffle: True
random_state: 5
Notes:
- Both HPO and TPOT are heavily based around Dask and utilize Dask workers for running different hyper parameter configurations. Because of this it is recommended to utilise a pre-existent distributed Dask worker cluster.
- In contrast to other methods TPOT return the entire pipeline not just the predictive mode.
Prediction is largely unchanged between the various EDE modes. Its parameters are:
- Method - Name of the predictive model type used
- Type - Specifies what type the model is (i.e. clustering, classsification, tpot etc.)
- Load - Name of the serialized predictive model to be instantiated. See the export from training.
- Scaler - Name of the scaler (if used). Once the scaler has been invoced during training the result will be serialized by EDE and can be reused for prediction.
- Analysis - Will attach root cause analysis in the form of computed Shapely values and feature importance for all detected anomalous instances.
- Plot- If set to
True
it will generate plots for each detected anomalous instance;- Clustering: feature importance, summary and heatmap
- Classification: force, summary
- Plot- If set to
Example of a prediction:
Detect:
Method: isoforest
Type: clustering
Load: clustering_1
Scaler: StandardScaler # Same as for training
#Analysis: True
Analysis: # if plotting of heatmap, summary and feature importance is require, if not set False or use previous example
Plot: True
EDE is capable of running any user defined analysis methods on the data. Users can add data exploration methods. Its parameters are:
- Analysis
- Methods - List of methods to be used
- Method - Information required for instantiation of user methods (including keyword arguments)
- Solo - If it is set to true it will run only the analysis and ignore any other Training or Prediction tasks.
- Methods - List of methods to be used
Example analysis implementations included in EDE are Pearson correlation and a Line plot:
# Analysis example
Analysis:
Methods:
- Method: !!python/object/apply:edeuser.user_methods.wrapper_analysis_corr
kwds:
name: Pearson1
annot: False
cmap: RdBu_r
columns:
- node_load1_10.211.55.101:9100
- node_load1_10.211.55.102:9100
- node_load1_10.211.55.103:9100
- node_memory_Cached_bytes_10.211.55.101:9100
- node_memory_Cached_bytes_10.211.55.102:9100
- node_memory_Cached_bytes_10.211.55.103:9100
- time
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
- Method: !!python/object/apply:edeuser.user_methods.wrapper_analysis_plot
kwds:
name: line1
columns:
- node_load1_10.211.55.101:9100
- node_load1_10.211.55.102:9100
- node_load1_10.211.55.103:9100
- time
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
- Method: !!python/object/apply:edeuser.user_methods.wrapper_improved_pearson
kwds:
name: Test_Training
dcol:
- target
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
show: False
- Method: !!python/object/apply:edeuser.user_methods.wrapper_rank2
kwds:
name: Test_rank
dcol:
- target
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
algorithm: spearman
show: False
- Method: !!python/object/apply:edeuser.user_methods.wrapper_rank1
kwds:
name: Test_rank1
dcol:
- target
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
algorithm: shapiro
- Method: !!python/object/apply:edeuser.user_methods.wrapper_pca_plot
kwds:
name: Test_PCA
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
projection: 3
target: target
# show: False
- Method: !!python/object/apply:edeuser.user_methods.wrapper_manifold
kwds:
name: Test_manifold
target: target
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
manifold: tsne
n_neighbors: 10
- Method: !!python/object/apply:edeuser.user_methods.wrapper_manifold
kwds:
name: Test_manifold
target: target
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
manifold: hessian
- Method: !!python/object/apply:edeuser.user_methods.wrapper_plot_on_features
kwds:
name: complete_columns
target: target
location: /Users/Gabriel/Documents/workspaces/Event-Detection-Engine/edeuser/analysis
Solo: True
Point:
Memory:
cached:
gd: 231313
ld: 312334
buffered:
gd: 231313
ld: 312334
used:
gd: 231313
ld: 312334
Load:
shortterm:
gd: 231313
ld: 312334
midterm:
gd: 231313
ld: 312334
Network:
tx:
gd: 231313
ld: 312334
rx:
gd: 231313
ld: 312334
mMiscellaneous settings:
- heap: Size of JVM heap used for weka based methods (now deprecated, will be removed in next version)
- checkpoint: All filtering, augmentation steps will can be set to save to disk their results, thus in case of faliure processing can be resumed from the last step succesfully executed from the EDE processing pipeline.
- delay : Used to set how often new data is to be fetched from the datasource. Same as QDelay'
- interval: Query interval to be used when generating query strings. It will be ignored if user defined query string is used.
- resetindex: Deletes anomaly index in case ElasticSearch is used for reporting.
- point: Toggles point anomaly execution (now deprecated, will be removed in next version)
Misc:
heap: 512m
checkpoint: True
delay: 15s
interval: 30m
resetindex: False
point: False
- EDE Analysis
- EDE Clustering
- EDE Clustering user defined
- EDE Clustering Prediction
- EDE Classification
- EDE Classification Predicton
- EDE HPO
- EDE TPOT
- EDE TPOT Predict
The EDE Service is designed to offer a REST API for EDE. The current version only supports executing inference and not training.
GET /v1/config
Returns the current version of the configuration file. See EDE Configuration for more details.
PUT /v1/config
Uploads a new configuration file in yaml format. See EDE Configuration for more details.
GET /v1/config/augmentation
Returns the current augmentation configuration.
{
"Scaler": {
"StandardScaler": {
"copy": true,
"with_mean": true,
"with_std": true
}
}
}
PUT /v1/config/augmentation
Modifies augmentation part of the configuration.
GET /v1/config/connector
Returns the current connector configuration.
{
"Dask": {
"EnforceCheck": false,
"Scale": 3,
"SchedulerEndpoint": "local",
"SchedulerPort": 8787
},
"Index": "time",
"KafkaEndpoint": "10.9.8.136",
"KafkaPort": 9092,
"KafkaTopic": "edetopic",
"MPort": 9200,
"MetricsInterval": "1m",
"PREndpoint": "194.102.62.155",
"QDelay": "10s",
"QSize": 0,
"Query": {
"query": "{__name__=~\"node.+\"}[1m]"
}
}
PUT /v1/config/connector
Modifies connector part of the configuration.
GET /v1/config/filter
Returns the current filter configuration.
{
"DColumns": {
"Dlist": "..data/low_variance.yaml"
},
"Dropna": true,
"Fillna": true
}
PUT /v1/config/filter
Modifies filter part of the configuration.
GET /v1/config/inference
Return current inference configuration.
{
"Analysis": {
"Plot": true
},
"Load": "cluster_y2_v3",
"Method": "IForest",
"Scaler": "StandardScaler",
"Type": "clustering"
}
PUT /v1/config/inference
Modifies inference part of the configuration.
These resources deal's with local data handling.
GET /v1/data
Returns a list of local datafiles. Currently only, txt, csv, xlsx and json files are supported.
{
"files": [
"serrano_test_cluster.csv"
]
}
GET /v1/data/{data_file}
This resource fetches the datafile denoted by the data_file parameter.
PUT /v1/data/{data_file}
This resource allows external files to be uploaded to the EDE service. Currently only, txt, csv, xlsx and json files are supported. We should note that the data_file parameter must be the same as the name of the file being uploaded.
POST /v1/inference
Starts the inference job using EDE based on the current configuration file.
GET /v1/logs
Returns EDE Service logs.
These resources are used to control RQ workers which wrap individual EDE instances. Each EDE instance can use a DASK cluster (local or remote).
GET /v1/service/jobs
Returns information about jobs from the services. It contains the unique ids for 4 types of jobs; failed, finished, queued, started. An example response can be seen bellow:
{
"failed": [],
"finished": [
"a9784914-165c-488a-b5a6-7c58c6b421e6"
],
"queued": [],
"started": []
}
GET /v1/service/jobs/{job_id}
Returns information about a specific job denoted by its unique id or <job_uuid>. Some meta information is also contained in the response as reported by the background process. This resource can be used to check periodically if a particular job is finished. An example response can be seen bellow:
{
"finished": true,
"meta": {
"progress": "Finished inference"
},
"status": "finished"
}
GET /v1/service/jobs/worker
Returns a list of workers from the current service instance. The list also includes workers which are no longer active, see status from the response. Other information about the workers are their unique id and pid from the operating system. An example response can be seen bellow:
{
"workers": [
{
"id": "e3b0c442-98fc-11e7-8f38-2b66f5e7a637",
"pid": 1,
"status": "idle"
},
{
"id": "e3b0c442-98fc-11e7-8f38-2b66f5e7a638",
"pid": 2,
"status": "idle"
}
]
}
POST /v1/service/jobs/worker
Every time a post request is issued to this resource it will start a background worker. The maximum number of workers is dependent on the number of physical CPU cores available. An example response can be seen bellow:
{
"status": "workers started"
}
If the maximum number of workers has been reached the following response will be given:
{
"warning": "maximum number of workers active!",
"workers": 4
}
DELETE /v1/service/jobs/worker
This resource enables the halting of workers. This resource need to be accessed each time a worker needs to be stopped.
We use several environment variables to configure the service. The following variables are required:
EDE_HOST
- the host of the ede-serviceEDE_PORT
- the port of the ede-serviceEDE_DEBUG
- the debug level of the ede-serviceREDIS_END
- the endpoint for redis queueREDIS_PORT
- the port for redis queueWORKER_THRESHOLD
- threshold modifier for number of supported workers, be default it is twice the number of CPU coresRQ_TIMEOUT
- timeout for rq workers, default 3600 seconds All environment variables have default values in accordance with the libraries used.
This is a work in progress and is not yet ready for production use. Additional features are being added and the API is subject to change. These features include:
- Integration with Serrano Telemetry System
- Support for training
- Support for more data sources
- Support input data validation and examples in swagger
- Support for additional anomaly reporting (currently only via kafka topic)