Skip to content

8. Machine‐learning

Louis-Mael Gueguen edited this page Dec 19, 2024 · 4 revisions

Models

We configured and optimized several machine learning models from python’s package scikit-learn for our metagenomic analysis, including Random Forest (RF), Support Vector Machine (SVR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), and XGBoost. Each model had specific hyperparameters to tune, with XGBoost being particularly suited for handling complex, high-dimensional data like ours.

Optimisation

To build models that will generalize well to new data, we optimize hyperparameters. Fine-tuning these settings helps the model learn patterns that aren’t tied too closely to the training data, so it can make accurate predictions on unseen data. By adjusting things like regularization strength or learning rate, we strike a balance that keeps the model from overfitting while still capturing the essential patterns. This optimization makes our models more adaptable and reliable when applied to new datasets.

We used scikit-optimize with Bayesian optimization to tune hyperparameters efficiently. This approach explores the parameter space based on past results, helping us quickly find the best settings for the regularization parameters and learning rate without excessive computation. It streamlined the tuning process, letting us improve model performance with minimal overhead. We ran 20 iterations of optimization per hyperparameter.

For all models, we adjusted scaler to handle feature scaling, with options like binary, minmax, standard, or robust, and used zeros_cutoff and features_cutoff to filter based on zero-value prevalence or feature selection thresholds.

Hyperparameters

Random Forest (RF)

The hyperparameters included max_depth (10–100) for tree complexity, max_samples (0.1–1) to control the sample proportion drawn per tree, and parameters like min_samples_split (2–10), min_samples_leaf (1–10), and n_estimators (1–100) to refine how the trees are grown.

Support Vector Machine (SVM)

The hyperparameters included tol (1e-4 to 1) for stopping criteria, kernel type (linear, rbf, or poly), and regularization C (0.001–100). These options help balance noise addition and dropout to make the model more robust while giving flexibility in data mapping and generalization.

Linear Discriminant Analysis (LDA)

LDA has no additional hyperparameters than those shared by all models.

XGBoost

We optimized the hyperparameters max_depth for tree complexity, eta for learning rate, min_child_weight for minimum sum of instance weights, gamma to control tree split regularization, and subsample to prevent overfitting.

K-Nearest Neighbors

We optimized the hyperparameter n_neighbors, which controls the number of neighbors considered in the algorithm. Higher values lead to a smoother decision boundary, reducing the risk of overfitting.

SHAP Values

In order to retrieve the features with highest importance in the decision making of the models, we calculated Shapeley values with the shap package. We scaled the absolute values returned so that all importances, including base values, sum to 1.

Results

All the results are saved in the folder results, which is created automatically when training the first model: results/{exp_name}_{n_features}features_mi{is_mi}/ where exp_name is the experiment name, n_features is the number of features used (the default value is -1, which uses all features). The parameter is_mi controls.

The following results are created:

Best model

The best model weights, best hyperparameters and scores are saved in results/{exp_name}_{n_features}features_mi{is_mi}/confusion_matrix/

Confusion matrices

The confusion matrices of the best model are saved in results/{exp_name}_{n_features}features_mi{is_mi}/confusion_matrix/. Three confusion matrices are saved for the train, valid and test sets. Each confusion matrices is saved in two formats: csv and png. Data visualization plots

All ordination plots for visualization are in results/{exp_name}_{n_features}features_mi{is_mi}/ord/. It includes:

  • MultiDimentional Scaling (MDS)
  • Principal Components Analysis (PCA)
  • Fisher's Linear Discriminant Analysis (LDA)
  • Uniform Manifold Approximation and Projection (UMAP)

Histograms

Four different histograms are saved in results/{exp_name}_{n_features}features_mi{is_mi}/histograms/.

The first histogram allclasses.png represents the distribution of values in the outputs from your best model, using 30 bins. The x-axis indicates the output values, and the y-axis represents the frequency of those values.

The histogram zeros_per_feature_allclasses.png illustrates the distribution of zeros across the features in the dataset. The x-axis represents the number of zeros per feature, while the y-axis indicates the count of features that fall within each range of zeros.

The histogram zeros_per_feature_allclasses.png illustrates the distribution of zeros across the samples in the dataset. The x-axis represents the number of zeros per sample, while the y-axis indicates the count of features that fall within each range of zeros.

If using the option use_mi, the figure mutual_info_gain.png is saved.