An AI-driven agent built with LangChain and LangGraph, powered by ChatGPT, to automate the creation of Scikit-learn (Sklearn) machine learning pipelines. The system processes CSV files and generates end-to-end data processing and model training pipelines that are ready for deployment.
- Domain Identification: Determines the domain of the dataset from metadata and user descriptions.
- Dataset Type Detection: Automatically identifies if the dataset is tabular or time series.
- Task Classification: Identifies the task type (classification, regression, or anomaly detection) based on the data and user-provided input.
- Data Statistics Analysis: Computes key dataset statistics to guide further processing and model selection.
- Algorithm Recommendation: Selects the most suitable algorithm from the Scikit-learn library based on the dataset's characteristics.
- Data Preprocessing: Automatically applies column-specific transformations, including data cleaning, imputation, scaling, and more.
- Pipeline Generation: Combines preprocessing, algorithm selection, and task-specific configurations into a complete Sklearn pipeline for training and deployment.
-
Input:
- Accepts a CSV file as input.
- Optionally includes a description of the dataset for better task identification.
-
Multi-Node Processing:
- Domain Identification: Identifies the problem domain.
- Dataset Type Detection: Detects if the dataset is time series or tabular.
- Task Classification: Classifies the task type (classification, regression, or anomaly detection).
- Data Statistics Analysis: Computes key dataset statistics (e.g., missing values, distributions).
- Algorithm Recommendation: Selects the best algorithm from Sklearn based on the dataset.
- Data Preprocessing: Determines column-specific preprocessing steps.
- Pipeline Generation: sklearn pipeline for the task with data processing units
-
Output:
- Generates a complete Sklearn pipeline.
- The pipeline includes preprocessing, algorithm application, and task-specific configurations.
- Ready-to-train and deploy model.
- Python 3.8+
- Pip or another package manager
- Basic understanding of machine learning concepts
-
Clone the repository:
git clone https://github.com/MLShet/LangraphMLBot.git cd ai-ml-pipeline-builder
-
Install dependencies:
pip install -r requirements.txt
-
Install LangChain and LangGraph (if not included in requirements):
pip install langchain langgraph
- Prepare your dataset as a CSV file.
- Run the AI agent:
python main.py --input <path_to_csv> --description "<optional_description>"
- View the generated pipeline:
- The output will include a serialized Sklearn pipeline that you can use directly for training or deployment.
Input CSV file:
feature1, feature2, target
1.2, 3.4, 1
2.3, 4.5, 0
...
Run the agent:
python main.py --input data.csv --description "Binary classification task"
Output (Example):
Pipeline(steps=[
('preprocessing', ColumnTransformer(...)),
('model', LogisticRegression())
])
.
├── README.md # Project documentation
├── main.py # Entry point for the AI agent
├── requirements.txt # Required Python packages
├── nodes/ # Multi-node logic for pipeline generation
├── states/ # Multi-node logic for pipeline generation
├── examples/ # Example datasets and usage
└── tests/ # Unit tests for each node
Contributions are welcome! Please fork the repository and create a pull request with your changes.
- Fork the repository.
- Create a new branch for your feature or bug fix:
git checkout -b feature-name
- Commit your changes and push to your fork:
git commit -m "Add feature-name" git push origin feature-name
- Submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain for workflow orchestration.
- LangGraph for advanced graph-based task flows.
- Scikit-learn for machine learning capabilities.
- OpenAI's ChatGPT for natural language processing capabilities.