Skip to content

Latest commit

 

History

History
411 lines (310 loc) · 13.6 KB

10 Underrated Python Packages for Data Science.md

File metadata and controls

411 lines (310 loc) · 13.6 KB

10 Underrated Python Packages for Data Science

Slide 1: CleanLab - Advanced Data Cleaning

CleanLab is a powerful library for automatically detecting and addressing data quality issues in machine learning datasets. It uses confident learning algorithms to identify label errors, outliers, and systematic noise patterns that could compromise model performance.

from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Initialize base classifier and wrap it with CleanLab
clf = RandomForestClassifier(n_estimators=100)
cl = CleanLearning(clf)

# Generate sample data with noisy labels
X = np.random.randn(1000, 10)
y_true = (X[:, 0] + X[:, 1] > 0).astype(int)
y_noisy = y_true.copy()
noise_idx = np.random.choice(len(y_true), size=100)
y_noisy[noise_idx] = 1 - y_noisy[noise_idx]

# Find label issues and clean dataset
cl.fit(X, y_noisy)
label_issues = cl.find_label_issues()
print(f"Found {sum(label_issues)} label issues")

Slide 2: LazyPredict - Rapid Model Evaluation

LazyPredict revolutionizes the model selection process by enabling simultaneous training and evaluation of multiple machine learning models. It automatically handles data preprocessing and provides comprehensive performance metrics for quick comparison.

from lazypredict.Supervised import LazyRegressor, LazyClassifier
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load sample dataset
data = load_boston()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize and fit multiple models
reg = LazyRegressor(verbose=0, ignore_warnings=True)
models_train, predictions_train = reg.fit(X_train, X_test, y_train, y_test)

# Display performance metrics for all models
print(models_train)

Slide 3: Lux Data Visualization

Lux transforms data exploration by automatically generating relevant visualizations based on data characteristics. It integrates seamlessly with pandas DataFrames and provides intelligent recommendations for insightful visual analysis.

import lux
import pandas as pd

# Create sample dataset
df = pd.read_csv("sales_data.csv")
df.maintain_metadata()

# Enable Lux visualization recommendations
df.maintain_recs()

# Generate automated visualizations
df.viz_widget

# Create custom intent-driven visualizations
df.intent = ["Sales", "Region"]
df.maintain_recs()

Slide 4: PyForest - Smart Library Import Management

PyForest streamlines the data science workflow by automatically importing commonly used libraries on demand. It intelligently manages imports to reduce memory usage and simplifies code organization while maintaining explicit control over library usage.

from pyforest import *
# Libraries are imported only when first used
# Example usage without explicit imports:

# Using pandas (automatically imported)
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Using numpy (automatically imported)
arr = np.array([1, 2, 3])

# Using matplotlib (automatically imported)
plt.plot(arr)
plt.show()

# Check active imports
active_imports()

Slide 5: PivotTableJS - Interactive Data Analysis

PivotTableJS brings powerful pivot table functionality to Jupyter notebooks with an interactive interface. It enables complex data aggregation and visualization without writing code, making it ideal for exploratory data analysis and reporting.

from pivottablejs import pivot_ui
import pandas as pd

# Create sample dataset
data = pd.DataFrame({
    'Date': pd.date_range('2023-01-01', periods=100),
    'Category': ['A', 'B', 'C'] * 34,
    'Sales': np.random.randint(100, 1000, 100),
    'Region': ['North', 'South', 'East', 'West'] * 25
})

# Launch interactive pivot table
pivot_ui(data, 
         rows=['Category'], 
         cols=['Region'],
         vals=['Sales'],
         aggregatorName='Sum')

Slide 6: Drawdata - Interactive Dataset Creation

Drawdata revolutionizes the way we create custom datasets for machine learning experimentation. It provides an intuitive drawing interface in Jupyter notebooks to generate 2D datasets, perfect for testing algorithms and understanding decision boundaries.

import drawdata
import matplotlib.pyplot as plt
from sklearn.svm import SVC

# Create interactive drawing canvas
canvas = drawdata.draw()

# Get drawn points and labels
X, y = canvas.get_data()

# Train a classifier on drawn data
clf = SVC(kernel='rbf')
clf.fit(X, y)

# Visualize decision boundary
plt.figure(figsize=(10, 8))
drawdata.plot_decision_boundary(clf, X, y)
plt.title('SVM Decision Boundary on Custom Dataset')
plt.show()

Slide 7: Black - Professional Code Formatting

Black is an uncompromising Python code formatter that enforces consistent style across projects. It automatically handles line length, quotes, whitespace, and complex formatting decisions, making code more readable and maintainable while following PEP 8 guidelines.

# Install black: pip install black
import black
import tempfile
from pathlib import Path

# Sample unformatted code
unformatted_code = '''
def complex_function   (    x,y   ):
    if x>0:
        return {'result':x+y,
            'status':'positive'}
    else:
        return {'result':x-y,
        'status':'negative'}
'''

# Format code using black
with tempfile.NamedTemporaryFile(mode='w', suffix='.py') as tmp:
    tmp.write(unformatted_code)
    tmp.flush()
    formatted_code = black.format_file_contents(
        unformatted_code, 
        fast=False,
        mode=black.FileMode()
    )
print(formatted_code)

Slide 8: PyCaret - Automated Machine Learning Pipeline

PyCaret provides an end-to-end machine learning workflow that automates preprocessing, model selection, hyperparameter tuning, and deployment. It significantly reduces development time while maintaining full control over the ML pipeline.

from pycaret.regression import *
import pandas as pd

# Sample dataset
data = pd.read_csv('house_prices.csv')

# Initialize PyCaret setup
reg = setup(data, 
           target='Price',
           normalize=True,
           polynomial_features=True,
           session_id=123)

# Compare all models
best_model = compare_models(n_select=3)

# Tune best model
tuned_model = tune_model(best_model)

# Generate predictions
predictions = predict_model(tuned_model)

# Plot model performance
plot_model(tuned_model, plot='feature')
plot_model(tuned_model, plot='residuals')

Slide 9: PyTorch Lightning - Structured Deep Learning

PyTorch Lightning provides a structured approach to organizing PyTorch code, automatically handling training loops, distributed training, and complex optimizations while maintaining full flexibility of PyTorch's ecosystem.

import pytorch_lightning as pl
import torch.nn as nn
import torch.nn.functional as F

class LightningClassifier(pl.LightningModule):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = F.relu(self.layer1(x))
        return F.softmax(self.layer2(x), dim=1)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

# Train model
trainer = pl.Trainer(max_epochs=10, gpus=1)
model = LightningClassifier(784, 128, 10)
trainer.fit(model, train_dataloader)

Slide 10: Streamlit - Interactive Data Applications

Streamlit transforms Python scripts into shareable web applications with minimal code. It enables data scientists to create interactive dashboards, visualization tools, and machine learning demos without web development experience.

import streamlit as st
import pandas as pd
import plotly.express as px

def main():
    st.title("Data Analysis Dashboard")
    
    # File uploader
    uploaded_file = st.file_uploader("Upload CSV", type=['csv'])
    if uploaded_file:
        df = pd.read_csv(uploaded_file)
        
        # Interactive components
        st.sidebar.header("Filters")
        columns = st.sidebar.multiselect("Select columns", df.columns)
        
        # Data visualization
        if columns:
            fig = px.scatter_matrix(df[columns])
            st.plotly_chart(fig)
            
            # Statistical summary
            st.write("Statistical Summary")
            st.dataframe(df[columns].describe())

if __name__ == "__main__":
    main()

Slide 11: Real-world Application - Automated Data Quality Pipeline

This comprehensive example demonstrates the integration of CleanLab and PyCaret for automated data cleaning and model development in a production environment.

import pandas as pd
from cleanlab.classification import CleanLearning
from pycaret.classification import *
import numpy as np

class DataQualityPipeline:
    def __init__(self, data_path):
        self.data = pd.read_csv(data_path)
        self.cl_model = None
        self.final_model = None
    
    def clean_data(self):
        # Setup PyCaret environment
        clf_setup = setup(self.data, target='target', silent=True)
        
        # Initial model for label quality detection
        initial_model = create_model('rf')
        
        # Wrap with CleanLab
        self.cl_model = CleanLearning(initial_model)
        self.cl_model.fit(self.data.drop('target', axis=1), self.data['target'])
        
        # Find and fix label issues
        label_issues = self.cl_model.find_label_issues()
        self.data = self.data[~label_issues]
        
        return self.data
    
    def train_model(self):
        # Train final model on cleaned data
        self.final_model = compare_models(n_select=1)
        return self.final_model

# Usage example
pipeline = DataQualityPipeline('customer_data.csv')
clean_data = pipeline.clean_data()
final_model = pipeline.train_model()

Slide 12: Results for Data Quality Pipeline Implementation

This slide presents the performance metrics and analysis of the automated data quality pipeline implemented in the previous slide, showing real-world impact on model performance.

# Performance Analysis Results
from sklearn.metrics import classification_report
import pandas as pd

def analyze_pipeline_results(pipeline, test_data):
    # Before cleaning metrics
    initial_predictions = pipeline.cl_model.predict(test_data.drop('target', axis=1))
    initial_report = classification_report(test_data['target'], initial_predictions, output_dict=True)
    
    # After cleaning metrics
    final_predictions = predict_model(pipeline.final_model, test_data)
    
    results = pd.DataFrame({
        'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
        'Before Cleaning': [initial_report['accuracy'], 
                          initial_report['weighted avg']['precision'],
                          initial_report['weighted avg']['recall'],
                          initial_report['weighted avg']['f1-score']],
        'After Cleaning': [final_predictions['Accuracy'].mean(),
                         final_predictions['Precision'].mean(),
                         final_predictions['Recall'].mean(),
                         final_predictions['F1'].mean()]
    })
    
    print("Pipeline Performance Metrics:")
    print(results.to_string(index=False))
    
    print("\nData Quality Improvements:")
    print(f"Removed noisy samples: {len(test_data) - len(final_predictions)}")
    print(f"Accuracy improvement: {(final_predictions['Accuracy'].mean() - initial_report['accuracy'])*100:.2f}%")

Slide 13: Interactive Data Exploration Dashboard

Using Streamlit and Lux for comprehensive data exploration and visualization, creating an interactive system for data scientists to quickly understand dataset characteristics.

import streamlit as st
import lux
import pandas as pd
from pycaret.regression import *

def create_analysis_dashboard():
    st.title("Advanced Data Analysis Dashboard")
    
    uploaded_file = st.file_uploader("Upload Dataset", type=['csv'])
    if uploaded_file:
        df = pd.read_csv(uploaded_file)
        
        # Automated visualizations with Lux
        df.maintain_metadata()
        recommendations = df.maintain_recs()
        
        # Display automated insights
        st.subheader("Automated Data Insights")
        for idx, rec in enumerate(recommendations[:5]):
            st.write(f"Insight {idx+1}:")
            st.pyplot(rec.to_matplotlib())
        
        # Quick model training with PyCaret
        if st.button("Train Models"):
            target = st.selectbox("Select Target Variable", df.columns)
            setup(df, target=target, silent=True)
            best_models = compare_models(n_select=3)
            
            st.write("Model Performance Summary:")
            st.dataframe(pull())

if __name__ == "__main__":
    create_analysis_dashboard()

Slide 14: Additional Resources