toolbox

Curated libraries for a faster workflow

Phase: Data

Data Annotation

Data Collection

Words: curse-words, badwords, LDNOOBW, english-words (A text file containing over 466k English words), 10K most common words
Text Corpus: project gutenberg, oscar (big multilingual corpus), nlp-datasets, 1 trillion n-grams, The Big Bad NLP Database, litbank
Summarization Data: curation-corpus
Conversational data: conversational-datasets, cornell-movie-dialog-corpus
Image: 1 million fake faces, flickr-faces, CIFAR-10, The Street View House Numbers (SVHN), STL-10, imagenette, objectnet, Yahoo Flickr Creative Commons 100 Million (YFCC100m)
Dataset search engine: datasetlist, UCI Machine Learning Datasets, Google Dataset Search, fastai-datasets, Data For Everyone

Importing Data

Audio: pydub
Video: pytube (download youtube vidoes), moviepy
Image: py-image-dataset-generator (auto fetch images from web for certain search)
News: news-please
PDF: camelot, tabula-py, Parsr, pdftotext
Excel: openpyxl
Remote file: smart_open
Crawling: pyppeteer (chrome automation), MechanicalSoup, libextract
Google sheets: gspread
Google drive: gdown, pydrive
Python API for datasets: pydataset
Google maps location data: geo-heatmap
Tex to Speech: gtts
Databases: blaze (pandas and numpy interface to databases)

Data Augmentation

Text: nlpaug, noisemix
Image: imgaug, albumentations, augmentor, solt
Audio: audiomentations, muda
OCR data: TextRecognitionDataGenerator
Automatic augmentation: deepaugment(image)

Phase: Exploration

Data Preparation

Missing values: missingno
Split images into train/validation/test: split-folders
Class Imbalance: imblearn
Categorical encoding: category_encoders
Numerical data: numerizer (convert natural language numerics into ints and floats)
Data Validation: pandera (validation for pandas)
Data Cleaning: pyjanitor (janitor ported to python)
Parsing: pyparsing, parse
Natural date parser: dateparser
Unicode: text-unidecode
Emoji: emoji
Weak Supervision: snorkel

Data Exploration

View Jupyter notebooks through CLI: nbdime
Parametrize notebooks: papermill
Access notebooks programatically: nbformat
Convert notebooks to other formats: nbconvert
Extra utilities not present in frameworks: mlxtend
Maps in notebooks: ipyleaflet
Data Exploration: bamboolib (a GUI for pandas)

Phase: Feature Engineering

Feature Generation

Automatic feature engineering: featuretools, autopandas, tsfresh (automatic feature engineering for time series)
Custom distance metric learning: metric-learn, pytorch-metric-learning
Time series: python-holidays, skits
DAG based dataset generation: DFFML

Phase: Modeling

Model Selection

Bruteforce through all scikit-learn model and parameters: auto-sklearn, tpot
Curations: bert-related-papers
Autogenerate ML code: automl-gs, mindsdb, autocat (auto-generate text classification models in spacy)
ML from command line (or Python or HTTP): DFFML
Pretrained models: modeldepot, pytorch-hub, papers-with-code, pretrained-models.pytorch
Find SOTA models: sotawhat
Gradient Boosting: catboost, lightgbm (GPU-capable), thunderbm (GPU-capable)
Hidden Markov Models: hmmlearn
Genetic Programming: gplearn
Active Learning: modal
Support Vector Machines: thundersvm (GPU-capable)
Rule based classifier: sklearn-expertsys
Probabilistic modeling: pomegranate
Graph Embedding and Community Detection: karateclub
Anomaly detection: adtk
Spiking Neural Network: norse
Fuzzy Learning: fylearn, scikit-fuzzy
Dimensionality reduction: fbpca
Noisy Label Learning: cleanlab
Few Shot Learning: keras-fewshotlearning

NLP

Libraries: spacy , nltk, corenlp, deeppavlov, kashgari, camphr (spacy plugin for transformers, elmo, udify), transformers, simpletransformers, ernie, stanza, scispacy (spacy for medical documents)
Preprocessing: textacy
Text Extractio: textract (Image, Audio, PDF)
Text Generation: gp2client, textgenrnn, gpt-2-simple
Summarization: textrank, pytldr, bert-extractive-summarizer
Spelling Correction: JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy
Contraction Mapping: contractions
Keyword extraction: rake, pke, phrasemachine
Multiply Choice Question Answering: mcQA
Sequence to sequence models: headliner
Transfer learning: finetune
Translation: googletrans, word2word, translate-python
Embeddings: pymagnitude (manage vector embeddings easily), chakin (download pre-trained word vectors), sentence-transformers, InferSent, bert-as-service, sent2vec, sense2vec, zeugma (pretrained-word embeddings as scikit-learn transformers), BM25Transformer, laserembeddings, glove-python
Multilingual support: polyglot, inltk (indic languages), indic_nlp
NLU: snips-nlu
Semantic parsing: quepy
Inflections: inflect
Contractions: pycontractions
Coreference Resolution: neuralcoref
Readability: homer
Language Detection: language-check
Topic Modeling: guidedlda, enstop, top2vec
Clustering: spherecluster (kmeans with cosine distance), kneed (automatically find number of clusters from elbow curve), kmodes
Metrics: seqeval (NER, POS tagging)
String match: jellyfish (perform string and phonetic comparison),flashtext (superfast extract and replace keywords), pythonverbalexpressions: (verbally describe regex), commonregex (readymade regex for email/phone etc)
Sentiment: vaderSentiment (rule based)
Text distances: textdistance, editdistance, word-mover-distance, wmd-relax (word mover distance for spacy)
PID removal: scrubadub
Profanity detection: profanity-check
Visualization: stylecloud (wordclouds), scattertext
Fuzzy Search : fuzzywuzzy
Named Entity Recognition(NER) : spaCy , Stanford NER, sklearn-crfsuite, med7(spacy NER for medical records)
Fill blanks: fitbert
Dictionary: vocabulary
Nearest neighbor: faiss
Sentence Segmentation: nnsplit
Knowledge Distillation: textbrewer

Speech Recognition

Library: speech_recognition, pyannotate
Diarization: resemblyzer

RecSys

Factorization machines (FM), and field-aware factorization machines (FFM): xlearn, DeepCTR
Scikit-learn like API: surprise
Recommendation System in Pytorch: CaseRecommender
Apriori algorithm: apyori

Computer Vision

Image processing: scikit-image, imutils
Segmentation Models in Keras: segmentation_models
Face recognition: face_recognition, face-alignment (find facial landmarks)
GANS: mimicry
Face swapping: faceit, faceit-live
Video summarization: videodigest
Semantic search over videos: scoper
OCR: keras-ocr, pytesseract
Object detection: luminoth
Image hashing: ImageHash

Timeseries

Predict Time Series: prophet
Scikit-learn like API: sktime
ARIMA models: pmdarima

Framework extensions

Pytorch: Keras like summary for pytorch, skorch (wrap pytorch in scikit-learn compatible API), catalyst
Einstein notation: einops
Scikit-learn: scikit-lego, iterstrat (cross-validation for multi-label data)
Keras: keras-radam, larq (binarized neural networks), ktrain (fastai like interface for keras), tavolo (useful techniques from kaggle as utilities), tensorboardcolab (make tensorfboard work in colab), tf-sha-rnn
Tensorflow: tensorflow-addons

Phase: Validation

Model Training Monitoring

Learning curve: lrcurve (plot realtime learning curve in Keras), livelossplot
Notifications: knockknock (get notified by slack/email), jupyter-notify (notify when task is completed in jupyter)
Progress bar: fastprogress

Interpretability

Visualize keras models: keras-vis
Interpret models: eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance
Interpret BERT: exbert
Interpret word2vec: word2viz

Phase: Optimization

Hyperparameter Optimization

Keras: keras-tuner
Scikit-learn: sklearn-deap (evolutionary algorithm for hyperparameter search), hyperopt-sklearn
General: hyperopt, optuna, evol, talos

Visualization

Draw CNN figures: nn-svg
Visualization for scikit-learn: yellowbrick, scikit-plot
XKCD like charts: chart.xkcd
Convert matplotlib charts to D3 charts: mpld3
Generate graphs using markdown: mermaid
Visualize topics models: pyldavis
High dimensional visualization: umap
Visualization libraries: pygal, plotly, plotnine
Interactive charts: bokeh
Visualize architectures: netron
Activation maps for keras: keract
Create interactive charts online: flourish-studio
Color Schemes: open-color,mplcyberpunk(cyberpunk style for matplotlib)

Phase: Production

Model Serialization

Transpiling: sklearn-porter (transpile sklearn model to C, Java, JavaScript and others), m2cgen
Pickling extended: cloudpickle, jsonpickle

Scalability

Parallelize Pandas: pandarallel, swifter, modin
Parallelize numpy operations: numba

Bechmark

Profile pytorch layers: torchprof
Load testing: k6
Monitor GPU usage: nvtop

API

Configuration Management: config, python-decouple
Data Validation: schema, jsonschema, cerebrus, pydantic, marshmallow, validators
Enable CORS in Flask: flask-cors
Caching: cachetools, cachew (cache to local sqlite)
Authentication: pyjwt (JWT)
Task Queue: rq, schedule
Database: flask-sqlalchemy, tinydb
Logging: loguru

Dashboard

Generate frontend with python: streamlit

Adversarial testing

Generate images to fool model: foolbox
Generate phrases to fool NLP models: triggers
General: cleverhans

Python libraries

Datetime compatible API for Bikram Sambat: nepali-date
bloom filter: python-bloomfilter
Run python libraries in sandbox: pipx
Pretty print tables in CLI: tabulate
Leaflet maps from python: folium
Debugging: PySnooper
Date and Time: pendulum
Create interactive prompts: prompt-toolkit
Concurrent database: pickleshare
Aync: tomorrow
Testing: crosshair(find failure cases for functions)
CLI tools: gitjk: Undo what you just did in git

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

toolbox

Phase: Data

Data Annotation

Data Collection

Importing Data

Data Augmentation

Phase: Exploration

Data Preparation

Data Exploration

Phase: Feature Engineering

Feature Generation

Phase: Modeling

Model Selection

NLP

Speech Recognition

RecSys

Computer Vision

Timeseries

Framework extensions

Phase: Validation

Model Training Monitoring

Interpretability

Phase: Optimization

Hyperparameter Optimization

Visualization

Phase: Production

Model Serialization

Scalability

Bechmark

API

Dashboard

Adversarial testing

Python libraries

About

Releases

Packages

License

petersandersen/toolbox

Folders and files

Latest commit

History

Repository files navigation

toolbox

Phase: Data

Data Annotation

Data Collection

Importing Data

Data Augmentation

Phase: Exploration

Data Preparation

Data Exploration

Phase: Feature Engineering

Feature Generation

Phase: Modeling

Model Selection

NLP

Speech Recognition

RecSys

Computer Vision

Timeseries

Framework extensions

Phase: Validation

Model Training Monitoring

Interpretability

Phase: Optimization

Hyperparameter Optimization

Visualization

Phase: Production

Model Serialization

Scalability

Bechmark

API

Dashboard

Adversarial testing

Python libraries

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages