Specialization classifier

Overview

Goal of this project was to train a classifier on medical patients' questions and predict, which medical field the questions belong to. The classifier managed to reach an F1-score of 0.915 and an accuracy of 0.913.

Keywords

NLP, Logistic Regression, Tf-Idf, Supervised, Multinomial Classification.

Files

Name	Description
config.py	Configuration variables
requirements.txt	Required Python packages
service.py	Script for model deployment
train_model.py	Script for model training

Installation

Requirements

Python 3.7+

Installing Python packages

Automatically

To reduce headaches with incompatible package versions across multiple projects, it is preferred to maintain separate environments. After creating and activating a conda environment, all required packages are available just for the environment.

Install Anaconda
Create conda environment: conda create -y --name <env-name> python=3.7
Install the packages from the requirements.txt: conda install --force-reinstall -y -q --name <env-name> -c conda-forge --file requirements.txt
Activate the environment: conda activate <env-name>

Manually

To install packages manually, go to the requirements.txt file and install each package. For example: to resolve the numpy=1.19.5 line, simply run pip install numpy.

Usage instructions

Summary of steps is following:

Data preparation
1. Vectorizer data
2. Classifier data
Training
1. Run python train_model.py
Deployment
1. Run python service.py

Data preparation

train_model.py expects two CSV files. The first for the vectorizer training and the second for the classifier training.

SQL Query to fetch the data:

SELECT
  q.text AS question,
  mp.id AS spec_id
FROM question q
JOIN medical_problem mp ON q.medical_problem_id = mp.id;

Place the datasets in the root folder with proper names as described in config.py:

VOCABULARY_DATA_PATH = 'data_vectorizer.csv'
CLASSIFIER_DATA_PATH = 'data_classifier.csv'

For the vectorizer training, provide as much data as possible. For the classifier training, you can provide less, but cleaner data (double check that spec. id’s are correctly assigned) for better accuracy. Keep in mind, that reducing the data, can worsen the final accuracy.

Expected CSV format:

Name	Type	Description
question	string	Question asked by the client
spec_id	number	Specialization ID

Training

To train the model, run python train_model.py.

Training script expects these configuration variables:

Name	Type	Description	Example
PICKLE_VECTORIZER_NAME	string	Vectorizer file name	`'model_vectorizer.pickle'`
PICKLE_CLASSIFIER_NAME	string	Classifier file name	`'model_classifier.pickle'`
VOCABULARY_DATA_PATH	string	Vectorizer data file name	`'data_vectorizer.csv'`
CLASSIFIER_DATA_PATH	string	Classifier data file name	`'data_classifier.csv'`
RANDOM_STATE	number	Random number to ensure consistent training results	`42`
MAPPING	dictionary	Mapping of the specialization IDs	`{ 20: 2, 15:2, 11: 1 }`

Output

Classifier model stored in CLASSIFIER_DATA_PATH
Vectorizer model stored in VOCABULARY_DATA_PATH

How to extend the mapping of specializations

If you want to add additional specialization, simply add the key-value pair into the dictionary. Key represents the source specialization and value represents the target specialization (e.g. to map from id 11 to id 4, just add the pair as: { 11: 4 })

There are two rules to follow:

Target ID 0 is reserved for the default/unmapped classes
Target IDs must create a sequence of 1 to N (There can be no skipped numbers from 1 to N).

Deployment

To deploy the model, run python service.py. This script will deploy the model as a service using the Flask micro web framework and the Waitress WSGI server.

Host, port, API version and prefix can be configured in the config.py.

REST API

Predictions

URL	/api/v1/predictions/specialization
query param	question: string
Method	GET
Response	200 OK - returns number // ID of predicted specialization 500 Internal Server Error
Example	Request: http://localhost:5000/api/v1/predictions/specialization?question='Trápí mě zubní kaz' Response: 4

Health check

URL	/api/v1/health-check
Method	GET
Response Status	200 OK 404 Not Found

Copyright and licensing information

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Specialization classifier

Overview

Keywords

Files

Installation

Requirements

Installing Python packages

Automatically

Manually

Usage instructions

Data preparation

Training

Output

How to extend the mapping of specializations

Deployment

REST API

Predictions

Health check

Copyright and licensing information

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.adoc		README.adoc
config.py		config.py
requirements.txt		requirements.txt
service.py		service.py
train_model.py		train_model.py

jankoada/spec-classifier

Folders and files

Latest commit

History

Repository files navigation

Specialization classifier

Overview

Keywords

Files

Installation

Requirements

Installing Python packages

Automatically

Manually

Usage instructions

Data preparation

Training

Output

How to extend the mapping of specializations

Deployment

REST API

Predictions

Health check

Copyright and licensing information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages