This FastAPI-based service accepts text input and a model name, scrubs Personally Identifiable Information (PII) using spaCy Named Entity Recognition (NER) and regex patterns, then sends the sanitized input to a specified Large Language Model (LLM) endpoint. The results (including performance metrics, CPU usage, and PII detection details) are logged to both a local SQLite database and returned as a JSON response.
NER using Spacy is the Python-based Natural Language Processing task that focuses on detecting and categorizing named entities.
Currently, the code supports legacy Text Completaions (https://docs.anthropic.com/en/api/complete) by Anthropic and chat completion (https://platform.openai.com/docs/api-reference/chat) by OpenAI. It supports text generation (https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion) of Ollama.
-
PII Scrubbing:
- Uses spaCy's
en_core_web_sm
model to detect named entities that could be considered sensitive (e.g.,PERSON
,GPE
,ORG
,DATE
,TIME
,MONEY
,NORP
,LOC
,FAC
). - Additional PII patterns like
IP_ADDRESS
,SSN
,CREDIT_CARD
, andPASSWORD
are detected using custom regex patterns. - All detected PII are replaced with placeholder tags (e.g.,
<PERSON>
,<IP_ADDRESS>
).
- Uses spaCy's
-
Flexible LLM Backend:
- Local LLM (Ollama): For model names that do not start with
"gpt-"
or"claude-"
, the service assumes a locally running Ollama LLM endpoint athttp://0.0.0.0:11434/api/generate
. - OpenAI ChatGPT: For model names starting with
"gpt-"
, it calls the OpenAI ChatCompletion API. RequiresOPENAI_API_KEY
to be set as an environment variable. - Anthropic Claude: For model names starting with
"claude-"
, it calls the Anthropic API endpoint. RequiresANTHROPIC_API_KEY
to be set as an environment variable.
- Local LLM (Ollama): For model names that do not start with
-
Performance and Metadata Logging:
- Logs details about execution (timestamps, CPU usage, network latency, token usage) and the chosen model into a SQLite database.
- Information is returned as a JSON response including scrubbed input, detected entities, and LLM output.
-
Local SQLite Database Logging:
- The code uses SQLAlchemy to store metadata of each request.
- Columns include prompt, scrubbed PII, timestamps, durations, CPU usage, model name, and performance statistics.
-
Receive Request:
The/process
endpoint accepts a JSON payload with:text
: The user-provided text containing potential PII.model
: The LLM model identifier (e.g.,"qwen2.5:0.5b-instruct-q4_K_M"
,"gpt-4o"
,gpt-4o-mini
,"claude-2.1"
).
-
Scrub PII:
- The code uses spaCy NER to find entities labeled as PII.
- Additional regex patterns detect things like IP addresses, SSNs, credit cards, and passwords.
- Detected PII entities are replaced with placeholders (e.g.,
<PERSON>
,<IP_ADDRESS>
). - The scrubbed text and a list of detected entities are retained.
-
LLM Call:
- Based on the
model
string, the service decides which API to call:- Local/Ollama: If no prefix (
gpt-
orclaude-
), calls the Ollama endpoint. - OpenAI: If the model name starts with
"gpt-"
, calls OpenAI's API. RequiresOPENAI_API_KEY
. - Anthropic: If the model name starts with
"claude-"
, calls Anthropic's API. RequiresANTHROPIC_API_KEY
.
- Local/Ollama: If no prefix (
- Measures CPU usage and network latency before and after the request.
- Extracts response text and performance metrics (e.g., tokens processed).
- Based on the
-
Logging and Response:
- All relevant data (prompt, scrubbed text, PII details, performance metrics, CPU usage, model name) are inserted into the local SQLite database.
- The final JSON response returns:
scrubbed_input
: The sanitized input.detected_entities
: A list of detected PII and their types.response
: The LLM's generated output.log
: A dictionary of metadata (durations, CPU usage, tokens/second, etc.).
- Python 3.9+
pip install fastapi uvicorn spacy requests sqlalchemy
orpip install -r requirements.txt
- SpaCy English model:
python -m spacy download en_core_web_sm
- Access to desired LLM endpoints (Ollama, OpenAI, Anthropic).
- Appropriate API keys set as environment variables (if using OpenAI or Anthropic).
- For Ollama, first install it as per https://github.com/ollama/ollama and then pull or run some models. For Ollama only PII scrubbing run python3 pii_scrub_ner_eng_ollama.py and then call it via curl as per later given process.
-
Install Dependencies:
pip install -r requirements.txt python -m spacy download en_core_web_sm
-
Set Environment Variables for Remote LLMs:
- For OpenAI:
export OPENAI_API_KEY="your_openai_api_key"
- For Anthropic:
export ANTHROPIC_API_KEY="your_anthropic_api_key"
If you're only using a local LLM (Ollama), you do not need these keys.
Run below (serial no. 4) for running FastAPI server on the same terminal.
- For OpenAI:
-
Run the FastAPI Uvicron Server:
uvicorn __main__:app --host 0.0.0.0 --port 5000 --reload
or
python3 pii_scrub_ner_eng_hybrid.py
This will start the FastAPI server on port 5000.
-
Database Setup:
- On the first run,
Base.metadata.create_all(bind=engine)
creates therequests
table inollama_log.db
. - If you change the schema, remove or rename
ollama_log.db
before restarting to recreate the schema.
- On the first run,
Local LLM (Ollama):
curl -X POST "http://127.0.0.1:5000/process" \
-H "Content-Type: application/json" \
-d '{"text": "Here is my password: super_secret123 and IP address 172.16.254.1.", "model": "qwen2.5:0.5b-instruct-q4_K_M"}'
OpenAI (ChatGPT):
export OPENAI_API_KEY="your_openai_api_key"
curl -X POST "http://127.0.0.1:5000/process" \
-H "Content-Type: application/json" \
-d '{"text": "John Doe has an email johndoe@example.com", "model": "gpt-4o-mini"}'
Anthropic (Claude):
export ANTHROPIC_API_KEY="your_anthropic_api_key"
curl -X POST "http://127.0.0.1:5000/process" \
-H "Content-Type: application/json" \
-d '{"text": "Alice from California, email alice123@gmail.com", "model": "claude-2.1"}'
Common entities recognized by spaCy and custom logic:
- PERSON: Individual human names (fictional or real).
- GPE: Geo-political entities like countries, states, and cities.
- ORG: Organizations, companies, agencies.
- DATE: Specific dates or date ranges.
- TIME: Times of day or durations within a day.
- MONEY: Monetary values and currencies.
- NORP: Nationalities, religious groups, political organizations.
- LOC: Non-GPE locations like mountains, lakes, etc.
- FAC: Facilities, buildings, airports.
- EMAIL, PHONE: Can be recognized via custom rules or regex.
- IP_ADDRESS, SSN, CREDIT_CARD, PASSWORD: Handled via regex patterns.
The service replaces these PII instances with placeholder tags (e.g., <PERSON>
, <IP_ADDRESS>
).
- Make sure to run
uvicorn
and set API keys in the same environment and session. - If you encounter
no such table: requests
, removeollama_log.db
and restart the server. - Adjust the model name and LLM endpoint logic as needed for different providers.
This README should help you understand how to use and extend this PII scrubbing and LLM integration service.