Skip to content

Commit

Permalink
docs(weave): Document autopatch_settings for PII nb and general settings
Browse files Browse the repository at this point in the history
  • Loading branch information
J2-D2-3PO committed Jan 3, 2025
1 parent d995ff0 commit 6fd27d1
Showing 1 changed file with 165 additions and 35 deletions.
200 changes: 165 additions & 35 deletions docs/notebooks/pii.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
"---\n",
"docusaurus_head_meta::end -->\n",
"\n",

"<!--- @wandbcode{cod-notebook} -->"
]
},
Expand All @@ -29,16 +28,16 @@
"id": "C70egOGRLCgm"
},
"source": [
"In this tutorial, we'll demonstrate how to utilize Weave while ensuring your Personally Identifiable Information (PII) data remains private. Weave supports removing PII from LLM calls and preventing PII from being displayed in the Weave UI. \n",
"In this guide, you'll learn how to use Weave while ensuring your Personally Identifiable Information (PII) data remains private. With Weave, you can remove PII from LLM input and output, prevent PII from being displayed in the Weave UI, and define \n",
"\n",
"To detect and protect our PII data, we'll identify and redact PII data and optionally anonymize it with the following methods:\n",
"1. __Regular expressions__ to identify PII data and redact it.\n",
"2. __Microsoft's [Presidio](https://microsoft.github.io/presidio/)__, a python-based data protection SDK. This tool provides redaction and replacement functionalities.\n",
"3. __[Faker](https://faker.readthedocs.io/en/master/)__, a Python library to generate fake data, combined with Presidio to anonymize PII data.\n",
"\n",
"Additionally, we'll make use of _Weave Ops input/output logging customization_ to seamlessly integrate PII redaction and anonymization into the workflow. See [here](https://weave-docs.wandb.ai/guides/tracking/ops/#customize-logged-inputs-and-outputs) for more information.\n",
"Additionally, you'll learn how to use both _Weave Ops input/output logging customization_ and _`autopatch_settings`_ to integrate PII redaction and anonymization into the workflow. For more information, see [Customize logged inputs and outputs](https://weave-docs.wandb.ai/guides/tracking/ops/#customize-logged-inputs-and-outputs).\n",
"\n",
"For this use-case, we will leverage Anthropic's Claude Sonnet to perform sentiment analysis while tracing the LLM calls using Weave's [Traces](https://wandb.github.io/weave/quickstart). Sonnet will receive a block of text and output one of the following sentiment classifications: _positive_, _negative_, or _neutral_."
"In this guide, you'll use Anthropic's Claude Sonnet to perform sentiment analysis while tracing LLM calls using [Traces](https://wandb.github.io/weave/quickstart). Claude Sonnet will receive a block of text and output one of the following sentiment classifications: _positive_, _negative_, or _neutral_."
]
},
{
Expand Down Expand Up @@ -89,7 +88,7 @@
"source": [
"# Setup\n",
"\n",
"Let's install the required packages and set up our API keys. Your Weights & Biases API key can be found [here](https://wandb.ai/authorize), and your Anthropic API keys are [here](https://console.anthropic.com/settings/keys)."
"1. First, install the required packages. "
]
},
{
Expand All @@ -110,6 +109,16 @@
"!pip install cryptography # to encrypt our data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Set up your API keys. You can find your API keys at the following the links.\n",
"\n",
" - [W&B](https://wandb.ai/authorize)\n",
" - [Anthropic](https://console.anthropic.com/settings/keys)."
]
},
{
"cell_type": "code",
"execution_count": 4,
Expand All @@ -126,6 +135,13 @@
"_ = set_env(\"WANDB_API_KEY\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Initialize your Weave project."
]
},
{
"cell_type": "code",
"execution_count": 6,
Expand Down Expand Up @@ -154,7 +170,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load our initial PII data. For demonstration purposes, we'll use a dataset containing 10 text blocks. A larger dataset with 1000 entries is available."
"4. Load the demo PII dataset, which contains 10 text blocks. "
]
},
{
Expand Down Expand Up @@ -184,16 +200,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Redaction Methods Implementation"
"## Redaction methods overview\n",
"\n",
"Once you've completed the [setup](#setup), you can \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 1: Regular Expression Filtering\n",
"### Method 1: Filter using regular expressions\n",
"\n",
"Our initial method is to use [regular expressions (regex)](https://docs.python.org/3/library/re.html) to identify PII data and redact it. It allows us to define patterns that can match various formats of sensitive information like phone numbers, email addresses, and social security numbers. By using regex, we can scan through large volumes of text and replace or redact information without the need for more complex NLP techniques. "
"The simplest method is to use [regular expressions (regex)](https://docs.python.org/3/library/re.html) to identify PII data and redact it. It allows us to define patterns that can match various formats of sensitive information like phone numbers, email addresses, and social security numbers. By using regex, we can scan through large volumes of text and replace or redact information without the need for more complex NLP techniques. "
]
},
{
Expand Down Expand Up @@ -287,15 +305,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 2: Microsoft Presidio Redaction\n",
"Our next method involves complete removal of PII data using Presidio. This approach redacts PII and replaces it with a placeholder representing the PII type. \n",
"\n",
"For example:\n",
"`\"My name is Alex\"` becomes `\"My name is <PERSON>\"`.\n",
"### Method 2: Redact using Microsoft Presidio \n",
"Our next method involves complete removal of PII data using [Microsoft Presidio](https://microsoft.github.io/presidio/). Presidio redacts PII and replaces it with a placeholder representing the PII type. For example, Presidio replaces `Alex` in `\"My name is Alex\"` with `<PERSON>`.\n",
"\n",
"Presidio comes with a built-in [list of recognizable entities](https://microsoft.github.io/presidio/supported_entities/). We can select the ones that are important for our use case. In the below example, we redact names, phone numbers, locations, email addresses, and US Social Security Numbers.\n",
"\n",
"We'll then encapsulate the Presidio process into a function."
"Presidio comes with a built-in support for [common entities](https://microsoft.github.io/presidio/supported_entities/). In the below example, we redact all entities that are a `PHONE_NUMBER`, `PERSON`, `LOCATION`, `EMAIL_ADDRESS` or `US_SSN`. The Presidio process is encapsulated in a function."
]
},
{
Expand Down Expand Up @@ -366,13 +379,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 3: Anonymization with Replacement using Fakr and Presidio\n",
"### Method 3: Anonymize with replacement using Faker and Presidio\n",
"\n",
"Instead of redacting text, we can anonymize it by swapping PII (like names and phone numbers) with fake data generated using the [Faker](https://faker.readthedocs.io/en/master/) Python library. For example:\n",
"Instead of redacting text, we can anonymize it by swapping PII like names and phone numbers with fake data generated using the [Faker](https://faker.readthedocs.io/en/master/) Python library. Using Faker, we can process the following data:\n",
"\n",
"`\"My name is Raphael and I like to fish. My phone number is 212-555-5555\"` \n",
"\n",
"might become\n",
"Once processing is complete, the data might look like:\n",
"\n",
"`\"My name is Katherine Dixon and I like to fish. My phone number is 667.431.7379\"`\n",
"\n",
Expand Down Expand Up @@ -543,24 +556,58 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applying the Methods to Weave Calls\n",
"### Method 4: Use `autopatch_settings` \n",
"\n",
"You can use `autopatch_settings` to configure PII handling directly during initialization for one or more of the supported LLM integrations. The advantages of this method are:\n",
"\n",
"1. PII handling logic is centralized and scoped at initialization, reducing the need for scattered custom logic.\n",
"2. PII processing workflows can be customized or disabled entirely for specific intergations.\n",
"\n",
"To use `autopatch_settings` to configure PII handling, define `postprocess_inputs` and/or `postprocess_output` in `op_settings` for any one of the supported LLM integrations. \n",
"\n",
"```python \n",
"client = weave.init(\n",
" ...,\n",
" autopatch_settings={\n",
" \"openai\": {\n",
" \"op_settings\": {\n",
" \"postprocess_inputs\": ...,\n",
" \"postprocess_output\": ...,\n",
" }\n",
" },\n",
" \"anthropic\": {\n",
" \"op_settings\": {\n",
" \"postprocess_inputs\": ...,\n",
" \"postprocess_output\": ...,\n",
" }\n",
" }\n",
" },\n",
")\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applying the methods to Weave calls\n",
"\n",
"In these examples we will integrate our PII redaction and anonymization methods into Weave Models, and preview the results in Weave Traces.\n",
"In the following examples, we will integrate our PII redaction and anonymization methods into Weave Models and preview the results in Weave Traces.\n",
"\n",
"We'll create a [Weave Model](https://wandb.github.io/weave/guides/core-types/models) which is a combination of data (which can include configuration, trained model weights, or other information) and code that defines how the model operates. \n",
"First, we'll create a [Weave Model](https://wandb.github.io/weave/guides/core-types/models). A Weave Model is a combination of information like configuration settings, model weights, and code that defines how the model operates. \n",
"\n",
"In this model, we will include our predict function where the Anthropic API will be called. Additionally, we will include our postprocessing functions to ensure that our PII data is redacted or anonymized before it is sent to the LLM.\n",
"In our model, we will include our predict function where the Anthropic API will be called. Additionally, we will include our postprocessing functions to ensure that our PII data is redacted or anonymized before it is sent to the LLM.\n",
"\n",
"Once you run this code you will receive a links to the Weave project page as well as the specific trace (LLM calls)you ran."
"Once you run this code, you will receive a links to the Weave project page, as well as the specific trace (LLM calls) you ran."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regex Method \n",
"### Regex method \n",
"\n",
"In the simplest case, we can use regex to identify and redact PII data in the original text."
"In the simplest case, we can use regex to identify and redact PII data from the original text."
]
},
{
Expand Down Expand Up @@ -651,9 +698,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Presidio Redaction Method\n",
"### Presidio redaction method\n",
"\n",
"Here we will use Presidio to identify and redact PII data in the original text."
"Next, we will use Presidio to identify and redact PII data from the original text."
]
},
{
Expand Down Expand Up @@ -784,9 +831,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Faker + Presidio Replacement Method\n",
"### Faker and Presidio replacement method\n",
"\n",
"Here we will have Faker generate anonymized replacement PII data and use Presidio to identify and replace the PII data in the original text.\n"
"In this example, we use Faker to generate anonymized replacement PII data and use Presidio to identify and replace the PII data in the original text.\n"
]
},
{
Expand Down Expand Up @@ -886,18 +933,101 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checklist for Safely Using Weave with PII Data\n",
"### `autopatch_settings` method\n",
"\n",
"In the following example, we set `postprocess_inputs` for `anthropic` to the `postprocess_inputs_regex()` function () at initialization. The `postprocess_inputs_regex` function applies the`redact_with_regex` method defined in [Method 1: Regular Expression Filtering](#method-1-regular-expression-filtering). Now, `redact_with_regex` will be applied to all inputs to any `anthropic` models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from typing import Any\n",
"\n",
"import anthropic\n",
"\n",
"import weave\n",
"\n",
"client = weave.init(\n",
" ...,\n",
" autopatch_settings={\n",
" \"anthropic\": {\n",
" \"op_settings\": {\n",
" \"postprocess_inputs\": postprocess_inputs_regex,\n",
" }\n",
" }\n",
" },\n",
")\n",
"\n",
"\n",
"# Define an input postprocessing function that applies our regex redaction for the model prediction Weave Op\n",
"def postprocess_inputs_regex(inputs: dict[str, Any]) -> dict:\n",
" inputs[\"text_block\"] = redact_with_regex(inputs[\"text_block\"])\n",
" return inputs\n",
"\n",
"\n",
"# Weave model / predict function\n",
"class sentiment_analysis_regex_pii_model(weave.Model):\n",
" model_name: str\n",
" system_prompt: str\n",
" temperature: int\n",
"\n",
" async def predict(self, text_block: str) -> dict:\n",
" client = anthropic.AsyncAnthropic()\n",
" response = await client.messages.create(\n",
" max_tokens=1024,\n",
" model=self.model_name,\n",
" system=self.system_prompt,\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": text_block}]}\n",
" ],\n",
" )\n",
" result = response.content[0].text\n",
" if result is None:\n",
" raise ValueError(\"No response from model\")\n",
" parsed = json.loads(result)\n",
" return parsed"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create our LLM model with a system prompt\n",
"model = sentiment_analysis_regex_pii_model(\n",
" name=\"claude-3-sonnet\",\n",
" model_name=\"claude-3-5-sonnet-20240620\",\n",
" system_prompt='You are a Sentiment Analysis classifier. You will be classifying text based on their sentiment. Your input will be a block of text. You will answer with one the following rating option[\"positive\", \"negative\", \"neutral\"]. Your answer should be one word in json format: {classification}. Ensure that it is valid JSON.',\n",
" temperature=0,\n",
")\n",
"\n",
"print(\"Model: \", model)\n",
"# for every block of text, anonymized first and then predict\n",
"for entry in pii_data:\n",
" await model.predict(entry[\"text\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Checklist for Safely Using Weave with PII Data\n",
"\n",
"### During Testing\n",
"#### During Testing\n",
"- Log anonymized data to check PII detection\n",
"- Track PII handling processes with Weave Traces\n",
"- Measure anonymization performance without exposing real PII\n",
"\n",
"### In Production\n",
"#### In production\n",
"- Never log raw PII\n",
"- Encrypt sensitive fields before logging\n",
"\n",
"### Encryption Tips\n",
"#### Encryption tips\n",
"- Use reversible encryption for data you need to decrypt later\n",
"- Apply one-way hashing for unique IDs you don't need to reverse\n",
"- Consider specialized encryption for data you need to analyze while encrypted"
Expand Down

0 comments on commit 6fd27d1

Please sign in to comment.