Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llm prompt to docs #2541

Merged
merged 5 commits into from
Dec 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions docs/demos/tutorials/00_Tutorial_Introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,7 @@
"\n",
"Throughout the tutorial, we use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.\n",
"\n",
"You can find these tutorial notebooks in the `docs/demos/tutorials/` folder of the [splink repo](https://github.com/moj-analytical-services/splink/tree/master/docs/demos/tutorials), or click the Colab links to run in your browser.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
"You can find these tutorial notebooks in the `docs/demos/tutorials/` folder of the [splink repo](https://github.com/moj-analytical-services/splink/tree/master/docs/demos/tutorials), or click the Colab links to run in your browser."
]
},
{
Expand All @@ -71,6 +63,20 @@
"\n",
"If you'd like to learn more about record linkage theory, an interactive introduction is available [here](https://www.robinlinacre.com/intro_to_probabilistic_linkage/)."
]
},
{
"cell_type": "markdown",
"id": "8c28bba7",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"source": [
"## LLM prompts\n",
"\n",
"If you're using an LLM to suggest Splink code, see [here](./topic_guides/llms/prompting_llms.md) for suggested prompts and context."
]
}
],
"metadata": {
Expand Down Expand Up @@ -99,4 +105,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
1 change: 1 addition & 0 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ To get a basic Splink model up and running, use the following code. It demonstra
df_clusters = clusters.as_pandas_dataframe(limit=5)
```

If you're using an LLM to suggest Splink code, see [here](./topic_guides/llms/prompting_llms.md) for suggested prompts and context.

## Tutorials

Expand Down
99 changes: 99 additions & 0 deletions docs/topic_guides/llms/prompting_llms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Using LLMs such as ChatGPT to help you write Splink code

We provide two files that summarise the Splink docs to help you use Splink with LLMs:

[Concise Splink Documentation Summary](https://gist.githubusercontent.com/RobinL/2257a0253d80ab8e651a5ec08bc3d00a/raw/e3b00bcd07cce609cf4ee60908d9e6d39c5b9c5f/llm_context_short.txt) - around 22,000 tokens

[Comprehensive Splink Documentation Summary](https://gist.githubusercontent.com/RobinL/2257a0253d80ab8e651a5ec08bc3d00a/raw/e3b00bcd07cce609cf4ee60908d9e6d39c5b9c5f/llm_context_long.txt) - around 72,000 tokens

At present, only the concise one is short enough for ChatGPT. The longer one fits in the currently Claude long context (200k tokens) model.
### Recommended use

To use these files, we suggest copying and pasing their contents into the LLM, and then appending your question:

For instance, you may paste their contents and then ask:

```
I have a dataset I want to dedupe using duckdb

It has columns full_name, dob, occupation, zip

Do some preparatory data cleaning and run a splink script to train a model and dedupe
```

Which results in (ChatGPT `o1` model)

<details>
<summary>Click to see example response</summary>

```python
import duckdb
from splink import Linker, SettingsCreator, DuckDBAPI, block_on
import splink.comparison_library as cl

# Load and clean data in DuckDB
con = duckdb.connect()
con.execute("""
CREATE TABLE raw_data AS SELECT * FROM read_csv('your_data.csv', HEADER=TRUE)
""")

# Example cleaning steps: lowercasing fields, removing punctuation from full_name, standardizing dob format
con.execute("""
CREATE TABLE cleaned AS
SELECT
regexp_replace(lower(full_name), '[^a-z ]', '', 'g') as full_name,
regexp_replace(dob, '[^0-9-]', '', 'g') as dob,
lower(occupation) as occupation,
regexp_replace(lower(zip), '[^a-z0-9 ]', '', 'g') as zip
FROM raw_data
""")

df = con.table("cleaned")

settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[
block_on("full_name"),
block_on("dob"),
block_on("zip"),
],
comparisons=[
cl.NameComparison("full_name"),
cl.DateOfBirthComparison("dob", input_is_string=True),
cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
cl.LevenshteinAtThresholds("zip", 2),
],
retain_intermediate_calculation_columns=True,
)

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)

deterministic_rules = [
"l.full_name = r.full_name and l.dob = r.dob",
"l.zip = r.zip and levenshtein(l.full_name, r.full_name) < 2"
]

linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("full_name"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

predictions = linker.inference.predict(threshold_match_probability=0.9)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(predictions, threshold_match_probability=0.95)
```
</details>

!!! warning

Whilst these prompts can be helpful to give you some initial code, there's **no guarantees** it will be correct.

However, in many cases it can provide a useful starting point.

### How these prompts are generated

These files are generated with the following scripts, which you may wish to modify for your purposes:

[Create LLM Prompt Long Script](https://github.com/moj-analytical-services/splink/blob/master/scripts/create_llm_prompt_long.py)

[Create LLM Prompt Short Script](https://github.com/moj-analytical-services/splink/blob/master/scripts/create_llm_prompt_short.py)
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,7 @@ nav:
- Model Evaluation:
- accuracy chart from labels table: "charts/accuracy_analysis_from_labels_table.ipynb"
- threshold selection tool: charts/threshold_selection_tool_from_labels_table.ipynb
- LLM prompts: "topic_guides/llms/prompting_llms.md"
- Contributing:
- "dev_guides/index.md"
- Contributing to Splink:
Expand Down
57 changes: 46 additions & 11 deletions scripts/create_llm_prompt_long.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import inspect
import logging
import os
import urllib.request

import nbformat

Expand Down Expand Up @@ -137,9 +138,7 @@ def extract_all_function_docstrings_from_module(module):
return docstrings


def save_docstrings_with_append(
docstrings, docstring_filename="llm_context_long.txt", append_filenames=None
):
def save_docstrings_with_append(docstrings, docstring_filename, append_filenames=None):
append_content = ""
if append_filenames:
for filename in append_filenames:
Expand Down Expand Up @@ -196,7 +195,7 @@ def save_docstrings_with_append(

print( # NOQA: T201
"\nDocstrings extracted, saved, and organized into sections in "
"llm_context_long.txt"
f"{docstring_filename}"
)


Expand All @@ -217,8 +216,32 @@ def extract_and_append_md_content(md_files, docstring_filename):
print(f"Warning: File {full_path} not found.") # NOQA: T201


def fetch_url_content(url):
with urllib.request.urlopen(url) as response:
return response.read().decode("utf-8")


additional_instructions_to_llms = """
If the user asks for Splink code, when writing Splink code use ONLY
functions and methods which you've seen used in the context provided. Do NOT guess at
other Splink methods or functions.

If you don't know how to do something, put a comment referring them to
https://moj-analytical-services.github.io/splink

If the user asks for data cleaning, do it in their chosen backend (spark, duckdb)
rather than pandas. In general avoid pandas unless specificaly asked for.

In most cases, you should assume the incoming data is not clean and suggest some simple
data cleaning steps.
"""

# Main execution
if __name__ == "__main__":
# Clear the output file at start
output_filename = "llm_context_long.txt"
open(output_filename, "w").close()

# Extract docstrings for all public methods in specified Linker submodules
linker_docstrings = extract_method_docstrings(
linker,
Expand Down Expand Up @@ -277,7 +300,7 @@ def extract_and_append_md_content(md_files, docstring_filename):
print("Extracting and saving docstrings...") # NOQA: T201
save_docstrings_with_append(
all_docstrings,
"llm_context_long.txt",
output_filename,
append_filenames=[
"../docs/api_docs/settings_dict_guide.md",
"../docs/api_docs/datasets.md",
Expand All @@ -288,8 +311,8 @@ def extract_and_append_md_content(md_files, docstring_filename):
demos_examples_dir = "../docs/demos/examples"
demos_tutorials_dir = "../docs/demos/tutorials"

extract_and_append_notebook_content(demos_examples_dir, "llm_context_long.txt")
extract_and_append_notebook_content(demos_tutorials_dir, "llm_context_long.txt")
extract_and_append_notebook_content(demos_examples_dir, output_filename)
extract_and_append_notebook_content(demos_tutorials_dir, output_filename)

# New part: Append content from specified Markdown files
mds_to_append = [
Expand All @@ -304,9 +327,21 @@ def extract_and_append_md_content(md_files, docstring_filename):
"/docs/topic_guides/performance/performance_evaluation.md",
"/docs/api_docs/settings_dict_guide.md",
]
extract_and_append_md_content(mds_to_append, "llm_context_long.txt")
extract_and_append_md_content(mds_to_append, output_filename)

# Fetch and append content from the URL
url = "https://gist.githubusercontent.com/RobinL/edb10e93caeaf47c675cbfa189e4e30c/raw/fbe773db3002663dd3ddb439e38d2a549358e713/top_tips.md"
splink_tips = fetch_url_content(url)
with open(output_filename, "a", encoding="utf-8") as f:
f.write("\n\nSplink Tips:\n")
f.write(splink_tips)

# Append additional instructions to the output file
with open(output_filename, "a", encoding="utf-8") as f:
f.write("IMPORTANT Instructions to LLMs:")
f.write(additional_instructions_to_llms)

print( # NOQA: T201
"Docstrings extracted, saved, and all specified content "
"appended to llm_context_long.txt"
)
"Docstrings extracted, saved, and all specified content including tips and "
f"instructions appended to {output_filename}"
) # NOQA: T201
43 changes: 41 additions & 2 deletions scripts/create_llm_prompt_short.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import os
import urllib.request

import nbformat

Expand All @@ -24,6 +25,11 @@ def extract_and_append_notebook_code(base_dir, output_filename):
for file in files:
if file.endswith(".ipynb") and not file.endswith("-checkpoint.ipynb"):
notebook_path = os.path.join(root, file)
# Skip files with athena or sqlite in path
if any(x in notebook_path.lower() for x in ["athena", "sqlite"]):
print(f"Skipping {notebook_path} due to athena/sqlite...") # noqa: T201
continue

if ".ipynb_checkpoints" not in notebook_path:
print(f"Processing {notebook_path}...") # noqa: T201
code = extract_notebook_code(notebook_path)
Expand Down Expand Up @@ -53,9 +59,30 @@ def extract_and_append_md_content(md_files, output_filename):
print(f"Warning: File {full_path} not found.") # noqa: T201


def fetch_url_content(url):
with urllib.request.urlopen(url) as response:
return response.read().decode("utf-8")


additional_instructions_to_llms = """
If the user asks for Splink code, when writing Splink code use ONLY
functions and methods which you've seen used in the context provided. Do NOT guess at
other Splink methods or functions.

If you don't know how to do something, put a comment referring them to
https://moj-analytical-services.github.io/splink

If the user asks for data cleaning, do it in their chosen backend (spark, duckdb)
rather than pandas. In general avoid pandas unless specificaly asked for.

In most cases, you should assume the incoming data is not clean and suggest some simple
data cleaning steps.
"""

# Main execution
if __name__ == "__main__":
output_filename = "llm_context_short.txt"
open(output_filename, "w").close()

# Extract and save Python code from notebooks in the specified directories
demos_examples_dir = "../docs/demos/examples"
Expand All @@ -71,7 +98,19 @@ def extract_and_append_md_content(md_files, output_filename):
]
extract_and_append_md_content(mds_to_append, output_filename)

# Fetch and append content from the URL
url = "https://gist.githubusercontent.com/RobinL/edb10e93caeaf47c675cbfa189e4e30c/raw/fbe773db3002663dd3ddb439e38d2a549358e713/top_tips.md"
splink_tips = fetch_url_content(url)
with open(output_filename, "a", encoding="utf-8") as f:
f.write("\n\nSplink Tips:\n")
f.write(splink_tips)

# Append additional instructions to the output file
with open(output_filename, "a", encoding="utf-8") as f:
f.write("IMPORTANT Instructions to LLMs:")
f.write(additional_instructions_to_llms)

print( # noqa: T201
"Python code from notebooks and markdown content extracted and saved to "
"extracted_python_code_and_markdown.txt"
"Python code from notebooks, markdown content, Splink tips, and additional"
" instructions extracted and saved to llm_context_short.txt"
)
Loading