GitHub - thesanjeetc/LLMsForDataScience: Domain-Aware Prompting with LLMs for Data Science Notebooks (Research Project)

Domain-Aware Prompting with LLMs for Data Science Notebooks

Abstract

LLMs have shown impressive capabilities across a variety of tasks, such as semantic understanding, reasoning capabilities, and code synthesis. This has the potential to assist data scientists in data preparation and exploration tasks. Computational notebooks have become popular tools for data scientists to explore, analyse and prepare data. Notebooks provide a richer contexual environment in comparison to single-step code generation tasks and benchmarks due to its live execution state and multi-step nature.

We build upon the ARCADE benchmark and experiments and propose novel prompting tech- niques that leverages contextual information in a stateful notebook environment. We explore combining execution state with various prompting methods to improve on the original ARCADE benchmark results and evaluate our prompting methods with the Llama 3 family of models. Our experiments highlight the importance designing domain-specific prompting that takes into context all available information to enable stronger model performance on data science tasks in a stateful notebook environment.

Project Structure

Source Code

execution.py
Generates execution metadata resources, including variables, outputs, and runtime information.
analysis.py
Parses and extracts meta-information from code, such as structure, dependencies, and execution details.
experiments.py
Core functionality for generating predictions, running experiments, and creating datasets.
explore.py
Gradio application for exploring and visualizing experiment results interactively.
llm.py
Interfaces with Large Language Models (LLMs).
multistep.py
Manages multi-step message chains.
prompt_templates.py
Contains prompt template strings.
prompt_utils.py
Utility functions for prompt generation.
prompts.py
Core functions for building and managing prompts.

Directory Overview

Notebooks
Prototyping and experimentation notebooks.
Models
Tokenizer and code interface for LLaMA 3 models.
Resources
Extracted execution information, return types, and exemplars for prompts.
Artifacts
Raw datasets and notebooks forming the ARCADE dataset.
Datasets
Generated prompt datasets for the different experiments.
arcade_nl2code
Evaluation code and utilities for building the initial dataset from the original ARCADE paper.
Please refer to the ARCADE repository for instructions on building the original dataset.

Natural Language to Code Generation in Interactive Data Science Notebooks (Yin et al., ACL 2023)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
arcade_nl2code		arcade_nl2code
models/llama3		models/llama3
notebooks		notebooks
src		src
LICENSE		LICENSE
README.md		README.md
jobs.py		jobs.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
report.pdf		report.pdf
slides.pdf		slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Aware Prompting with LLMs for Data Science Notebooks

Abstract

Project Structure

Source Code

Directory Overview

About

Releases

Packages

Languages

License

thesanjeetc/LLMsForDataScience

Folders and files

Latest commit

History

Repository files navigation

Domain-Aware Prompting with LLMs for Data Science Notebooks

Abstract

Project Structure

Source Code

Directory Overview

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages