Skip to content

Commit

Permalink
Small tweaks to improve the glaive dataset blog post. (#938)
Browse files Browse the repository at this point in the history
- Reduce the amount of times we log events for progress to 0.5, and then
multiply by shard_count, reducing the overall amount of logs by 24 when
you are running a 12-core multiproc map. This is a hack, we should
really use dask properly with progress once we switch to streaming.
- Restart the dask client when executing a new task if there are no
tasks pending. This helps with making the second run much faster as
workers dont GC intermittently.

Coding dataset notebook:
- Adds a limit example to the notebook to show how you determine whether
it's what you want.

Also:
- Add the glaive dataset to the huggingface demo with the map outputs.
This currently is a bit weird as I ran it locally and pushed, so it's
not completely reproducible.
- Improve the notebook for curating a coding dataset.


https://lilacai-lilac.hf.space/datasets#lilac/glaive&expandedStats=%7B%22answer_formatted.has_edit%22%3Atrue%7D&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22answer_formatted%22%2C%22has_edit%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A1%7D%5D%7D&compareColumns=%5B%7B%22column%22%3A%5B%22answer%22%5D%2C%22compareToColumn%22%3A%5B%22answer_formatted%22%2C%22answer%22%5D%2C%22swapDirection%22%3Afalse%7D%5D&rowId=%22fffc265c-845e-4a2b-b3ce-2caa61fed0f4%22

Fixes #928
  • Loading branch information
nsthorat authored Dec 7, 2023
1 parent 177e61b commit 51be6ce
Show file tree
Hide file tree
Showing 6 changed files with 286 additions and 50 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 27 additions & 24 deletions docs/blog/curate-coding-dataset.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,32 @@
# Curate a coding dataset with Lilac

Dec 2, 2023
_Dec 7, 2023_

Good data is the engine that drives progress in AI. Companies that have control of their data can
add unique capabilities and differentiate their product. Beyond differentiation, companies are also
recognizing that building models on their own data reduces cost, and improves speed, control and
compliance.
add unique capabilities and differentiate their product. Companies are also recognizing that
building models with their own data reduces cost, and improves speed, control and compliance.

Data curation is often the most effective way to control how AI models behave. This process involves
standard procedures like de-duplication and PII scrubbing. However, focusing on the long-tail of
product specific requirements can deliver an amazing user experience. At Lilac, we also believe that
having more eyes on data ultimately leads to fundamental discoveries of how a model will behave,
giving the developer more control of their downstream AI product.
standard procedures like de-duplication and PII scrubbing, but also the long-tail of product
specific requirements that can deliver an amazing user experience.

At Lilac, we also believe that having more eyes on data ultimately leads to fundamental discoveries
of how a model will behave, giving the developer more control of their downstream AI product.

In this blog post, we'll delve into the excellent
[Glaive coding assistant](https://huggingface.co/datasets/glaiveai/glaive-code-assistant) dataset
with the goal of fine-tuning a code assistant model. We'll modify the dataset so that code outputted
by our AI product follows consistent formatting rules, and we'll visualize how the dataset has
changed.

<img src="../_static/curate_coding_dataset/glaive_preview.png">

## A First Look at the Glaive Dataset

Let's load the Glaive dataset into Lilac from the HuggingFace hub. In this example, we're going to
be using a Jupyter notebook (follow along
[here](https://github.com/lilacai/lilac/blob/main/notebooks/CurateCodingDataset.ipynb)).
[here](https://github.com/lilacai/lilac/blob/main/notebooks/CurateCodingDataset.ipynb)) or
[view the live demo on HuggingFace](https://lilacai-lilac.hf.space/datasets#lilac/glaive&expandedStats=%7B%22answer_formatted.has_edit%22%3Atrue%7D&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22answer_formatted%22%2C%22has_edit%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A1%7D%5D%7D&compareColumns=%5B%7B%22column%22%3A%5B%22answer%22%5D%2C%22compareToColumn%22%3A%5B%22answer_formatted%22%2C%22answer%22%5D%2C%22swapDirection%22%3Afalse%7D%5D&rowId=%22fffc265c-845e-4a2b-b3ce-2caa61fed0f4%22).

```python
import lilac as ll
Expand Down Expand Up @@ -51,7 +54,7 @@ INFO: Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit)

You can see that the dataset consists of `question` and `answer` pairs, where the answer is in
markdown format, often containing python code blocks. Immediately we can see that the python
formatting is not consistent with our style, which will result in an AI product producing
formatting is not consistent with our desired style, which will result in an AI product producing
inconsistent code.

Let's standardize the model's code output by running the excellent
Expand All @@ -63,8 +66,8 @@ In our Jupyter notebook, we'll define a simple function that takes one row from
returns a new `answer_formatted` column that has two sub-fields:

1. `answer`: the rewritten output with formatted python code
2. `has_edit`: a bit that is true if the code formatter made any changes. We will use the bit in the
UI to filter on the rows that got updated.
2. `has_edit`: true when the code formatter made a change. We will use the bit in the UI to filter
on the rows that got updated.

To modify the dataset in Lilac, we will use [](#Dataset.map). To learn more about `Dataset.map`, see
the guide on [](../datasets/dataset_edit.md).
Expand Down Expand Up @@ -104,7 +107,7 @@ ds.map(format_code, output_column='answer_formatted', num_jobs=-1, execution_typ

## Dataset.map

`Dataset.map` is the main vehicle of making edits to the data. It's similar to HuggingFace's
`Dataset.map` is the main vehicle for making edits to data. It's similar to HuggingFace's
[`Dataset.map()`](https://huggingface.co/docs/datasets/process#map) with a few key differences:

- The output of Lilac's `Dataset.map` is always stored in a separate column. This enables tracking
Expand Down Expand Up @@ -139,10 +142,10 @@ different examples by using the left and right arrow keys.
The process of refining data is iterative. If the diff is not exactly what we like, we can change
the parameters to the formatter, re-run the map with `overwrite=True`, and see the new results.

If some of the edits are not what we want, we can manually label them as "bad" to clicking on the
label in the top left corner of the example. Then we can apply a filter on the "bad" examples to
make sure the new version of our map improved those examples. Conversely, we can also label good
examples as "good" and filter on those in future versions to make sure we didn't regress.
If some of the edits are not ideal, we can click on the label in the top left corner of the example
and tag it as "bad". Then we can apply a filter for "bad" examples and make sure that new versions
of our map improved on those examples. Conversely, we can tag "good" examples and see if we regress
as we iterate on the dataset.

<img src="../_static/curate_coding_dataset/label.png">

Expand All @@ -152,13 +155,13 @@ using the download dialog or the python API. See

## Going forward

In this blog post, we've shown how to use Lilac to curate a dataset for a code assistant model. We
used a formatter to standardize the python code outputted by our AI product. We then visualized the
changes to the dataset to understand the behavior of the formatter and any side-effects.
We believe that text is becoming new programming language. It is the source code of LLMs.

At Lilac, we are building the tooling to work with this new programming language, bringing the
tooling, rigor, and best practices from software engineering to the development of the data behind
AI systems.

Looking ahead, the landscape of programming is undergoing a paradigm shift with the emergence of
text as a new programming language. Lilac plays a big part in this transformation as the traditional
boundaries between code and data dissolve.
There is much more to come!

We hope you enjoyed this blog post. If you have any questions or feedback, please reach out to us on
If you have any questions or feedback, please reach out to us on
[Discord](https://discord.gg/jNzw9mC8pp) or [Github](https://github.com/lilacai/lilac).
1 change: 1 addition & 0 deletions lilac/data/dataset_duckdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -879,6 +879,7 @@ def _compute_disk_cached(
task_step_id=task_step_id,
initial_id=start_idx,
shard_id=shard_id,
shard_count=shard_count,
estimated_len=estimated_len,
step_description=task_step_description,
)
Expand Down
42 changes: 34 additions & 8 deletions lilac/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import builtins
import functools
import multiprocessing
import random
import time
import traceback
import uuid
Expand Down Expand Up @@ -163,15 +164,16 @@ def __init__(self, dask_client: Optional[Client] = None) -> None:
async def _update_tasks(self) -> None:
adapter = TypeAdapter(list[TaskStepInfo])
for task_id, task in list(self._tasks.items()):
task_progress_topic = _progress_event_topic(task_id)
if task.status == TaskStatus.COMPLETED:
if task_id in self._task_threadpools:
threadpool = self._task_threadpools[task_id]
threadpool.shutdown()
# Clean up threaded events.
del THREADED_EVENTS[_progress_event_topic(task_id)]
if task_progress_topic in THREADED_EVENTS:
del THREADED_EVENTS[task_progress_topic]
continue

task_progress_topic = _progress_event_topic(task_id)
if task_id in self._dask_futures:
try:
step_events = cast(Any, self._dask_client.get_events(task_progress_topic))
Expand Down Expand Up @@ -322,6 +324,12 @@ def _set_task_completed(self, task_id: TaskId, task_future: Union[DaskFuture, Fu
if task_id in self._dask_futures:
del self._dask_futures[task_id]

def _restart_client_if_no_tasks(self) -> None:
# Check if any tasks are not completed. If not, we restart the dask client to free up memory.
tasks_pending = any(task.status == TaskStatus.PENDING for task in self._tasks.values())
if not tasks_pending:
self._dask_client.restart()

def _set_task_shard_completed(
self, task_id: TaskId, task_future: Union[DaskFuture, Future], num_shards: int
) -> None:
Expand All @@ -336,6 +344,9 @@ def execute(self, task_id: str, type: TaskExecutionType, task: TaskFn, *args: An
task_info = self._tasks[task_id]

if type == 'processes':
# Restart the workers to avoid GC slowing down the workers.
self._restart_client_if_no_tasks()

dask_task_id = _dask_task_id(task_id, None)
task_future = self._dask_client.submit(
functools.partial(_execute_task, task, task_info, dask_task_id),
Expand Down Expand Up @@ -376,8 +387,10 @@ def execute_sharded(

for i, (task, args) in enumerate(subtasks):
if type == 'processes':
dask_task_id = _dask_task_id(task_id, i)
# Restart the workers to avoid GC slowing down the workers.
self._restart_client_if_no_tasks()

dask_task_id = _dask_task_id(task_id, i)
task_future = self._dask_client.submit(
functools.partial(_execute_task, task, task_info, dask_task_id), *args, key=dask_task_id
)
Expand Down Expand Up @@ -543,14 +556,18 @@ def show_progress(
pbar.update(total_len - pbar.n)


# The interval to emit progress events.
EMIT_EVERY_SEC = 0.5


def report_progress(
it: Union[Iterator[TProgress], Iterable[TProgress]],
task_step_id: Optional[TaskStepId],
shard_id: Optional[int] = None,
shard_count: Optional[int] = None,
initial_id: Optional[int] = None,
estimated_len: Optional[int] = None,
step_description: Optional[str] = None,
emit_every_s: float = 0.25,
) -> Generator[TProgress, None, None]:
"""An iterable wrapper that emits progress and yields the original iterable."""
if not task_step_id or task_step_id[0] == '':
Expand All @@ -576,11 +593,15 @@ def report_progress(

it_idx = initial_id if initial_id else 0
start_time = time.time()
last_emit = time.time() - emit_every_s
# Reduce the emit frequency if there are multiple shards to reduce the size of the event stream.
emit_every_sec = EMIT_EVERY_SEC if not shard_count else EMIT_EVERY_SEC * shard_count
# Add jitter to the emit frequency to avoid all workers emitting at the same time.
jitter_sec = random.uniform(0, emit_every_sec)
last_emit = time.time() - EMIT_EVERY_SEC - jitter_sec

for t in it:
cur_time = time.time()
if estimated_len and cur_time - last_emit > emit_every_s:
cur_time = time.time() + jitter_sec
if estimated_len and cur_time - last_emit > emit_every_sec:
elapsed_sec = cur_time - start_time
it_per_sec = ((it_idx or 0) - (initial_id or 0.0)) / elapsed_sec
set_worker_task_progress(
Expand Down Expand Up @@ -693,7 +714,12 @@ def set_worker_task_progress(

steps[step_id].it_idx = it_idx
steps[step_id].estimated_len = estimated_len
steps[step_id].estimated_total_sec = estimated_total_sec
current_estimated_total_sec = steps[step_id].estimated_total_sec
steps[step_id].estimated_total_sec = (
max(current_estimated_total_sec, estimated_total_sec)
if current_estimated_total_sec
else estimated_total_sec
)
steps[step_id].elapsed_sec = elapsed_sec
steps[step_id].it_per_sec = it_per_sec

Expand Down
17 changes: 17 additions & 0 deletions lilac_hf_space.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,23 @@ datasets:
source_name: huggingface
dataset_name: imdb

- namespace: lilac
name: glaive
source:
dataset_name: glaiveai/glaive-code-assistant
source_name: huggingface
settings:
tags: [machine-learning]
ui:
view_type: 'single_item'
ui:
media_paths:
- question
- answer
- - answer_formatted
- answer
markdown_paths: []

- name: open-asssistant-conversations
namespace: lilac
settings:
Expand Down
Loading

0 comments on commit 51be6ce

Please sign in to comment.