Small tweaks to improve the glaive dataset blog post. (#938)

- Reduce the amount of times we log events for progress to 0.5, and then multiply by shard_count, reducing the overall amount of logs by 24 when you are running a 12-core multiproc map. This is a hack, we should really use dask properly with progress once we switch to streaming. - Restart the dask client when executing a new task if there are no tasks pending. This helps with making the second run much faster as workers dont GC intermittently. Coding dataset notebook: - Adds a limit example to the notebook to show how you determine whether it's what you want. Also: - Add the glaive dataset to the huggingface demo with the map outputs. This currently is a bit weird as I ran it locally and pushed, so it's not completely reproducible. - Improve the notebook for curating a coding dataset. https://lilacai-lilac.hf.space/datasets#lilac/glaive&expandedStats=%7B%22answer_formatted.has_edit%22%3Atrue%7D&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22answer_formatted%22%2C%22has_edit%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A1%7D%5D%7D&compareColumns=%5B%7B%22column%22%3A%5B%22answer%22%5D%2C%22compareToColumn%22%3A%5B%22answer_formatted%22%2C%22answer%22%5D%2C%22swapDirection%22%3Afalse%7D%5D&rowId=%22fffc265c-845e-4a2b-b3ce-2caa61fed0f4%22 Fixes #928
databricks · Dec 7, 2023 · 51be6ce · 51be6ce
1 parent 177e61b
commit 51be6ce
Show file tree

Hide file tree

Showing 6 changed files with 286 additions and 50 deletions.
diff --git a/docs/_static/curate_coding_dataset/glaive_preview.png b/docs/_static/curate_coding_dataset/glaive_preview.png
diff --git a/docs/blog/curate-coding-dataset.md b/docs/blog/curate-coding-dataset.md
@@ -1,29 +1,32 @@
 # Curate a coding dataset with Lilac
 
-Dec 2, 2023
+_Dec 7, 2023_
 
 Good data is the engine that drives progress in AI. Companies that have control of their data can
-add unique capabilities and differentiate their product. Beyond differentiation, companies are also
-recognizing that building models on their own data reduces cost, and improves speed, control and
-compliance.
+add unique capabilities and differentiate their product. Companies are also recognizing that
+building models with their own data reduces cost, and improves speed, control and compliance.
 
 Data curation is often the most effective way to control how AI models behave. This process involves
-standard procedures like de-duplication and PII scrubbing. However, focusing on the long-tail of
-product specific requirements can deliver an amazing user experience. At Lilac, we also believe that
-having more eyes on data ultimately leads to fundamental discoveries of how a model will behave,
-giving the developer more control of their downstream AI product.
+standard procedures like de-duplication and PII scrubbing, but also the long-tail of product
+specific requirements that can deliver an amazing user experience.
+
+At Lilac, we also believe that having more eyes on data ultimately leads to fundamental discoveries
+of how a model will behave, giving the developer more control of their downstream AI product.
 
 In this blog post, we'll delve into the excellent
 [Glaive coding assistant](https://huggingface.co/datasets/glaiveai/glaive-code-assistant) dataset
 with the goal of fine-tuning a code assistant model. We'll modify the dataset so that code outputted
 by our AI product follows consistent formatting rules, and we'll visualize how the dataset has
 changed.
 
+<img src="../_static/curate_coding_dataset/glaive_preview.png">
+
 ## A First Look at the Glaive Dataset
 
 Let's load the Glaive dataset into Lilac from the HuggingFace hub. In this example, we're going to
 be using a Jupyter notebook (follow along
-[here](https://github.com/lilacai/lilac/blob/main/notebooks/CurateCodingDataset.ipynb)).
+[here](https://github.com/lilacai/lilac/blob/main/notebooks/CurateCodingDataset.ipynb)) or
+[view the live demo on HuggingFace](https://lilacai-lilac.hf.space/datasets#lilac/glaive&expandedStats=%7B%22answer_formatted.has_edit%22%3Atrue%7D&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22answer_formatted%22%2C%22has_edit%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A1%7D%5D%7D&compareColumns=%5B%7B%22column%22%3A%5B%22answer%22%5D%2C%22compareToColumn%22%3A%5B%22answer_formatted%22%2C%22answer%22%5D%2C%22swapDirection%22%3Afalse%7D%5D&rowId=%22fffc265c-845e-4a2b-b3ce-2caa61fed0f4%22).
 
 ```python
 import lilac as ll
@@ -51,7 +54,7 @@ INFO:     Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit)
 
 You can see that the dataset consists of `question` and `answer` pairs, where the answer is in
 markdown format, often containing python code blocks. Immediately we can see that the python
-formatting is not consistent with our style, which will result in an AI product producing
+formatting is not consistent with our desired style, which will result in an AI product producing
 inconsistent code.
 
 Let's standardize the model's code output by running the excellent
@@ -63,8 +66,8 @@ In our Jupyter notebook, we'll define a simple function that takes one row from
 returns a new `answer_formatted` column that has two sub-fields:
 
 1. `answer`: the rewritten output with formatted python code
-2. `has_edit`: a bit that is true if the code formatter made any changes. We will use the bit in the
-   UI to filter on the rows that got updated.
+2. `has_edit`: true when the code formatter made a change. We will use the bit in the UI to filter
+   on the rows that got updated.
 
 To modify the dataset in Lilac, we will use [](#Dataset.map). To learn more about `Dataset.map`, see
 the guide on [](../datasets/dataset_edit.md).
@@ -104,7 +107,7 @@ ds.map(format_code, output_column='answer_formatted', num_jobs=-1, execution_typ
 
 ## Dataset.map
 
-`Dataset.map` is the main vehicle of making edits to the data. It's similar to HuggingFace's
+`Dataset.map` is the main vehicle for making edits to data. It's similar to HuggingFace's
 [`Dataset.map()`](https://huggingface.co/docs/datasets/process#map) with a few key differences:
 
 - The output of Lilac's `Dataset.map` is always stored in a separate column. This enables tracking
@@ -139,10 +142,10 @@ different examples by using the left and right arrow keys.
 The process of refining data is iterative. If the diff is not exactly what we like, we can change
 the parameters to the formatter, re-run the map with `overwrite=True`, and see the new results.
 
-If some of the edits are not what we want, we can manually label them as "bad" to clicking on the
-label in the top left corner of the example. Then we can apply a filter on the "bad" examples to
-make sure the new version of our map improved those examples. Conversely, we can also label good
-examples as "good" and filter on those in future versions to make sure we didn't regress.
+If some of the edits are not ideal, we can click on the label in the top left corner of the example
+and tag it as "bad". Then we can apply a filter for "bad" examples and make sure that new versions
+of our map improved on those examples. Conversely, we can tag "good" examples and see if we regress
+as we iterate on the dataset.
 
 <img src="../_static/curate_coding_dataset/label.png">
 
@@ -152,13 +155,13 @@ using the download dialog or the python API. See
 
 ## Going forward
 
-In this blog post, we've shown how to use Lilac to curate a dataset for a code assistant model. We
-used a formatter to standardize the python code outputted by our AI product. We then visualized the
-changes to the dataset to understand the behavior of the formatter and any side-effects.
+We believe that text is becoming new programming language. It is the source code of LLMs.
+
+At Lilac, we are building the tooling to work with this new programming language, bringing the
+tooling, rigor, and best practices from software engineering to the development of the data behind
+AI systems.
 
-Looking ahead, the landscape of programming is undergoing a paradigm shift with the emergence of
-text as a new programming language. Lilac plays a big part in this transformation as the traditional
-boundaries between code and data dissolve.
+There is much more to come!
 
-We hope you enjoyed this blog post. If you have any questions or feedback, please reach out to us on
+If you have any questions or feedback, please reach out to us on
 [Discord](https://discord.gg/jNzw9mC8pp) or [Github](https://github.com/lilacai/lilac).
diff --git a/lilac/data/dataset_duckdb.py b/lilac/data/dataset_duckdb.py
@@ -879,6 +879,7 @@ def _compute_disk_cached(
         task_step_id=task_step_id,
         initial_id=start_idx,
         shard_id=shard_id,
+        shard_count=shard_count,
         estimated_len=estimated_len,
         step_description=task_step_description,
       )

diff --git a/lilac/tasks.py b/lilac/tasks.py
@@ -4,6 +4,7 @@
 import builtins
 import functools
 import multiprocessing
+import random
 import time
 import traceback
 import uuid
@@ -163,15 +164,16 @@ def __init__(self, dask_client: Optional[Client] = None) -> None:
   async def _update_tasks(self) -> None:
     adapter = TypeAdapter(list[TaskStepInfo])
     for task_id, task in list(self._tasks.items()):
+      task_progress_topic = _progress_event_topic(task_id)
       if task.status == TaskStatus.COMPLETED:
         if task_id in self._task_threadpools:
           threadpool = self._task_threadpools[task_id]
           threadpool.shutdown()
           # Clean up threaded events.
-          del THREADED_EVENTS[_progress_event_topic(task_id)]
+          if task_progress_topic in THREADED_EVENTS:
+            del THREADED_EVENTS[task_progress_topic]
         continue
 
-      task_progress_topic = _progress_event_topic(task_id)
       if task_id in self._dask_futures:
         try:
           step_events = cast(Any, self._dask_client.get_events(task_progress_topic))
@@ -322,6 +324,12 @@ def _set_task_completed(self, task_id: TaskId, task_future: Union[DaskFuture, Fu
     if task_id in self._dask_futures:
       del self._dask_futures[task_id]
 
+  def _restart_client_if_no_tasks(self) -> None:
+    # Check if any tasks are not completed. If not, we restart the dask client to free up memory.
+    tasks_pending = any(task.status == TaskStatus.PENDING for task in self._tasks.values())
+    if not tasks_pending:
+      self._dask_client.restart()
+
   def _set_task_shard_completed(
     self, task_id: TaskId, task_future: Union[DaskFuture, Future], num_shards: int
   ) -> None:
@@ -336,6 +344,9 @@ def execute(self, task_id: str, type: TaskExecutionType, task: TaskFn, *args: An
     task_info = self._tasks[task_id]
 
     if type == 'processes':
+      # Restart the workers to avoid GC slowing down the workers.
+      self._restart_client_if_no_tasks()
+
       dask_task_id = _dask_task_id(task_id, None)
       task_future = self._dask_client.submit(
         functools.partial(_execute_task, task, task_info, dask_task_id),
@@ -376,8 +387,10 @@ def execute_sharded(
 
     for i, (task, args) in enumerate(subtasks):
       if type == 'processes':
-        dask_task_id = _dask_task_id(task_id, i)
+        # Restart the workers to avoid GC slowing down the workers.
+        self._restart_client_if_no_tasks()
 
+        dask_task_id = _dask_task_id(task_id, i)
         task_future = self._dask_client.submit(
           functools.partial(_execute_task, task, task_info, dask_task_id), *args, key=dask_task_id
         )
@@ -543,14 +556,18 @@ def show_progress(
         pbar.update(total_len - pbar.n)
 
 
+# The interval to emit progress events.
+EMIT_EVERY_SEC = 0.5
+
+
 def report_progress(
   it: Union[Iterator[TProgress], Iterable[TProgress]],
   task_step_id: Optional[TaskStepId],
   shard_id: Optional[int] = None,
+  shard_count: Optional[int] = None,
   initial_id: Optional[int] = None,
   estimated_len: Optional[int] = None,
   step_description: Optional[str] = None,
-  emit_every_s: float = 0.25,
 ) -> Generator[TProgress, None, None]:
   """An iterable wrapper that emits progress and yields the original iterable."""
   if not task_step_id or task_step_id[0] == '':
@@ -576,11 +593,15 @@ def report_progress(
 
   it_idx = initial_id if initial_id else 0
   start_time = time.time()
-  last_emit = time.time() - emit_every_s
+  # Reduce the emit frequency if there are multiple shards to reduce the size of the event stream.
+  emit_every_sec = EMIT_EVERY_SEC if not shard_count else EMIT_EVERY_SEC * shard_count
+  # Add jitter to the emit frequency to avoid all workers emitting at the same time.
+  jitter_sec = random.uniform(0, emit_every_sec)
+  last_emit = time.time() - EMIT_EVERY_SEC - jitter_sec
 
   for t in it:
-    cur_time = time.time()
-    if estimated_len and cur_time - last_emit > emit_every_s:
+    cur_time = time.time() + jitter_sec
+    if estimated_len and cur_time - last_emit > emit_every_sec:
       elapsed_sec = cur_time - start_time
       it_per_sec = ((it_idx or 0) - (initial_id or 0.0)) / elapsed_sec
       set_worker_task_progress(
@@ -693,7 +714,12 @@ def set_worker_task_progress(
 
   steps[step_id].it_idx = it_idx
   steps[step_id].estimated_len = estimated_len
-  steps[step_id].estimated_total_sec = estimated_total_sec
+  current_estimated_total_sec = steps[step_id].estimated_total_sec
+  steps[step_id].estimated_total_sec = (
+    max(current_estimated_total_sec, estimated_total_sec)
+    if current_estimated_total_sec
+    else estimated_total_sec
+  )
   steps[step_id].elapsed_sec = elapsed_sec
   steps[step_id].it_per_sec = it_per_sec
 

diff --git a/lilac_hf_space.yml b/lilac_hf_space.yml
@@ -11,6 +11,23 @@ datasets:
       source_name: huggingface
       dataset_name: imdb
 
+  - namespace: lilac
+    name: glaive
+    source:
+      dataset_name: glaiveai/glaive-code-assistant
+      source_name: huggingface
+    settings:
+      tags: [machine-learning]
+      ui:
+        view_type: 'single_item'
+      ui:
+        media_paths:
+          - question
+          - answer
+          - - answer_formatted
+            - answer
+        markdown_paths: []
+
   - name: open-asssistant-conversations
     namespace: lilac
     settings: