Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/lilacai/lilac into ds-blog-…
Browse files Browse the repository at this point in the history
…update
  • Loading branch information
dsmilkov committed Jan 31, 2024
2 parents 0575e59 + bbb14ac commit 3b644e0
Show file tree
Hide file tree
Showing 4 changed files with 55 additions and 24 deletions.
24 changes: 14 additions & 10 deletions docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ Fill in HuggingFace-specific fields:
```{note}
Lilac's sweet spot is ~100K-1M rows of data, although up to 10 million rows are possible.
This quickstart uses 10,000 rows so that clustering and embedding operations finish locally
in ~10 minutes even without a GPU.
in 10-20 minutes even without a GPU. [Lilac Garden](https://docs.lilacml.com/blog/introducing-garden.html) can help speed
up your computation for larger datasets.
```

Finally:
Expand All @@ -61,14 +62,16 @@ When we load a dataset, Lilac creates a default UI configuration, inferring whic
presented differently in the UI.

Let's edit the configuration by clicking the `Dataset settings` button in the top-right corner. If
your media field contains markdown, you can enable markdown rendering.
your media field contains markdown, you can
[enable markdown rendering](../datasets/dataset_configure.md).

<video loop muted autoplay controls src="../_static/getting_started/orca-settings.mp4"></video>

## Cluster

Lilac can detect clusters in your dataset. Clusters are a powerful way to understand the types of
content present in your dataset, as well as to target subsets for removal from the dataset.
Lilac can detect [clusters in your dataset](../datasets/dataset_cluster.md). Clusters are a powerful
way to understand the types of content present in your dataset, as well as to target subsets for
removal from the dataset.

To cluster, open up the dataset schema tray to reveal the fields in your dataset. Here, you can
choose which field will get clustered.
Expand All @@ -80,9 +83,10 @@ other fields in your dataset by changing the Explore and Group By selections.

## Tagging and Deleting rows

Lilac can curate your dataset by tagging or deleting rows.
Lilac can curate your dataset by [tagging](../datasets/dataset_labels.md) or
[deleting](../datasets/dataset_delete_rows.md) rows.

Deleting is not permanent - you can toggle visibility of deleted items - but it is a convenient way
Deleting is not permanent - you can toggle visibility of deleted item. Deleting is a convenient way
to iterate on your dataset by removing undesired slices of data. Later on, when you export data from
Lilac, deleted rows will be excluded by default.

Expand Down Expand Up @@ -124,10 +128,10 @@ can open the statistics panel to see the distribution of concept scores.

## Download

Now that we've clustered, curated, and enriched the dataset, let's download it by clicking on the
`Download data` button in the top-right corner. This will download a json file with the same name as
the dataset. Once we have the data, we can continue working with it in a Python notebook, or any
other language.
Now that we've clustered, curated, and enriched the dataset, let's
[download it](../datasets/dataset_export.md) by clicking on the `Download data` button in the
top-right corner. This will download a json file with the same name as the dataset. Once we have the
data, we can continue working with it in a Python notebook, or any other language.

You can also get the dataset as a Pandas dataframe through the [Python API](quickstart_python.md).

Expand Down
14 changes: 8 additions & 6 deletions docs/getting_started/quickstart_python.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ dataset = ll.get_dataset('local', 'open-orca-100k')

## Compute clusters

Let's compute clusters on the `question`field.
Let's [compute clusters](../datasets/dataset_cluster.md#from-python) on the `question`field.

```python
dataset.cluster('question')
Expand Down Expand Up @@ -130,8 +130,8 @@ question__cluster:
## Select specific rows

Let's find all clusters that talk about movies via [](#Dataset.select_rows), which works very
similarly to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op) filter on
`question__cluster.cluster_title`.
similarly to a `SQL Select` statement. We do this by adding an [`regex_matches`](#Filter.op) filter
on `question__cluster.cluster_title`. (See [Querying](../datasets/dataset_query.md) for more.)

```py
df_with_emails = dataset.select_rows(
Expand Down Expand Up @@ -177,8 +177,9 @@ For more information on querying, see [](#Dataset.select_rows).
### Profanity detection

Let's also run the profanity concept on the `response` field to see if the LLM produced any profane
content. To do that we need to _index_ the `response` field using a text embedding. We only need to
index once. For a fast on-device embedding, we recommend the
content. To do that we need to _index_ the `response` field using a
[text embedding](../datasets/dataset_embeddings.md#from-python). We only need to index once. For a
fast on-device embedding, we recommend the
[GTE-Small embedding](https://huggingface.co/thenlper/gte-small).

```py
Expand Down Expand Up @@ -221,7 +222,8 @@ Computing signal "concept_score" on local/open-orca-10k:('response',) took 0.025
4 [{'__span__': {'start': 0, 'end': 164}, 'score...
```

To compute the concept score over the entire dataset, we do:
To compute the [concept score](../datasets/dataset_concepts.md#from-python) over the entire dataset,
we do:

```py
dataset.compute_concept('lilac', 'profanity', embedding='gte-small', path='response')
Expand Down
37 changes: 31 additions & 6 deletions docs/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 3b644e0

Please sign in to comment.