Skip to content

Commit

Permalink
Merge branch 'main' into feature/map-extract-v1
Browse files Browse the repository at this point in the history
  • Loading branch information
edwinjosechittilappilly authored Feb 25, 2025
2 parents 7c662d8 + 73551b0 commit c62afb4
Show file tree
Hide file tree
Showing 48 changed files with 4,379 additions and 1,674 deletions.
2 changes: 1 addition & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ Dockerfile text
*.mp4 binary
*.svg binary
*.csv binary

*.wav binary
51 changes: 33 additions & 18 deletions docs/docs/Components/components-vector-stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,31 +37,46 @@ For more information, see the [DataStax documentation](https://docs.datastax.com

| Name | Display Name | Info |
|------|--------------|------|
| collection_name | Collection Name | The name of the collection within Astra DB where the vectors will be stored (required) |
| token | Astra DB Application Token | Authentication token for accessing Astra DB (required) |
| api_endpoint | API Endpoint | API endpoint URL for the Astra DB service (required) |
| search_input | Search Input | Query string for similarity search |
| ingest_data | Ingest Data | Data to be ingested into the vector store |
| namespace | Namespace | Optional namespace within Astra DB to use for the collection |
| embedding_choice | Embedding Model or Astra Vectorize | Determines whether to use an Embedding Model or Astra Vectorize for the collection |
| embedding | Embedding Model | Allows an embedding model configuration (when using Embedding Model) |
| provider | Vectorize Provider | Provider for Astra Vectorize (when using Astra Vectorize) |
| metric | Metric | Optional distance metric for vector comparisons |
| batch_size | Batch Size | Optional number of data to process in a single batch |
| setup_mode | Setup Mode | Configuration mode for setting up the vector store (options: "Sync", "Async", "Off", default: "Sync") |
| pre_delete_collection | Pre Delete Collection | Boolean flag to determine whether to delete the collection before creating a new one |
| number_of_results | Number of Results | Number of results to return in similarity search (default: 4) |
| search_type | Search Type | Search type to use (options: "Similarity", "Similarity with score threshold", "MMR (Max Marginal Relevance)") |
| search_score_threshold | Search Score Threshold | Minimum similarity score threshold for search results |
| search_filter | Search Metadata Filter | Optional dictionary of filters to apply to the search query |
| token | Astra DB Application Token | The authentication token for accessing Astra DB. |
| environment | Environment | The environment for the Astra DB API Endpoint. For example, `dev` or `prod`. |
| database_name | Database | The database name for the Astra DB instance. |
| api_endpoint | Astra DB API Endpoint | The API endpoint for the Astra DB instance. This supersedes the database selection. |
| collection_name | Collection | The name of the collection within Astra DB where the vectors are stored. |
| keyspace | Keyspace | An optional keyspace within Astra DB to use for the collection. |
| embedding_choice | Embedding Model or Astra Vectorize | Choose an embedding model or use Astra vectorize. |
| embedding_model | Embedding Model | Specify the embedding model. Not required for Astra vectorize collections. |
| number_of_results | Number of Search Results | The number of search results to return (default: `4`). |
| search_type | Search Type | The search type to use. The options are `Similarity`, `Similarity with score threshold`, and `MMR (Max Marginal Relevance)`. |
| search_score_threshold | Search Score Threshold | The minimum similarity score threshold for search results when using the `Similarity with score threshold` option. |
| advanced_search_filter | Search Metadata Filter | An optional dictionary of filters to apply to the search query. |
| autodetect_collection | Autodetect Collection | A boolean flag to determine whether to autodetect the collection. |
| content_field | Content Field | A field to use as the text content field for the vector store. |
| deletion_field | Deletion Based On Field | When provided, documents in the target collection with metadata field values matching the input metadata field value are deleted before new data is loaded. |
| ignore_invalid_documents | Ignore Invalid Documents | A boolean flag to determine whether to ignore invalid documents at runtime. |
| astradb_vectorstore_kwargs | AstraDBVectorStore Parameters | An optional dictionary of additional parameters for the AstraDBVectorStore. |

### Outputs

| Name | Display Name | Info |
|------|--------------|------|
| vector_store | Vector Store | Astra DB vector store instance configured with the specified parameters. |
| search_results | Search Results | The results of the similarity search as a list of `Data` objects. |
| search_results | Search Results | The results of the similarity search as a list of [Data](/concepts-objects#data-object) objects. |

### Generate embeddings

The **Astra DB Vector Store** component offers two methods for generating embeddings.

1. **Embedding Model**: Use your own embedding model by connecting an [Embeddings](/components-embedding-models) component in Langflow.

2. **Astra Vectorize**: Use Astra DB's built-in embedding generation service. When creating a new collection, choose the embeddings provider and models, including NVIDIA's `NV-Embed-QA` model hosted by Datastax.

:::important
The embedding model selection is made when creating a new collection and cannot be changed later.
:::

For an example of using the **Astra DB Vector Store** component with an embedding model, see the [Vector Store RAG starter project](/starter-projects-vector-store-rag).

For more information, see the [Astra DB Serverless documentation](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).

## AstraDB Graph vector store

Expand Down
41 changes: 28 additions & 13 deletions docs/docs/Get-Started/get-started-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ Get to know Langflow by building an OpenAI-powered chatbot application. After yo

* [An OpenAI API key](https://platform.openai.com/)
* [An Astra DB vector database](https://docs.datastax.com/en/astra-db-serverless/get-started/quickstart.html) with:
* An AstraDB application token
* [A collection in Astra](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection)
* An Astra DB application token scoped to read and write to the database
* A collection created in [Astra](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection) or a new collection created in the **Astra DB** component

## Open Langflow and start a new project

Expand All @@ -31,7 +31,7 @@ Continue to [Run the basic prompting flow](#run-basic-prompting-flow).

The Basic Prompting flow will look like this when it's completed:

![](/img/starter-flow-basic-prompting.png)
![Completed basic prompting flow](/img/starter-flow-basic-prompting.png)

To build the **Basic Prompting** flow, follow these steps:

Expand All @@ -46,7 +46,7 @@ The [OpenAI](components-models#openai) model component sends the user input and

You should now have a flow that looks like this:

![](/img/quickstart-basic-prompt-no-connections.png)
![Basic prompting flow with no connections](/img/quickstart-basic-prompt-no-connections.png)

With no connections between them, the components won't interact with each other.
You want data to flow from **Chat Input** to **Chat Output** through the connections between the components.
Expand Down Expand Up @@ -111,7 +111,7 @@ If you don't want to create a blank flow, click **New Flow**, and then select **

Adding vector RAG to the basic prompting flow will look like this when completed:

![](/img/quickstart-add-document-ingestion.png)
![Add document ingestion to the basic prompting flow](/img/quickstart-add-document-ingestion.png)

To build the flow, follow these steps:

Expand All @@ -120,24 +120,39 @@ To build the flow, follow these steps:
The [Astra DB vector store](/components-vector-stores#astra-db-vector-store) component connects to your **Astra DB** database.
3. Click **Data**, select the **File** component, and then drag it to the canvas.
The [File](/components-data#file) component loads files from your local machine.
3. Click **Processing**, select the **Split Text** component, and then drag it to the canvas.
4. Click **Processing**, select the **Split Text** component, and then drag it to the canvas.
The [Split Text](/components-processing#split-text) component splits the loaded text into smaller chunks.
4. Click **Processing**, select the **Parse Data** component, and then drag it to the canvas.
5. Click **Processing**, select the **Parse Data** component, and then drag it to the canvas.
The [Data to Message](/components-processing#data-to-message) component converts the data from the **Astra DB** component into plain text.
5. Click **Embeddings**, select the **OpenAI Embeddings** component, and then drag it to the canvas.
6. Click **Embeddings**, select the **OpenAI Embeddings** component, and then drag it to the canvas.
The [OpenAI Embeddings](/components-embedding-models#openai-embeddings) component generates embeddings for the user's input, which are compared to the vector data in the database.
6. Connect the new components into the existing flow, so your flow looks like this:
7. Connect the new components into the existing flow, so your flow looks like this:

![](/img/quickstart-add-document-ingestion.png)
![Add document ingestion to the basic prompting flow](/img/quickstart-add-document-ingestion.png)

8. Configure the **Astra DB** component.
1. In the **Astra DB Application Token** field, add your **Astra DB** application token.
The component connects to your database and populates the menus with existing databases and collections.
2. Select your **Database**.
If you don't have a collection, select **New database**.
Complete the **Name**, **Cloud provider**, and **Region** fields, and then click **Create**. **Database creation takes a few minutes**.
3. Select your **Collection**. Collections are created in your [Astra DB deployment](https://astra.datastax.com) for storing vector data.
If you don't have a collection, see the [DataStax Astra DB Serverless documentation](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
4. Select **Embedding Model** to bring your own embeddings model, which is the connected **OpenAI Embeddings** component.
The **Dimensions** value must match the dimensions of your collection. This value can be found in your **Collection** in your [Astra DB deployment](https://astra.datastax.com).
:::info
If you select a collection embedded with NVIDIA through Astra's vectorize service, the **Embedding Model** port is removed, because you have already generated embeddings for this collection with the NVIDIA `NV-Embed-QA` model. The component fetches the data from the collection, and uses the same embeddings for queries.
:::

9. If you don't have a collection, create a new one within the component.
1. Select **New collection**.
2. Complete the **Name**, **Embedding generation method**, **Embedding model**, and **Dimensions** fields, and then click **Create**.

Your choice for the **Embedding generation method** and **Embedding model** depends on whether you want to use embeddings generated by a provider through Astra's vectorize service, or generated by a component in Langflow.

* To use embeddings generated by a provider through Astra's vectorize service, select the model from the **Embedding generation method** dropdown menu, and then select the model from the **Embedding model** dropdown menu.
* To use embeddings generated by a component in Langflow, select **Bring your own** for both the **Embedding generation method** and **Embedding model** fields. In this starter project, the option for the embeddings method and model is the **OpenAI Embeddings** component connected to the **Astra DB** component.
* The **Dimensions** value must match the dimensions of your collection. This field is **not required** if you use embeddings generated through Astra's vectorize service. You can find this value in the **Collection** in your [Astra DB deployment](https://astra.datastax.com).

For more information, see the [DataStax Astra DB Serverless documentation](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).


If you used Langflow's **Global Variables** feature, the RAG application flow components are already configured with the necessary credentials.

Expand Down
97 changes: 97 additions & 0 deletions docs/docs/Integrations/Nvidia/integrations-nvidia-ingest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: Integrate Nvidia Ingest with Langflow
slug: /integrations-nvidia-ingest
---

The **NVIDIA Ingest** component integrates with the [NVIDIA nv-ingest](https://github.com/NVIDIA/nv-ingest) microservice for data ingestion, processing, and extraction of text files.

The `nv-ingest` service supports multiple extraction methods for PDF, DOCX, and PPTX file types, and includes pre- and post-processing services like splitting, chunking, and embedding generation.

The **NVIDIA Ingest** component imports the NVIDIA `Ingestor` client, ingests files with requests to the NVIDIA ingest endpoint, and outputs the processed content as a list of [Data](/concepts-objects#data-object) objects. `Ingestor` accepts additional configuration options for data extraction from other text formats. To configure these options, see the [component parameters](/integrations-nvidia-ingest#parameters).

## Prerequisites

* An NVIDIA Ingest endpoint. For more information on setting up an NVIDIA Ingest endpoint, see the [NVIDIA Ingest quickstart](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart).

* The **NVIDIA Ingest** component requires the installation of additional dependencies to your Langflow environment. To install the dependencies in a virtual environment, run the following commands:
```bash
source **YOUR_LANGFLOW_VENV**/bin/activate
uv sync --extra nv-ingest
uv run langflow run
```

## Use the NVIDIA Ingest component in a flow

The **NVIDIA Ingest** component accepts **Message** inputs and outputs **Data**. The component calls a NVIDIA Ingest microservice's endpoint to ingest a local file and extract the text.

To use the NVIDIA Ingest component in your flow, follow these steps:
1. In the component library, click the **NVIDIA Ingest** component, and then drag it onto the canvas.
2. In the **NVIDIA Ingestion URL** field, enter the URL of the NVIDIA Ingest endpoint.
Optionally, add the endpoint URL as a **Global variable**:
1. Click **Settings**, and then click **Global Variables**.
2. Click **Add New**.
3. Name your variable. Paste your endpoint in the **Value** field.
4. In the **Apply To Fields** field, select the field you want to globally apply this variable to. In this case, select **NVIDIA Ingestion URL**.
5. Click **Save Variable**.
3. In the **Path** field, enter the path to the file you want to ingest.
4. Select which text type to extract from the file.
The component supports text, charts, and tables.
5. Select whether to split the text into chunks.
Modify the splitting parameters in the component's **Configuration** tab.
7. Click **Run** to ingest the file.
8. To confirm the component is ingesting the file, open the **Logs** pane to view the output of the flow.
9. To store the processed data in a vector database, add an **AstraDB Vector** component to your flow, and connect the **NVIDIA Ingest** component to the **AstraDB Vector** component with a **Data** output.

![NVIDIA Ingest component flow](nvidia-component-ingest-astra.png)

10. Run the flow.
Inspect your Astra DB vector database to view the processed data.

## NVIDIA Ingest component parameters {#parameters}

The **NVIDIA Ingest** component has the following parameters.

For more information, see the [NV-Ingest documentation](https://nvidia.github.io/nv-ingest/user-guide/).

### Inputs

| Name | Display Name | Info |
|------|--------------|------|
| base_url | NVIDIA Ingestion URL | The URL of the NVIDIA Ingestion API. |
| path | Path | File path to process. |
| extract_text | Extract Text | Extract text from documents. Default: `True`. |
| extract_charts | Extract Charts | Extract text from charts. Default: `False`. |
| extract_tables | Extract Tables | Extract text from tables. Default: `True`. |
| text_depth | Text Depth | The level at which text is extracted. Support for 'block', 'line', and 'span' varies by document type. Default: `document`. |
| split_text | Split Text | Split text into smaller chunks. Default: `True`. |
| split_by | Split By | How to split into chunks. 'size' splits by number of characters. Default: `word`. |
| split_length | Split Length | The size of each chunk based on the 'split_by' method. Default: `200`. |
| split_overlap | Split Overlap | The number of segments to overlap from the previous chunk. Default: `20`. |
| max_character_length | Max Character Length | The maximum number of characters in each chunk. Default: `1000`. |
| sentence_window_size | Sentence Window Size | The number of sentences to include from previous and following chunks when `split_by=sentence`. Default: `0`. |

### Outputs

The **NVIDIA Ingest** component outputs a list of [Data](/concepts-objects#data-object) objects where each object contains:
- `text`: The extracted content.
- For text documents: The extracted text content.
- For tables and charts: The extracted table/chart content.
- `file_path`: The source file name and path.
- `document_type`: The type of the document ("text" or "structured").
- `description`: Additional description of the content.

The output varies based on the `document_type`:

- Documents with `document_type: "text"` contain:
- Raw text content extracted from documents, for example, paragraphs from PDFs or DOCX files.
- Content stored directly in the `text` field.
- Content extracted using the `extract_text` parameter.

- Documents with `document_type: "structured"` contain:
- Text extracted from tables and charts and processed to preserve structural information.
- Content extracted using the `extract_tables` and `extract_charts` parameters.
- Content stored in the `text` field after being processed from the `table_content` metadata.

:::note
Images are currently not supported and will be skipped during processing.
:::
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ This opens a starter flow with the necessary components to run an agentic applic

## Simple Agent flow

<img src="/img/starter-flow-simple-agent.png" alt="Starter flow simple agent" width="75%"/>
![Simple agent starter flow](/img/starter-flow-simple-agent.png)

The **Simple Agent** flow consists of these components:

Expand Down
Loading

0 comments on commit c62afb4

Please sign in to comment.