Merge branch 'main' into feature/map-extract-v1

langflow-ai · Feb 25, 2025 · c62afb4 · c62afb4
2 parents 7c662d8 + 73551b0
commit c62afb4
Show file tree

Hide file tree

Showing 48 changed files with 4,379 additions and 1,674 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -32,4 +32,4 @@ Dockerfile text
 *.mp4 binary
 *.svg binary
 *.csv binary
-
+*.wav binary
diff --git a/docs/docs/Components/components-vector-stores.md b/docs/docs/Components/components-vector-stores.md
@@ -37,31 +37,46 @@ For more information, see the [DataStax documentation](https://docs.datastax.com
 
 | Name | Display Name | Info |
 |------|--------------|------|
-| collection_name | Collection Name | The name of the collection within Astra DB where the vectors will be stored (required) |
-| token | Astra DB Application Token | Authentication token for accessing Astra DB (required) |
-| api_endpoint | API Endpoint | API endpoint URL for the Astra DB service (required) |
-| search_input | Search Input | Query string for similarity search |
-| ingest_data | Ingest Data | Data to be ingested into the vector store |
-| namespace | Namespace | Optional namespace within Astra DB to use for the collection |
-| embedding_choice | Embedding Model or Astra Vectorize | Determines whether to use an Embedding Model or Astra Vectorize for the collection |
-| embedding | Embedding Model | Allows an embedding model configuration (when using Embedding Model) |
-| provider | Vectorize Provider | Provider for Astra Vectorize (when using Astra Vectorize) |
-| metric | Metric | Optional distance metric for vector comparisons |
-| batch_size | Batch Size | Optional number of data to process in a single batch |
-| setup_mode | Setup Mode | Configuration mode for setting up the vector store (options: "Sync", "Async", "Off", default: "Sync") |
-| pre_delete_collection | Pre Delete Collection | Boolean flag to determine whether to delete the collection before creating a new one |
-| number_of_results | Number of Results | Number of results to return in similarity search (default: 4) |
-| search_type | Search Type | Search type to use (options: "Similarity", "Similarity with score threshold", "MMR (Max Marginal Relevance)") |
-| search_score_threshold | Search Score Threshold | Minimum similarity score threshold for search results |
-| search_filter | Search Metadata Filter | Optional dictionary of filters to apply to the search query |
+| token | Astra DB Application Token | The authentication token for accessing Astra DB. |
+| environment | Environment | The environment for the Astra DB API Endpoint. For example, `dev` or `prod`. |
+| database_name | Database | The database name for the Astra DB instance. |
+| api_endpoint | Astra DB API Endpoint | The API endpoint for the Astra DB instance. This supersedes the database selection. |
+| collection_name | Collection | The name of the collection within Astra DB where the vectors are stored. |
+| keyspace | Keyspace | An optional keyspace within Astra DB to use for the collection. |
+| embedding_choice | Embedding Model or Astra Vectorize | Choose an embedding model or use Astra vectorize. |
+| embedding_model | Embedding Model | Specify the embedding model. Not required for Astra vectorize collections. |
+| number_of_results | Number of Search Results | The number of search results to return (default: `4`). |
+| search_type | Search Type | The search type to use. The options are `Similarity`, `Similarity with score threshold`, and `MMR (Max Marginal Relevance)`. |
+| search_score_threshold | Search Score Threshold | The minimum similarity score threshold for search results when using the `Similarity with score threshold` option. |
+| advanced_search_filter | Search Metadata Filter | An optional dictionary of filters to apply to the search query. |
+| autodetect_collection | Autodetect Collection | A boolean flag to determine whether to autodetect the collection. |
+| content_field | Content Field | A field to use as the text content field for the vector store. |
+| deletion_field | Deletion Based On Field | When provided, documents in the target collection with metadata field values matching the input metadata field value are deleted before new data is loaded. |
+| ignore_invalid_documents | Ignore Invalid Documents | A boolean flag to determine whether to ignore invalid documents at runtime. |
+| astradb_vectorstore_kwargs | AstraDBVectorStore Parameters | An optional dictionary of additional parameters for the AstraDBVectorStore. |
 
 ### Outputs
 
 | Name | Display Name | Info |
 |------|--------------|------|
 | vector_store | Vector Store | Astra DB vector store instance configured with the specified parameters. |
-| search_results | Search Results | The results of the similarity search as a list of `Data` objects. |
+| search_results | Search Results | The results of the similarity search as a list of [Data](/concepts-objects#data-object) objects. |
+
+### Generate embeddings
+
+The **Astra DB Vector Store** component offers two methods for generating embeddings.
+
+1. **Embedding Model**: Use your own embedding model by connecting an [Embeddings](/components-embedding-models) component in Langflow.
+
+2. **Astra Vectorize**: Use Astra DB's built-in embedding generation service. When creating a new collection, choose the embeddings provider and models, including NVIDIA's `NV-Embed-QA` model hosted by Datastax.
+
+:::important
+The embedding model selection is made when creating a new collection and cannot be changed later.
+:::
+
+For an example of using the **Astra DB Vector Store** component with an embedding model, see the [Vector Store RAG starter project](/starter-projects-vector-store-rag).
 
+For more information, see the [Astra DB Serverless documentation](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).
 
 ## AstraDB Graph vector store
 

diff --git a/docs/docs/Get-Started/get-started-quickstart.md b/docs/docs/Get-Started/get-started-quickstart.md
@@ -11,8 +11,8 @@ Get to know Langflow by building an OpenAI-powered chatbot application. After yo
 
 * [An OpenAI API key](https://platform.openai.com/)
 * [An Astra DB vector database](https://docs.datastax.com/en/astra-db-serverless/get-started/quickstart.html) with:
-	* An AstraDB application token
-	* [A collection in Astra](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection)
+	* An Astra DB application token scoped to read and write to the database
+	* A collection created in [Astra](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection) or a new collection created in the **Astra DB** component
 
 ## Open Langflow and start a new project
 
@@ -31,7 +31,7 @@ Continue to [Run the basic prompting flow](#run-basic-prompting-flow).
 
 The Basic Prompting flow will look like this when it's completed:
 
-![](/img/starter-flow-basic-prompting.png)
+![Completed basic prompting flow](/img/starter-flow-basic-prompting.png)
 
 To build the **Basic Prompting** flow, follow these steps:
 
@@ -46,7 +46,7 @@ The [OpenAI](components-models#openai) model component sends the user input and
 
 You should now have a flow that looks like this:
 
-![](/img/quickstart-basic-prompt-no-connections.png)
+![Basic prompting flow with no connections](/img/quickstart-basic-prompt-no-connections.png)
 
 With no connections between them, the components won't interact with each other.
 You want data to flow from **Chat Input** to **Chat Output** through the connections between the components.
@@ -111,7 +111,7 @@ If you don't want to create a blank flow, click **New Flow**, and then select **
 
 Adding vector RAG to the basic prompting flow will look like this when completed:
 
-![](/img/quickstart-add-document-ingestion.png)
+![Add document ingestion to the basic prompting flow](/img/quickstart-add-document-ingestion.png)
 
 To build the flow, follow these steps:
 
@@ -120,24 +120,39 @@ To build the flow, follow these steps:
 The [Astra DB vector store](/components-vector-stores#astra-db-vector-store) component connects to your **Astra DB** database.
 3. Click **Data**, select the **File** component, and then drag it to the canvas.
 The [File](/components-data#file) component loads files from your local machine.
-3. Click **Processing**, select the **Split Text** component, and then drag it to the canvas.
+4. Click **Processing**, select the **Split Text** component, and then drag it to the canvas.
 The [Split Text](/components-processing#split-text) component splits the loaded text into smaller chunks.
-4. Click **Processing**, select the **Parse Data** component, and then drag it to the canvas.
+5. Click **Processing**, select the **Parse Data** component, and then drag it to the canvas.
 The [Data to Message](/components-processing#data-to-message) component converts the data from the **Astra DB** component into plain text.
-5. Click **Embeddings**, select the **OpenAI Embeddings** component, and then drag it to the canvas.
+6. Click **Embeddings**, select the **OpenAI Embeddings** component, and then drag it to the canvas.
 The [OpenAI Embeddings](/components-embedding-models#openai-embeddings) component generates embeddings for the user's input, which are compared to the vector data in the database.
-6. Connect the new components into the existing flow, so your flow looks like this:
+7. Connect the new components into the existing flow, so your flow looks like this:
 
-![](/img/quickstart-add-document-ingestion.png)
+![Add document ingestion to the basic prompting flow](/img/quickstart-add-document-ingestion.png)
 
 8. Configure the **Astra DB** component.
 	1. In the **Astra DB Application Token** field, add your **Astra DB** application token.
 	The component connects to your database and populates the menus with existing databases and collections.
 	2. Select your **Database**.
+	If you don't have a collection, select **New database**.
+	Complete the **Name**, **Cloud provider**, and **Region** fields, and then click **Create**. **Database creation takes a few minutes**.
 	3. Select your **Collection**. Collections are created in your [Astra DB deployment](https://astra.datastax.com) for storing vector data.
-	If you don't have a collection, see the [DataStax Astra DB Serverless documentation](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection).
-	4. Select **Embedding Model** to bring your own embeddings model, which is the connected **OpenAI Embeddings** component.
-	The **Dimensions** value must match the dimensions of your collection. This value can be found in your **Collection** in your [Astra DB deployment](https://astra.datastax.com).
+	:::info
+	If you select a collection embedded with NVIDIA through Astra's vectorize service, the **Embedding Model** port is removed, because you have already generated embeddings for this collection with the NVIDIA `NV-Embed-QA` model. The component fetches the data from the collection, and uses the same embeddings for queries.
+	:::
+
+9. If you don't have a collection, create a new one within the component.
+	1. Select **New collection**.
+	2. Complete the **Name**, **Embedding generation method**, **Embedding model**, and **Dimensions** fields, and then click **Create**.
+
+		Your choice for the **Embedding generation method** and **Embedding model** depends on whether you want to use embeddings generated by a provider through Astra's vectorize service, or generated by a component in Langflow.
+
+		* To use embeddings generated by a provider through Astra's vectorize service, select the model from the **Embedding generation method** dropdown menu, and then select the model from the **Embedding model** dropdown menu.
+		* To use embeddings generated by a component in Langflow, select **Bring your own** for both the **Embedding generation method** and **Embedding model** fields. In this starter project, the option for the embeddings method and model is the **OpenAI Embeddings** component connected to the **Astra DB** component.
+		* The **Dimensions** value must match the dimensions of your collection. This field is **not required** if you use embeddings generated through Astra's vectorize service. You can find this value in the **Collection** in your [Astra DB deployment](https://astra.datastax.com).
+
+		For more information, see the [DataStax Astra DB Serverless documentation](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).
+
 
 If you used Langflow's **Global Variables** feature, the RAG application flow components are already configured with the necessary credentials.
 

diff --git a/docs/docs/Integrations/Nvidia/integrations-nvidia-ingest.md b/docs/docs/Integrations/Nvidia/integrations-nvidia-ingest.md
@@ -0,0 +1,97 @@
+---
+title:  Integrate Nvidia Ingest with Langflow
+slug: /integrations-nvidia-ingest
+---
+
+The **NVIDIA Ingest** component integrates with the [NVIDIA nv-ingest](https://github.com/NVIDIA/nv-ingest) microservice for data ingestion, processing, and extraction of text files.
+
+The `nv-ingest` service supports multiple extraction methods for PDF, DOCX, and PPTX file types, and includes pre-  and post-processing services like splitting, chunking, and embedding generation.
+
+The **NVIDIA Ingest** component imports the NVIDIA `Ingestor` client, ingests files with requests to the NVIDIA ingest endpoint, and outputs the processed content as a list of [Data](/concepts-objects#data-object) objects. `Ingestor` accepts additional configuration options for data extraction from other text formats. To configure these options, see the [component parameters](/integrations-nvidia-ingest#parameters).
+
+## Prerequisites
+
+* An NVIDIA Ingest endpoint. For more information on setting up an NVIDIA Ingest endpoint, see the [NVIDIA Ingest quickstart](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart).
+
+* The **NVIDIA Ingest** component requires the installation of additional dependencies to your Langflow environment. To install the dependencies in a virtual environment, run the following commands:
+```bash
+source **YOUR_LANGFLOW_VENV**/bin/activate
+uv sync --extra nv-ingest
+uv run langflow run
+```
+
+## Use the NVIDIA Ingest component in a flow
+
+The **NVIDIA Ingest** component accepts **Message** inputs and outputs **Data**. The component calls a NVIDIA Ingest microservice's endpoint to ingest a local file and extract the text.
+
+To use the NVIDIA Ingest component in your flow, follow these steps:
+1. In the component library, click the **NVIDIA Ingest** component, and then drag it onto the canvas.
+2. In the **NVIDIA Ingestion URL** field, enter the URL of the NVIDIA Ingest endpoint.
+Optionally, add the endpoint URL as a **Global variable**:
+    1. Click **Settings**, and then click **Global Variables**.
+    2. Click **Add New**.
+    3. Name your variable. Paste your endpoint in the **Value** field.
+    4. In the **Apply To Fields** field, select the field you want to globally apply this variable to. In this case, select **NVIDIA Ingestion URL**.
+    5. Click **Save Variable**.
+3. In the **Path** field, enter the path to the file you want to ingest.
+4. Select which text type to extract from the file.
+The component supports text, charts, and tables.
+5. Select whether to split the text into chunks.
+Modify the splitting parameters in the component's **Configuration** tab.
+7. Click **Run** to ingest the file.
+8. To confirm the component is ingesting the file, open the **Logs** pane to view the output of the flow.
+9. To store the processed data in a vector database, add an **AstraDB Vector** component to your flow, and connect the **NVIDIA Ingest** component to the **AstraDB Vector** component with a **Data** output.
+
+![NVIDIA Ingest component flow](nvidia-component-ingest-astra.png)
+
+10. Run the flow.
+Inspect your Astra DB vector database to view the processed data.
+
+## NVIDIA Ingest component parameters {#parameters}
+
+The **NVIDIA Ingest** component has the following parameters.
+
+For more information, see the [NV-Ingest documentation](https://nvidia.github.io/nv-ingest/user-guide/).
+
+### Inputs
+
+| Name | Display Name | Info |
+|------|--------------|------|
+| base_url | NVIDIA Ingestion URL | The URL of the NVIDIA Ingestion API. |
+| path | Path | File path to process. |
+| extract_text | Extract Text | Extract text from documents. Default: `True`. |
+| extract_charts | Extract Charts | Extract text from charts. Default: `False`. |
+| extract_tables | Extract Tables | Extract text from tables. Default: `True`. |
+| text_depth | Text Depth | The level at which text is extracted. Support for 'block', 'line', and 'span' varies by document type. Default: `document`. |
+| split_text | Split Text | Split text into smaller chunks. Default: `True`. |
+| split_by | Split By | How to split into chunks. 'size' splits by number of characters. Default: `word`. |
+| split_length | Split Length | The size of each chunk based on the 'split_by' method. Default: `200`. |
+| split_overlap | Split Overlap | The number of segments to overlap from the previous chunk. Default: `20`. |
+| max_character_length | Max Character Length | The maximum number of characters in each chunk. Default: `1000`. |
+| sentence_window_size | Sentence Window Size | The number of sentences to include from previous and following chunks when `split_by=sentence`. Default: `0`. |
+
+### Outputs
+
+The **NVIDIA Ingest** component outputs a list of [Data](/concepts-objects#data-object) objects where each object contains:
+- `text`: The extracted content.
+  - For text documents: The extracted text content.
+  - For tables and charts: The extracted table/chart content.
+- `file_path`: The source file name and path.
+- `document_type`: The type of the document ("text" or "structured").
+- `description`: Additional description of the content.
+
+The output varies based on the `document_type`:
+
+- Documents with `document_type: "text"` contain:
+  - Raw text content extracted from documents, for example, paragraphs from PDFs or DOCX files.
+  - Content stored directly in the `text` field.
+  - Content extracted using the `extract_text` parameter.
+
+- Documents with `document_type: "structured"` contain:
+  - Text extracted from tables and charts and processed to preserve structural information.
+  - Content extracted using the `extract_tables` and `extract_charts` parameters.
+  - Content stored in the `text` field after being processed from the `table_content` metadata.
+
+:::note
+Images are currently not supported and will be skipped during processing.
+:::
diff --git a/docs/docs/Integrations/Nvidia/nvidia-component-ingest-astra.png b/docs/docs/Integrations/Nvidia/nvidia-component-ingest-astra.png
diff --git a/docs/docs/Starter-Projects/starter-projects-simple-agent.md b/docs/docs/Starter-Projects/starter-projects-simple-agent.md
@@ -22,7 +22,7 @@ This opens a starter flow with the necessary components to run an agentic applic
 
 ## Simple Agent flow
 
-<img src="/img/starter-flow-simple-agent.png" alt="Starter flow simple agent" width="75%"/>
+![Simple agent starter flow](/img/starter-flow-simple-agent.png)
 
 The **Simple Agent** flow consists of these components: