chore: fix Docling example (Colab env integration, typos) (#16425)

run-llama · Oct 8, 2024 · 7d9d8be · 7d9d8be
1 parent 40d1161
commit 7d9d8be
Showing 1 changed file with 56 additions and 76 deletions.
diff --git a/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb b/docs/docs/examples/data_connectors/DoclingReaderDemo.ipynb
@@ -27,7 +27,7 @@
    "source": [
     "[Docling](https://github.com/DS4SD/docling) extracts PDF documents into a rich representation (incl. layout, tables etc.), which it can export to Markdown or JSON.\n",
     "\n",
-    "The `DoclingReader` seamlessly integrates Docling into LlamaIndex, enabling you to:\n",
+    "Docling Reader and Docling Node Parser presented in this notebook seamlessly integrate Docling into LlamaIndex, enabling you to:\n",
     "- use PDF documents in your LLM applications with ease and speed, and\n",
     "- leverage Docling's rich format for advanced, document-native grounding."
    ]
@@ -36,31 +36,32 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Notebook setup"
+    "## Setup"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "> 👉 For best conversion speed, use GPU acceleration whenever available (e.g. if running on Colab, use a GPU-enabled runtime)."
+    "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n",
+    "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n",
+    "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Note: you may need to restart the kernel to use updated packages.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "%pip install -q llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv"
+    "%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now define the main parameters:"
    ]
   },
   {
@@ -74,14 +75,28 @@
     "import os\n",
     "from dotenv import load_dotenv\n",
     "\n",
+    "\n",
+    "def get_env_from_colab_or_os(key):\n",
+    "    try:\n",
+    "        from google.colab import userdata\n",
+    "\n",
+    "        try:\n",
+    "            return userdata.get(key)\n",
+    "        except userdata.SecretNotFoundError:\n",
+    "            pass\n",
+    "    except ImportError:\n",
+    "        pass\n",
+    "    return os.getenv(key)\n",
+    "\n",
+    "\n",
     "load_dotenv()\n",
-    "source = \"https://arxiv.org/pdf/2408.09869\"  # Docling Technical Report\n",
-    "query = \"Which are the main AI models in Docling?\"\n",
-    "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n",
-    "gen_model = HuggingFaceInferenceAPI(\n",
-    "    token=os.getenv(\"HF_TOKEN\"),\n",
+    "EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n",
+    "GEN_MODEL = HuggingFaceInferenceAPI(\n",
+    "    token=get_env_from_colab_or_os(\"HF_TOKEN\"),\n",
     "    model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
-    ")"
+    ")\n",
+    "SOURCE = \"https://arxiv.org/pdf/2408.09869\"  # Docling Technical Report\n",
+    "QUERY = \"Which are the main AI models in Docling?\""
    ]
   },
   {
@@ -96,7 +111,7 @@
    "metadata": {},
    "source": [
     "To create a simple RAG pipeline, we can:\n",
-    "- define a `DoclingPDFReader`, which by default exports to Markdown, and\n",
+    "- define a `DoclingReader`, which by default exports to Markdown, and\n",
     "- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`"
    ]
   },
@@ -105,20 +120,6 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "4b7b5ee0f1b945f49103169144091dfa",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -153,12 +154,12 @@
     "node_parser = MarkdownNodeParser()\n",
     "\n",
     "index = VectorStoreIndex.from_documents(\n",
-    "    documents=reader.load_data(source),\n",
+    "    documents=reader.load_data(SOURCE),\n",
     "    transformations=[node_parser],\n",
-    "    embed_model=embed_model,\n",
+    "    embed_model=EMBED_MODEL,\n",
     ")\n",
-    "result = index.as_query_engine(llm=gen_model).query(query)\n",
-    "print(f\"Q: {query}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
+    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
+    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
     "display([(n.text, n.metadata) for n in result.source_nodes])"
    ]
   },
@@ -174,7 +175,7 @@
    "metadata": {},
    "source": [
     "To leverage Docling's rich native format, we:\n",
-    "- create a `DoclingPDFReader` with JSON export type, and\n",
+    "- create a `DoclingReader` with JSON export type, and\n",
     "- employ a `DoclingNodeParser` in order to appropriately parse that Docling format.\n",
     "\n",
     "Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):"
@@ -185,20 +186,6 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "4f0bfd6b70ba4b79a7620cb08b209300",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -243,12 +230,12 @@
     "node_parser = DoclingNodeParser()\n",
     "\n",
     "index = VectorStoreIndex.from_documents(\n",
-    "    documents=reader.load_data(source),\n",
+    "    documents=reader.load_data(SOURCE),\n",
     "    transformations=[node_parser],\n",
-    "    embed_model=embed_model,\n",
+    "    embed_model=EMBED_MODEL,\n",
     ")\n",
-    "result = index.as_query_engine(llm=gen_model).query(query)\n",
-    "print(f\"Q: {query}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
+    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
+    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
     "display([(n.text, n.metadata) for n in result.source_nodes])"
    ]
   },
@@ -277,8 +264,8 @@
     "import requests\n",
     "\n",
     "tmp_dir_path = Path(mkdtemp())\n",
-    "r = requests.get(source)\n",
-    "with open(tmp_dir_path / f\"{Path(source).name}.pdf\", \"wb\") as out_file:\n",
+    "r = requests.get(SOURCE)\n",
+    "with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n",
     "    out_file.write(r.content)"
    ]
   },
@@ -294,13 +281,6 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Loading files: 100%|██████████| 1/1 [00:10<00:00, 10.29s/file]\n"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -315,12 +295,12 @@
      "data": {
       "text/plain": [
        "[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
-       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmpgrhz7355/2408.09869.pdf',\n",
+       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmphcglm858/2408.09869.pdf',\n",
        "   'file_name': '2408.09869.pdf',\n",
        "   'file_type': 'application/pdf',\n",
        "   'file_size': 5566574,\n",
-       "   'creation_date': '2024-10-07',\n",
-       "   'last_modified_date': '2024-10-07',\n",
+       "   'creation_date': '2024-10-08',\n",
+       "   'last_modified_date': '2024-10-08',\n",
        "   'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
        "   'path': '#/main-text/37',\n",
        "   'heading': '3.2 AI models',\n",
@@ -330,12 +310,12 @@
        "    506.29705810546875,\n",
        "    407.3725280761719]}),\n",
        " ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n",
-       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmpgrhz7355/2408.09869.pdf',\n",
+       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmphcglm858/2408.09869.pdf',\n",
        "   'file_name': '2408.09869.pdf',\n",
        "   'file_type': 'application/pdf',\n",
        "   'file_size': 5566574,\n",
-       "   'creation_date': '2024-10-07',\n",
-       "   'last_modified_date': '2024-10-07',\n",
+       "   'creation_date': '2024-10-08',\n",
+       "   'last_modified_date': '2024-10-08',\n",
        "   'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
        "   'path': '#/main-text/10',\n",
        "   'heading': '1 Introduction',\n",
@@ -359,12 +339,12 @@
     ")\n",
     "\n",
     "index = VectorStoreIndex.from_documents(\n",
-    "    documents=dir_reader.load_data(source),\n",
+    "    documents=dir_reader.load_data(SOURCE),\n",
     "    transformations=[node_parser],\n",
-    "    embed_model=embed_model,\n",
+    "    embed_model=EMBED_MODEL,\n",
     ")\n",
-    "result = index.as_query_engine(llm=gen_model).query(query)\n",
-    "print(f\"Q: {query}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
+    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
+    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
     "display([(n.text, n.metadata) for n in result.source_nodes])"
    ]
   }