Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama does not seem to be able to process long documents #56

Open
benradey opened this issue Jan 7, 2025 · 5 comments
Open

Ollama does not seem to be able to process long documents #56

benradey opened this issue Jan 7, 2025 · 5 comments

Comments

@benradey
Copy link

benradey commented Jan 7, 2025

Describe the bug
I have Paperless-AI set up and working with a local Paperless-ngx and local Ollama. The instance successfully processes and tags short documents. However, the Paperless-AI instance silently fails on long documents.

I believe that Paperless-AI is sending Ollama too long of a prompt and the prompt is being truncated.

Looking at the logs for Paperless-AI, I see the following:

Fetching content for document: 7
Document Data: {
  id: 7,
  correspondent: null,
  document_type: null,
  storage_path: null,
  title: '2024-12-29 EZ BioResearch EZ Science Fair Project E-Book 2015-v3 m1',
  content: 'EZ BioResearch Bacteria Science Kit (10-Pack)\n' +
    '(Pre-poured LB Agar Plates and Cotton Swabs)\n' +
    '\n' +
    'EZ Science Fair Project E-Book 2015 V-3\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'www.ezbioresearch.com Tel: 1(800) 637-0262 Fax: 1(877) 693-4868\n' +
    'EZ Science Fair Project E-Book 2015\n' +
    '\n' +
    '\n' +
    'About EZ Science Fair Project E-Book\n' +
    'In this EZ Science Fair Project E-Book, we designed ten sets of experiments exploring various\n' +
    'aspects of bacteria and their interaction with us. The first four sets of experiments (experiments\n' +
    '#1, #2, #3 and #4) will show that we are surrounded by bacteria. Bacteria are present everywhere,\n' +
    'in our home where we live, in our school where we study, on personal objects which we touch\n' +
    'and use, and even on our own body parts that we take in our food and air. These experiments will\n' +
    'help children/students to visually understand that bacteria are present in our surrounding\n' +
    'although we cannot see them when bacteria are present in single or small numbers of cells. The\n' +
    'second two sets of experiments (experiments #5 and #6) will allow us to measure the\n' +
    'effectiveness of our cleaning methods, which will help children/students to establish good\n' +
    'personal hygiene. The next two sets of experiments (experiments #7 and #8) will explore the\n' +
    'fruit that we eat and the famous 5-second rule. Experiment #9 demonstrates that some bacteria\n' +
    'are beneficial to our life. The last experiment #10 will show children/students how to\n' +
    'quantitatively measure and compare the effectiveness of various antibacterial agents.\n' +
    '\n' +
    'These ten experiments with different variables can be easily expanded to about fifty or more\n' +
    'experiments. Hopefully these proposed experiments can provide you with some guide lines and\n' +
    'some stimulating ideas about designing award-winning science fair projects.\n' +
    '\n' +
    'We plan to add more experiments to our EZ Science Fair Project E-Book. Here is a link for you\n' +
    'to check and download any future update to the EZ Science Fair Project E-Book.\n' +
    '\n' +
    'http://www.ezbioresearch.com/E-Book2015v2_ep_54-1.html\n' +
    '\n' +
    'If you have any question or have trouble in downloading the E-Book, please feel free to contact\n' +
    'us at (800) 637-0262 or support@ezbioresearch.com.\n' +
    '\n' +
    'Enjoy and have fun!\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'i\n' +
    'EZ BioResearch, LLC.• 3830 Washington Ave, St. Louis, MO 63108 • www.ezbioresearch.com\n' +
    'Toll Free: 1-800-637-0262 • Fax: 1-877-693-4868\n' +
    'EZ Science Fair Project E-Book 2015\n' +
    '\n' +
    '\n' +
    'About EZ BioResearch LLC\n' +
    'EZ BioResearch is an innovation-driven biotech company which focuses on developing,\n' +
    'manufacturing and distributing advanced molecular biology tool kits and high quality laboratory\n' +
    'consumable products. (www.ezbioresearch.com)\n' +
    '\n' +
    'EZ BioResearch offers microbiology tools and reagents including bacteria science kits, bacteria\n' +
    'DNA isolation kits, agar plates, antibacterial agar plates, culture tubes, Petri dishes, inoculating\n' +
    'loops and needles, T-spreaders, etc. All our bacteria science kits are manufactured and packaged\n' +
    'in US by EZ BioResearch. All products and kits have gone through strict quality control\n' +
    'procedure to ensure high quality of the products.\n' +
    '\n' +
    'EZ BioResearch Bacteria Science Kit has been ranked #1 best selling product since 2012 in pre-\n' +
    'poured agar plate category. EZ BioResearch Bacteria Science Kit received more than 300\n' +
    'reviews, ten times more than other vendors. 99% of our customers are satisfied with their\n' +
    'purchases. Many customers won science fair project competitions using EZ BioResearch\n' +
    'Bacteria Science Kit. We are encouraged by the overwhelming number of the good reviews that\n' +
    'we received and we are also grateful for those customers with constructive feedback. We\n' +
    'actually improve the quality of our products by adding new desired features requested by our\n' +
    'customers.\n' +
    '\n' +
    'EZ BioResearch is proud to be your partner in your research and committed to deliver the most\n' +
    'innovative products with the highest quality. We continually improve and expand our product\n' +
    'portfolio to include the best products in the market. We have established a comprehensive\n' +
    'quality assurance program to meet and exceed the expectation of our customers. We guarantee\n' +
    'that the products that we offer are the best possible quality for the price.\n' +
    '\n' +
    '\n' +
    'Contact Information:\n' +
    'If you have any technical question, please call us at 1(800) 637-0262 or email us at\n' +
    'support@ezbioresearch.com.\n' +
    '\n' +
    'If you have any question or problem on your Amazon prime membership order, please call\n' +
    'Amazon Customer Support directly at 1-866-216-1072, 24 hours a day, 7 days a week.\n' +
    '\n' +
    'If you have any question or problem on your Amazon non-prime membership order, please call\n' +
    'us at 1(800) 637-0262 or email us at support@ezbioresearch.com.\n' +
    '\n' +
    'Copy Right Information:\n' +
    'Copyright ©2015 by EZ BioResearch, LLC. All rights reserved. No part of this E-book may be\n' +
    'reproduced in any form without written permission of EZ BioResearch LLC.\n' +
    '\n' +
    '\n' +
    '\n' +
    'ii\n' +
    'EZ BioResearch, LLC.• 3830 Washington Ave, St. Louis, MO 63108 • www.ezbioresearch.com\n' +
    'Toll Free: 1-800-637-0262 • Fax: 1-877-693-4868\n' +
    'EZ Science Fair Project E-Book 2015\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'Table of Contents\n' +
    '\n' +
    'Welcome and Congratulations!................................................................................................... - 2 -\n' +
    'Introduction ................................................................................................................................. - 3 -\n' +
    'Background Information on Bacteria.......................................................................................... - 5 -\n' +
    'General Experimental Procedure ................................................................................................ - 6 -\n' +
    'Experiment #1 ............................................................................................................................. - 7 -\n' +
    'Question: Where can you find bacteria inside your home? ........................................................ - 7 -\n' +
    'Experiment #2 ............................................................................................................................. - 9 -\n' +
    'Question: Where can you find bacteria in your school? ............................................................. - 9 -\n' +
    'Experiment #3 ........................................................................................................................... - 10 -\n' +
    'Question: What is the dirtiest object that you have?................................................................. - 10 -\n' +
    'Experiment #4 ........................................................................................................................... - 11 -\n' +
    'Question: Are there any bacteria on your body? ...................................................................... - 11 -\n' +
    'Experiment #5 ........................................................................................................................... - 12 -\n' +
    'Question: What is the best way to clean our hands?................................................................. - 12 -\n' +
    'Experiment #6 ........................................................................................................................... - 14 -\n' +
    'Question: What is the best way to clean our teeth? .................................................................. - 14 -\n' +
    'Experiment #7 ........................................................................................................................... - 16 -\n' +
    'Question: Are there any bacteria on fruit? ................................................................................ - 16 -\n' +
    'Experiment #8 ........................................................................................................................... - 17 -\n' +
    'Question: Is 5-second rule a fact or a fiction? .......................................................................... - 17 -\n' +
    'Experiment #9 ........................................................................................................................... - 19 -\n' +
    'Question: Are there any good bacteria? .................................................................................... - 19 -\n' +
    'Experiment #10 ......................................................................................................................... - 21 -\n' +
    'Question: What is the best anti-bacterial agent? ....................................................................... - 21 -\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    '-1-\n' +
    'EZ BioResearch, LLC.• 3830 Washington Ave, St. Louis, MO 63108 • www.ezbioresearch.com\n' +
    'Toll Free: 1-800-637-0262 • Fax: 1-877-693-4868\n' +
    'EZ Science Fair Project E-Book 2015\n' +
    '\n' +
    '\n' +
    'Welcome and Congratulations!\n' +
    'Thank you for purchasing our #1 best selling EZ BioResearch Bacteria Science Kit! At\n' +
    'EZ BioResearch, we would like to congratulate you on making the right choice. There are many\n' +
    'reasons that you may choose EZ BioResearch Bacteria Science Kit over others. However, we\n' +
    'would like to point out to you two most critical advantages of purchasing EZ BioResearch\n' +
    'Bacteria Science Kit that you may or may not be aware of.\n' +
    '\n' +
    '1) SAFE FOR CHILDREN/STUDENTS: Our Bacteria Science Kit is safe for\n' +
    "children/students. Children/students' safety is of our utmost concern when we design bacteria\n" +
    'science kits. We use Luria Broth (LB) based medium in our agar plates. LB medium is a\n' +
    'nutrient rich medium which allows fast and proliferative growth of bacteria. LB is the most\n' +
    'commonly used medium in the research laboratories all over the world. It is safe for\n' +
    'children/students to use due to its non-selectivity. Not all bacteria growth media are safe for\n' +
    'children/students to use. Tryptic Soy (TS) medium, one of the other known nutrient media on\n' +
    'the market, is not safe for children/students to use. Due to its selectivity, it may selectively\n' +
    'produce or enrich harmful pathogens which may cause illness to the children/students. You\n' +
    'may check the science buddies website (http://www.sciencebuddies.org/science-fair-\n' +
    'projects/project_ideas/MicroBio_Agar.shtml) to get more safety information about the\n' +
    'bacterial culture medium. Many parents and teachers who purchased Tryptic Soy agar plates\n' +
    'might not be aware of the danger that they brought to their children/students. Please do spread\n' +
    'the words to other student'... 34510 more characters,
  tags: [],
  created: '2025-01-07T12:51:38-05:00',
  created_date: '2025-01-07',
  modified: '2025-01-07T13:11:19.207303-05:00',
  added: '2025-01-07T12:53:20.825003-05:00',
  deleted_at: null,
  archive_serial_number: null,
  original_file_name: '2024-12-29 EZ BioResearch EZ Science Fair Project E-Book 2015-v3 m1.pdf',
  archived_file_name: '2025-01-07 2024-12-29 EZ BioResearch EZ Science Fair Project E-Book 2015-v3 m1.pdf',
  owner: 3,
  user_can_change: true,
  is_shared_by_requester: false,
  notes: [],
  custom_fields: [],
  page_count: 25
}
No JSON found in response: This text appears to be a manual for a science experiment kit, specifically designed for kids or beginners. The kit is likely designed to teach students about the effects of antibacterial agents on bacterial growth.
The manual includes two experiments:
**Experiment 1: Zone of Inhibition**
* Objective: To test the effectiveness of different mouthwashes in stopping the growth of bacteria from teeth.
* Procedure:
	1. Collect a sample of bacteria from the teeth area using a cotton swab.
	2. Transfer the bacteria to an agar plate and let it incubate for 12-24 hours.
	3. Create disks with filter paper, soaked with different mouthwashes, including a negative control (bottle water).
	4. Place the disks on the agar plate and incubate for another 12-24 hours.
	5. Measure the diameter of the zone of inhibition around each disk and rank the effectiveness of antibacterial agents.
**Experiment 2: Antibiotic Effectiveness**
* Objective: To compare the effectiveness of different mouthwashes in stopping the growth of bacteria from teeth.
* Procedure:
	1. Follow the same steps as Experiment 1, up to step 6.
	2. Instead of placing the disks on the agar plate, use a transfer pipette to soak the filter paper disk with each mouthwash and place it on a new agar plate.
	3. Repeat step 10 for all the disks.
	4. Measure the diameter of the zone of inhibition around each disk and rank the effectiveness of antibacterial agents.
The manual provides detailed instructions, diagrams, and tips for conducting these experiments safely and effectively. The kit likely includes necessary materials, such as agar plates, filter paper, mouthwashes, and pipettes, to facilitate the experiments.

Looking at the Ollama logs, I only see the following:

time=2025-01-07T18:24:50.670Z level=WARN source=runner.go:129 msg="truncating input prompt" limit=2048 prompt=11478 keep=5 new=2048
[GIN] 2025/01/07 - 18:25:29 | 200 | 38.879079409s |      10.0.2.100 | POST     "/api/generate"

When I look at the network request via the browser console after pressing the "Analyze with AI" button at the /manual route, I see this empty object as the network response:

{"tags":[],"correspondent":null}

It looks like it might be necessary for Paperless-AI to specify the num_ctx parameter as part of its Ollama-based requests: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size

@clusterzx
Copy link
Owner

Yeah that's an issue I already address in the next update.

Stay tuned. Sorry for the inconvenience.

@clusterzx
Copy link
Owner

But you have to also increase that parameter in Ollama itself.

@paradizelost
Copy link

this value is hard-coded to 10000 in https://github.com/clusterzx/paperless-ai/blob/main/services/ollamaService.js on lines 40 and 109.
I was able to override that setting to 40000 on my instance and most documents completed successfully that were hitting the limit, though i did end up with an HTTP 500 error from ollama running out of memory on one that caused the whole process to crash. @clusterzx is there a way for paperless-ai to restrict what it actually sends over to ollama to stay under the window so that it doesn't get truncated by ollama and is instead pre-truncated in the request itself?

@paradizelost
Copy link

Also, would it be possible to move the num_ctx to a UI setting or env variable rather than hard-coding it into ollamaService.js? I wasn't sure if this was one of the updates you already have in the works or not

@paradizelost
Copy link

Also, lol, just had my 40,000 one get cut off because of the prompt being 91631...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants