Local LLM (llama3, mixtral, gemma) short and unformatted output compared to gpt-4(o) #504

xektop · 2024-06-02T13:10:56Z

xektop
Jun 2, 2024

Hi all,
I am playing with the fabric extract wisdom and while the gpt-4 creates a really nice output with all the ideas, insights, quotes, habits, facts, references, takeaway, recommendations, etc. the local LLM produces something like 5-15 sentences without formatting or anything similar to the gpt-4 outputs.

Is there some settings for ollama that I can change or this is LLM limitation and nothing can be changed for it to produce similar results locally?
GPU: nvidia 4090
RAM: 32gb
CPU: intel 11900k
OS: Arch linux

If there are options I can change in the ollama settings or some parameter I can add to the input for it to generate longer text I would be happy to understand :)

The input:
yt --transcript https://www.youtube.com/watch?v=UbDyjIIGaxQ | fabric -m llama3:latest -sp extract_wisdom

The output:
"The video is discussing the concept of fabric, an AI tool that helps users extract surprising, insightful, and interesting information from text content. The speaker, who is also the creator of fabric, shares his thoughts on how to use fabric effectively, including defining what you're trying to do with the tool and using it to distill down overwhelming amounts of information.

The video also touches on the idea of AI augmenting human capabilities, rather than replacing them. The speaker believes that AI can help humans become better by taking current capabilities and increasing them at a faster rate than before.

Additionally, the video discusses how the speaker uses fabric in his daily life, including recording conversations with friends and family, transcribing them with Whisper AI, and then using fabric to extract relevant information from those conversations. He also mentions that he has started using fabric to process Bible study discussions, which has helped him to better retain important details.

The video concludes by highlighting the Obsidian Save feature in fabric, which allows users to save notes directly to their Obsidian note-taking app. The speaker demonstrates how to set up this feature and uses it to save a note from his GC analyzer to Obsidian.

Overall, the video is about exploring the potential of AI tools like fabric to improve human capabilities and make life easier."

Expected outcome should be similar to what gpt-4 outputs if that is possible at all.

Thank you all!

redtecinc · 2024-06-04T20:59:05Z

redtecinc
Jun 4, 2024

I have noticed the same thing. I have a much smaller setup but running Fabric in WSL on same host as Ollama so local only. I also tried creating a custom pattern but still get similar results to what is listed by the OP. I is there a source that could explain how to format a pattern so that local llama3 would output in a structured format? Thanks!

0 replies

yacinebousnina · 2024-06-05T11:49:14Z

yacinebousnina
Jun 5, 2024

I had the same problem on llama3 locally. what's weird is that when I give the same prompt to Groq llama3-8b-8192, I get the correct result

4 replies

xektop Jun 5, 2024
Author

That seems like something I would like to try, but I find it difficult to find that model and I am not sure how to install it with ollama? I see the ollama models here, but the Groq model you mentioned is not there. Do you mind explaining how to get the model and use it?
Thank you!

yacinebousnina Jun 5, 2024

I concatenated the prompt (https://raw.githubusercontent.com/danielmiessler/fabric/main/patterns/extract_wisdom/system.md) with my article and paste it in https://groq.com/. Then I selected "llama3-8b-8192" (at the top right of the screen) as a model and run

xektop Jun 5, 2024
Author

Thanks for the info... so maybe it's related to the local installation or how the groq deals with the prompt and not so the model. I am kinda new to all that so I don't even know how to troubleshoot or check for problems on my machine.

Asentient Jul 13, 2024

I confirm your results with groc and openrouter.ai the very same models operate correctly through those two cloud services which leads us to conclude that it must be something on our local setup side and NOT RELATED to the choice of model at all.

juliendf · 2024-06-06T21:11:10Z

juliendf
Jun 6, 2024

have you tried with llama3 8b running locally with LM Studio ?

1 reply

yacinebousnina Jun 6, 2024

I tried with Open WebUi, same result

Asentient · 2024-06-29T22:33:58Z

Asentient
Jun 29, 2024

what hardware does everyone have? is this related to gpu limitations by any chance? I am having the same issue.

0 replies

snafu4 · 2024-06-30T00:11:08Z

snafu4
Jun 30, 2024

I have a 7+ yo GTX 1070 running under Windows 10 WSL that works without any problems.

…

On Jun 29, 2024 at 6:34 PM -0400, Asentient ***@***.***>, wrote: what hardware does everyone have? is this related to gpu limitations by any chance? I am having the same issue. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

3 replies

Asentient Jul 8, 2024

so the local LLM outputs the extract_summary pattern correctly for you with markdown and everything on the GTX 1070 with Windows 10 WSL ? and you are using the --model switch to specify a local LLM or have it set as your default in the .env file?

snafu4 Jul 9, 2024

I assume you mean extract_wisdom. "Correctly" is a relative term (see #504 (comment) below). I use both my default model and the -m on the CLI. I get varying degrees of 'completeness' depending on the model and subject matter but of the commercial models I use, GPT4o is usually best.

My old h/w is slow with responses but I do not believe that the response completeness is a reflection of the hardware but the LLM chosen (but I haven't tested this).

It would be great if we didn't need to do prompt engineering and each model had it's own (invisible) pre-processing step that took the input prompt, optimally massaged it for the specific LLM and produced the best results it can produce, but the models are not there yet.

Asentient Jul 13, 2024

I assume you mean extract_wisdom. "Correctly" is a relative term (see #504 (comment) below). I use both my default model and the -m on the CLI. I get varying degrees of 'completeness' depending on the model and subject matter but of the commercial models I use, GPT4o is usually best.

My old h/w is slow with responses but I do not believe that the response completeness is a reflection of the hardware but the LLM chosen (but I haven't tested this).

It would be great if we didn't need to do prompt engineering and each model had it's own (invisible) pre-processing step that took the input prompt, optimally massaged it for the specific LLM and produced the best results it can produce, but the models are not there yet.

Yes i meant extract_wisdom. Please see my other note below. I do believe it may be setup related if not hardware, because as i said elsewhere. The very same model is being used on the cloud and locally and the results are drastically different!

patonw · 2024-07-08T19:15:36Z

patonw
Jul 8, 2024

Llama3 (and any other LLM for that matter) is really a part of a family of models fine-tuned on different datasets. Different fine tunings will result in models that follow certain instructions to different degrees.

You might have more luck with the "instruct" variant of llama3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
Otherwise, just try different variants from https://ollama.com/library/llama3/tags.

Of course, how you word and format the prompt and examples matters, but we're trying to abstract that away here.

1 reply

Asentient Jul 13, 2024

This is not model related. The same models behave differently when used through groqcloud or openrouter.ai but gives an incomplete non-markdown blabber that does not follow the prompt instructions at all when used locally.

Everyone else who is having that problem could you please double check that on your setups and confirm or deny this here?

Basically the very same models are not producing the same results when running locally. Why else do you think I am suspecting the hardware! Taking that glaring fact into consideration, you have to disregard all the advice given that points to it being model related! That is not helpful and is misguiding the diagnosis of the issue.

stephangrobler · 2024-07-17T04:50:32Z

stephangrobler
Jul 17, 2024

I am experiencing the exact same issue. My setup is:
Ollama with llama3:instruct running
I am using the API endpoints to post to to see if I could get the same output as gpt4, but I am unable to do so. I have an RTX 3060 with 12gb vram with a AMD Ryzen 5 5500 cpu.
I looked at the code of the server trying to see what options are passed in and and also passed that in as best I could to see if that would make a difference but to no avail.
"options": {
"temperature": 0.0,
"top_p":1,
"repeat_penalty": 1.1,
"num_ctx": 8192
}

0 replies

celobusana · 2024-07-25T23:24:36Z

celobusana
Jul 25, 2024

Hey everyone, check out the potential solution here: https://medium.com/@celobusana/solving-fabric-and-local-ollama-context-issues-a-step-by-step-guide-1d67e443e27e

3 replies

Dunkloestus Jul 29, 2024

WORKED LIKE A CHARM! I highly recommend to follow that method simple and quick !

vhongtuanha Jul 29, 2024

Thanks for the solution. It works for me too

brianbrandson Aug 17, 2024

It didn't for me. Where do I need to create this file "llama3.1_ctx_4096"? Or it does not matter? I am on Ubuntu PC.

bledburn · 2024-07-29T20:43:49Z

bledburn
Jul 29, 2024

Hi all, I am playing with the fabric extract wisdom and while the gpt-4 creates a really nice output with all the ideas, insights, quotes, habits, facts, references, takeaway, recommendations, etc. the local LLM produces something like 5-15 sentences without formatting or anything similar to the gpt-4 outputs.

Is there some settings for ollama that I can change or this is LLM limitation and nothing can be changed for it to produce similar results locally? GPU: nvidia 4090 RAM: 32gb CPU: intel 11900k OS: Arch linux

If there are options I can change in the ollama settings or some parameter I can add to the input for it to generate longer text I would be happy to understand :)

The input: yt --transcript https://www.youtube.com/watch?v=UbDyjIIGaxQ | fabric -m llama3:latest -sp extract_wisdom

The output: "The video is discussing the concept of fabric, an AI tool that helps users extract surprising, insightful, and interesting information from text content. The speaker, who is also the creator of fabric, shares his thoughts on how to use fabric effectively, including defining what you're trying to do with the tool and using it to distill down overwhelming amounts of information.

The video also touches on the idea of AI augmenting human capabilities, rather than replacing them. The speaker believes that AI can help humans become better by taking current capabilities and increasing them at a faster rate than before.

Additionally, the video discusses how the speaker uses fabric in his daily life, including recording conversations with friends and family, transcribing them with Whisper AI, and then using fabric to extract relevant information from those conversations. He also mentions that he has started using fabric to process Bible study discussions, which has helped him to better retain important details.

The video concludes by highlighting the Obsidian Save feature in fabric, which allows users to save notes directly to their Obsidian note-taking app. The speaker demonstrates how to set up this feature and uses it to save a note from his GC analyzer to Obsidian.

Overall, the video is about exploring the potential of AI tools like fabric to improve human capabilities and make life easier."

Expected outcome should be similar to what gpt-4 outputs if that is possible at all.

Thank you all!

I had the exact same problem when I first started using fabric on the command line. The format of a pattern, like extract_wisdom, came out wonderful when using my OpenAI API key, and using their models. When using local models pulled from Ollama, I was only getting a very short summary. I moved on, playing around with the fabric framework, and started stitching (piping) patterns together.

Randomly, I tried stitching extract_wisdom together twice, just to see what it would output. Voila! Llama3 output a very nice response, with proper format for that pattern!

ex:
yt --transcript https://www.youtube.com/watch?v=UbDyjIIGaxQ | fabric -m llama3:latest -p extract_wisdom | fabric -m llama3:latest -p extract_wisdom

See if that gives you the proper response and format for that specific pattern (extract_wisdom).

This allows for the proper response without changing any parameters. You do, however, have to pipe to the command a second time. I have no idea why this is; just a random discovery.

Maybe the prompt isn't being input as a system prompt locally, in Ollama (or whatever you're using for implementing your local models). Or maybe it is another issue entirely. Someone smarter than me will figure it out. This is an amazing framework, so I'm sure it will be addressed in its due time.

2 replies

bledburn Jul 29, 2024

Naturally, I played around with that solution further. I tried stitching the same pattern together once, twice, and thrice. A single stitching produced the best results. The response seemed to degrade with each subsequent iteration (second and third stitch to the same pattern). So, stitching the same command together once is producing the best results, no matter what model I use locally.

bledburn Jul 29, 2024

Lastly (sorry I did not think of all this in the first post), I have a couple desktop rigs, and the issue is the same on them all. The lowest rank rig I have has a 4060 8GB card w/ Ryzen 7 7700 8-core 16-thread CPU. I bought it at the right price just to get the CPU (and everything around it free). So, I have to conclude this is not hardware related at all.

I hope these random facts help in resolving the issue. I suspect it has something to do with integrating Ollama with Fabric, locally. I'm just a hobbyist, so I must cut my speculation short.

xektop · 2025-01-10T09:53:39Z

xektop
Jan 10, 2025
Author

After reading all of the suggestions and trying to resolve it by creating a custom model with num_ctx PARAMETERS like that example:
$ ollama show phi4_ctx_16384
Model
architecture phi3
parameters 14.7B
context length 16384
embedding length 5120
quantization Q4_K_M

License
Microsoft.
Copyright (c) Microsoft Corporation.
I couldn't make it output properly (long structured text like ChatGPT and other providers) the extract_wisdom and other patterns. I have tried to customize 3-4 of the most popular models with different num_ctx values.

I was able to narrow it down to the ollama configuration, because it is working properly with LM Studio using any of the open source models.
From all the test and information provided in this discussion it seems that the context length is not properly setup for ollama, but I don't know how to fix it and the custom models with num_ctx parameter don't seem to change anything for me.

For people with the same issue I may suggest trying LM Studio until they find another solution for themselves.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local LLM (llama3, mixtral, gemma) short and unformatted output compared to gpt-4(o) #504

{{title}}

Replies: 10 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Local LLM (llama3, mixtral, gemma) short and unformatted output compared to gpt-4(o) #504

Replies: 10 comments · 14 replies

xektop Jun 5, 2024 Author

xektop Jun 5, 2024 Author

xektop Jan 10, 2025 Author

Replies: 10 comments 14 replies

xektop Jun 5, 2024
Author

xektop Jun 5, 2024
Author

xektop
Jan 10, 2025
Author