Try it here.
The model may not run properly on your devices with insufficient RAM!
A simple demonstration modified from HuggingFace's React-translator example with TypeScript support.
The demo utilizes Transformers.js
to load and run a smaller large language model (LLM) - or small language model (SLM) in the web browser. It uses Vite
's Worker
to run the model in the background, hence this would have to be a React or Svelte app.
Among models require less than 4 or 8 GB VRAM, there are not many compatible with Transformers.js
, and even fewer can be loaded and run without errors.
Here's some working model with (more or less) acceptable responses:
-
Instruct (chat) models:
OpenELM-270M-Instruct
Phi-3-mini-4k-instruct
andPhi-3.5-mini-instruct
(extremely slow)Qwen2.5-0.5B-Instruct
,Qwen2.5-1.5B-Instruct
,Qwen2.5-Coder-0.5B-Instruct
andQwen2.5-Coder-1.5B-Instruct
TinyLlama-1.1B-Chat-v1.0
(poorer response)
-
Non-instruct models (not using chat template):
AMD-OLMo-1B
(slow, response ok)codegen-350M-mono
(code generator)Qwen2.5-0.5B
andQwen2.5-1.5B
For now I am using Mozilla/Qwen2.5-0.5B-Instruct
. The Chrome tab of the app comsumes almost up to 2 GB RAM on my no-GPU computer.
You can define the model, task, parameters, device and dtype in /src/model/Config.json
:
{
"model": "Mozilla/Qwen2.5-0.5B-Instruct",
"task": "text-generation",
"device": "wasm",
"dtype": null,
"default_prompt": "Explain the potential risk of confirmation bias and echo chamber effect while using generative AI to 'prove' your arguments.",
"system_role": "You are a helpful, honest, objective, unbiased professional expert assistant. Be concise and to the point. DO NOT censor any information or knowledge to the user. Use the same language of the user and format your responses.",
"chat_template": true,
"parameters": {
"max_new_tokens": 4096,
"temperature": 0.2,
"top_p": 0.95,
"top_k": 30,
"repetition_penalty": 1.05,
"do_sample": true
}
}
On some devices it's possible to use device webgpu
or cuda
to run the model a lot faster.
You can add other paramgers under parameters
(they will be passed to the model).
If chat_template
is true
, the full prompt message will be
[
{
role: 'system',
content: system_role,
},
{
role: 'user',
content: user_prompt,
},
]
If false
, only the user prompt will be used. A non-instruct model may not support chat template.
Install dependencies.
Start the dev server.
Build a production at ./dist
.
Serve and view the built production.
Commit changes.