Features of this lit-gpt fork

1- Sliding Window Attention (SWA)

This fork of lit-gpt introduces the Sliding Window Attention (SWA) as defined in the Mistral paper. The original repository supports the Mistral model with some limitations due to the absence of a SWA implementation.

The SWA is here introduced by adding two keywords to the config file, one boolean to enable it, and the other (optional) integer to define the SWA size. Then the causal mask generated within lit-gpt and associated to the KV-Cache is updated to take into account the SWA, if any. Please note that the same causal mask is generated when SWA is enabled and the KV-cache not used. It is then assumed that such a mechanism is used in causal models only.

A new set of tests have been introduced to validate the introduction of SWA.

Disclaimer

The configuration of Mistral within this repo have not been changed to reflect the newly introduced support for SWA.

An example on how to instantiate a Mistral model with SWA enabled follows (please refer to the lit-gpt tutorials for details about downloading and using model weights checkpoints).

from lit_gpt import GPT, Config

config = Config.from_name(
    "Mistral-7B-Instruct-v0.1",  # Mistral checkpoint name
    block_size=32768,  # without SWA is constrained to 4096
    use_sliding_window=True,  # enable SWA
    sliding_window_size=4096
)

mistral = GPT(config)

2- Rolling KV-Cache

The KV-Cache of lit-gpt will always return tensors with the maximum possible size, then the cache mask will filter out all the values that are not covered by the SWA. However, in this way the benefits of SWA are lost since the attention will operate on the full block size (that in case of the Mistral is 32 times the SWA size). An ideal implementation will operate on tensors with a size equal to the sliding window's size.

To do so, an extra parameter is introduced in config to enable a specific optimisation for the KV cache (available only when the SWA is enabled too). This version of the KV-cache saves only the values corresponding to the sliding window, by discarding all the (old) values no longer required, effectively acting as a rolling cache.

Rolling KV-Cache implementation technical details

To make the implementation easier, defined as S the size of the sliding window, the optimised cache will have size 2S, and every new value will be saved twice as follows. Defined as p the positional index of such new value, it will be saved at indices p % S and p % S + S. In this way the S values required by the SWA will be always in consecutive positions within the cache, precisely in the interval (p % S, p % S + S].

A specific cache mask will filter out the unneeded values.

As a final note, this rolling KV-Cache implementation does not reduce the time complexity of attention or the cache memory requirements that will still be O(block_size² * n_embd) and O(block_size), respectively, but it will change the constant factors associated. As an example, in the case of Mistral, block size is 32 times the sliding window size S and being the rolling cache 2S, the effective block_size used within attention will be 16 times smaller.

Original README.md

⚡ Lit-GPT

Hackable implementation of state-of-the-art open-source large language models released under the Apache 2.0 license.

Supports the following popular model checkpoints:

Model	Model size	Reference
Code Llama by Meta AI	7B, 13B, 34B, 70B	Rozière et al. 2023
Dolly by Databricks	3B, 7B, 12B	Conover et al. 2023
Falcon by TII UAE	7B, 40B, 180B	TII 2023
FreeWilly2 (Stable Beluga 2) by Stability AI	70B	Stability AI 2023
Function Calling Llama 2 by Trelis	7B	Trelis et al. 2023
Llama 2 by Meta AI	7B, 13B, 70B	Touvron et al. 2023
LongChat by LMSYS	7B, 13B	LongChat Team 2023
Mistral and Mixtral by Mistral AI	7B	Mistral website
Nous-Hermes by NousResearch	7B, 13B, 70B	Org page
OpenLLaMA by OpenLM Research	3B, 7B, 13B	Geng & Liu 2023
Phi by Microsoft Research	1.3B, 2.7B	Li et al. 2023
Platypus by Lee at el.	7B, 13B, 70B	Lee, Hunter, and Ruiz 2023
Pythia by EleutherAI	{14,31,70,160,410}M, {1,1.4,2.8,6.9,12}B	Biderman et al. 2023
RedPajama-INCITE by Together	3B, 7B	Together 2023
StableCode by Stability AI	3B	Stability AI 2023
StableLM by Stability AI	3B, 7B	Stability AI 2023
StableLM Zephyr by Stability AI	3B	Stability AI 2023
TinyLlama by Zhang et al.	1.1B	Zhang et al. 2023
Vicuna by LMSYS	7B, 13B, 33B	Li et al. 2023

This implementation extends on Lit-LLaMA and nanoGPT, and it's powered by Lightning Fabric ⚡.

🏆 NeurIPS 2023 Large Language Model Efficiency Challenge: 1 LLM + 1 GPU + 1 Day

The Lit-GPT repository was the official starter kit for the NeurIPS 2023 LLM Efficiency Challenge, which is a competition focused on finetuning an existing non-instruction tuned LLM for 24 hours on a single GPU.

Lit-GPT design principles

This repository follows the main principle of openness through clarity.

Lit-GPT is:

Simple: Single-file implementation without boilerplate.
Correct: Numerically equivalent to the original model.
Optimized: Runs fast on consumer hardware or at scale.
Open-source: No strings attached.

Avoiding code duplication is not a goal. Readability and hackability are.

Get involved!

Join our Discord to build high-performance, truly open-source models for the common benefit of the community.

Setup

Clone the repo:

git clone https://github.com/Lightning-AI/lit-gpt
cd lit-gpt

Install with all dependencies (including CLI, quantization, tokenizers for all models, etc.):

pip install -r requirements-all.txt

Use the model

To generate text predictions, you need to download the model weights. If you don't have them, check out our guide.

Run inference:

python generate/base.py --prompt "Hello, my name is"

This will run the 3B pretrained model and require ~7 GB of GPU memory using the bfloat16 datatype.

Full guide for generating samples from the model.

You can also chat with the model interactively:

python chat/base.py

Run large models on smaller consumer devices

We support 4-bit quantization (as in QLoRA), (bnb.nf4, bnb.nf4-dq, bnb.fp4, bnb.fp4-dq) and 8-bit quantization (bnb.int8) for inference by following this guide.

Finetune the model

We provide a simple training scripts (finetune/adapter.py, finetune/adapter_v2.py, and finetune/lora.py) that instruction-tunes a pretrained model on the Alpaca dataset.

Download the data and generate an instruction tuning dataset:

python scripts/prepare_alpaca.py

Run the finetuning script

For example, you can either use

Adapter (Zhang et al. 2023):

python finetune/adapter.py

or Adapter v2 (Gao et al. 2023):

python finetune/adapter_v2.py

or LoRA (Hu et al. 2021):

python finetune/lora.py

(Please see the tutorials/finetune_adapter for details on the differences between the two adapter methods.)

The finetuning requires at least one GPU with ~12 GB memory (RTX 3060).

It is expected that you have downloaded the pretrained weights as described above. More details about each finetuning method and how you can apply it to your own data can be found in our technical how-to guides.

Finetuning how-to guides

These technical tutorials illustrate how to run the finetuning code.

Understanding finetuning -- conceptual tutorials

Looking for conceptual tutorials and explanations? We have some additional articles below:

Pretraining

We provide simple training scripts based on Fabric if you want to venture into pretraining. Conversion scripts for our optimized streaming PackedDataset are included.

Follow this guide to start pretraining on

Supported datasets

Lit-GPT includes a variety of dataset preparation scripts for finetuning and pretraining. Additional information about the datasets and dataset preparation is provided in the Preparing Datasets tutorial.

XLA

Lightning AI has partnered with Google to add first-class support for Cloud TPUs in Lightning’s frameworks and Lit-GPT, helping democratize AI for millions of developers and researchers worldwide.

Using TPUs with Lightning is as straightforward as changing one line of code.

We provide scripts fully optimized for TPUs in the XLA directory

Get involved!

We are on a quest towards fully open source AI.

Join us and start contributing, especially on the following areas:

We welcome all individual contributors, regardless of their level of experience or hardware. Your contributions are valuable, and we are excited to see what you can accomplish in this collaborative and supportive environment.

Unsure about contributing? Check out our How to Contribute to Lit-GPT and Lit-LLaMA guide.

Don't forget to join our Discord!

Acknowledgements

@karpathy for nanoGPT
@EleutherAI for GPT-NeoX and the Evaluation Harness
@TimDettmers for bitsandbytes
@Microsoft for LoRA
@tridao for Flash Attention 2

Citation

If you use Lit-GPT in your research, please cite the following work:

@misc{lit-gpt-2023,
  author       = {Lightning AI},
  title        = {Lit-GPT},
  howpublished = {\url{https://github.com/Lightning-AI/lit-gpt}},
  year         = {2023},
}

License

Lit-GPT is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 693 Commits
.github		.github
chat		chat
eval		eval
finetune		finetune
generate		generate
lit_gpt		lit_gpt
notebooks		notebooks
pretrain		pretrain
scripts		scripts
tests		tests
tutorials		tutorials
xla		xla
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-all.txt		requirements-all.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features of this lit-gpt fork

1- Sliding Window Attention (SWA)

Disclaimer

2- Rolling KV-Cache

Rolling KV-Cache implementation technical details

Original README.md

⚡ Lit-GPT

⚡ Lit-GPT

Lit-GPT design principles

Get involved!

Setup

Use the model

Run large models on smaller consumer devices

Finetune the model

Finetuning how-to guides

Understanding finetuning -- conceptual tutorials

Pretraining

Supported datasets

XLA

Get involved!

Acknowledgements

Citation

License

About

Releases

Packages

Languages

License

simdis/lit-gpt-with-sliding-window-attention

Folders and files

Latest commit

History

Repository files navigation

Features of this lit-gpt fork

1- Sliding Window Attention (SWA)

Disclaimer

2- Rolling KV-Cache

Rolling KV-Cache implementation technical details

Original README.md

⚡ Lit-GPT

⚡ Lit-GPT

Lit-GPT design principles

Get involved!

Setup

Use the model

Run large models on smaller consumer devices

Finetune the model

Finetuning how-to guides

Understanding finetuning -- conceptual tutorials

Pretraining

Supported datasets

XLA

Get involved!

Acknowledgements

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages