diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100644 index 0000000..c5746a2 --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,45 @@ +# Welcome to the contribution guide πŸ€— + +We are excited to invite the community to contribute to the repository! We appreciate all contributions, big or small. Your efforts help make this repository a valuable resource for everyone working with Llama models. + +Thank you for your time and happy coding! + +## πŸš€ How to Contribute + +1. **Fork the Repository** + - Click on the "Fork" button at the top right corner of this page to create your own copy of the repository. + + ![fork button](../assets/Fork.png) + +2. **Create a Branch** + - In your forked repository, create a new branch for your contribution: + ```bash + git checkout -b feature/your-feature-name + ``` + +3. **Make Your Changes** + - Add your scripts, notebooks, or any relevant files. + - **Don't forget to update the `README.md`** to include your example, + so others can easily find and use it. + +4. **Commit and Push** + - Commit your changes with a meaningful commit message: + ```bash + git commit -m "Add feature: your feature name" + ``` + - Push the changes to your forked repository: + ```bash + git push origin feature/your-feature-name + ``` + +5. **Open a Pull Request** + - Navigate to the original repository and click on "New Pull Request" + - Compare across forks and select your branch. + + ![pull request](../assets/PR.png) + - Provide a clear description of your contribution. + + +## πŸ’‘ Need Help? + +If you have any questions or need guidance, feel free to open an issue or draft PR. We're here to help! diff --git a/README.md b/.github/README.md similarity index 72% rename from README.md rename to .github/README.md index 36ffbd2..8ddc1e4 100644 --- a/README.md +++ b/.github/README.md @@ -1,6 +1,6 @@ # Hugging Face Llama Recipes -![thumbnail for repository](./assets/hf-llama-recepies.png) +![thumbnail for repository](../assets/hf-llama-recepies.png) πŸ€—πŸ¦™Welcome! This repository contains *minimal* recipes to get started quickly with **Llama 3.x** models, including **Llama 3.1** and **Llama 3.2**. @@ -71,8 +71,6 @@ So do we! The memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations: -### Llama 3.1 - | Model Size | Llama Variant | BF16/FP16 | FP8 | INT4(AWQ/GPTQ/bnb) | | :--: | :--: | :--: | :--: | :--: | | 1B | 3.2 | 2.5 GB | 1.25GB | 0.75GB | @@ -88,15 +86,15 @@ implementation details and optimizations. Working with the capable Llama 3.1 8B models: -* [Run Llama 3.1 8B in 4-bits with bitsandbytes](./4bit_bnb.ipynb) -* [Run Llama 3.1 8B in 8-bits with bitsandbytes](./8bit_bnb.ipynb) -* [Run Llama 3.1 8B with AWQ & fused ops](./awq.ipynb) +* [Run Llama 3.1 8B in 4-bits with bitsandbytes](../local_inference/4bit_bnb.ipynb) +* [Run Llama 3.1 8B in 8-bits with bitsandbytes](../local_inference/8bit_bnb.ipynb) +* [Run Llama 3.1 8B with AWQ & fused ops](../local_inference/awq.ipynb) Working on the 🐘 big Llama 3.1 405B model: -* [Run Llama 3.1 405B FP8](./fp8-405B.ipynb) -* [Run Llama 3.1 405B quantized to INT4 with AWQ](./awq_generation.py) -* [Run Llama 3.1 405B quantized to INT4 with GPTQ](./gptq_generation.py) +* [Run Llama 3.1 405B FP8](../local_inference/fp8-405B.ipynb) +* [Run Llama 3.1 405B quantized to INT4 with AWQ](../local_inference/awq_generation.py) +* [Run Llama 3.1 405B quantized to INT4 with GPTQ](../local_inference/gptq_generation.py) ## Model Fine Tuning: @@ -106,43 +104,44 @@ custom dataset. Here are some scripts showing how to fine-tune the models. Fine tune models on your custom dataset: -* [Fine tune Llama 3.2 Vision on a custom dataset](./Llama-Vision%20FT.ipynb) -* [Supervised Fine Tuning on Llama 3.2 Vision with TRL](./sft_vlm.py) -* [How to fine-tune Llama 3.1 8B on consumer GPU with PEFT and QLoRA with bitsandbytes](./peft_finetuning.py) -* [Execute a distributed fine tuning job for the Llama 3.1 405B model on a SLURM-managed computing cluster](./qlora_405B.slurm) +* [Fine tune Llama 3.2 Vision on a custom dataset](../fine_tune/Llama-Vision%20FT.ipynb) +* [Supervised Fine Tuning on Llama 3.2 Vision with TRL](../fine_tune/sft_vlm.py) +* [How to fine-tune Llama 3.1 8B on consumer GPU with PEFT and QLoRA with bitsandbytes](../fine_tune/peft_finetuning.py) +* [Execute a distributed fine tuning job for the Llama 3.1 405B model on a SLURM-managed computing cluster](../fine_tune/qlora_405B.slurm) ## Assisted Decoding Techniques Do you want to use the smaller Llama 3.2 models to speedup text generation of bigger models? These notebooks showcase assisted decoding (speculative decoding), which gives you upto 2x speedups for text generation on Llama 3.1 70B (with greedy decoding). -* [Run assisted decoding with 🐘 Llama 3.1 70B and 🀏 Llama 3.2 3B](./assisted_decoding_70B_3B.ipynb) -* [Run assisted decoding with Llama 3.1 8B and Llama 3.2 1B](./assisted_decoding_8B_1B.ipynb) -* [Assisted Decoding with 405B model](./assisted_decoding.py) +* [Run assisted decoding with 🐘 Llama 3.1 70B and 🀏 Llama 3.2 3B](../assisted_decoding/assisted_decoding_70B_3B.ipynb) +* [Run assisted decoding with Llama 3.1 8B and Llama 3.2 1B](../assisted_decoding/assisted_decoding_8B_1B.ipynb) +* [Assisted Decoding with 405B model](../assisted_decoding/assisted_decoding.py) ## Performance Optimization Let us optimize performace shall we? -* [Accelerate your inference using torch.compile](./torch_compile.py) -* [Accelerate your inference using torch.compile and 4-bit quantization with torchao](./torch_compile_with_torchao.ipynb) -* [Quantize KV Cache to lower memory requirements](./quantized_cache.py) -* [How to reuse prompts with dynamic caching](./prompt_reuse.py) +* [Accelerate your inference using torch.compile](../performance_optimization/torch_compile.py) +* [Accelerate your inference using torch.compile and 4-bit quantization with torchao](../performance_optimization/torch_compile_with_torchao.ipynb) +* [Quantize KV Cache to lower memory requirements](../performance_optimization/quantized_cache.py) +* [How to reuse prompts with dynamic caching](../performance_optimization/prompt_reuse.py) +* [How to setup distributed training utilizing DeepSpeed with mixed-precision and Zero-3 optimization](../performance_optimization/deepspeed_zero3.yaml) ## API inference Are these models too large for you to run at home? Would you like to experiment with Llama 70B? Try out the following examples! -* [Use the Inference API for PRO users](./inference-api.ipynb) +* [Use the Inference API for PRO users](../api_inference/inference-api.ipynb) ## Llama Guard and Prompt Guard In addition to the generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects jailbreaks and prompt injections. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations. Learn how to use them as done in the following notebooks: -* [Detecting jailbreaks and prompt injection with Prompt Guard](./prompt_guard.ipynb) +* [Detecting jailbreaks and prompt injection with Prompt Guard](../llama_guard/prompt_guard.ipynb) ## Synthetic Data Generation With the ever hungry models, the need for synthetic data generation is on the rise. Here we show you how to build your very own synthetic dataset. -* [Generate synthetic data with `distilabel`](./synthetic-data-with-llama.ipynb) +* [Generate synthetic data with `distilabel`](../synthetic_data_gen/synthetic-data-with-llama.ipynb) diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..496ee2c --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +.DS_Store \ No newline at end of file diff --git a/inference-api.ipynb b/api_inference/inference-api.ipynb similarity index 100% rename from inference-api.ipynb rename to api_inference/inference-api.ipynb diff --git a/assets/Fork.png b/assets/Fork.png new file mode 100644 index 0000000..819338a Binary files /dev/null and b/assets/Fork.png differ diff --git a/assets/PR.png b/assets/PR.png new file mode 100644 index 0000000..cc2de15 Binary files /dev/null and b/assets/PR.png differ diff --git a/assisted_decoding.py b/assisted_decoding/assisted_decoding.py similarity index 100% rename from assisted_decoding.py rename to assisted_decoding/assisted_decoding.py diff --git a/assisted_decoding_70B_3B.ipynb b/assisted_decoding/assisted_decoding_70B_3B.ipynb similarity index 100% rename from assisted_decoding_70B_3B.ipynb rename to assisted_decoding/assisted_decoding_70B_3B.ipynb diff --git a/assisted_decoding_8B_1B.ipynb b/assisted_decoding/assisted_decoding_8B_1B.ipynb similarity index 100% rename from assisted_decoding_8B_1B.ipynb rename to assisted_decoding/assisted_decoding_8B_1B.ipynb diff --git a/Llama-Vision FT.ipynb b/fine_tune/Llama-Vision FT.ipynb similarity index 100% rename from Llama-Vision FT.ipynb rename to fine_tune/Llama-Vision FT.ipynb diff --git a/peft_finetuning.py b/fine_tune/peft_finetuning.py similarity index 100% rename from peft_finetuning.py rename to fine_tune/peft_finetuning.py diff --git a/qlora_405B.slurm b/fine_tune/qlora_405B.slurm similarity index 100% rename from qlora_405B.slurm rename to fine_tune/qlora_405B.slurm diff --git a/sft_vlm.py b/fine_tune/sft_vlm.py similarity index 100% rename from sft_vlm.py rename to fine_tune/sft_vlm.py diff --git a/prompt_guard.ipynb b/llama_guard/prompt_guard.ipynb similarity index 100% rename from prompt_guard.ipynb rename to llama_guard/prompt_guard.ipynb diff --git a/4bit_bnb.ipynb b/local_inference/4bit_bnb.ipynb similarity index 100% rename from 4bit_bnb.ipynb rename to local_inference/4bit_bnb.ipynb diff --git a/8bit_bnb.ipynb b/local_inference/8bit_bnb.ipynb similarity index 100% rename from 8bit_bnb.ipynb rename to local_inference/8bit_bnb.ipynb diff --git a/awq.ipynb b/local_inference/awq.ipynb similarity index 100% rename from awq.ipynb rename to local_inference/awq.ipynb diff --git a/awq_generation.py b/local_inference/awq_generation.py similarity index 100% rename from awq_generation.py rename to local_inference/awq_generation.py diff --git a/fp8-405B.ipynb b/local_inference/fp8-405B.ipynb similarity index 100% rename from fp8-405B.ipynb rename to local_inference/fp8-405B.ipynb diff --git a/gptq_generation.py b/local_inference/gptq_generation.py similarity index 100% rename from gptq_generation.py rename to local_inference/gptq_generation.py diff --git a/deepspeed_zero3.yaml b/performance_optimization/deepspeed_zero3.yaml similarity index 100% rename from deepspeed_zero3.yaml rename to performance_optimization/deepspeed_zero3.yaml diff --git a/prompt_reuse.py b/performance_optimization/prompt_reuse.py similarity index 100% rename from prompt_reuse.py rename to performance_optimization/prompt_reuse.py diff --git a/quantized_cache.py b/performance_optimization/quantized_cache.py similarity index 100% rename from quantized_cache.py rename to performance_optimization/quantized_cache.py diff --git a/torch_compile.py b/performance_optimization/torch_compile.py similarity index 100% rename from torch_compile.py rename to performance_optimization/torch_compile.py diff --git a/torch_compile_with_torchao.ipynb b/performance_optimization/torch_compile_with_torchao.ipynb similarity index 100% rename from torch_compile_with_torchao.ipynb rename to performance_optimization/torch_compile_with_torchao.ipynb diff --git a/synthetic-data-with-llama.ipynb b/synthetic_data_gen/synthetic-data-with-llama.ipynb similarity index 100% rename from synthetic-data-with-llama.ipynb rename to synthetic_data_gen/synthetic-data-with-llama.ipynb