From c8bc2614a1fb7ecb2f16103d3888914646e937a7 Mon Sep 17 00:00:00 2001
From: Divyansh Khanna <divyanshkhanna09@gmail.com>
Date: Sun, 18 Feb 2024 19:41:06 -0800
Subject: [PATCH] frugalGPT

---
 index.md | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)
diff --git a/index.md b/index.md
index a0d5030..51fe775 100755
--- a/index.md
+++ b/index.md
@@ -2,6 +2,7 @@
 layout: default
 ---
 
+[FrugalGPT: How to use LLM while reducing cost and improving performance](#frugalgpt)  
 [Mathematics of Deep Learning](#vidal)  
 [Wasserstein GAN](#wgan)   
 [Why and How of Nonnegative Matrix Factorization](#nmf)   
@@ -25,9 +26,34 @@ layout: default
 [Notification Volume Control and Optimization System at Pinterest](#pinterest_notification)   
 [Class-Balanced Loss Based on Effective Number of Samples](#class_balanced_loss)    
 [Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](#mmoe)    
-
 ---
 
+## <a name="frugalgpt"></a> FrugalGPT: How to use LLM while reducing cost and improving performance
+* The premise of this paper is that we can't call the big LLM for everything, all the time
+* We need a way to reduce the costs per task solved by LLMs, and do that by maintaining quality
+* In a nutshell, the contribution of the paper is proposing an "ensemble" style approach to utilizing LLMs - have many LLM API handy, figure out when to call which one, reduce costs per call by optimizing on prompt size or using a cache
+* They model this problem as an optimization problem - with the goal to maximize the performance while constraining the budget 
+
+$$ max \:\: {\mathbb{E}_{(q,a) \in (QxA)} [r(a, \hat{a}(s,q))]} \:\: with \:\: \mathbb{E}_{(q,a) \in (QxA)} [c(s,q)] \leq b$$
+
+* Q is the query space, A is the answer space, $$\hat{a}$$ is the LLM's answer, b is the budget constraint and c is the cost per query using strategy s
+* The three strategies for cost reduction: prompt adaptation, LLM approximation and LLM cascade
+* Prompt adaptation is reducing the prompt length to retain its value but minimizing the input cost to the LLM API (cost is linear to prompt size)
+* LLM approximation: make use of a "similar" LLM if the LLM api is quite expensive to call - use a completion cache (dont ask LLM stuff for which you already have an answer)
+  * this is a simple but can be very powerful: given all queries in the world, it would be a long tailed distribution, but the head load would be ripe for savings as we can save the results of those queries in a cache which serves already asked answer - similar to any search engine or RecSys product
+* Another example of LLM approximation is to fine tune a smaller model
+* LLM cascade is using a scorer and a router. Say you have a chain of LLMs with 'smaller'/'cheaper' ones the first 
+* Score the output of the first one using a scorer, judge it against a threshold and either use that answer or move to the next bigger LLM in the chain
+  * Scoring function can be obtained by training a simple regression model that learns whether a generation is correct from the query and a generated answer
+* Learning the list of LLMs and the thresholds can be modeled as a constraint optimization problem.
+* There are restrictions to these approaches which make the problem nuanced, but overall is a good direction of work that can be leveraged by many small/medium players.
+* The authors ran experiments which showed FrugalGPT can match the perf of a big LLM wit upto 98% savings or improve the accuracy by 4% with same cost. Impressive.
+
+
+References
+* [Paper](https://arxiv.org/abs/2305.05176)
+
+---
 ## <a name="vidal"></a>Mathematics of Deep Learning
 
 * The paper talks about the principles of mathematics which underline the field
@@ -554,8 +580,6 @@ $$
 
 * Key takeaway: try using a gating network instead of a 'switch' like gate based on the input.
 
-
-
 References
 * [Paper](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007)