-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
384 lines (357 loc) · 18.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
<!DOCTYPE html>
<html>
<head>
<title>How Far Can We Extract Diverse Perspectives from Large Language
Models?</title>
<link rel="icon" type="image/x-icon" href="website/static/images/favicon.ico" />
<link
href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet"
/>
<link rel="stylesheet" href="website/static/css/bulma.min.css" />
<link rel="stylesheet" href="website/static/css/bulma-carousel.min.css" />
<link rel="stylesheet" href="website/static/css/bulma-slider.min.css" />
<link rel="stylesheet" href="website/static/css/fontawesome.all.min.css" />
<link
rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"
/>
<link rel="stylesheet" href="website/static/css/index.css" />
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
<script defer src="website/static/js/fontawesome.all.min.js"></script>
<script src="website/static/js/bulma-carousel.min.js"></script>
<script src="website/static/js/bulma-slider.min.js"></script>
<script src="website/static/js/index.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">
How Far Can We Extract Diverse Perspectives from Large Language Models?
</h1>
<div class="is-size-5 publication-authors">
<!-- Paper authors -->
<span class="author-block">
<a href="https://www.shirley.id/" target="_blank"
>Shirley Anugrah Hayati</a
><sup>*</sup>,
</span>
<span class="author-block">
<a href="https://mimn97.github.io/" target="_blank"
>Minhwa Lee</a
><sup>*</sup>,
</span>
<span class="author-block">
<a href="" target="_blank"
>Dheeraj Rajagopal</a
><sup>†</sup>,
</span>
<span class="author-block">
<a href="https://dykang85.github.io/" target="_blank"
>Dongyeop Kang</a
><sup>*</sup>
</span>
<div class="is-size-5 publication-authors">
<span class="eql-cntrb"
>
<sup>*</sup>University of Minnesota <sup>†</sup>Google Research</small
></span
>
</div>
<br>
<h4><i>EMNLP 2024 (Main, Long Paper)</i></h4>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a
href="https://github.com/minnesotanlp/diversity-extraction-from-llms/tree/main/data"
target="_blank"
class="external-link button is-normal is-rounded is-dark is-outlined"
>
<span class="icon">
<i class="fa fa-laptop"></i>
</span>
<span>Data</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-6">
<img
src="website/static/images/figure1_diversity_prompting.png"
width="500"
class="center-image"
/>
</div>
</div>
</div>
<section class="section hero">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Collecting diverse human opinions is costly and challenging. This leads to a recent trend in collaborative efforts between humans and Large Language Models (LLMs) for generating diverse data, offering potential scalable and efficient solutions. However, the extent of LLMs' capability to generate diverse perspectives on subjective topics remains an unexplored question.
In this study, we investigate LLMs' capacity for generating diverse perspectives and rationales on subjective topics, such as social norms and argumentative texts. We formulate a new problem of <i>maximum diversity extraction</i> from LLMs. Motivated by how humans develop their opinions through their values, we propose a criteria-based prompting technique to ground diverse opinions.
To see how far we can extract diverse perspectives from LLMs, or called <i>>diversity coverage</i>, we employ a step-by-step recall prompting for generating more outputs from the model in an iterative manner. As we apply our methods to various tasks, indeed we find that LLMs can generate diverse opinions according to the degree of task subjectivity.
</p>
</div>
</div>
</div>
</div>
</section>
<!-- Research Contributions start -->
<section class="section hero is-small is-light">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-full">
<div class="content">
<h2 class="title is-3">Research Contributions</h2>
<div class="level-set has-text-justified">
<ul>
<li>First, we propose the idea of perspective diversity for generative LLMs, unlike lexical diversity, syntactical diversity, and semantic diversity which have been main interests in previous works.
We conduct various experiments to measure LLMs' ability to generate maximum perspective diversity.
</li>
<li>Second, we thus introduce a new prompting technique called criteria-based diversity prompting, as a way of extracting and grounding diverse perspectives from LLMs.
</li>
<li>Finally, as it is unclear how much diversity LLMs can cover, we suggest a step-by-step approach for measuring the coverage of LLMs' diversity generation (i.e., measuring the recall for diversity prompting).
We then compare this coverage between LLM's generated opinions and human-written opinions.
</li>
</ul>
</div>
</div>
</div>
</section>
<!-- Research Contributions end -->
<!--- Methods Start -->
<section class="hero is-small">
<div class="hero-body">
<div class="columns is-centered has-text-centered;">
<h1 class="title is-3">Methods</h1>
</div>
</div>
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-full">
<div class="item">
<br>
<img src="website/static/images/combined_method_1.png" alt="prompting"/>
</div>
<div class="content has-text-justified">
<br>
<p>
<ol>
<li>
<b style="color: slateblue">Criteria-based Diversity Prompting </b>
<p>
Our Criteria-based Diversity Prompting is as follows (shown in Figure <b>[a]</b>):
<br>
"Given a <i>statement</i>, we prompt the LLMs to generate its <b style = "color:magenta">stance (e.g., agree or disagree)</b> and explain its <b style="color:purple">Reasons</b> with a list of <b style="color:blue">Criteria</b> that affect its perspective. ""
<br>
<br>
Here, we consider <b style="color:blue">criteria</b> words or phrases that frame the LLM's high-level decision and generate the grounded reasons well (e.g., model values).
</p>
<p>
</p>
</li>
<br>
<li>
<b style="color: slateblue">Step-by-Step Recall Prompting</b>
<p>
To see the LLMs' diversity coverage, we suggest a step-by-step recall prompting (as shown in Figure <b>[b]</b> ):
<br>
We first ask LLMs to generate one opinion ('1st Opinion') for the given statement, and we ask the models to continue generating more opinions until the requested number of opinions ('N') is reached.
</p>
<p> Note that the first opinion is used to guide the structured format for the output since we do not do few-shot prompting for this experiment. </p>
</li>
<br>
<li>
<b style="color: slateblue">Dataset & Models</b>
<p>
We collected the following datasets: (1) Social-Chem-101 (Forbes et al., 2020); (2) Change My View (CMV) (Hidey et al., 2017).
For the recall prompting technique, we added the two more datasets: (3) Hate Speech (Vidgen et al., 2021); and (4) Moral Stories (Emelin et al., 2021).
</p>
<p>
Then, we assemble GPT-4, ChatGPT, and GPT-3 (text-davinci-002) as well as open-source models such as LLaMA2-70B-chat (Touvron et al., 2023) and Mistral-7B-Instruct (Jiang et al., 2023).
</p>
</li>
<br>
<li>
<b style="color: slateblue">Evaluation</b>
<p>
We measured the diversity in LLM-generated opinions by using the following two metrics:
<ol>
<li>
<b>Semantic Diversity</b>: For each statement, we first model the generated reasons from LLMs as sentence embeddings using SentenceBERT.
We then measure the cosine distance among every pair of reasons and compute the average cosine distance across all the pairs. Note that we used this metric to compare the diversity of models' generated reasons
between criteria-based prompting and free-form prompting.
</li>
<br>
<li>
<b>Perspective Diversity</b>: We prompt GPT-4 to cluster criteria words with similar meaning into one group, in order to examine the step-by-step recall prompting.
A perspective diversity score for a statement is the percentage of how many generated opinions of the statement have each of their criteria not duplicated with each other.
The higher the score is, the more diverse the set of generated opinions is.
</li>
</ol>
</p>
</li>
</ol>
</p>
</div>
</div>
</div>
</div>
</div>
</section>
<!--- Methods end -->
<!-- Takeaway start -->
<span id="takeaway">
<section class="section hero is-small is-light">
<div class="container is-max-desktop">
<div class="content">
<h2 class="title is-3 has-text-centered">Key Takeaways</h2>
<ul>
<li>
<p class="subtitle">
GPT-4 with the criteria-based diversity prompting in an one-shot setting shows the most semantically diverse opinions about social norms and argumentative topics.
</p>
</li>
<div class="hero-body">
<div class="container">
<div class="item">
<br>
<img src="website/static/images/table1_semantic.png" alt="cobbler pipeline"/>
</div>
<div class="content has-text-justified">
<br>
<p>Semantic diversity (cosine distance) results for criteria-based prompting vs. free-form prompting and LLM variants.
1-criteria refers to one-shot criteria-based prompting and so on.
Text for the highest diversity score within the same LLM type is made \textbf{bold}. * p< 0.05 when comparing criteria-based prompting with free-form prompting.
</p>
</div>
</div>
</div>
<br>
<li>
<p class="subtitle">
Task subjectivity of dataset tends to influence the capabilities of LLMs in producing the maximium number of diverse opinions.
</p>
</li>
<div class="hero-body">
<div class="container">
<div id="carousel2" class="carousel results-carousel">
<div class="item">
<div style="display: flex; justify-content: center">
<img src="website/static/images/fig4_recall.png" style="max-height: 350px" />
</div>
<p class="subtitle is-6 has-text-centered">
X-axis is the number of generated opinions for our diversity coverage experiment and Y-axis is the average number of unique criteria clusters for all statements.
Moral Stories do not have stances, so the line is only for all generated continued stories.
</p>
</div>
<div class="item">
<div style="display: flex; justify-content: center">
<img src="website/static/images/table3_criteria.png" style="max-height: 350px" />
</div>
<p class="subtitle is-6 has-text-centered">
Different numbers of LLMs' generated unique criteria clusters for different task types. Max and median refer to the maximum and the median of the number of unique criteria clusters.
</p>
</div>
</div>
</div>
</div>
<li>
<p class="subtitle">
Semantic diversity is not always positively correlated with perspective diversity.
</p>
</li>
<div class="hero-body">
<div class="container">
<div class="item">
<br>
<img src="website/static/images/fig5_corr.png" alt="cobbler pipeline"/>
</div>
<div class="content has-text-justified">
<br>
<p>Scatter plot for X= semantic diversity (cosine distance) of opinions in each statement, Y = perspective diversity (% of statements without duplicate opinions).
A green circle refers to one statement with agree/hate speech reasons while a red triangle refers to statements with disagree/not hate opinions.
Story continuation in Moral Stories does not have stances and each story is represented by a purple circle.
</p>
</div>
</div>
</div>
<br>
<li>
<p class="subtitle">
Humans and LLMs have different perspectives on socially argumentative topics.
</p>
</li>
<div class="hero-body">
<div class="container">
<div id="carousel2" class="carousel results-carousel">
<div class="item">
<div style="display: flex; justify-content: center">
<img src="website/static/images/human_llm_qual.png" style="max-height: 350px" />
</div>
<p class="subtitle is-6 has-text-centered">
Opinions generated by GPT-4 (top) and a human (bottom) about a statement from Social-Chem-101.
</p>
</div>
<div class="item">
<div style="display: flex; justify-content: center">
<img src="website/static/images/table4_human.png" style="max-height: 260px" />
</div>
<p class="subtitle is-6 has-text-centered">
Average number of criteria clusters of human opinions vs. GPT-4-generated opinions per statement with standard deviation.
<b>Humans generated slightly more diverse opinions than LLMs.</b>
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</ul>
</div>
</div>
</section>
</span>
<!-- Takeaway end -->
<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This page was built using the
<a
href="https://github.com/eliahuhorwitz/Academic-project-page-template"
target="_blank"
>Academic Project Page Template</a
>
which was adopted from the <a
href="https://nerfies.github.io"
target="_blank"
>Nerfies</a
> project page.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>