Skip to content

Commit

Permalink
Rename articles, load different in prompting article with images. Ref…
Browse files Browse the repository at this point in the history
…actor index page
  • Loading branch information
eLQeR committed Oct 15, 2024
1 parent be478a6 commit 6537e83
Show file tree
Hide file tree
Showing 40 changed files with 642 additions and 165 deletions.
Original file line number Diff line number Diff line change
@@ -1,43 +1,44 @@
+++
title = 'Third Post'
title = 'Langsmith vs OpenAI Evals vs DeepEval: A Comprehensive Evaluation'
date = 2024-10-14T12:35:30+03:00
draft = false
author = 'Yaroslav Biziuk'
+++


# Evaluation and Testing of Langsmith, OpenAI Evals, and DeepEval
# Langsmith vs OpenAI Evals vs DeepEval: A Comprehensive Evaluation

## Evaluating and Testing

Langsmith offers an online UI for prompt testing using either a custom dataset or manual inputs. This feature is particularly convenient because it allows users to store prompts along with their commit history and test results. Unfortunately, in DeepEval, testing cannot be done directly through the UI. Instead, users must use code to create `TestCase` objects and evaluate the results through the UI. Also OpenAI provides a powerful SDK for evaluating prompts. However, it requires a bit more setup.


![img_1.png](img_1.png)
![img_1.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_1.png)

*Picture 1. Prompt saved in Langsmith*

![img_2.png](img_2.png)
![img_2.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_2.png)

*Picture 2. Langsmith SDK for testing*

![img_3.png](img_3.png)
![img_3_1.png](img_3_1.png)
![img_3.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_3.png)
![img_3_1.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_3_1.png)

*Picture 3. OpenAI SDK for testing and evaluating*

### Metrics

Metrics are crucial for evaluating LLM performance. All three frameworks—Langsmith, OpenAI Evals, and DeepEval—support both default metrics and custom metric creation. Here is how it's implemented across the frameworks:

![img_4.png](img_4.png)
![img_4.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_4.png)

*Picture 4. Default metrics in Langsmith*

![img_5.png](img_5.png)
![img_5.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_5.png)

*Picture 5. Creating custom metrics in Langsmith*

![img_6.png](img_6.png)
![img_6.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_6.png)

*Picture 6. Default metrics in DeepEval*
...
Expand Down Expand Up @@ -176,20 +177,20 @@ return llm_response, response_id
*Picture 26. Monitoring DeepEval LLM calls filtered by feedback*


![img_27.png](img_27.png)
![img_27.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_27.png)

*Picture 27. Detail feedback page in DeepEval*

## Pricing

### LangSmith Pricing

![img.png](img_28.png)
![img.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_28.png)
*Picture 28. LangSmith Pricing*

### DeepEval Pricing

![img.png](img_29.png)...
![img.png](/dont_trust_ai/posts/comprehensive_frameworks_evaluation/img_29.png)
*Picture 29. DeepEval Pricing*

## Overall Conclusion
Expand Down
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ draft = false
author = 'Yaroslav Biziuk'
+++

[Prompt Engineering through Structured Instructions and Advanced Techniques is described on Google Docs with photo exapmles](https://docs.google.com/document/d/10pz3nPghcG3tyN9RuzrNerfcbeP59kq1YJRXjBApTQY/edit)

# 1. Introduction
Language models (LLMs) are powerful tools for a variety of tasks, but their effectiveness is highly dependent on the design of prompts. This article examines advanced techniques in prompt engineering, focusing on the impact of instruction order, the "Ask Before Answer" technique, and the "Chain of Thoughts" (CoT) method, etc. By optimizing these factors, we can significantly enhance the accuracy and reliability of LLM outputs.

Expand All @@ -15,7 +13,7 @@ Instruction order plays a crucial role in prompt engineering. Altering the seque

Example:

![img.png](/dont_trust_ai/posts/my-second-post/img.png)
![img.png](/dont_trust_ai/posts/prompt_engineering/img.png)

# 3. The "Ask Before Answer" Technique
The "Ask Before Answer" technique is particularly effective when optimizing prompts to reduce the occurrence of hallucinations. By prompting the LLM to seek clarification before resolving a task, we can preempt misunderstandings that might lead to incorrect answers.
Expand All @@ -37,7 +35,7 @@ I will give more info if you need.
```
**Result:**

![img_1.png](/dont_trust_ai/posts/my-second-post/img_1.png)
![img_1.png](/dont_trust_ai/posts/prompt_engineering/img_1.png)


When applying this technique, we ask the LLM to identify specific areas where it may be uncertain or confused in resolving a test case. By doing so, we can pinpoint where hallucinations occur, understand why the LLM struggles with certain choices, and refine the prompt in those areas where the model tends to get confused. This method is highly effective in improving the quality of the instructions provided in the prompt.
Expand All @@ -46,14 +44,14 @@ When applying this technique, we ask the LLM to identify specific areas where it

**Result without CoT:**

![img_2.png](/dont_trust_ai/posts/my-second-post/img_2.png)
![img_2.png](/dont_trust_ai/posts/prompt_engineering/img_2.png)


One of the most critical steps in creating an effective prompt with complex instructions is the use of the Chain of Thoughts (CoT) technique. By including phrases like "You think step by step," "Take your time," or "Explain every step," the LLM is given time to reflect and process all input data. This approach significantly improves the results, making them more logical and coherent. However, caution is needed when using "Explain every step," as the LLM can sometimes provide the most likely correct answers without fully understanding why, leading to hallucinations.

**Result with CoT:**

![img_3.png](/dont_trust_ai/posts/my-second-post/img_3.png)
![img_3.png](/dont_trust_ai/posts/prompt_engineering/img_3.png)

# 5. Meta-Prompting: An Advanced Strategy in Prompt Engineering
Meta-prompting is an advanced technique in prompt engineering that goes beyond merely guiding a language model (LLM) through a task. It involves crafting prompts that instruct the model on how to think or approach a problem before the primary task begins. This strategic layer of prompting enhances the LLM's ability to navigate complex instructions by embedding a meta-level understanding of the task at hand. For example, instead of directly asking the LLM to solve a problem, a meta-prompt might instruct the model to first assess whether it fully understands the task, identify potential ambiguities, and request clarification if necessary.
Expand All @@ -62,7 +60,7 @@ When applied in Claude LLM, meta-prompting proved to be more effective than in G

**Example how meta-prompting optimized Claude outputs:**

![img_4.png](/dont_trust_ai/posts/my-second-post/img_4.png)
![img_4.png](/dont_trust_ai/posts/prompt_engineering/img_4.png)

However, in our specific case, meta-prompting did not lead to the exceptional results we had hoped for. While it is a valuable technique, its effectiveness can vary depending on the complexity of the task and the model's inherent capabilities.

Expand Down
Binary file removed blog/public/images/img.png
Diff not rendered.
Binary file removed blog/public/images/img_1.png
Diff not rendered.
Binary file removed blog/public/images/img_2.png
Diff not rendered.
Binary file removed blog/public/images/img_3.png
Diff not rendered.
Binary file removed blog/public/images/img_4.png
Diff not rendered.
Binary file removed blog/public/images/prompt_engineering/img.png
Diff not rendered.
Binary file removed blog/public/images/prompt_engineering/img_1.png
Diff not rendered.
Binary file removed blog/public/images/prompt_engineering/img_2.png
Diff not rendered.
Binary file removed blog/public/images/prompt_engineering/img_3.png
Diff not rendered.
Binary file removed blog/public/images/prompt_engineering/img_4.png
Diff not rendered.
33 changes: 13 additions & 20 deletions blog/public/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8,40 +8,33 @@
<language>en-us</language>
<lastBuildDate>Mon, 14 Oct 2024 12:35:30 +0300</lastBuildDate>
<atom:link href="http://localhost:1313/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Third Post</title>
<link>http://localhost:1313/posts/my-third-post/third-post/</link>
<pubDate>Mon, 14 Oct 2024 12:35:30 +0300</pubDate>
<guid>http://localhost:1313/posts/my-third-post/third-post/</guid>
<description>&lt;h1 id=&#34;evaluation-and-testing-of-langsmith-openai-evals-and-deepeval&#34;&gt;Evaluation and Testing of Langsmith, OpenAI Evals, and DeepEval&lt;/h1&gt;&#xA;&lt;h2 id=&#34;evaluating-and-testing&#34;&gt;Evaluating and Testing&lt;/h2&gt;&#xA;&lt;p&gt;Langsmith offers an online UI for prompt testing using either a custom dataset or manual inputs. This feature is particularly convenient because it allows users to store prompts along with their commit history and test results. Unfortunately, in DeepEval, testing cannot be done directly through the UI. Instead, users must use code to create &lt;code&gt;TestCase&lt;/code&gt; objects and evaluate the results through the UI.&lt;/p&gt;</description>
</item>
<item>
<title>Second Post</title>
<link>http://localhost:1313/posts/my-second-post/</link>
<pubDate>Sat, 12 Oct 2024 13:52:37 +0300</pubDate>
<guid>http://localhost:1313/posts/my-second-post/</guid>
<description>&lt;p&gt;&lt;a href=&#34;https://docs.google.com/document/d/10pz3nPghcG3tyN9RuzrNerfcbeP59kq1YJRXjBApTQY/edit&#34;&gt;Prompt Engineering through Structured Instructions and Advanced Techniques is described on Google Docs with photo exapmles&lt;/a&gt;&lt;/p&gt;&#xA;&lt;h1 id=&#34;1-introduction&#34;&gt;1. Introduction&lt;/h1&gt;&#xA;&lt;p&gt;Language models (LLMs) are powerful tools for a variety of tasks, but their effectiveness is highly dependent on the design of prompts. This article examines advanced techniques in prompt engineering, focusing on the impact of instruction order, the &amp;ldquo;Ask Before Answer&amp;rdquo; technique, and the &amp;ldquo;Chain of Thoughts&amp;rdquo; (CoT) method, etc. By optimizing these factors, we can significantly enhance the accuracy and reliability of LLM outputs.&lt;/p&gt;</description>
</item>
<item>
<title>Welcome To LLM Integrations Research Blog by COXIT</title>
<link>http://localhost:1313/posts/my-first-post/</link>
<pubDate>Thu, 10 Oct 2024 19:19:14 +0300</pubDate>
<guid>http://localhost:1313/posts/my-first-post/</guid>
<description>&lt;h3 id=&#34;welcome&#34;&gt;Welcome&lt;/h3&gt;&#xA;&lt;p&gt;Welcome to the internal COXIT blog about Prompt Engeneering and LLM integrations.&lt;/p&gt;&#xA;&lt;p&gt;This is COXIT&amp;rsquo;s internal R&amp;amp;D project. We aim to investigate different tools, approaches, and prompt engineering techniques for building LLM-powered production-ready solutions. We aim to try different tools currently known on the market and define what works and what causes any problems for our specific use cases.&lt;/p&gt;&#xA;&lt;p&gt;As a result of this project, we will build a knowledge base describing which of the available LLM evaluation tools, frameworks, LLMs themselves, and prompt engineering techniques worked best for our specific LLM-related cases, explain what didn&amp;rsquo;t work, and why.&lt;/p&gt;</description>
</item>
<item>
<title>Langsmith vs OpenAI Evals vs DeepEval: A Comprehensive Evaluation</title>
<link>http://localhost:1313/posts/my-third-post/third-post/</link>
<pubDate>Mon, 14 Oct 2024 12:35:30 +0300</pubDate>
<guid>http://localhost:1313/posts/my-third-post/third-post/</guid>
<description>&lt;h1 id=&#34;langsmith-vs-openai-evals-vs-deepeval-a-comprehensive-evaluation&#34;&gt;Langsmith vs OpenAI Evals vs DeepEval: A Comprehensive Evaluation&lt;/h1&gt;&#xA;&lt;h2 id=&#34;evaluating-and-testing&#34;&gt;Evaluating and Testing&lt;/h2&gt;&#xA;&lt;p&gt;Langsmith offers an online UI for prompt testing using either a custom dataset or manual inputs. This feature is particularly convenient because it allows users to store prompts along with their commit history and test results. Unfortunately, in DeepEval, testing cannot be done directly through the UI. Instead, users must use code to create &lt;code&gt;TestCase&lt;/code&gt; objects and evaluate the results through the UI. Also OpenAI provides a powerful SDK for evaluating prompts. However, it requires a bit more setup.&lt;/p&gt;</description>
</item>
<item>
<title>Prompt Engineering through Structured Instructions and Advanced Techniques</title>
<link>http://localhost:1313/posts/my-second-post/my-second-post/</link>
<pubDate>Thu, 10 Oct 2024 11:59:14 +0300</pubDate>
<guid>http://localhost:1313/posts/my-second-post/my-second-post/</guid>
<description>&lt;p&gt;&lt;a href=&#34;https://docs.google.com/document/d/10pz3nPghcG3tyN9RuzrNerfcbeP59kq1YJRXjBApTQY/edit&#34;&gt;Prompt Engineering through Structured Instructions and Advanced Techniques is described on Google Docs with photo exapmles&lt;/a&gt;&lt;/p&gt;&#xA;&lt;h1 id=&#34;1-introduction&#34;&gt;1. Introduction&lt;/h1&gt;&#xA;&lt;p&gt;Language models (LLMs) are powerful tools for a variety of tasks, but their effectiveness is highly dependent on the design of prompts. This article examines advanced techniques in prompt engineering, focusing on the impact of instruction order, the &amp;ldquo;Ask Before Answer&amp;rdquo; technique, and the &amp;ldquo;Chain of Thoughts&amp;rdquo; (CoT) method, etc. By optimizing these factors, we can significantly enhance the accuracy and reliability of LLM outputs.&lt;/p&gt;</description>
<description>&lt;h1 id=&#34;1-introduction&#34;&gt;1. Introduction&lt;/h1&gt;&#xA;&lt;p&gt;Language models (LLMs) are powerful tools for a variety of tasks, but their effectiveness is highly dependent on the design of prompts. This article examines advanced techniques in prompt engineering, focusing on the impact of instruction order, the &amp;ldquo;Ask Before Answer&amp;rdquo; technique, and the &amp;ldquo;Chain of Thoughts&amp;rdquo; (CoT) method, etc. By optimizing these factors, we can significantly enhance the accuracy and reliability of LLM outputs.&lt;/p&gt;&#xA;&lt;h1 id=&#34;2-the-importance-of-instruction-order&#34;&gt;2. The Importance of Instruction Order&lt;/h1&gt;&#xA;&lt;p&gt;Instruction order plays a crucial role in prompt engineering. Altering the sequence of instructions or actions can drastically change the outcome produced by the LLM. For instance, when we previously placed an instruction about not considering &amp;ldquo;semi-exposed surfaces&amp;rdquo; as the eleventh step, the LLM would still process these surfaces as it followed each step sequentially, reaching the exclusion instruction too late to apply it effectively. However, when this instruction was moved to precede all other steps, the LLM correctly disregarded &amp;ldquo;semi-exposed&amp;rdquo; surfaces. This demonstrates the necessity of positioning general concepts or definitions above the specific step-by-step instructions, ensuring they are applied throughout the process.&lt;/p&gt;</description>
</item>
<item>
<title>Prompt Engineering through Structured Instructions and Advanced Techniques</title>
<link>http://localhost:1313/posts/second-post/</link>
<pubDate>Thu, 10 Oct 2024 11:59:14 +0300</pubDate>
<guid>http://localhost:1313/posts/second-post/</guid>
<description>&lt;p&gt;&lt;a href=&#34;https://docs.google.com/document/d/10pz3nPghcG3tyN9RuzrNerfcbeP59kq1YJRXjBApTQY/edit&#34;&gt;Prompt Engineering through Structured Instructions and Advanced Techniques is described on Google Docs with photo exapmles&lt;/a&gt;&lt;/p&gt;&#xA;&lt;h1 id=&#34;1-introduction&#34;&gt;1. Introduction&lt;/h1&gt;&#xA;&lt;p&gt;Language models (LLMs) are powerful tools for a variety of tasks, but their effectiveness is highly dependent on the design of prompts. This article examines advanced techniques in prompt engineering, focusing on the impact of instruction order, the &amp;ldquo;Ask Before Answer&amp;rdquo; technique, and the &amp;ldquo;Chain of Thoughts&amp;rdquo; (CoT) method, etc. By optimizing these factors, we can significantly enhance the accuracy and reliability of LLM outputs.&lt;/p&gt;</description>
<title>Differences in Prompting Techniques: Claude vs. GPT</title>
<link>http://localhost:1313/posts/difference-in-gpt-claude-prompting/difference-in-prompting-post/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>http://localhost:1313/posts/difference-in-gpt-claude-prompting/difference-in-prompting-post/</guid>
<description>&lt;h1 id=&#34;differences-in-prompting-techniques-claude-vs-gpt&#34;&gt;Differences in Prompting Techniques: Claude vs. GPT&lt;/h1&gt;&#xA;&lt;h2 id=&#34;1-introduction&#34;&gt;1. Introduction&lt;/h2&gt;&#xA;&lt;p&gt;When working with language models (LLMs) like Claude and GPT, the effectiveness of prompts can vary significantly based on the model&amp;rsquo;s architecture and capabilities. This article explores key differences in prompting strategies for Claude and GPT, using advanced techniques such as meta-prompting, XML tagging, and the Chain of Thoughts (CoT) method. By understanding these differences, users can optimize their prompts to enhance the accuracy and reliability of LLM outputs.&lt;/p&gt;</description>
</item>
</channel>
</rss>
Loading

0 comments on commit 6537e83

Please sign in to comment.