You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m Mengdi, currently working on my thesis. My dataset consists of 29 clinical cases from the Merck Manual, including patient information, diagnosis questions, and correct answers. Here are my Essential Differential Diagnosis results using GPT-4o:
• Baseline Evaluation (Temperature 0, 3 runs): Mean F1 = 0.6920
• Baseline Evaluation (Default Temperature, 3 runs): Mean F1 = 0.6802
• Prompt Engineering (No Embeddings): Mean F1 = 0.6080
• Prompt Engineering (With Embeddings): Mean F1 = 0.6061
For the baseline evaluation, I provided only the patient information and asked GPT-4o to generate responses. I tested this at both temperature 0 and the default temperature.
For prompt engineering, I introduced few-shot examples and used chain-of-thought (CoT) reasoning to guide the model’s responses (F1 = 0.6080). I then incorporated embeddings, retrieving the top 3 most similar cases via cosine similarity before prompting the model with CoT (F1 = 0.6061).
However, despite adding structured reasoning and dynamic few-shot prompting, the performance decreased. Do you have any insights on why this might be happening?
Looking forward to your thoughts!
Best,
Mengdi
The text was updated successfully, but these errors were encountered:
Hi there,
I’m Mengdi, currently working on my thesis. My dataset consists of 29 clinical cases from the Merck Manual, including patient information, diagnosis questions, and correct answers. Here are my Essential Differential Diagnosis results using GPT-4o:
For the baseline evaluation, I provided only the patient information and asked GPT-4o to generate responses. I tested this at both temperature 0 and the default temperature.
For prompt engineering, I introduced few-shot examples and used chain-of-thought (CoT) reasoning to guide the model’s responses (F1 = 0.6080). I then incorporated embeddings, retrieving the top 3 most similar cases via cosine similarity before prompting the model with CoT (F1 = 0.6061).
However, despite adding structured reasoning and dynamic few-shot prompting, the performance decreased. Do you have any insights on why this might be happening?
Looking forward to your thoughts!
Best,
Mengdi
The text was updated successfully, but these errors were encountered: