Skip to content

jschrier/KRICT_hackathon_phosphors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This is a project for the 2024 KRICT ChemDX Hackathon This project uses the KRICT Inorganic Phosphor Optical Property dataset. The goal is to try to predict the color that the phosphor emits, given only the chemical formula describing the phosphor: e.g., Ba2La6.4Si6O26Eu1.6 -> Orange.

Motivated by my recent work on solid-state chemistry predictions using fine-tuned LLMs and text embedding vectors, which I spoke about earlier in the week at the KRICT R&D Forum: Materials, AI, and Autonomous Laboratory, I will try to see how much can be learned directly from the formula strings. We will compare 4 different methods:

Data Overview

The data was provided as an XLSX file (data/Inorganic_Phosphor_Optical.xlsx), where the maximum emission wavelength is provided in nanometers.

The script scripts/01_Data_preparation.wls converts this to the corresponding color, and performs a random 80/20% train/test split of the data. (There are 3008 training items and 752 test items.) It also does some visualization of the dataset, which we show below:

Histogram of wavelengths

Color categories

Modeling Methods & Results

For reference, the most common class is "Orange" emission, and if you select this majority class all the time, you will be correct ~34% of the time.

String Edit Distance

Relevant code is in scripts/05_Edit_Distance.wls. In short, we compute the Levenshtein edit distance between the a query chemical formula and the examples in the training set, and select the nearest one. The premise here is that similar formulas should have similar results. This gives surprisingly high accuracy of 79%.

Edit Distance Result

SentenceBERT vector similarity

Next, we examined encoding the chemical formulas using SentenceBERT. Naively, we don't expect this to be amazing, as SentenceBERT is really more about semantic meaning of general text, but it is easy enough to do. Implementation is in scripts/02_SentenceBert_Similarity.wls

Using the nearest similarity example yields the best results (comparable, maybe slightly better than string edit distance):

sentence BERT nearest

We can also look at the commonest label for the 10 nearest items, and this is noticeably worse:

sentence BERT commonest of 10

Possible improvements:

  • Use a different/better similarity function
  • Tune the number of nearest-neighbors to consider

Fine-tuned LLM models (Lllama 3.1-8B and GPT-40-mini)

What about fine-tuning LLMs on this? We use the system prompt: You are an expert materials chemist predicting the emission color of a phosphor with the chemical composition provided. Answer only with the following color names and no other output: Ultraviolet, Violet, Blue, Green, Yellow, Orange, Red, Infrared.

The chemical formula is provided as the user content and the color is the agent response. We fine-tuned this on Openpipe using default hyperparameters, then downloaded the evaluated results (results/KRICT_Phosphors_(test_entries).jsonl) and analyze them with scripts/04_Evaluate_Openpipe_results.wls. (Elsewhere I have written more extensive notes on LLM fine-tuning which may be useful if you are new to this field.)

The two fine-tuned LLMs are comparable (84-85% accuracy), which is a noticeable improvement upon using the edit distance/string similarity. Not bad for a chatbot!

Fine-tuning Llama-3.1 model cost $2.48 USD; I've put the LoRA BF16 weights in models.

Fine-tuning gpt-4o-mini model cost $2.22 USD. You'll have to make your own, if you want to reproduce this, but you can do it easily using the data/train.jsonl file.

Llama-3.1-8b

Llama-3.1-8B confusion matrix

GPT-4o-mini

GPT-4o-mini confusion matrix

Possible improvements:

  • Play with fine-tuning hyperparameters and prompt
  • Perform a cross-validation to assess error bars
  • Examine the log-probabilities of output, evaluate perplexity, etc.

Conclusions and Future Directions

It is possible to predict natural language color of the emission of a phosphor given its chemical formula with reasonable accuracy using these text-based approaches. Additionally, adopting a text based approach may be more natural for scientific users.

summary barchart

About

KRICT ChemDX Hackathon project: Inorganic Phosphors

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published