Large language models for scientific discovery in molecular property prediction

No Thumbnail Available
File version
Author(s)
Zheng, Yizhen
Koh, Huan Yee
Ju, Jiaxin
Nguyen, Anh TN
May, Lauren T
Webb, Geoffrey I
Pan, Shirui
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2025
Size
File type(s)
Location
License
Abstract

Large language models (LLMs) are a form of artificial intelligence system encapsulating vast knowledge in the form of natural language. These systems are adept at numerous complex tasks including creative writing, storytelling, translation, question-answering, summarization and computer code generation. Although LLMs have seen initial applications in natural sciences, their potential for driving scientific discovery remains largely unexplored. In this work, we introduce LLM4SD, a framework designed to harness LLMs for driving scientific discovery in molecular property prediction by synthesizing knowledge from literature and inferring knowledge from scientific data. LLMs synthesize knowledge by extracting established information from scientific literature, such as molecular weight being key to predicting solubility. For inference, LLMs identify patterns in molecular data, particularly in Simplified Molecular Input Line Entry System-encoded structures, such as halogen-containing molecules being more likely to cross the blood–brain barrier. This information is presented as interpretable knowledge, enabling the transformation of molecules into feature vectors. By using these features with interpretable models such as random forest, LLM4SD can outperform the current state of the art across a range of benchmark tasks for predicting molecular properties. We foresee it providing interpretable and potentially new insights, aiding scientific discovery in molecular property prediction.

Journal Title

Nature Machine Intelligence

Conference Title
Book Title
Edition
Volume

7

Issue

3

Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)

ARC

Grant identifier(s)

DP240101547

Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Persistent link to this record
Citation

Zheng, Y; Koh, HY; Ju, J; Nguyen, ATN; May, LT; Webb, GI; Pan, S, Large language models for scientific discovery in molecular property prediction, Nature Machine Intelligence, 2025, 7 (3), pp. 437-447

Collections