Scientists use the language of molecules to accelerate material and drug discovery

The machine learning framework learns the language of molecules to generate new ones and predict the properties of materials, paving the way for material and drug discovery.
Tejasri Gururaj
Molecular structure
Molecular structure


Predicting molecular properties and generating new molecules is critical for material and drug discovery. The advancement of machine learning (ML) technologies has led them to be employed for material and drug discovery.

However, one of the problems with using ML models for material and drug discovery is the training process, which often requires extensive datasets, which can be expensive and time-consuming to create. 

Now, a team of researchers from Masacheussets Institute of Technology (MIT) has built a unified framework that can predict molecular properties and generate new molecules while trained on a relatively small dataset.

The team was led by Minghao Guo, a graduate student at MIT, who is also the study's first author. Their system is more efficient than traditional deep-learning approaches. 

"Our goal with this project is to use some data-driven methods to speed up the discovery of new molecules, so you can train a model to do the prediction without all of these cost-heavy experiments," said Guo in a press release

The language of molecules

Traditional methods rely on ML models acquiring knowledge based on large datasets which aren't domain-specific. This results in the model performing poorly.

The research team decided to take a different approach by relying on the language of molecules. Atoms and molecules obey laws or rules of physics that dictate how they interact with each other to form molecules. The researchers used this molecular grammar to train their system.

The system can produce new compounds and anticipate their attributes in a data-efficient manner by learning this language and identifying the similarities between molecular structures.

"Once we have this grammar as a representation for all the different molecules, we can use it to boost the process of property prediction," explained Guo in the press release. 

The team used reinforcement learning to train the system on the production rules of molecular grammar. They simplified the learning process by breaking the molecular grammar into two components—a general metagrammar and a molecule-specific grammar.

Combined with reinforcement learning, this hierarchical approach accelerated learning and empowered the system to generate viable molecules and make accurate predictions about their properties.

Making predictions

The researchers tested their system and found that it outperformed several state-of-the-art ML approaches at generating feasible polymers and polymers, as well as predicting their properties. This was when the model was trained on a domain-specific dataset having only a hundred samples.  

Some prior approaches also needed costly pretraining, which their system dodges. Their system performed remarkably nally well at predicting the properties of polymers like glass. These properties are hard to determine experimentally, requiring very high pressures and temperatures. 

The researchers achieved comparable results using only 94 samples, cutting the training set by more than half.

"This grammar-based representation is very powerful. And because the grammar itself is a very general representation, it can be deployed to different kinds of graph-form data. We are trying to identify other applications beyond chemistry or material science," Guo said in the press release.

The researchers aim to extend their research to incorporate 3D geometry to study polymer chain interactions. They also work on an interface to display learned grammar rules and gather user feedback for improved accuracy.

Their findings were presented at the Proceedings of the 40th International Conference on Machine Learning.


The prediction of molecular properties is a crucial task in the field of material and drug discovery. The potential benefits of using deep learning techniques are reflected in the wealth of recent literature. Still, these techniques are faced with a common challenge in practice: Labeled data are limited by the cost of manual extraction from literature and laborious experimentation. In this work, we propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules. Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, our evaluation shows that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data.

Add Interesting Engineering to your Google News feed.
Add Interesting Engineering to your Google News feed.
message circleSHOW COMMENT (1)chevron
Job Board