MIT scientists develop super speedy AI system for biology research

The system turns a months-long process into just a few hours. 
Loukia Papadopoulos
Representational image of machine learning.jpg
Representational image of machine learning.


MIT researchers led by Jim Collins, the Termeer Professor of Medical Engineering and Science in the Department of Biological Engineering have developed BioAutoMATED, an automated machine-learning system for biology research. The system can select and build an appropriate model for a given dataset and even take care of the laborious task of data preprocessing, whittling down a months-long process to just a few hours. 

This is according to a press release by the institution published on Thursday.

“It would take many weeks of effort to figure out the appropriate model for our dataset, and this is a really prohibitive step for a lot of folks that want to use machine learning or biology,” said Jacqueline Valeri, a fifth-year PhD student of biological engineering in Collins’s lab who is first co-author of the paper. 

“The fundamental language of biology is based on sequences,” explained Soenksen, who earned his doctorate in the MIT Department of Mechanical Engineering. “Biological sequences such as DNA, RNA, proteins, and glycans have the amazing informational property of being intrinsically standardized, like an alphabet. A lot of AutoML tools are developed for text, so it made sense to extend it to [biological] sequences.”

Conventional AutoML tools have the disadvantage of only being able to explore and build reduced types of models.

“But you can’t really know from the start of a project which model will be best for your dataset,” Valeri said. “By incorporating multiple tools under one umbrella tool, we really allow a much larger search space than any individual AutoML tool could achieve on its own.”

BioAutoMATED, on the other hand, is even able to help determine how much data is required to appropriately train the chosen model.

"Our tool explores models that are better-suited for smaller, sparser biological datasets as well as more complex neural networks,” Valeri said. This is particularly well suited to research groups with new data that may or may not be suited for a machine learning problem.

"Conducting novel and successful experiments at the intersection of biology and machine learning can cost a lot of money,” Soenksen explained. "Currently, biology-centric labs need to invest in significant digital infrastructure and AI-ML trained human resources before they can even see if their ideas are poised to pan out. We want to lower these barriers for domain experts in biology.” 

BioAutoMATED’s open-source code is publicly available and allows researchers to run initial experiments to assess if it’s worthwhile to hire a machine-learning expert to build a different model for further experimentation. 

“What we would love to see is for people to take our code, improve it, and collaborate with larger communities to make it a tool for all,” Soenksen said in the statement. “We want to prime the biological research community and generate awareness related to AutoML techniques, as a seriously useful pathway that could merge rigorous biological practice with fast-paced AI-ML practice better than it is achieved today.”