A Milestone in Machine Learning

Savoie research team builds largest dataset of reaction mechanisms in existence

Theoreticians have worked in tandem with experimentalists since the dawn of the scientific age. The advent of machine learning facilitated computational work on a larger scale and a faster timetable. While data about the properties of specific molecules has been available for years, predications of how those molecules would react in different environments and under various conditions remained elusive.

After developing the automated computational method YARP — Yet Another Reaction Program — two years ago, a research team led by Brett Savoie, the Charles Davidson Associate Professor of Chemical Engineering, applied YARP’s technology to build the largest dataset of reaction mechanisms in existence. The research was published in the journal Scientific Data in March. “We didn’t develop YARP with the intention of creating a reaction dataset, but we recognized the opportunity to address a huge gap in the field because of our work in machine learning,” Savoie says. “We knew we could leverage our technology to develop a valuable resource for the scientific community.”

Graduate student Qiyuan Zhao worked with Savoie to develop YARP, a new approach to predicting research outcomes from scratch. YARP treats chemicals like graphs, which allows chemical reactions to be described in a way that computers can interpret and automate. To build the dataset, the research team attempted about 700,000 virtual reactions using YARP and from those attempts, observed 175,000 interesting reactions. YARP’s hit rate is four times that of other existing computational methods.

“We’ve spent two years developing YARP, which is now in its third version,” Savoie says. “We’ve been constantly improving the algorithms it uses to find these transition states. At this point, YARP is both the fastest and most accurate transition state finding algorithm that exists to our knowledge.”

The reaction dataset enables researchers to predict how a material’s properties would react in different environments. Changes in moisture, temperature and acidity can affect the stability of molecules and can cause them to break down leading to instability of materials. Accurately predicting how materials will break down can determine whether a new material will remain viable in the field.

“Material stability is a bottleneck in the production of virtually anything that is synthesized,” Savoie says. “Using computational methods to predict susceptibilities earlier in the process is more cost-effective and time efficient than experimenting with traditional iterations in a lab. This dataset will have far-reaching applications across multiple industries.”

The team is currently in the process of expanding the dataset with the goal of eventually covering all of reaction space, which could mean simulating billions of reactions using YARP. The university’s unrivaled computational resources enable the team to run extensive simulations on a relatively short timescale. While the initial dataset was publicly released, future versions may be licensed.


“Purdue gifted the scientific community with an extremely important asset by publicly releasing this dataset,” Savoie says. “We’ve generated a milestone for our field and reinforced the university’s reputation as a preeminent institution for algorithm engineering and machine learning, but we see a lot more opportunity to build on this success with future versions. Our ambition is to eventually characterize all classes of reactions so that any researcher wanting to understand or optimize a reaction will be able to find it in our dataset.

“The past century has shown how synthesized materials can dramatically increase our quality of life. Because of its immediate relevance to predicting stability, we expect this dataset to accelerate the creation of better materials with functionalities that haven’t yet been realized. I’m incredibly optimistic about the semiconductors of tomorrow, the medicines of tomorrow. The future is as bright as it has ever been with respect to what chemical engineering can do for us. The caveat is that the things we make can also be destructive. So as the barrier to the design and synthesis of new molecules and materials is reduced, we must be careful as we deploy these technologies.”

Mapping Glucose Pyrolysis

The computational methods being developed by the Savoie group are also finding other applications. In a recent project, Savoie’s team used YARP to characterize a record-sized reaction network for glucose pyrolysis — how one specific system degrades under heat. Converting biomass waste into valuable chemical components can be challenging because glucose undergoes a myriad of reactions as it is heated. The traditional methods of engineering reactors to transition from glucose to a subset of high-value products is difficult. The reaction network produced by the research team is the first to map the reactions from glucose to optimized products.

“Once you have that reaction network, you can start to rationally improve the yields of these products,” Savoie says. “We’ve developed the first method of its kind to map those pathways. It’s a landmark achievement in terms of the size of the network we’ve generated, the number of steps involved and using simulations to spontaneously discover the reaction mechanisms that produce these valuable chemical products from glucose.”

The research was published in the Proceedings of the National Academy of Science.


This story appeared in the Purdue University School of Chemical Engineering’s 2024 newsletter.