Jump to content Jump to navigation menu

From Formula to Code, From Library to AI

An article from carl 03|2025

by Frank Frick

AI tools make chemical structural formulae machine-readable. In doing so, they facilitate the automated input of molecular data in databases.

Forty years ago, chemical professors would search in vain for their PhD students in the lab. Instead, they would find them in the library, because that’s where up-and-coming scientists would be searching for the answer to the question of whether or not somebody else had already produced and described the substance that, according to analysis, they had just produced in their own flasks. 


They would be most likely to find the critical answer in the Chemical Abstracts – hugely expensive books that filled dozens of metres of shelves. Since 1965, employees of the Chemical Abstract Service have sifted through the world’s chemistry-related literature, assigning a registration number called the CAS number “for unambiguous identification of each chemical substance” [1]. The structural formula published in the specialist literature, which shows how atoms in the molecule are linked and spatially arranged, is used as a basis for registration of organic substances.     


Today, chemists search online for information on substances – it’s easier, quicker and the results are always up-to-date. But in order for man and machine to share their knowledge about a molecule, and since 2D molecular images in documents are unreadable for computers, the structural information has to be translated, so to speak. 

In the beginning, this was done by people, and during the 1990s, the first computer programmes came along. These worked on the basis of rules, detected atomic symbols using methods from book digitalisation and classified bonds based on features such as length and line thickness. „Für akademische Forscher waren diese kommerziellen Programme nicht zugänglich“, berichtet Christoph Steinbeck, Professor für Analytische Chemie, Chemieinformatik und Chemometrie an der Universität Jena. 


It wasn’t until 2009 that researchers from the National Cancer Institute of the USA published an open source programme called Optical Structure Recognition Application (OSRA) [2]. The programme successfully converted around 90 percent of the structural formulae from high-quality images [3]. “Yet it fails if the structural formulae of molecules with ring-shaped elements are drawn with a slight distortion – so slight that a person wouldn’t notice,” said Steinbeck. 
 

carl-03-2025-Von der Formel zum Code, von der Bibliothek zur KI-Bild-2.jpg
Christoph Steinbeck (right) and Achim Zielesny’s team developed the AI tool DECIMER.ai, which can be used by researchers worldwide.

Inspired by the Go competition 

In 2020, the chemist from Jena, his employees Kohulan Rajan and Achim Zielesny, Professor at the Westphalian Technical College with sites in Gelsenkirchen, Bocholt and Recklinghausen, were the first to introduce a freely available tool in which artificial intelligence recognised structural formulae in scientific publications and translated them into machine-readable codes [4, 5]. The name: DECIMER (Deep lEarning for Chemical IMagE Recognition). 

Steinbeck and Zielesny were inspired by the spectacular success of the software AlphaGo and AlphaGo Zero, developed by Google DeepMind, in mastering the board game Go. At a tournament in 2016, AlphaGo, which uses neuronal networks and machine-based learning, defeated the then best Go player in the world, Lee Sedol. Until then, it seemed unthinkable that AI could beat a human at this game. 

“When AlphaGo Zero later achieved a suprahuman playing level, no longer being trained with human players, but repeatedly playing against itself, we realised that AI, with enough training data, can also resolve other very complex problems,” recalls Steinbeck. 
 

The chemists trained a neuronal network using data from over a hundred million organic molecules from the PubChem database [6]. They downloaded machine-readable SMILES codes (SMILES: Simplified Molecular Input Line Entry System) of these molecules and created images of the molecular structures using software they developed themselves. “The trick was that we generated different images of each molecule by rotating, cutting or blurring the structural formulae, adding noise, atomic numbers or arrows and using short symbols for functional groups,” explained Steinbeck. 

carl-03-2025-Von der Formel zum Code, von der Bibliothek zur KI-Bild-1.jpg
Structural formula of the caffeine molecule and its translation into machine-readable SMILE code

Training with several hundred million pieces of dat

So, the scientists killed two birds with one stone: first, they generated hundreds of millions of pieces of training data – a volume that was not accessible to them from chemical literature. Second, they were able to train the AI with structural formula in different formats or with poor image quality. 

carl-03-2025-Von der Formel zum Code, von der Bibliothek zur KI-Bild-3.jpg
Examples of distorted versions of caffeine from the DECIMER training

The result: the new DECIMER AI tool translates such representations of structural formula correctly significantly more often than OSRA and other rule-based programmes. This can be seen, amongst other things, in the recognition of molecular structures in patent specifications, which are particularly difficult for software due to the Markush formulae, that are often used, in which variable fragments are replaced by short symbols, such as R or X. In 2024, a team of researchers from Germany and the USA compared the performance of DECIMER and OSRA based on 400 randomly selected molecular structures from patents: OSRA only translated 257 of these correctly, while DECIMER managed 337 [7]. In the meantime, there is also another freely available AI tool with similar performance levels: MOLScribe [8].

Anyone can use DECIMER in an internet browser: simply upload scientific articles that contain chemical structural formulae and the AI immediately begins its work [9].

Christoph Steinbeck, who is also a representative of the National Research Data Infrastructure for Chemistry (NFDI4Chem), hopes to be able to make the chemical literature dating all the way back to the 1950s machine-readable. NFDI4Chem is discussing this plan with specialist publishing houses, which hold the rights to the utilisation of publications and data. The extent to which the Chemical Abstract Service of the American Chemical Society will be persuaded by the project remains to be seen.

Glossary

Markush formulae carry the name of the chemist Eugene Markush, who registered a patent in 1924. In these, he used a new notation to conceal a group of chemical bonds with similar basic structures but different substituents.

[1] www.cas.org/about/cas-history 
[2] www.sourceforge.net/projects/osra/ 
[3] K. Rajan et al., 2020, J. Cheminform., 12, 60, doi.org/10.1186/s13321-020-00465-0
[4] K. Rajan et al., 2021,  J. Cheminform., 13, 61, doi.org/10.1186/s13321-021-00538-8  
[5] K. Rajan et al., 2023, Nat. Commun., 14, 5045, doi.org/10.1038/s41467-023-40782-0
[6] www.pubchem.ncbi.nlm.nih.gov
[7] A. Krasnov et al., 2024, Digital Discovery, 2024, 3, 681, doi.org/10.1039/d3dd00228d
[8] www.github.com/thomas0809/MolScribe 
[9] www.decimer.ai

Image credits: Anne Günthe, Uni Jena / iKohulan Rajan / Kohulan Rajan

Did you enjoy this article?

Then discover our carl Magazine – there you will find many more exciting and inspiring articles.