Open PhD Positions

The AiChemist Project will officially commence on 1st September 2023. Prospective candidates should apply ASAP!

Eligibility and Mobility Rules

On the date of recruitment by the host organization, doctoral candidates (DCs) should be in the first four years (full-time equivalent research experience) of their research careers and have not been awarded a doctoral degree.

"Date of Recruitment normally means the first day of the employment of the fellow for the purposes of the project (i.e. the starting date indicated in the employment contract or equivalent direct contract)."

For all recruitments, the eligibility of the researcher will be determined at the date of their first recruitment in the project. This status will not evolve over the lifetime of the action, even if they are re-recruited at another participating organisation.

"Full-Time Equivalent Research Experience is measured from the date when a researcher obtained the degree which would formally entitle him/her to embark on a doctorate, either in the country in which the degree was obtained or in the country in which the researcher is recruited or seconded, irrespective of whether or not a doctorate is or was ever envisaged."

At the time of recruitment by the host organization, researchers must not have resided or carried out their main activity (work, studies, etc.) in the country of the first host organization for more than 12 months in the 3 years immediately prior to the reference date. Compulsory national service and/or short stays such as holidays are not taken into account. As far as international European interest organizations or international organizations are concerned, this rule does not apply to the hosting of eligible researchers. However, the appointed researcher must not have spent more than 12 months in the 3 years immediately prior to their recruitment at the host organization.

Researchers with refugee status benefit from a less restrictive mobility rule:

"For refugees under the Geneva Convention (1951 Refugee Convention and the 1967 Protocol), the refugee procedure (i.e. before refugee status is conferred) will not be counted as ‚Äėperiod of residence/activity in the country of the beneficiary'."

More details are available here.

We welcome applications from people regardless of gender, nationality, ethnicity, sexual identity, physical abilities and religion. We adhere to the European Comissions' Code of Conduct for the Recruitment of Researchers: https://euraxess.ec.europa.eu/jobs/charter/code

Common requirements for all DCs

All the applicants are expected to:

  1. Have a Master's degree in computer science, physics, chemistry, or engineering with a sincere interest in biology and the life sciences.
  2. Have some prior expertise in one or more of the following fields: machine learning, modeling and simulation.
  3. Be excellent in oral and written English with good presentation skills.
  4. Possess strong interpersonal skills, excellent written and verbal communication, and the ability to work effectively both independently and in cross-functional teams.
  5. Be a highly creative person with outstanding problem-solving ability and the willingness to undertake challenging analysis tasks in a timely fashion.

Furthermore, the following software skills are required:

  1. Excellent software engineering skills are essential. Programming skills in Python must be top-notch.
  2. Experience with relevant libraries (TensorFlow/PyTorch, the python scientific stack) is necessary.
  3. Good command of modern software development tools, from git to continuous integration pipelines, is an additional plus.

The successful candidate will also demonstrate a passion for driving scientific questions with a positive and problem-solving attitude and the willingness to undertake challenging analysis tasks in a timely fashion. Excellent English is required, both spoken and written, and the ability to work effectively both independently and in cross-functional teams. You should enjoy teamwork, have a collaborative nature, and be an encouraging team member.

How to apply

  1. Make sure that you satisfy the eligibility and mobility rules! 
  2. Prepare your profile and provide sufficient details about your educational and work background, proof of your education (or expected award date of your MSc/Diploma), your CV, and motivation letter.
  3. Submit your application to apply@aichemist.eu. The screening and first wave of interviews is ongoing, so please do not hesitate to submit your application. 

Note: A candidate may apply for up to three of the listed positions, as long as they have the relevant background and expertise. If you wish to apply for more than one position, please rank the positions according to your preference i.e. indicate your first, second and third choices. 

The screening procedure is as follows

  • Each application will be screened by the respective supervisors from the host organizations.
  • Prospective candidates will be contacted by the supervisors for individual interviews and the most suitable candidates will be shortlisted.
  • The shortlisted candidates will be interviewed by the recruitment commission either in person or over Skype/Zoom.
  • The candidates will be informed by e-mail about the results of their applications.

Descriptions of individual DCs’ Projects

For each position, academic and industrial hosts are provided in the order of employment sequences. For example, DC1 will start in HMGU (Germany) and then continue his/her work in AstraZeneca (Sweden). Check this order with the mobility rule.

DC1: Improving accuracy and applicability domain of models using representation learning 

Academic Host: HMGU (first 18 months), Industrial Host: AstraZeneca (second 18 months), Planned Secondment: Bayer

The MELLODDY (IMI project, 18M‚ā¨) convincingly demonstrated advantages of federated learning by developing neural network embedding based on >2.7 billion data points trained on 40k assays from ten pharma companies. Models developed using this embedding consistently overperformed those developed using data from any single partner both in accuracy and covered AD. The main research question is whether or not modern representation learning methods (e.g., GNN, Transformers, equivariant NN) and those to be developed by other DCs can provide better performance without the need for federated learning due to their design (i.e., transfer learning following pretraining on SMILES, quantum chemistry (QC) parameters). The in-house and public (CheMBL, Tox21) data will be used to benchmark these methods. The identified deficiencies in current approaches will be used to improve respective methods. Specifically, the following questions will be addressed:

  1. Which representations provide the best results for prediction of biological activity assays, toxicity assays, ADME properties
  2. Do models developed using representation learning have a wider AD as compared to ML based on descriptors and whether AD of models can be extended by their pre-training on new chemical spaces? 
  3. How does the quality of pre-training (i.e., using different QC methods) influence the accuracy of the developed models? 

The candidate will be working under the joint supervision of Dr. Igor Tetko of Helmholtz Munich German Research Centre for Environmental Health and Prof. Dr. Fabian Theis of Technische Universit√§t M√ľnchen and HMGU for the first half of the doctoral studies (Germany, 18 months) while the second half of the doctorate will be carried out at AstraZeneca under the supervision of Dr. Ola Engkvist (Sweden, 18 months). A month-long secondment at Bayer is expected to be carried out mid-way through the PhD.¬†

DC2: Using XAI to Develop Hybrid Chemotypes 

Academic Host: HMGU (first 12 months), Industrial Hosts: MolNet (12 months) and Pfizer (final 12 months), Planned Secondment: Bayer

Artificial intelligence (AI) approaches will be used to further develop novel hybrid chemotype rules, e.g., reactions and/or alerts which are transparently applied in regulatory settings. Traditionally, the identification of substructural fragments associated with specific chemical modes of action (e.g., molecular initiating events) has relied on human expertise in chemistry, biology, pharmacology, and toxicology. Chemotypes based on Chemical Structure & Reaction Mark-Up Language (CSRML) define hybrid rules representing structural motifs as well as atomic and molecular properties, chemical reactivities, and metabolic transformations. Although the CSRML methodology has yielded important results (e.g., ToxPrint chemotypes), development of new knowledge is resource intensive. Adoption of a machine learning approach within the hybrid chemotype definitions increases the predictive power; however, the approach still requires human expertise. This project will use XAI (existing and to be developed in the project) for chemically explainable multi-task molecular representations to generate new hybrid chemotypes that will go beyond what a human expert might initially consider but will still be interpretable and consistent with human knowledge. Due to the relatively large amount of available training data for genetic toxicity, skin sensitization and cardiotoxicity, this project will focus on developing novel hybrid chemotypes for these endpoints. The hybrid chemotypes can be tested and refined in collaboration with other doctoral candidates. 

The candidate will work under the joint supervision of Dr. Igor Tetko and Prof. Sattler of the Helmholtz Munich German Research Centre for Environmental Health for the first year of the doctoral studies (Germany, 12 months), before moving on to MolNet to work under the supervision of Dr. Chihae Yang for the second year (Germany, 12 months). The final year of the doctoral studies will be carried out at Pfizer (Germany, 12 months), under the supervision of Dr. Djork-Arné Clevert. A month-long secondment at Bayer is expected to be carried out within the first year of the PhD. 

DC3: Predicting chemical stability and degradation rates of the compounds in acidic, basic, oxidative and reductive media using combined metadynamics-MD and ML approach 

Academic Host: UCPH (first 18 months), Industrial Host: AstraZeneca (second 18 months). Planned secondment: UNISTRA

There has been a large amount of synthesis experimental data generated that was successfully used to build machine learning models to predict the success of the experiments in a library synthesis setting. However, apart from the reactivities of the functional groups directly involved in the desired chemical reaction, other functional motifs or scaffolds may be incompatible with the reagents or reaction conditions, such as acid, base, oxidant or reductant. Compound instability or certain functional group transformations under these conditions often lead to synthesis failure and require alternative reaction conditions or the introduction of protecting groups. Metadynamics-MD simulations at the semiempirical level have been shown to estimate rates of low-barrier transformations with reasonable accuracy. Built on this, the candidate will work on developing an automatic workflow to detect potential 2decomposition pathways and thus assess rates of the following side reactions for the synthetic methods which are strategically important in medicinal chemistry:

  1. Base-induced degradation for SNAr and Suzuki reactions.
  2. Acid-induced degradation for acylation and electrophilic aromatic substitution.
  3. Reductant-induced degradation for reduction and reductive amination.

The resulting models will be used together with the desired synthetic transformation machine learning prediction models to improve the synthesis prediction accuracy as well as to suggest more suitable reaction conditions. As well as establishing an optimal metadynamics-based workflow for the estimation of reaction barriers associated with the various reaction systems, the candidate will explore the additional benefits of using bespoke physical models in conjunction with metadynamics-MD for capturing degradation pathways.

The candidate will work under the supervision of Prof. Jan Halborg Jensen of K√łbenhavns Universitet for the first half of the doctoral studies (Denmark, 18 months), before moving on to AstraZeneca to work under the supervision of Dr. Mikhail Kabeshov for the second half (Sweden, 18 months). A month-long secondment at UNISTRA is expected to be carried out within the second year of the PhD.¬†

DC4: Prediction of optimal reaction conditions using Artificial Intelligence tools

Industrial Host: AstraZeneca (first 18 months), Academic Host: UNISTRA (second 18 months), Planned secondment: UCPH.

The automated in silico prediction of a synthesis plan and the subsequent synthesis of the target compound necessitate the careful selection of experimental conditions conducive to reasonable yield. This selection process is fraught with challenges, including the sparsity of reaction-condition data, absence of negative results, and a many-to-many relationship between chemical transformations and possible conditions. Given that the same reaction can potentially be executed under various conditions with only a fraction truly tested, the traditional machine-learning setup requiring a one-to-one correspondence between chemical structure and target property is not easily adaptable for optimal reaction conditions prediction.

In this project, the Likelihood Ranking Approach (LRA) will be employed to address these challenges. The LRA, an artificial neural network method, outputs a list of diverse conditions ranked by suitability for a specific chemical transformation (Int. J. Mol. Sci. 2022, 23, 248). It will be systematically applied to popular transformations in medicinal chemistry using either experimental data from the Reaxys database or in-house data collected at AstraZeneca for selected reaction classes such as amide coupling, acylation, palladium-catalyzed cross-coupling (e.g., Suzuki, Buchwald-Hartwig), and reductive amination reactions.

The candidate selected for this project will engage in research under the supervision of Dr. Thierry Kogej and Dr. Mikhail Kabeshov at AstraZeneca during the initial 18 months of doctoral studies in Sweden. Following this, the candidate will transition to the Université de Strasbourg in France, where they will work under the supervision of Prof. Alexandre Varnek for the remaining 18 months. Additionally, a month-long secondment at UCPH is planned to be undertaken during the first phase of the PhD program.

By maintaining alignment with both academic and industrial hosts, this project represents a robust approach to addressing a complex, multifaceted challenge in the field of chemical synthesis. Its success holds significant implications for both theoretical advancement and practical applications in medicinal chemistry.

DC5: Multi-task Neural Network reactivity prediction using in-silico simulations and synthesis experimental data

Academic Host: ULEI (first 18 months), Industrial Host: AstraZeneca (second 18 months). Planned secondment: UCPH

Machine learning models can vastly speed up synthesis prediction, by efficiently predicting the success of experiments in a library synthesis setting. However, they need large amounts of data in every region of the investigated chemical space in order to be precise enough for any realistic use-case. Although a large amount of synthesis experimental data is available, it cannot cover the whole chemical space and thus has to be extended by simulations. At the same time, quantum chemistry approaches at the ab initio, Density Functional Theory (DFT) and semi-empirical levels have been widely used to describe mechanisms of the chemical transformations and subsequently to predict chemical reaction outcomes. Being usually accurate and easy to interpret, they are often applied to the small local reactivity spaces with exact sets of conditions and are too slow to be used for the synthesis prediction at scale. Here, we propose a comprehensive study involving multi-task or transfer-learning experimentation to identify the shared knowledge between in silico quantum chemistry simulations from one side and chemical reactivity needed for the synthesis prediction from another. As the essential part of this work, it is planned to identify the regions where available experimental data is not sufficient for reliable model building and augment them with the additional in silico simulation data, thereby demonstrating the benefit of the latter for increasing the usefulness of the model. From a computer science perspective, this is a very challenging problem that steadily re-occurs for all types of mixtures of real-world and simulation data, and determining where to generate additional data and how to integrate different data sources is quite difficult due to the vast and complex underlying chemical space where even measuring distance between molecules is hardly possible to define in an unambiguous way.

The following research questions will be addressed:

  1. What is the benefit of using in silico simulated data on the performance and applicability domain of the model predicting reaction outcomes and suitable reaction conditions?
  2. For which reaction classes where this benefit can be observed?
  3. Is there a benefit of using more expensive DFT and/or predicting structures of the reactive intermediates compared to the cheaper semi-empirical simulations of reactants and products?
  4. What is the most efficient model configuration and learning type (multi-task vs transfer learning)?
  5. Can the shared knowledge between the simulated and experimental data be rationalised and/or visualised? 

The candidate will work under the supervision of Dr. Mike Pruess at Universiteit Leiden for the first half of the doctoral studies (Netherlands, 18 months), before moving on to AstraZeneca to work under the supervision of Dr. Mikhail Kabeshov for the second half of the doctoral studies (Sweden, 18 months). A month-long secondment at UCPH is expected to be carried out within the first year of the PhD.

DC6: Advanced ML methods to predict and understand toxicity of drugs 

Academic Host: IRFMN (first 18 months), Industrial Host: Bayer (second 18 months). Planned secondment: HMGU

Predicting and understanding toxicological liabilities of small molecules is of utmost importance. The thesis will be aimed at the application of Machine Learning (ML) methods in order to develop predictive models for different endpoints of toxicity. Initially, the focus of the project will be on the cardiotoxicity, kidney and liver toxicity of drugs. State-of-the-art techniques such as Graph Neural Networks (GCNs), multitask and transfer learning will be used to predict the results of biological assays with the aim to develop models that can estimate not only the toxic effects but also the Mode Of Actions (MOAs) of the molecules. Explainable AI will reveal the most likely MOAs and the chemical substructures that most contribute to the identified risks. The doctoral candidate will work on building models of cardiotoxicity, liver toxicity and kidney toxicity, taking into account known and possible mechanisms, and also explain predictions of all models and correlate them with known issues for compounds with known mode of action. The models will be tested on internal data at Bayer, and successful models will be deployed and shared within a multi-objective system. 

The candidate will work under the supervision of Dr. Alessandra Roncaglioni at the Istituto di Ricerche Farmacologiche Mario Negri for the first half of the doctoral studies (Italy, 18 months), before moving on to Bayer to work under the supervision of Dr. Maria Garcia de Lomana for the second half of the doctoral studies (Germany, 18 months). A month-long secondment at HMGU is expected to be carried out within the first year of the PhD.

DC7: Generative language models for the design of tailored chemical transformations 

Academic Host: CSIC (first 18 months), Industrial Host: Bayer (second 18 months). Planned secondment: EPFL

Enzymes are attractive nanoscopic material capable of accelerating chemical transformations several orders of magnitude, while working in sustainable, mild conditions. Understandably, enormous research efforts have been put into the engineering of enzymes that catalyze chemical reactions in a greener and cheaper fashion. The thesis will be aimed at developing state-of-the-art generative protein language models, in particular, exploiting the enormous advances of translation machines. One focus will be on understanding what those models learn and exploiting these rational principles to propose improved enzymes. 

The DC will train seq2seq models for the design of tailored enzymes. The models’ encoder input will be chemical reaction representations, while the decoder will output enzymes that catalyze those reactions. This will make use of already established techniques, such as those used in the Molecular Transformer model   but will also enormously benefit from the advances in learning chemical representations (which other DCs will work on), while providing another benchmark for representation embedding being developed by other DCs. The hypotheses will be tested with experimental biophysical characterizations both at Bayer and CSIC, for the specific case of attractive but costly chemical reactions. This framework  will also be useful for biocatalysis, since it can provide new scalable routes for new-to-nature de novo enzyme design.

Furthermore, the DC will apply explainable/visualization AI techniques to understand what the models learn, how fine-tuning modifies this learning, and to identify the key structural components of a specific enzyme-driven reaction. Finally, the student will work on building a visual analysis tool (similar to exBERT) to facilitate visualization and explainability of methods developed during the project, which will be publicly released for the benefit of other DCs and the wider community. 

The candidate will work under the supervision of Dr. Noelia Ferruz at the Agencia Estatal Consejo Superior De Investigaciones Científicas for the first half of the doctoral studies (Spain, 18 months), before moving on to Bayer to work under the supervision of Dr. Santiago Villalba for the second half of the doctoral studies (Germany, 18 months). A month-long secondment at EPFL is expected to be carried out within the second year of the PhD. 

DC8: Modeling drug response in image-based screens as function of chemical space 

Academic Host: TUM (first 18 months), Industrial Host: Bayer (second 18 months). Planned secondment: Pfizer

Existing phenotypic screening techniques like Cell Painting enable the generation of immense datasets of images, showing the response of cell lines to chemical perturbations. This project will aim at combining learned representations from the screening images (using modern ML techniques like self-supervised learning) and from the known chemical structure of tested compounds (using graph neural networks, for example) in an actionable way. It will be based on pre-existing work in Theis lab, with the key novelty being the use of advanced chemical representations, which will be developed by other DCs. The fellow will work on existing public and private datasets of hundreds of thousands of perturbations. 

Main objectives are:

  1. Learn an image-based ‚Äúmorphometry latent space‚ÄĚ by adapting and extending existing methods
  2. Use feature attribution methods to make the latent space accessible and gain insights about the learned features from the morphological embedding
  3. Allow for better integration of assay data coming from different laboratories and batches by creating a condition-invariant phenotype representation
  4. Incorporate information from the chemical perturbation by providing meaningful and explainable encodings for the drugs
  5. Build a conditional generative model that can sample from the chemical space to obtain a desired phenotype

The candidate will work under the supervision of Prof. Fabian Theis at the Technische Universit√§t M√ľnchen / Helmholtz Munich German Research Centre for Environmental Health for the first half of the doctoral studies (Germany, 18 months), before moving on to Bayer to work under the supervision of Dr. Paula A. Marin Zapata for the second half of the doctoral studies (Germany, 18 months). A month-long secondment at Pfizer is expected to be carried out mid-way through the PhD.

DC9: Explainable active learning for multi-objective de novo design 

Academic Host: TU/e (first 18 months), Industrial Host: Sanofi (second 18 months). Planned secondment: Bayer

Generative deep learning methods can design novel bioactive compounds from scratch and have a great potential for chemical space exploration. These methods can propel active learning ‚Äď whereby sets of molecules are selected, synthesized, and tested with the goal of iteratively improving the model while discovering innovative bioactive chemotypes. Generative AI can produce thousands of molecular designs on demand, thus, selecting molecules for synthesis is far from trivial. This selection crucially impacts the success of active learning, where the initial decisions will determine the following iterations and affect the overall success of the campaign. There are many reasons to (de)prioritize the synthesis of a compound (e.g., synthetic accessibility, physicochemical properties, or intuition on structure-activity relationship), and it is complex to simultaneously consider all of them. In this context, XAI has an untapped potential to provide ‚Äúhuman-interpretable‚ÄĚ explanations about multi-objective molecular information that is relevant for follow-up experiments. ‚ÄėPeeking into the black box‚Äô can (a) increase the acceptance of the AI propositions by synthetic chemists, (b) augment human intuition in multi-objective design, (c) provide new insights into structure-activity relationships. Here, XAI approaches will be designed to highlight molecular characteristics of a design based on an array of properties and criteria. The explanations will be tailored to industry-standard active learning and will be (a) provided in the form of natural language, and (b) obtained in a data-driven manner, by considering different evaluation grids (e.g., bioactivity, position/chemical nature of the modifications, synthetic accessibility, relationship with traditional medicinal chemistry moves). This project will benefit from (and strengthen) research on advanced and multi-task chemical representations, toxicity models, and multitask learning, which will be carried out by other DCs. The developed XAI approaches will empower medicinal chemists by giving them the possibility to perform an informed choice between the different possible sets of de novo designs for active learning, and to learn from underlying non-linear patterns captured by AI.¬†

The candidate will work under the supervision of Prof. Francesca Grisoni at the Technische Universiteit Eindhoven for the first half of the doctoral studies (The Netherlands, 18 months), before moving on to Sanofi to work under the supervision of Dr. Marc Bianciotto for the second half of the doctoral studies (France, 18 months). A month-long secondment at Bayer is expected to be carried out mid-way through the PhD.

DC10: Simple quantum descriptors for actionable insights on ADMET-related properties 

Academic Host: ENS-PSL (first 18 months), Industrial Host: Sanofi (second 18 months). Planned secondment: IRFMN

Molecular orbital theory, usually based on simplified models like H√ľckel theory, is part of the chemist's curricula because it can be used to explain reactivity, structure, stability and many other chemical phenomena. Yet, because of their simplicity they can be inaccurate approximations. Quantum chemistry methods however are very accurate nowadays but are long and costly to compute for screening purposes. The goal of this project is to use approximate molecular orbital theories as descriptors for machine learning, and these descriptors can, in turn, be used to explain the trained model's predictions. Among others, these QM-derived descriptors can be atomic charges, HOMO-LUMO gap or electron affinities. These descriptors can be calculated from model Hamiltonians, H√ľckel theory, semi-empirical methods or from non-self-consistent Kohn-Sham calculations using machine-learned electronic densities. They will be used to build explainable and predictive models of CYP450 metabolism and of reactivity with GSH. The project will be building upon recent well-performing approaches, with the objective of improving the metabolic stability prediction of closely related compounds in order to provide actionable insights for modulating the metabolic liabilities of drug candidates. By using quantum properties that are closely related to the physico-chemical phenomena of interest, the multicollinearity issues that limit the performance and explainability of many machine-learning approaches will be overcome.

The candidate will work under the supervision of Prof. Rodolphe Vuilleumier at the √Čcole normale sup√©rieure for the first half of the doctoral studies (France, 18 months), before moving on to Sanofi to work under the supervision of Dr. Marc Bianciotto for the second half of the doctoral studies (France, 18 months). A month-long secondment at IRFMN is expected to be carried out within the first year of the PhD.

DC11: Multi-instance explainable learning for decoding stereo-dependent biological effects 

Academic Host: UNISTRA (first 18 months), Industrial Host: Sanofi (second 18 months). Planned secondment: ENS-PSL

Stereoisomerism is consensually recognized as a key feature in the rationalization of interactions between chemical entities in biological processes. In this regard, the thalidomide incident is often cited to illustrate the dramatic consequences of the presence of a wrong stereoisomer. However, stereoisomerism is a typical 3D feature which is vastly missing in 2D QSAR approaches. To solve this issue, 3D QSAR is more appropriate but suffers from the uncertainty arising from the somewhat arbitrary choices of the precise geometry of a molecule. To solve this issue, we propose to use the multi-instance learning approach. This machine-learning paradigm takes advantage of ensembles of equivalent versions of each data point, here the multiple computed conformers of molecules, for instance, generated using the MD studies carried out by other DCs. These methods work by weighting these conformers that are finally interpretable: for a given molecule, it is possible to retrieve the specific conformations that explain the prediction. It is then possible to compare structural data in order to better validate models or cross the 3D QSAR predictions with other 3D methods (docking, pharmacophore) to make better decisions. The 3D quantum descriptors proposed by DC10 are particularly relevant here. The method will be applied to topoisomers datasets and publicly available datasets of specific angles of rotation. 

The candidate will work under the supervision of Prof. Alexandre Varnek and Dr Gilles Marcou at the Université de Strasbourg for the first half of the doctoral studies (France, 18 months), before moving on to Sanofi to work under the supervision of Dr. Marc Bianciotto for the second half of the doctoral studies (France, 18 months). A two month-long secondment at ENS-PSL is expected to be carried out within the second year of the PhD.

DC12: Learning chemically explainable multi-task molecular representations 

Academic Host (employer): USI (36 months) Industrial Secondments: Pfizer and Bayer. The funding for this DC will come from a Swiss funding body rather than the EU.

The application of artificial intelligence techniques for molecular property and reactivity prediction is still hindered by the fact that their predictions are rarely aligned with a chemist’s intuition. Current state-of-the-art models for a number of chemical modelling tasks are mostly adapted from natural language processing. Their inference procedure and the intrinsic lack of chemical interpretability are a source of suspicion from the chemists. Improvements in this direction will help professional chemists in the pharma industry leverage the full potential of AI models. Recent work has shown that it is possible to learn molecular representations possessing higher interpretability by design. The DC will advance this research direction by building neural architectures to learn molecular representations which are chemically interpretable, scalable and suitable for a broad range of modelling tasks. The DC will achieve this by using graph neural networks, group-equivariant neural networks and incorporating the physical and chemical priors into the architectures. 

The main objectives of the work will be:

  1. Development of a neural model for chemical property prediction which will possess intrinsic chemical interpretability.
  2. Benchmarking the model on cardiotoxicity prediction and molecular property prediction tasks (in collaboration with other DCs). 

The candidate will work under the supervision of Prof. J√ľrgen Schmidhuber at the Universit√† Della Svizzera for the first half of the doctoral studies (Switzerland, 18 months), before moving on to Pfizer to work under the supervision of Dr. Djork-Arn√© Clevert for the second half of the doctoral studies (Germany, 18 months). A month-long secondment at Bayer is expected to be carried out mid-way through the PhD.¬†

DC13: Explainable chemical representations and models for reaction outcome predictions 

Academic Host: EPFL (36 months), Industrial secondment: Pfizer. The funding for this DC will come from a Swiss funding body rather than the EU.

Synthesis is one of the key bottlenecks in the molecular design cycle. Recent advancements in machine learning for chemical reactions have made it possible to accurately predict the outcome of chemical reactions in well-defined reaction spaces. First steps towards understanding machine learning models for chemical reactions include unsupervised atom-mapping, which provides information on how atoms rearrange during a reaction.However, due to the unsupervised training nature, it is challenging to improve the models when mistakes are detected. New training methods are required to go beyond and achieve higher quality atom-mapping. Recently, diffusion models have shown success for computer vision and molecule generation tasks. Diffusion models learn to recover data through a de-noising process. If the atom-mapping is known, the reaction prediction task is framed as a de-noising process from precursors to products. We will develop explainable diffusion reaction prediction models from high-quality atom-mapped reactions which will be tested in collaboration with DC4 to improve prediction reaction outcomes 

The candidate will work under the supervision of Prof. Philippe Schwaller  at the Ecole Polytechnique Fédérale De Lausanne for the first half of the doctoral studies (Switzerland, 18 months), before moving on to Pfizer to work under the supervision of Dr. Djork-Arné Clevert for the second half of the doctoral studies (Germany, 18 months). 

DC14: Development of XAI models for beyond rule of 5 chemical space molecules 

Academic Host: KIT (36 months). The funding for this DC will come from a Korean funding body rather than the EU.

Artificial intelligence (AI) models have been massively applied to design small molecule drugs. Even though more opportunities are awaiting to be found outside of the small molecule-centered chemical space, lack of cheminformatics tools hinders the application of AI for molecules beyond the traditional druggable chemical space such as natural products and their semi- synthetic derivatives, covalent inhibitors, and metal complexes, i.e., beyond the traditional rule of 5 chemical space (bRo5). Explainable AI (XAI) machine learning (ML) models developed for traditional chemical space will need to be extended to be applicable to bRo5 chemical space. ML methods based on Natural Language Processing (NLP), which analyze presentation of molecules as text (SMILES) are gaining popularity and success in modelling of traditional chemical spaces. In this project in addition to SMILES we will explore novel line notations of bRo5 molecules based on SMILES extensions, such as BILN, BigSMILES and SELFIES specifically developed to handle large molecules, with respect of their efficiency for development of ML for bRo5 using representation learning. Their accuracy and expandability will be tested for prediction of physicochemical properties, biological activities, and toxicities of molecules. The most accurate models will be used within reinforcement learning to design new active molecules for drug discovery. This research will allow to exploit the novel chemical spaces, which are gaining popularity in pharma industry. 

The candidate will work under the supervision of Prof. Hyun Kil Shin at the Korea Institute of Toxicology for the full duration of the doctoral studies.