From Data to Discovery in Biology and Health

ELLIS Summer School on Machine Learning in Healthcare and Biology – 11 to 13 June 2024  

Huw Day, Richard Lane, Christianne Fernée and Will Chapman, data scientists from the Jean Golding Institute, attended the European Laboratory for Learning and Intelligent Systems (ELLIS) Summer School on Machine Learning for Healthcare and Biology at The University of Manchester. Across three days they learned about cutting edge developments in machine learning for biology and healthcare from academic and industry leaders.

A major theme of the event was how researchers create real knowledge about biological processes at the convergence of deep learning and causal inference techniques. Through this machine learning can be used to bridge the gap between well-controlled lab experiments and the messy real world of clinical treatments.

Advancing Medical Science using Machine Learning

Huw’s favorite talk was “From Data to Discovery: Machine Learning’s Role in Advancing (Medical) Science” by Mihaela van der Schaar who is a Professor of ML, AI and Medicine at the University of Cambridge and Director for the Cambridge Centre for AI in Medicine.

Currently, machine learning models are excellent at deducing the association between variables. However, Mihaela argued that we need to go beyond this to understand the variables and their relationships so that we can discover so-called “governing equations”. In the past, human experts have discovered governing equations with domain knowledge, intuition and insight to extract equations from underlying data.

The speaker’s research group have been working to deduce different types of underlying governing equations from black box models. They have developed techniques to extract explicit functions as well as more involved functional equations and various types of ordinary and partial differential equations.

On the left are three graphs showing temporal effects of chemotherapy on tumour volume for observed data, D-CODE and SR-T. On the right is the actual equations for the D-CODE and SR-T plots on the left.
Slide 39 from Mihaela van der Schaar’s talk, showing observed data of the effects of chemotherapy on tumour volume over time and then two examples of derived governing equations in plots on the left with the actual equations written out on the right

The implications for healthcare applications are immense if these methods are able to be reliability integrated into our existing machine learning analysis. On a more philosophical angle, it begs interesting questions about how many systems in life sciences (and beyond) have governing equations and what proportion of these equations are possible to discover.

Gaussian processes and simulation of clinical trials

A highlight for Will and Christianne was the informative talk from Ti John which was a practical introduction to Gaussian Processes (GP) which furthered our understanding of how GPs learn non-linear functions from a dataset. By assuming that your data are a collection of realisations of some unknown random function (or combination of functions), and judicious choice of kernel, Gaussian Process modelling can allow the estimation of both long-term trends from short-term fluctuations from noisy data. The presentation was enhanced with this interactive visualisation of GPs, alongside an analysis of how the blood glucose response to meals changes after bariatric surgery. Another highlight was Alejandro Frangi’s talk on in silico clinical trials in which he described how mechanistic modelling (like fluid dynamic simulations of medical devices) can be combined with generative modelling (to synthesise a virtual patient cohort) to explore how medical treatments may perform in patients who would never have qualified for inclusion in a real randomised controlled trial.

Causality

Richard’s favourite talk was by Cheng Zhang from Microsoft Research on causal models, titled “Causality: From Statistics to Foundation Models”. Cheng highlighted that an understanding of causality is vital for the intelligent systems that have a role in decision-making. This area is on the cutting edge of research in AI for biology and healthcare – understanding consequences is necessary for a model that should propose interventions. While association (statistics) is still the main use case for AI, such models have no model of the “true” nature of the world that they are reflecting which leads to hallucinations such as images with too many fingers or nonsensical text generation. One recipe proposed by Cheng was to build a causally-aware model to:

  • Apply an attention mechanism/transformer to data so that the model focuses only on the most important parts
  • Use a penalised hinge loss- the model should learn from its mistakes, and should account for some mistakes being worse than others
  • Read off optimal balancing weights + the causal effect of actions – after training, we need to investigate the model to understand the impact of different actions.

In essence, this is a blueprint to build a smart system that can look at a lot of complex data, decide what’s important, learn from its mistakes efficiently and can help us understand the impact of different actions. As an example, we might be interested in how the amount of time spent studying affects students’ grades. At the outset, it’s hard to say if studying more causes better grades because these students might also have other good habits, have access to more resources, etc. Such a model would try to balance these factors and give a clearer picture of what causes what- effectively, it would try to find the true effect of studying more on a student’s grade, after removing the influence of other habits.

This behaviour is desired for the complex models being developed for healthcare and biology; for example, we may be interested in engineering CRISPR interventions to make crops more resilient to climate change or developing brain stimulation protocols to help with rehabilitation. A model proposing interventions for these should have a causal understanding of which genes impact a trait, or how different patterns of brain activity affect cognitive function.

Recordings of all the talks can be found on here