Chakaya Nyamvula’s JGI Placement 

Hi, I’m Chakaya. I am currently pursuing my MSc in AI and Data Science at Keele University and working as a Business Intelligence Analyst at iLabAfrica at Strathmore University in Nairobi, Kenya. This summer, thanks to the partnership between iLabAfrica and JGI, I had an amazing opportunity to work with JGI for my Master’s placement. I wanted to immerse myself in a research environment and connect with people in academia to help figure out my future career path. Working under the guidance of Dr Huw Day, I gained valuable insights into the world of research and expanded my professional network, all while experiencing life in the UK. 

Chakaya Nyamvula in front of a body of water
Chakaya Nyamvula, JGI Intern

What was the project about? 

Previously for a JGI funded Seedcorn project Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day, and Debbie Johnson had surveyed researchers on their thoughts on how they might want to use synthetic data to help with their research. 

Synthetic data is when you take an existing dataset and create a synthetic (i.e. fake) version of it. You might want to do this so you can share something that looks like the data but preserves the privacy of individuals in it, whilst still having a flavour of what the data looks like and what statistical patterns might be present within it. This is useful for writing data pipelines whilst you go through necessary ethics checks to access sensitive data, amongst other things. 

For my summer placement with JGI, I worked with the MIMIC IV dataset of electronic health records and explored methods of generating synthetic versions of some of this data. It was also important to understand how you could measure or benchmark how successful your synthetic data generation has been, based on how well you had preserved privacy or how well the statistics of your synthetic data emulated those of your real data. 

What else did you do as part of your placement? 

Alongside my main work, I attended JGI Data Science meetings and learnt about some of the data science projects at the JGI including a project on antimicrobial resistance and another on 3D image analysis of CT scanned zebrafish to study bone development. 

For some of the more computationally demanding aspects of the project, I got taught how to make use of the JGI’s server (known within the office as “Jeeves”). 

I also had the opportunity to meet some PhD students at the University of Bristol, ask them about their research, and get advice on applying for PhDs in the future. 

Left to right, Huw Day, Elena Fillola Mayoral, Yujie Dai and Chakaya Nyamvula sat at a table at an ice cream shop
From left to right: Huw Day (JGI Data Scientist), Elena Fillola Mayoral (PhD student in AI for Climate), Yujie Dai (CDT in Digital Health) and Chakaya Nyamvula (JGI Intern) discussing PhDs over ice cream

What did you learn about? 

One deep learning method we used was something called a Generative Adversarial Network (GAN). Prior to this project, I had never worked with GANs before, so diving into this methodology was both challenging and exciting.  

A GAN works by having two competing neural networks, a generator and a discriminator. The generator’s job in this case was to take the original data and generate synthetic versions of that data. The discriminator’s job is to try and spot the difference between the real and the synthetic data that has been generated. One of the advantages of such a system is that you have two outputs: 1) a neural network which can generate synthetic data based on some training data and 2) a second neural network which can discriminate between real and synthetic data. This has advantages for applications where people might maliciously generate synthetic data, for example deep fake images. 

A good analogy for GANs is two people learning chess by playing against one another. If both start at similar skill levels, then as one person improves, the other slowly improves too. If you lose a chess game, you know you made a mistake and you might be able to work out how to improve for the next time. If you win, then you know you were doing something right.  

However, if you pit a chess grandmaster against a complete beginner, then the beginner will lose every time and will struggle to understand where they are going wrong, making it difficult to improve. Because the task of making synthetic data is quite complicated, when we began the process of training the GAN, the generator was frequently getting it wrong and wasn’t really able to figure out how to improve. 

To combat this, we did two things. First, you can handicap the discriminator a bit to give the generator a head start (imagine making your grandmaster play blindfolded). This helped, but still wasn’t enough. 

One of the pair plots showing generated vs real data a epoch 0
One of the pair plots showing generated vs real data a epoch 25000
Pair plots showing how well the real and the synthetic data matches by comparing each column. Real data is in blue, synthetic data is in red. The diagonal plots show histogram density plots of each column and how it compares between real and synthetic data. The off diagonal show scatter plots between pairs of variables. The left pair plot shows the output at the start of training, where the synthetic generator just randomly samples a scatter of points. You can see that this is not a good match for the original data. The right pair plot shows that after training, the generator does a lot more of a convincing job at emulating the real data. It is still not perfect, but it is particularly good at identifying clumps of data.

Secondly, you can start to think about how you inform your neural networks whether or not they were successful. Imagine if instead of “win” or “lose” as your outcome of the chess games, you got a measure of how well you performed, say a measure of how many good moves you made. With this more specific information, it becomes easier to decipher why you lost and how you might improve.  

To Be Continued? 

To finish my placement, I shared my experience with my placement supervisors at Keele University through a presentation and a report. I then had the opportunity to present my work to the Data Science Seminar at the University of Bristol, with several lecturers from the data science community in attendance, alongside JGI Data Scientists and some friends I made along the way.  

Additionally, all the code we worked on can be found in a public GitHub repository for other researchers to use and experiment with can be found on Chakaya’s Github.

Chakaya Nyamvula and Huw Day standing in front of a projector presenting at the Data Science Seminar. The projector has a slide on it that says 'Introduction to synthetic data' 
Chakaya Nyamvula (left) and Huw Day (right) presenting at the Data Science Seminar 

Reflecting on my placement at JGI, I can confidently say it was an incredible learning experience. I had the privilege of working with a fantastic supervisor, Dr Huw Day, who provided guidance throughout the project. Co-working with the talented data scientists at JGI was both inspiring and rewarding, and I thoroughly enjoyed networking with professionals in academia. The challenges I faced particularly working with GANs for the first time, pushed me to grow and expand my skill set.  Overall, this experience not only deepened my technical expertise but also solidified my interest in pursuing a career that bridges research and data science. 

New Turing Liaison Officers join the JGI team

As an active member of the Turing University Network, we have appointed a Turing Liaison Manager and two Turing Liaison Academics to support and enhance the partnership between Alan Turing Institute and the University of Bristol. These roles will be focusing on increasing engagement from Turing, developing external and internal networks around data science and AI, and supporting relevant interest groups, Enrichment students and Turing Fellows at the University of Bristol.

Turing Liaison Manager, Isabelle Halton and Turing Academic Liaisons, Conor Houghton and Emmanouil Tranos, are keen to build communities around data science and AI, providing support to staff and students who want to be more involved in Turing activity.

Isabelle previously worked in the Professional Liaison Network in the Faculty of Social Sciences and Law. She has extensive experience in building relationships and networks, project and event management and streamlining activities connecting academics and external organisations.

Conor is a Reader in the School of Engineering Mathematics and Technology, interested in linguistics and the brain. Conor is a Turing Fellow and a member of the TReX, the Turing ethics committee.

Emmanouil is currently a Turing Fellow and a Professor of Quantitative Human Geography, specialising primarily on the spatial dimensions of the digital economy.


If you’re interested in becoming more involved with Turing activity or have any questions about the partnership, please email Isabelle Halton, Turing Liaison Manager via the Turing Mailbox

Ask JGI Student Experience Profiles: Rachael Laidlaw

Rachael Laidlaw (Ask-JGI Data Science Support 2023-24) 

I first came into contact with the Jean Golding Institute last year at The Alan Turing Institute’s annual AI UK conference in London, and then again in the early stages of the DataFace project in collaboration with Cheltenham Science Festival. This meant that before I officially joined the team back in October, I already knew what a lovely group of people I’d be getting involved with! Having nice colleagues, however, was not my only motivation for applying to be an Ask-JGI student. On top of that, I’d decided that whilst starting out in my ecological computer-vision PhD niche, I didn’t want to forget all of the statistical skills that I’d developed back in my MSc degree. Plus, it sounded really fun to keep myself on my toes by exercising my mind tackling a variety of data-oriented requests from across the university’s many departments. 

Rachael Laidlaw in centre with two JGI staff members to the left and one JGI staff member to the right pointing towards a Data pin board at the JGI stall
Rachael Laidlaw (centre), second-year PhD student in Interactive Artificial Intelligence, and other JGI staff members at the JGI stall

During the course of my academic life, I’ve taken the plunge of changing disciplines twice, moving from pure mathematics to applied statistics and then again to computer science, and I liked the idea of supporting others to potentially do the same thing as they looked to enhance their work by delving into data. Through Ask-JGI, I kept my weeks interesting by having something other than my own research to sometimes switch my focus to, and it felt very fulfilling to be able to offer useful technical advice to those who were in the same position that I myself had been in not so long ago too! I therefore got stuck in with anything and everything, from training CNNs for rainfall forecasting or performing statistical tests to compare the antibiotic resistance of different bacteria, to modelling the outcomes of university spinouts or advising on the ethical considerations and potential bias present when designing and deploying a questionnaire-based study. And, of course, by exposing myself to these problems (alongside additional outreach initiatives and showcase events), I also learned a lot along the way, both from my own exploration and from the rest of the team’s insights. 

One especially exciting query revolved around automating the process of identifying from images which particular underground printing presses had been used to produce various historical political pamphlets, based on imperfections in the script. This piqued my interest immediately as it drew parallels with my PhD project, highlighting the copious amount of uses of computer vision and how it can save us time by speeding up traditionally manual processes: from the monitoring of animal biodiversity to carrying out detective work on old written records. 

All in all, this year has broadened my horizons by giving me great consultancy-style work experience through the opportunity to share my expertise and help a wide range of researchers. I would absolutely encourage other curious PhD students to apply and see what they can both give to and gain from the role! 

From Data to Discovery in Biology and Health

ELLIS Summer School on Machine Learning in Healthcare and Biology – 11 to 13 June 2024  

Huw Day, Richard Lane, Christianne Fernée and Will Chapman, data scientists from the Jean Golding Institute, attended the European Laboratory for Learning and Intelligent Systems (ELLIS) Summer School on Machine Learning for Healthcare and Biology at The University of Manchester. Across three days they learned about cutting edge developments in machine learning for biology and healthcare from academic and industry leaders.

A major theme of the event was how researchers create real knowledge about biological processes at the convergence of deep learning and causal inference techniques. Through this machine learning can be used to bridge the gap between well-controlled lab experiments and the messy real world of clinical treatments.

Advancing Medical Science using Machine Learning

Huw’s favorite talk was “From Data to Discovery: Machine Learning’s Role in Advancing (Medical) Science” by Mihaela van der Schaar who is a Professor of ML, AI and Medicine at the University of Cambridge and Director for the Cambridge Centre for AI in Medicine.

Currently, machine learning models are excellent at deducing the association between variables. However, Mihaela argued that we need to go beyond this to understand the variables and their relationships so that we can discover so-called “governing equations”. In the past, human experts have discovered governing equations with domain knowledge, intuition and insight to extract equations from underlying data.

The speaker’s research group have been working to deduce different types of underlying governing equations from black box models. They have developed techniques to extract explicit functions as well as more involved functional equations and various types of ordinary and partial differential equations.

On the left are three graphs showing temporal effects of chemotherapy on tumour volume for observed data, D-CODE and SR-T. On the right is the actual equations for the D-CODE and SR-T plots on the left.
Slide 39 from Mihaela van der Schaar’s talk, showing observed data of the effects of chemotherapy on tumour volume over time and then two examples of derived governing equations in plots on the left with the actual equations written out on the right

The implications for healthcare applications are immense if these methods are able to be reliability integrated into our existing machine learning analysis. On a more philosophical angle, it begs interesting questions about how many systems in life sciences (and beyond) have governing equations and what proportion of these equations are possible to discover.

Gaussian processes and simulation of clinical trials

A highlight for Will and Christianne was the informative talk from Ti John which was a practical introduction to Gaussian Processes (GP) which furthered our understanding of how GPs learn non-linear functions from a dataset. By assuming that your data are a collection of realisations of some unknown random function (or combination of functions), and judicious choice of kernel, Gaussian Process modelling can allow the estimation of both long-term trends from short-term fluctuations from noisy data. The presentation was enhanced with this interactive visualisation of GPs, alongside an analysis of how the blood glucose response to meals changes after bariatric surgery. Another highlight was Alejandro Frangi’s talk on in silico clinical trials in which he described how mechanistic modelling (like fluid dynamic simulations of medical devices) can be combined with generative modelling (to synthesise a virtual patient cohort) to explore how medical treatments may perform in patients who would never have qualified for inclusion in a real randomised controlled trial.

Causality

Richard’s favourite talk was by Cheng Zhang from Microsoft Research on causal models, titled “Causality: From Statistics to Foundation Models”. Cheng highlighted that an understanding of causality is vital for the intelligent systems that have a role in decision-making. This area is on the cutting edge of research in AI for biology and healthcare – understanding consequences is necessary for a model that should propose interventions. While association (statistics) is still the main use case for AI, such models have no model of the “true” nature of the world that they are reflecting which leads to hallucinations such as images with too many fingers or nonsensical text generation. One recipe proposed by Cheng was to build a causally-aware model to:

  • Apply an attention mechanism/transformer to data so that the model focuses only on the most important parts
  • Use a penalised hinge loss- the model should learn from its mistakes, and should account for some mistakes being worse than others
  • Read off optimal balancing weights + the causal effect of actions – after training, we need to investigate the model to understand the impact of different actions.

In essence, this is a blueprint to build a smart system that can look at a lot of complex data, decide what’s important, learn from its mistakes efficiently and can help us understand the impact of different actions. As an example, we might be interested in how the amount of time spent studying affects students’ grades. At the outset, it’s hard to say if studying more causes better grades because these students might also have other good habits, have access to more resources, etc. Such a model would try to balance these factors and give a clearer picture of what causes what- effectively, it would try to find the true effect of studying more on a student’s grade, after removing the influence of other habits.

This behaviour is desired for the complex models being developed for healthcare and biology; for example, we may be interested in engineering CRISPR interventions to make crops more resilient to climate change or developing brain stimulation protocols to help with rehabilitation. A model proposing interventions for these should have a causal understanding of which genes impact a trait, or how different patterns of brain activity affect cognitive function.

Recordings of all the talks can be found on here

Rigour, imagination and production in data-driven science 

A public event organised by The Alan Turing Institute – 20 June 2024 
Blog post by Léo Gorman, Data Scientist, Jean Golding Institute 

Let’s say you are a researcher approaching a new dataset. Often it seems that there is a virtually infinite number of legitimate paths you could take between loading your data for the first time and building a model that is useful for prediction or inference. Even if we follow statistical best practice, it can feel that even more established methods still don’t allow us to communicate our uncertainty in an intuitive way, to say where our results are relevant and where they are not, or to understand whether our models can be used to infer causality. These are not trivial issues. The Alan Turing Institute (the Turing) hosted a theory and methods challenge fortnight (TMCF), where leading researchers got together to discuss these issues.

JGI team members Patty Holley, James Thomas and Léo Gorman (left to right) at the Turing 

Three of the researchers from the TMCF (Andrew Gelman, Jessica Hullman, and Hadley Wickham) took part in a public lecture and panel discussion where they shared their thoughts on more active and thoughtful data science. 

Members of the Jean Golding Institute (Patty Holley, James Thomas, and Léo Gorman) went to London to participate in this event, and to meet with staff at the Turing to discuss opportunities for more collaboration between the Turing and the University of Bristol. 

In this post, I aim to provide a brief summary of my take-home messages that I hope you will find useful. At the end of this post, I recommend materials from all three speakers which will cover these topics in much more depth. 

Andrew Gelman – Beyond “push a button, take a pill” data science

 

Andrew Gelman presenting

Gelman mainly discussed how are statistics used to assess the impact of ‘interventions’ in modern science. Randomised controlled trials (RCTs) are considered the gold-standard option, but according to Gelman, the common interpretation of these studies could be improved. First, the trials need to be taken in context, and it needs to be understood that these findings might be different in another scenario. 

We need to move beyond the binary “it worked” or “it didn’t” outcomes. There are intermediate outcomes which help us understand how well a treatment worked. For example, let’s take cancer treatment trial. Rather than just looking at if a treatment worked for a group, we could look at how the size of the tumour changed, and whether this changed for different people. As Gelman says in his blog: “Real-world effects vary among people and over time”. 

Jessica Hullman – What do we miss with average model effects? How can simulation and data visualisation help?

Jessica Hullman presenting

Hullman’s talk expanded on some of the themes in Gelman’s talk, Let’s continue with the example of an RCT for cancer treatment. If we saw an average effect of 0.1 between treatment and control, how would that vary for different characteristics (represented by the x-axis in the quartet of graphs below). Hullman demonstrated how simulation and visualisation can help us understand how different scenarios can lead to the same conclusion. 

Causal quartets, as shown in Gelman, Hullman, and Kennedy’s paper. All four plots show an average effect of 0.1, but these effects vary as a function of an explanatory variable (x-axis)

Hadley Wickham – Challenges with putting data science into production 

Hadley Wickham presenting

Wickham’s talk focused on some of the main issues with conducting reproducible, collaborative, and robust data science. Wickham framed these challenges under three broad themes: 

  1. Not just once: an analysis likely needs to be runnable more than once, for example you may want to run the same code on new data as it is collected.  
  1. Not just on my computer: You may need to run some code on your own laptop, but also another system, such as the University’s HPC. 
  1. Not just me: Someone else may need to use your code in their workflow. 

According to Wickham, for people in an organisation to be able to work on the same codebase, they have the following needs (in order of priority), they need to be able to: 

  1. find the code 
  1. run the code 
  1. understand the code 
  1. edit the code. 

These challenges exist at all types of organisation, and there are surprisingly few cases where organisations fulfil all criteria. 

Panel discussion – Reflections on data science 

Cagatay Turkay, Roger Beecham, Hadley Wickham, Andrew Gelman, Jessica Hullman (left to right) at the Turing 

Following each of their individual talks, the panellists reflected more generally. Here are a few key points: 

Causality and complex relationships: When asked about the biggest surprises in data science over the past 10 years both Gelman and Hullman seemed surprised at the uptake of ‘blackbox’ machine learning methods. More work needs to be done to understand how these models work and to try and communicate uncertainty. The causal quartet visualisations, presented in the talk, only addressed simple/ideal cases for causal inference. Gelman and Hullman both said that figuring out how to understand complex causal relationships for high-dimensional data was at the ‘bleeding edge’ of data science. 

People problems not methods/tools problems: All three panellists agreed that most of the issues we face in data science are people problems rather than methods/tools problems. Much of the tools/methods exist already, but we need to think more careful. 

Léo’s takeaway 

The whole trip reminded me of the importance of continual learning, and I will definitely be spending more time going through some of Gelman’s books (see below). 

Gelman and Hullman’s talk in general encouraged people to think: At each point in my analysis, were there alternative choices that could have been made that would have been equally reasonable, and if so, how different would my results have looked had I made these choices? This made me want to think more about multiverse analyses (see analysis package and article). 

Further Reading 

Theory and Methods Challenge Fortnight – Garden of Forking Paths 

The speakers were there as part of the Turing’s Theory and Methods Challenge Fortnight (TMCF), more information can be found below: 

Andrew Gelman 

For people who have not heard of Andrew Gelman before, he is known to be an entertaining communicator (you can search for some of his talks online or look at the Columbia statistics blog). He also has several great books: 

Jessica Hullman 

Again, check the Columbia statistics blog, where Hullman also contributes. The home page of Hullman’s website also includes selected publications which cover causal quartets, but also reliability and interpretability for more complex models. 

Hadley Wickham 

Wickham has made many contributions for R and data science. He is chief scientist at Posit and is lead of the tidyverse team. His book R for Data Science is a particularly useful resource. Other work can be found on his website