Ask JGI Student Experience Profiles: Daniel Collins

Daniel Collins (Ask-JGI Data Science Support 2023-24)

Daniel Collins
Daniel Collins, 2nd year PhD Student in the School of Computer Science at the University of Bristol

I applied to Ask-JGI as a 2nd year PhD student on the Interactive AI CDT. Before starting my PhD, I spent several years working in Medical Physics for the NHS. Without a formal background in data science, transitioning to an AI-focused PhD felt like a significant shift. I was looking for opportunities to gain more practical experience in areas of data science outside of my research topic, and Ask-JGI has been the perfect place to do this! 

Working with Ask-JGI has been a hugely rewarding experience, and I’ve really appreciated the variety it introduced into my day-to-day work. With a PhD, you’re often working towards a long-term goal in a very specific domain area, with projects that can span several months at a time. With Ask-JGI, each query becomes a self-contained mini-project with a much smaller scope and timeline. These short bursts of exploration and learning have been really valuable to have alongside my PhD. 

The queries involve supporting researchers from various specialisms across the University, and can involve a broad range of topics and technical skills. I’ve particularly enjoyed queries that have involved writing demo code e.g. for data processing, visualisation or modelling. One of the highlights has been my work with GenROC, visualising the number of children with different rare genetic conditions recruited to the study. To try to make it more engaging for the children and families involved, we developed a pipeline for creating 3D bubble plots with a space theme using the Blender Python API. This was great because I got to spend time learning a new software tool while also learning more about the important work the GenROC researchers are doing at the University!

Blender API bubble plots for GenROC project. Plots made with anonymised and randomised data
Example of the Blender API bubble plots made for GenROC, with anonymised and randomised data

I wholeheartedly recommend joining the team if you have experience in any area of data science and you’re looking to develop your skills. The JGI team have created an incredibly friendly and supportive environment for learning and collaboration. It’s an excellent opportunity to learn from others, and gain exposure to the different ways data science can be applied in academic research!

From Data to Discovery in Biology and Health

ELLIS Summer School on Machine Learning in Healthcare and Biology – 11 to 13 June 2024  

Huw Day, Richard Lane, Christianne Fernée and Will Chapman, data scientists from the Jean Golding Institute, attended the European Laboratory for Learning and Intelligent Systems (ELLIS) Summer School on Machine Learning for Healthcare and Biology at The University of Manchester. Across three days they learned about cutting edge developments in machine learning for biology and healthcare from academic and industry leaders.

A major theme of the event was how researchers create real knowledge about biological processes at the convergence of deep learning and causal inference techniques. Through this machine learning can be used to bridge the gap between well-controlled lab experiments and the messy real world of clinical treatments.

Advancing Medical Science using Machine Learning

Huw’s favorite talk was “From Data to Discovery: Machine Learning’s Role in Advancing (Medical) Science” by Mihaela van der Schaar who is a Professor of ML, AI and Medicine at the University of Cambridge and Director for the Cambridge Centre for AI in Medicine.

Currently, machine learning models are excellent at deducing the association between variables. However, Mihaela argued that we need to go beyond this to understand the variables and their relationships so that we can discover so-called “governing equations”. In the past, human experts have discovered governing equations with domain knowledge, intuition and insight to extract equations from underlying data.

The speaker’s research group have been working to deduce different types of underlying governing equations from black box models. They have developed techniques to extract explicit functions as well as more involved functional equations and various types of ordinary and partial differential equations.

On the left are three graphs showing temporal effects of chemotherapy on tumour volume for observed data, D-CODE and SR-T. On the right is the actual equations for the D-CODE and SR-T plots on the left.
Slide 39 from Mihaela van der Schaar’s talk, showing observed data of the effects of chemotherapy on tumour volume over time and then two examples of derived governing equations in plots on the left with the actual equations written out on the right

The implications for healthcare applications are immense if these methods are able to be reliability integrated into our existing machine learning analysis. On a more philosophical angle, it begs interesting questions about how many systems in life sciences (and beyond) have governing equations and what proportion of these equations are possible to discover.

Gaussian processes and simulation of clinical trials

A highlight for Will and Christianne was the informative talk from Ti John which was a practical introduction to Gaussian Processes (GP) which furthered our understanding of how GPs learn non-linear functions from a dataset. By assuming that your data are a collection of realisations of some unknown random function (or combination of functions), and judicious choice of kernel, Gaussian Process modelling can allow the estimation of both long-term trends from short-term fluctuations from noisy data. The presentation was enhanced with this interactive visualisation of GPs, alongside an analysis of how the blood glucose response to meals changes after bariatric surgery. Another highlight was Alejandro Frangi’s talk on in silico clinical trials in which he described how mechanistic modelling (like fluid dynamic simulations of medical devices) can be combined with generative modelling (to synthesise a virtual patient cohort) to explore how medical treatments may perform in patients who would never have qualified for inclusion in a real randomised controlled trial.

Causality

Richard’s favourite talk was by Cheng Zhang from Microsoft Research on causal models, titled “Causality: From Statistics to Foundation Models”. Cheng highlighted that an understanding of causality is vital for the intelligent systems that have a role in decision-making. This area is on the cutting edge of research in AI for biology and healthcare – understanding consequences is necessary for a model that should propose interventions. While association (statistics) is still the main use case for AI, such models have no model of the “true” nature of the world that they are reflecting which leads to hallucinations such as images with too many fingers or nonsensical text generation. One recipe proposed by Cheng was to build a causally-aware model to:

  • Apply an attention mechanism/transformer to data so that the model focuses only on the most important parts
  • Use a penalised hinge loss- the model should learn from its mistakes, and should account for some mistakes being worse than others
  • Read off optimal balancing weights + the causal effect of actions – after training, we need to investigate the model to understand the impact of different actions.

In essence, this is a blueprint to build a smart system that can look at a lot of complex data, decide what’s important, learn from its mistakes efficiently and can help us understand the impact of different actions. As an example, we might be interested in how the amount of time spent studying affects students’ grades. At the outset, it’s hard to say if studying more causes better grades because these students might also have other good habits, have access to more resources, etc. Such a model would try to balance these factors and give a clearer picture of what causes what- effectively, it would try to find the true effect of studying more on a student’s grade, after removing the influence of other habits.

This behaviour is desired for the complex models being developed for healthcare and biology; for example, we may be interested in engineering CRISPR interventions to make crops more resilient to climate change or developing brain stimulation protocols to help with rehabilitation. A model proposing interventions for these should have a causal understanding of which genes impact a trait, or how different patterns of brain activity affect cognitive function.

Recordings of all the talks can be found on here

Empowering schools to improve the data literacy of young people

What is DataFace?

Data science is everywhere in the modern world and is increasingly relevant to many careers. Part of Cheltenham Science Festival, and supported by the Jean Golding Institute and CyberFirst, the DataFace project gives secondary school students and their teachers the skills and confidence to dive into an open dataset, find an issue that catches their interest, and tell a story with creative data visualisations. Along with boosting their data literacy and building essential data science skills, DataFace breaks the stereotype that data science is just for those who study IT or computer science.

Why are we involved?

At the JGI, we support and train researchers in data science and develop data literacy skills across all disciplines at the University of Bristol. Through this work, we’ve realised that data literacy is important in every field, not just the traditional “STEM” subjects (science, technology, engineering, and mathematics).

The sooner we learn these skills, the sooner we can use them in our daily lives—whether it’s watching the news, understanding finances and cost of living or better understanding our contributions to global warming. This realisation inspired us to look for ways to engage more with the community, which led us to get involved with DataFace.

What did we do?

As part of the project, former JGI director Kate Robson Brown and data scientist James Thomas sat on the project steering group alongside representatives from Cheltenham Festivals and CyberFirst. With assistance from research software engineer Matt Williams they developed teaching resources, including core skills videos for teachers and pupils, dataset explainer videos, and the curated open datasets that the pupils go on to analyse. The JGI also engaged PhD students and postgraduate researchers to act as role models in the videos. These resources have since been expanded with the help of JGI data scientist Huw Day.

What happened at Cheltenham Science Festival?

On Monday, 3 June 2024, teams of students from 12 schools came together for the DataFace competition day. They had an opportunity to share their findings and creative data visualisations in a poster session to their peers, attendees and three judges (including Ask-JGI PhD student Rachael Laidlaw).

Left to right: Huw Day, Kate Robson Brown and Rachael Laidlaw

Huw Day JGI data scientist, Kate Robson Brown former JGI director, Rachael Laidlaw Ask-JGI PhD student attending Cheltenham Science Festival for the DataFace competition day.

Rather than plot the usual graphs and charts that we’re used to, the students created all manner of visualisations. For example, Gloucester Academy looked at data about the cost of living and made different sized models of food that reflected the buying power between years, with smaller biscuits and sandwiches with a chunk bitten out, reflecting the cost of living rising over time.

Gloucester Academy team of four students and their teacher behind their table which contains their poster and data visualisation graphs
Dean Close School team consisting of 5 students standing behind a table that contains 3D printed shapes of countries
Dean Close School showed how rising sea levels displaced inhabitants across different countries by 3D printing shapes of several countries (Nigeria, Netherlands, Thailand and Mexico), including the topography of the countries and 3D printed mini houses onto the maps. These were then placed in a trough filled with water, showing the areas and people that would be affected by flooding.

The top six schools from the poster sessions were selected by the judges. In the afternoon, they presented their work, sharing their experiences of working on the project and answering questions from the judges.

Stroud High School team consisting of 3 students standing behind a table that contains their poster

The winning school was Stroud High School for Girls who visualised data about the environment. They made a bar chart using people to represent the number of threatened species in a country. Each person was holding a papier-mâché balloon – the size of each depicted the emissions by a country per capita.

Rachael Laidlaw talking to an attendee at the Cheltenham Science Festival
Rachael Laidlaw, Ask-JGI PhD student and DataFace Judge at Cheltenham Science Festival

“The enthusiasm and initiative shown by all students involved in the competition was incredibly inspiring. I especially loved hearing the stories of how each group creatively came up with their project ideas and it was impressive to see the range of skills and resources they’d managed to incorporate into their designs. The room was filled with effective and insightful portrayals of data which each communicated a really important message. The event proved to be a wonderful showcase of just how fruitful the DataFace program has been and it was so rewarding to see the impact of the training videos that I featured in last year!”
Rachael Laidlaw, Ask-JGI PhD student and DataFace Judge

What’s next?

Ready for next year, we’ll be curating more datasets for students to work with and the training materials to go alongside them. We’re excited to work with data that matters to students by creating a new dataset focused on mental health, using information from the World Happiness Report.

Students have different needs, so we’re putting together a more challenging dataset for those who are ready to dig deeper. This dataset will feature mental health data collected over the past decade from countries all over the world. This will give students different options to explore patterns they find interesting. Whether they want to look at lots of countries at one point in time or a few countries over a longer period, they’ll have the freedom to choose how they visualise and analyse the data.

The next big step in data literacy is being able to share your findings with others who might not be familiar with your data, the context, or your analysis methods. That’s why we’re creating a new training video called “how to share your findings”. This video will help students better prepare to share their projects in one-on-one conversations, poster sessions and presentations.

Rigour, imagination and production in data-driven science 

A public event organised by The Alan Turing Institute – 20 June 2024 
Blog post by Léo Gorman, Data Scientist, Jean Golding Institute 

Let’s say you are a researcher approaching a new dataset. Often it seems that there is a virtually infinite number of legitimate paths you could take between loading your data for the first time and building a model that is useful for prediction or inference. Even if we follow statistical best practice, it can feel that even more established methods still don’t allow us to communicate our uncertainty in an intuitive way, to say where our results are relevant and where they are not, or to understand whether our models can be used to infer causality. These are not trivial issues. The Alan Turing Institute (the Turing) hosted a theory and methods challenge fortnight (TMCF), where leading researchers got together to discuss these issues.

JGI team members Patty Holley, James Thomas and Léo Gorman (left to right) at the Turing 

Three of the researchers from the TMCF (Andrew Gelman, Jessica Hullman, and Hadley Wickham) took part in a public lecture and panel discussion where they shared their thoughts on more active and thoughtful data science. 

Members of the Jean Golding Institute (Patty Holley, James Thomas, and Léo Gorman) went to London to participate in this event, and to meet with staff at the Turing to discuss opportunities for more collaboration between the Turing and the University of Bristol. 

In this post, I aim to provide a brief summary of my take-home messages that I hope you will find useful. At the end of this post, I recommend materials from all three speakers which will cover these topics in much more depth. 

Andrew Gelman – Beyond “push a button, take a pill” data science

 

Andrew Gelman presenting

Gelman mainly discussed how are statistics used to assess the impact of ‘interventions’ in modern science. Randomised controlled trials (RCTs) are considered the gold-standard option, but according to Gelman, the common interpretation of these studies could be improved. First, the trials need to be taken in context, and it needs to be understood that these findings might be different in another scenario. 

We need to move beyond the binary “it worked” or “it didn’t” outcomes. There are intermediate outcomes which help us understand how well a treatment worked. For example, let’s take cancer treatment trial. Rather than just looking at if a treatment worked for a group, we could look at how the size of the tumour changed, and whether this changed for different people. As Gelman says in his blog: “Real-world effects vary among people and over time”. 

Jessica Hullman – What do we miss with average model effects? How can simulation and data visualisation help?

Jessica Hullman presenting

Hullman’s talk expanded on some of the themes in Gelman’s talk, Let’s continue with the example of an RCT for cancer treatment. If we saw an average effect of 0.1 between treatment and control, how would that vary for different characteristics (represented by the x-axis in the quartet of graphs below). Hullman demonstrated how simulation and visualisation can help us understand how different scenarios can lead to the same conclusion. 

Causal quartets, as shown in Gelman, Hullman, and Kennedy’s paper. All four plots show an average effect of 0.1, but these effects vary as a function of an explanatory variable (x-axis)

Hadley Wickham – Challenges with putting data science into production 

Hadley Wickham presenting

Wickham’s talk focused on some of the main issues with conducting reproducible, collaborative, and robust data science. Wickham framed these challenges under three broad themes: 

  1. Not just once: an analysis likely needs to be runnable more than once, for example you may want to run the same code on new data as it is collected.  
  1. Not just on my computer: You may need to run some code on your own laptop, but also another system, such as the University’s HPC. 
  1. Not just me: Someone else may need to use your code in their workflow. 

According to Wickham, for people in an organisation to be able to work on the same codebase, they have the following needs (in order of priority), they need to be able to: 

  1. find the code 
  1. run the code 
  1. understand the code 
  1. edit the code. 

These challenges exist at all types of organisation, and there are surprisingly few cases where organisations fulfil all criteria. 

Panel discussion – Reflections on data science 

Cagatay Turkay, Roger Beecham, Hadley Wickham, Andrew Gelman, Jessica Hullman (left to right) at the Turing 

Following each of their individual talks, the panellists reflected more generally. Here are a few key points: 

Causality and complex relationships: When asked about the biggest surprises in data science over the past 10 years both Gelman and Hullman seemed surprised at the uptake of ‘blackbox’ machine learning methods. More work needs to be done to understand how these models work and to try and communicate uncertainty. The causal quartet visualisations, presented in the talk, only addressed simple/ideal cases for causal inference. Gelman and Hullman both said that figuring out how to understand complex causal relationships for high-dimensional data was at the ‘bleeding edge’ of data science. 

People problems not methods/tools problems: All three panellists agreed that most of the issues we face in data science are people problems rather than methods/tools problems. Much of the tools/methods exist already, but we need to think more careful. 

Léo’s takeaway 

The whole trip reminded me of the importance of continual learning, and I will definitely be spending more time going through some of Gelman’s books (see below). 

Gelman and Hullman’s talk in general encouraged people to think: At each point in my analysis, were there alternative choices that could have been made that would have been equally reasonable, and if so, how different would my results have looked had I made these choices? This made me want to think more about multiverse analyses (see analysis package and article). 

Further Reading 

Theory and Methods Challenge Fortnight – Garden of Forking Paths 

The speakers were there as part of the Turing’s Theory and Methods Challenge Fortnight (TMCF), more information can be found below: 

Andrew Gelman 

For people who have not heard of Andrew Gelman before, he is known to be an entertaining communicator (you can search for some of his talks online or look at the Columbia statistics blog). He also has several great books: 

Jessica Hullman 

Again, check the Columbia statistics blog, where Hullman also contributes. The home page of Hullman’s website also includes selected publications which cover causal quartets, but also reliability and interpretability for more complex models. 

Hadley Wickham 

Wickham has made many contributions for R and data science. He is chief scientist at Posit and is lead of the tidyverse team. His book R for Data Science is a particularly useful resource. Other work can be found on his website

Updates from a previous JGI Seed Corn funded project:  Addressing the fetal alcohol spectrum disorder (FASD) ‘data gap’

We are delighted to announce a few updates regarding one of our previous seed corn funded projects. In 2022-2023, the JGI funded Cheryl McQuire’s (Bristol Medical School) project on “Addressing the fetal alcohol spectrum disorder (FASD) ‘data gap’: ascertaining the feasibility of establishing the first UK National linked database for FASD”. This project allowed Cheryl’s team to explore the feasibility of establishing a National Linked Database for Fetal Alcohol Spectrum Disorder (FASD) as Landmark UK guidance has called for urgent action to increase identification, understanding, and support for those affected with this disorder.  

FASD is caused by prenatal alcohol exposure and is thought to be particularly common in the UK population. The aim of the seed corn project was to make the initial steps towards forming a UK National Database for FASD looking at feasibility, acceptability, key purposes and the data structure needed. Through questioning over 100 stakeholders including clinicians, data specialists, researchers, policy makers, charities, and people living with FASD, the project was able todemonstrate a strong support for a national FASD database but there was a common concern among stakeholders about privacy and data sharing. Full details of the project can be found on our previous blog post.  

Cheryl and their team also collaborated with the Elizabeth Blackwell Institute (EBI) on “Developing a National Database for Fetal Alcohol Spectrum Disorder (Nat-FASD UK): incorporating the views and recommendations of people with FASD and their carers.” Their findings from the projects funded by JGI and EBI were presented at ADR-UK conference 2023. The abstract for this work can be viewed here. In addition, a pre-print of their FASD National database workshop findings is now available here.  

Importantly, this work has been selected to feature in the Office for National Statistics (ONS) Research Excellence Series 2024. Cheryl will be delivering a webinar on “Showcasing methods for diverse stakeholder involvement in database design: establishing the feasibility and acceptability of a National Database for Fetal Alcohol Spectrum Disorder (FASD)” on Thursday 13 June 10:30 to 11:30 BST. The webinar will cover how the team developed a tailored, multi-method approach to public and professional involvement activities, leading to high levels of engagement. In addition, you will also hear what people living with FASD and health care, policy and data science professionals had to say about the feasibility and acceptability of a UK National Linked Database for FASD. There will be an opportunity to ask Cheryl any questions during the dedicated Q&A section. You can register a place on the webinar here.  

The work from both projects has been crucial in paving the way for progress in FASD research within the UK. It has also allowed us to get closer to addressing the FASD data gap that has been stalling the progress in prevention, understanding, and appropriate support for too long. Since both projects, Cheryl’s team has continued working on the FASD database and is currently pursuing funding options to establish a National database for FASD.  

The Jean Golding Institute offers seed corn projects every year to support and promote activities that will foster interdisciplinary research in the area of data science, based on the principle that a small financial investment will lead onto bigger things. We anticipate that our next seed corn funding call will be announced in the autumn of 2024.  Sign up to our mailing list to find out when the call goes live.