Ask JGI Student Experience Profiles: Emma Hazelwood

Emma Hazelwood (Ask-JGI Data Science Support 2023-24) 

Emma Hazelwood
Emma Hazelwood, final year PhD Student in Population Health Sciences

I am a final year PhD student in Population Health Sciences. I found out about the opportunity to support the JGI’s data science helpdesk through a friend who had done this job previously. I thought it sounded like a great way to do something a bit different, especially on those days when you need a bit of a break from your PhD topic.

I’ve learnt so many new skills from working within the JGI. The team are very friendly, and everyone is learning from each other. It’s also been very beneficial for me to learn some new skills, for instance Python, when considering what I want to do after my PhD. I’ve been able to see how the statistical methods that I know from my biomedical background be used in completely different contexts, which has really changed the way I think about data. 

I’ve worked on a range of topics through JGI, which have all been as interesting as they have been different. I’ve helped people with coding issues, thought about new ways to visualise data, and discussed what statistical methods would be most suitable for answering research questions. In particular, I’ve loved getting involved with a project in the Latin American studies department, where I’ve been mapping key locations from conferences throughout the early 20th century onto satellite images, bringing to life the routes that the conference attendees would have taken. 

This has been a great opportunity working with a very welcoming team, and one I’d recommend to anyone considering it!

Ask JGI Student Experience Profiles: Emilio Romero

Emilio Romero (Ask-JGI Data Science Support 2023-24)

Emilio Romero
Emilio Romero, 2nd year PhD Student in Translational Health Sciences

Over the past year, my experience helping with the Ask-JGI service has been really rewarding. I was keen to apply as I wanted to get more exposure to the research world in Bristol, meet different researchers and explore with them different ways of working and approaching data.  

From a technical perspective, I had the opportunity to work on projects related to psychometric data, biological matrices, proteins, chemometrics and mapping. I also worked mainly with R and in some cases SPSS, which offered different alternatives for data analysis and presentation. 

One of the most challenging projects was working with chemometric concentrations of different residues of chemical compounds extracted from vessels used in human settlements in the past. This challenge allowed me to talk to specialists in the field and to work in a multidisciplinary way in developing data matrices, extracting coordinates and creating maps in R. The most rewarding part was being able to use a colour scale to represent the variation in concentration of specific compounds across settlements. This was undoubtedly a great experience and a technique that I had never had the opportunity to practice. 

ASK-JGI also promoted many events, especially Bristol Data Week, which allowed many interested people to attend courses at different levels specialising in the use of data analysis software such as Python and R. 

The Ask-JGI team have made this year an enjoyable experience. As a cohort, we have come together to provide interdisciplinary advice to support various projects. I would highly recommend anyone with an interest in data science and statistics to apply. It is an incredible opportunity for development and networking and allows you to immerse yourself in the wider Bristol community, as well as learning new techniques that you can use during your time at the University of Bristol. 

Ask JGI Student Experience Profiles: Daniel Collins

Daniel Collins (Ask-JGI Data Science Support 2023-24)

Daniel Collins
Daniel Collins, 2nd year PhD Student in the School of Computer Science at the University of Bristol

I applied to Ask-JGI as a 2nd year PhD student on the Interactive AI CDT. Before starting my PhD, I spent several years working in Medical Physics for the NHS. Without a formal background in data science, transitioning to an AI-focused PhD felt like a significant shift. I was looking for opportunities to gain more practical experience in areas of data science outside of my research topic, and Ask-JGI has been the perfect place to do this! 

Working with Ask-JGI has been a hugely rewarding experience, and I’ve really appreciated the variety it introduced into my day-to-day work. With a PhD, you’re often working towards a long-term goal in a very specific domain area, with projects that can span several months at a time. With Ask-JGI, each query becomes a self-contained mini-project with a much smaller scope and timeline. These short bursts of exploration and learning have been really valuable to have alongside my PhD. 

The queries involve supporting researchers from various specialisms across the University, and can involve a broad range of topics and technical skills. I’ve particularly enjoyed queries that have involved writing demo code e.g. for data processing, visualisation or modelling. One of the highlights has been my work with GenROC, visualising the number of children with different rare genetic conditions recruited to the study. To try to make it more engaging for the children and families involved, we developed a pipeline for creating 3D bubble plots with a space theme using the Blender Python API. This was great because I got to spend time learning a new software tool while also learning more about the important work the GenROC researchers are doing at the University!

Blender API bubble plots for GenROC project. Plots made with anonymised and randomised data
Example of the Blender API bubble plots made for GenROC, with anonymised and randomised data

I wholeheartedly recommend joining the team if you have experience in any area of data science and you’re looking to develop your skills. The JGI team have created an incredibly friendly and supportive environment for learning and collaboration. It’s an excellent opportunity to learn from others, and gain exposure to the different ways data science can be applied in academic research!

From Data to Discovery in Biology and Health

ELLIS Summer School on Machine Learning in Healthcare and Biology – 11 to 13 June 2024  

Huw Day, Richard Lane, Christianne Fernée and Will Chapman, data scientists from the Jean Golding Institute, attended the European Laboratory for Learning and Intelligent Systems (ELLIS) Summer School on Machine Learning for Healthcare and Biology at The University of Manchester. Across three days they learned about cutting edge developments in machine learning for biology and healthcare from academic and industry leaders.

A major theme of the event was how researchers create real knowledge about biological processes at the convergence of deep learning and causal inference techniques. Through this machine learning can be used to bridge the gap between well-controlled lab experiments and the messy real world of clinical treatments.

Advancing Medical Science using Machine Learning

Huw’s favorite talk was “From Data to Discovery: Machine Learning’s Role in Advancing (Medical) Science” by Mihaela van der Schaar who is a Professor of ML, AI and Medicine at the University of Cambridge and Director for the Cambridge Centre for AI in Medicine.

Currently, machine learning models are excellent at deducing the association between variables. However, Mihaela argued that we need to go beyond this to understand the variables and their relationships so that we can discover so-called “governing equations”. In the past, human experts have discovered governing equations with domain knowledge, intuition and insight to extract equations from underlying data.

The speaker’s research group have been working to deduce different types of underlying governing equations from black box models. They have developed techniques to extract explicit functions as well as more involved functional equations and various types of ordinary and partial differential equations.

On the left are three graphs showing temporal effects of chemotherapy on tumour volume for observed data, D-CODE and SR-T. On the right is the actual equations for the D-CODE and SR-T plots on the left.
Slide 39 from Mihaela van der Schaar’s talk, showing observed data of the effects of chemotherapy on tumour volume over time and then two examples of derived governing equations in plots on the left with the actual equations written out on the right

The implications for healthcare applications are immense if these methods are able to be reliability integrated into our existing machine learning analysis. On a more philosophical angle, it begs interesting questions about how many systems in life sciences (and beyond) have governing equations and what proportion of these equations are possible to discover.

Gaussian processes and simulation of clinical trials

A highlight for Will and Christianne was the informative talk from Ti John which was a practical introduction to Gaussian Processes (GP) which furthered our understanding of how GPs learn non-linear functions from a dataset. By assuming that your data are a collection of realisations of some unknown random function (or combination of functions), and judicious choice of kernel, Gaussian Process modelling can allow the estimation of both long-term trends from short-term fluctuations from noisy data. The presentation was enhanced with this interactive visualisation of GPs, alongside an analysis of how the blood glucose response to meals changes after bariatric surgery. Another highlight was Alejandro Frangi’s talk on in silico clinical trials in which he described how mechanistic modelling (like fluid dynamic simulations of medical devices) can be combined with generative modelling (to synthesise a virtual patient cohort) to explore how medical treatments may perform in patients who would never have qualified for inclusion in a real randomised controlled trial.

Causality

Richard’s favourite talk was by Cheng Zhang from Microsoft Research on causal models, titled “Causality: From Statistics to Foundation Models”. Cheng highlighted that an understanding of causality is vital for the intelligent systems that have a role in decision-making. This area is on the cutting edge of research in AI for biology and healthcare – understanding consequences is necessary for a model that should propose interventions. While association (statistics) is still the main use case for AI, such models have no model of the “true” nature of the world that they are reflecting which leads to hallucinations such as images with too many fingers or nonsensical text generation. One recipe proposed by Cheng was to build a causally-aware model to:

  • Apply an attention mechanism/transformer to data so that the model focuses only on the most important parts
  • Use a penalised hinge loss- the model should learn from its mistakes, and should account for some mistakes being worse than others
  • Read off optimal balancing weights + the causal effect of actions – after training, we need to investigate the model to understand the impact of different actions.

In essence, this is a blueprint to build a smart system that can look at a lot of complex data, decide what’s important, learn from its mistakes efficiently and can help us understand the impact of different actions. As an example, we might be interested in how the amount of time spent studying affects students’ grades. At the outset, it’s hard to say if studying more causes better grades because these students might also have other good habits, have access to more resources, etc. Such a model would try to balance these factors and give a clearer picture of what causes what- effectively, it would try to find the true effect of studying more on a student’s grade, after removing the influence of other habits.

This behaviour is desired for the complex models being developed for healthcare and biology; for example, we may be interested in engineering CRISPR interventions to make crops more resilient to climate change or developing brain stimulation protocols to help with rehabilitation. A model proposing interventions for these should have a causal understanding of which genes impact a trait, or how different patterns of brain activity affect cognitive function.

Recordings of all the talks can be found on here

Empowering schools to improve the data literacy of young people

What is DataFace?

Data science is everywhere in the modern world and is increasingly relevant to many careers. Part of Cheltenham Science Festival, and supported by the Jean Golding Institute and CyberFirst, the DataFace project gives secondary school students and their teachers the skills and confidence to dive into an open dataset, find an issue that catches their interest, and tell a story with creative data visualisations. Along with boosting their data literacy and building essential data science skills, DataFace breaks the stereotype that data science is just for those who study IT or computer science.

Why are we involved?

At the JGI, we support and train researchers in data science and develop data literacy skills across all disciplines at the University of Bristol. Through this work, we’ve realised that data literacy is important in every field, not just the traditional “STEM” subjects (science, technology, engineering, and mathematics).

The sooner we learn these skills, the sooner we can use them in our daily lives—whether it’s watching the news, understanding finances and cost of living or better understanding our contributions to global warming. This realisation inspired us to look for ways to engage more with the community, which led us to get involved with DataFace.

What did we do?

As part of the project, former JGI director Kate Robson Brown and data scientist James Thomas sat on the project steering group alongside representatives from Cheltenham Festivals and CyberFirst. With assistance from research software engineer Matt Williams they developed teaching resources, including core skills videos for teachers and pupils, dataset explainer videos, and the curated open datasets that the pupils go on to analyse. The JGI also engaged PhD students and postgraduate researchers to act as role models in the videos. These resources have since been expanded with the help of JGI data scientist Huw Day.

What happened at Cheltenham Science Festival?

On Monday, 3 June 2024, teams of students from 12 schools came together for the DataFace competition day. They had an opportunity to share their findings and creative data visualisations in a poster session to their peers, attendees and three judges (including Ask-JGI PhD student Rachael Laidlaw).

Left to right: Huw Day, Kate Robson Brown and Rachael Laidlaw

Huw Day JGI data scientist, Kate Robson Brown former JGI director, Rachael Laidlaw Ask-JGI PhD student attending Cheltenham Science Festival for the DataFace competition day.

Rather than plot the usual graphs and charts that we’re used to, the students created all manner of visualisations. For example, Gloucester Academy looked at data about the cost of living and made different sized models of food that reflected the buying power between years, with smaller biscuits and sandwiches with a chunk bitten out, reflecting the cost of living rising over time.

Gloucester Academy team of four students and their teacher behind their table which contains their poster and data visualisation graphs
Dean Close School team consisting of 5 students standing behind a table that contains 3D printed shapes of countries
Dean Close School showed how rising sea levels displaced inhabitants across different countries by 3D printing shapes of several countries (Nigeria, Netherlands, Thailand and Mexico), including the topography of the countries and 3D printed mini houses onto the maps. These were then placed in a trough filled with water, showing the areas and people that would be affected by flooding.

The top six schools from the poster sessions were selected by the judges. In the afternoon, they presented their work, sharing their experiences of working on the project and answering questions from the judges.

Stroud High School team consisting of 3 students standing behind a table that contains their poster

The winning school was Stroud High School for Girls who visualised data about the environment. They made a bar chart using people to represent the number of threatened species in a country. Each person was holding a papier-mâché balloon – the size of each depicted the emissions by a country per capita.

Rachael Laidlaw talking to an attendee at the Cheltenham Science Festival
Rachael Laidlaw, Ask-JGI PhD student and DataFace Judge at Cheltenham Science Festival

“The enthusiasm and initiative shown by all students involved in the competition was incredibly inspiring. I especially loved hearing the stories of how each group creatively came up with their project ideas and it was impressive to see the range of skills and resources they’d managed to incorporate into their designs. The room was filled with effective and insightful portrayals of data which each communicated a really important message. The event proved to be a wonderful showcase of just how fruitful the DataFace program has been and it was so rewarding to see the impact of the training videos that I featured in last year!”
Rachael Laidlaw, Ask-JGI PhD student and DataFace Judge

What’s next?

Ready for next year, we’ll be curating more datasets for students to work with and the training materials to go alongside them. We’re excited to work with data that matters to students by creating a new dataset focused on mental health, using information from the World Happiness Report.

Students have different needs, so we’re putting together a more challenging dataset for those who are ready to dig deeper. This dataset will feature mental health data collected over the past decade from countries all over the world. This will give students different options to explore patterns they find interesting. Whether they want to look at lots of countries at one point in time or a few countries over a longer period, they’ll have the freedom to choose how they visualise and analyse the data.

The next big step in data literacy is being able to share your findings with others who might not be familiar with your data, the context, or your analysis methods. That’s why we’re creating a new training video called “how to share your findings”. This video will help students better prepare to share their projects in one-on-one conversations, poster sessions and presentations.