Children of the 90s and Synthetic Health Data 

JGI Seed Corn Funding Project Blog 2023/24: Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day and Debbie Johnson

What is Children of the 90s? 

Children of the 90s (Co90s) is a multi-generational population-based study following the health and development of nearly 15,000 families living around Bristol, whose children were born in 1991 and 1992. 

Co90s initially recruited its participants during the early stages of the mum’s pregnancy and captures information prospectively, at key time points, using self-reported questionnaires, interviews, clinics and electronic health records (EHR). 

The Co90s supports about 20 project teams using NHS data at any one time.  

What is Synthetic Data? 

At its most basic, synthetic data is information generated artificially rather than recorded directly from real-world events. It is essentially a computer-generated version of the data that doesn’t contain any real data and preserving privacy and confidentiality. 

Privacy vs Fidelity

Generating synthetic data is frequently a balancing act between fidelity and privacy (Figure 1). 

“Fidelity”: how well does the synthetic data represent the real-world data?  

“Privacy”: can personal information be deduced from the synthetic data? 

Blue line with an arrowhead at each end. Left side is High privacy, low fidelity and the right side is low privacy, high fidelity
Figure 1: Privacy versus fidelity

Why synthetic NHS data: 

EHR data are incredibly valuable and rich data sources, yet there are significant difficulties to accessing this data, including financial costs and the time taken to complete multiple application forms and have these approved. 

Because the authentic NHS data is so difficult to access, it is also not unusual for researchers to have never worked with, or possibly even seen, this type of data before. They often face a learning curve to understand how the data is structured, what variables are present in the data and how those variables relate to each other. 

The journey for a project to travel (Figure 2) just to get NHS data typically goes through the following stages: 

Multiple coloured boxed with each stage a project has to go through to get NHS data from initial grant application to data access
Figure 2: The stages a project goes through to get NHS data

Each of these stages can take several months and are usually sequential. It not unheard of for projects to run out of time and/or money due to these lengthy timescales. 

Current synthetic NHS data: 

Recently, the NHS has released synthetic Hospital Episode Statistics (HES) data (available here; https://digital.nhs.uk/services/artificial-data) which is, unfortunately, quite limited for practical purposes. This is because a very simple approach was adopted; each variable is randomly generated independently from all others. While it is possible to infer broadly accurate descriptive statistics for single variables (e.g., age or sex), it is impossible to infer relations between variables (e.g., how the number of cancer diagnoses increases with age). In the terms introduced above, it has high privacy but low fidelity. As shown in the heatmap, Figure 3, we observe practically no association between diagnosis and treatment because synthetic NHS data is randomly generated variable-by-variable.  

Heat map with disease groupings on the right side and different treatments on the bottom
Figure 3: Heatmap displaying the relations between disease groupings (right side) and treatment (bottom) from the synthetic NHS data. The colour shadings represent the number of patients (e.g., the darker the shading, the higher the number). The similarity in shading within each diagnosis row shows that treatment and diagnosis were largely independent in this synthetic dataset.

What do researchers want from synthetic data? 

We developed an anonymous survey and asked 230 researchers experienced with EHR data, what would be important to them when considering using synthetic EHR data. Out of the 24 responding most were epidemiologists at fellow or professor level. Researchers were then invited to an online discussion group to expand on insights from the survey. Seven researchers attended.   

Most researchers had a more than 3 years of experience using EHRs both within and outside of cohort studies. Although few had much knowledge of synthetic EHR data, many had heard of synthetic EHR data and were interested in its application, particularly as a tool for training and learning about EHRs generally. 

The most important issues to researchers (Figure 4) were consistent patient details and having all the additional diagnosis & treatment codes rather than just the main ones: 

Horizontal bar chart showing different desirable quantities in synthetic EHR against the number of responses.
Figure 4: What researchers look for in synthetic EHRs

The most important utility for these researchers was to test/develop code and understand broad structure of the data, as shown below (Figure 5): 

Chart showing the priorities of researchers on a scale from first to last choice
Figure 5: Priorities of researchers when using synthetic data

This was reflected in their main concerns about maintaining the utility of the data in the synthetic version by producing high level of accuracy and attention to detail. 

During the discussion it was recognised that EHRs are “messy” and synthetic data should emulate this, providing an opportunity to prepare for real EHRs. 

Visual showing discussion points about emulate "messy" real data
Emulate “messy” real data discussion visual

Being able to prepare for the use of real EHRs was the main use case for synthetic data. No one suggested using the synthetic data as the analysis dataset in place of the real data.   

Visual showing factors to consider in relation to preparation for using real EHR data
Preparation for using real EHR data visual

It was suggested, in both survey responses and the discussion group, that any synthetic data should be bespoke to the requirements of each project. Further, it was observed that each research project only ever used a portion of the complete dataset, therefore synthetic data should be minimized also.  

“I think any synthetic data set based on any of the electronic health records should be stripped back to the key things that people use, because then the task of making it a synthetic copy [is] less.” (online participant) 

Summary

Following the survey and discussion with some researchers familiar with EHRs a few key points came through: 

  • Training – using synthetic data to understand how EHRs work, and to develop code. 
  • Fidelity is important – using synthetic data as way for researchers to experience using EHRs (e.g. the real data flaws, linkage errors, duplicates). 
  • Cost – the synthetic data set, and associated training, must be low cost and easily accessible.  

Next Steps

There is a demand for a synthetic data set with a higher level of fidelity than is currently available, and particularly there is a need for data which is much more consistent over time. 

The Co90s is well placed to respond to this demand, and will look to: 

Ask JGI Student Experience Profiles: Mike Nsubuga

Mike Nsubuga (Ask-JGI Data Science Support 2023-24) 

Embarking on a New Path 

Mike Nsubuga
Mike Nsubuga, first year PhD Student in Computational Biology and Bioinformatics

In the early days at Bristol, even before I began my PhD, I stumbled upon something extraordinary. AskJGI, a university initiative that provides data science support to researchers from all disciplines, caught my attention through a recruitment advert circulated by my PhD supervisor for support data scientists.

My journey started with hesitation. As a brand-new PhD student, who had just relocated to the UK, I questioned whether I was ready or suitable for such a role. Despite my reservations, my supervisor saw potential in me and encouraged me to seize this opportunity. Yielding to their encouragement, I applied, not fully realizing then how this decision would profoundly shape both my academic and professional paths. 

A World of Opportunities 

Joining AskJGI opened a door to a dynamic world brimming with ideas and innovations. My background in bioinformatics and computational biology meant that working on biomedical queries was particularly rewarding. These projects varied from analyzing protein expression data to studying infectious diseases, allowing me to use data science in meaningful ways. 

Among the initiatives I was involved in was developing models to predict protein production efficiency in cells from their genetic sequences. Our goal was clear yet impactful: to identify patterns in genetic sequences that indicate protein production efficiency. We employed advanced data analysis and machine learning techniques to achieve effective predictions. 

Additionally, I contributed to a project analyzing the severity of dengue infections by using statistical models to identify key biological markers. We pinpointed certain markers as critical for distinguishing between mild and severe cases of the infection. 

These projects showcased the transformative power of data science in understanding and potentially managing diseases, directly impacting public health strategies. 

Making Science Accessible: Community Engagement at City Hall

A highlight of my tenure with AskJGI was participating in Data Science Week at bustling Bristol City Hall. The event was not merely a showcase of data science but an opportunity to demystify complex concepts for the public. Engaging in lively discussions and simplifying intricate algorithms for curious visitors was incredibly fulfilling, especially seeing their excitement as they understood the concepts that are often discussed in our professional circles. 

Audience sitting in City Hall. Some audience members are raising there hand. There is a projector and a speaker at the front of the hall
AI and the Future of Society event as part of Bristol Data Week 2024

Fostering Connections and Gaining Insights 

AskJGI enhanced my technical skills and broadened my understanding of the academic landscape at the University of Bristol. The connections I forged were invaluable, sparking collaborations that would have been unthinkable in the more isolated environment of my earlier academic career. Reflecting on my transformative journey with AskJGI, I am convinced more than ever of the importance of interdisciplinary collaboration and the critical role of data science in tackling complex challenges. I encourage any researcher at the University of Bristol who is uncertain about their next step to explore what AskJGI has to offer. For PhD students looking to get involved, it represents not just a learning opportunity but a chance to make a significant societal impact. 

Ask JGI Student Experience Profiles: Emma Hazelwood

Emma Hazelwood (Ask-JGI Data Science Support 2023-24) 

Emma Hazelwood
Emma Hazelwood, final year PhD Student in Population Health Sciences

I am a final year PhD student in Population Health Sciences. I found out about the opportunity to support the JGI’s data science helpdesk through a friend who had done this job previously. I thought it sounded like a great way to do something a bit different, especially on those days when you need a bit of a break from your PhD topic.

I’ve learnt so many new skills from working within the JGI. The team are very friendly, and everyone is learning from each other. It’s also been very beneficial for me to learn some new skills, for instance Python, when considering what I want to do after my PhD. I’ve been able to see how the statistical methods that I know from my biomedical background be used in completely different contexts, which has really changed the way I think about data. 

I’ve worked on a range of topics through JGI, which have all been as interesting as they have been different. I’ve helped people with coding issues, thought about new ways to visualise data, and discussed what statistical methods would be most suitable for answering research questions. In particular, I’ve loved getting involved with a project in the Latin American studies department, where I’ve been mapping key locations from conferences throughout the early 20th century onto satellite images, bringing to life the routes that the conference attendees would have taken. 

This has been a great opportunity working with a very welcoming team, and one I’d recommend to anyone considering it!

Ask JGI Student Experience Profiles: Emilio Romero

Emilio Romero (Ask-JGI Data Science Support 2023-24)

Emilio Romero
Emilio Romero, 2nd year PhD Student in Translational Health Sciences

Over the past year, my experience helping with the Ask-JGI service has been really rewarding. I was keen to apply as I wanted to get more exposure to the research world in Bristol, meet different researchers and explore with them different ways of working and approaching data.  

From a technical perspective, I had the opportunity to work on projects related to psychometric data, biological matrices, proteins, chemometrics and mapping. I also worked mainly with R and in some cases SPSS, which offered different alternatives for data analysis and presentation. 

One of the most challenging projects was working with chemometric concentrations of different residues of chemical compounds extracted from vessels used in human settlements in the past. This challenge allowed me to talk to specialists in the field and to work in a multidisciplinary way in developing data matrices, extracting coordinates and creating maps in R. The most rewarding part was being able to use a colour scale to represent the variation in concentration of specific compounds across settlements. This was undoubtedly a great experience and a technique that I had never had the opportunity to practice. 

ASK-JGI also promoted many events, especially Bristol Data Week, which allowed many interested people to attend courses at different levels specialising in the use of data analysis software such as Python and R. 

The Ask-JGI team have made this year an enjoyable experience. As a cohort, we have come together to provide interdisciplinary advice to support various projects. I would highly recommend anyone with an interest in data science and statistics to apply. It is an incredible opportunity for development and networking and allows you to immerse yourself in the wider Bristol community, as well as learning new techniques that you can use during your time at the University of Bristol. 

From Data to Discovery in Biology and Health

ELLIS Summer School on Machine Learning in Healthcare and Biology – 11 to 13 June 2024  

Huw Day, Richard Lane, Christianne Fernée and Will Chapman, data scientists from the Jean Golding Institute, attended the European Laboratory for Learning and Intelligent Systems (ELLIS) Summer School on Machine Learning for Healthcare and Biology at The University of Manchester. Across three days they learned about cutting edge developments in machine learning for biology and healthcare from academic and industry leaders.

A major theme of the event was how researchers create real knowledge about biological processes at the convergence of deep learning and causal inference techniques. Through this machine learning can be used to bridge the gap between well-controlled lab experiments and the messy real world of clinical treatments.

Advancing Medical Science using Machine Learning

Huw’s favorite talk was “From Data to Discovery: Machine Learning’s Role in Advancing (Medical) Science” by Mihaela van der Schaar who is a Professor of ML, AI and Medicine at the University of Cambridge and Director for the Cambridge Centre for AI in Medicine.

Currently, machine learning models are excellent at deducing the association between variables. However, Mihaela argued that we need to go beyond this to understand the variables and their relationships so that we can discover so-called “governing equations”. In the past, human experts have discovered governing equations with domain knowledge, intuition and insight to extract equations from underlying data.

The speaker’s research group have been working to deduce different types of underlying governing equations from black box models. They have developed techniques to extract explicit functions as well as more involved functional equations and various types of ordinary and partial differential equations.

On the left are three graphs showing temporal effects of chemotherapy on tumour volume for observed data, D-CODE and SR-T. On the right is the actual equations for the D-CODE and SR-T plots on the left.
Slide 39 from Mihaela van der Schaar’s talk, showing observed data of the effects of chemotherapy on tumour volume over time and then two examples of derived governing equations in plots on the left with the actual equations written out on the right

The implications for healthcare applications are immense if these methods are able to be reliability integrated into our existing machine learning analysis. On a more philosophical angle, it begs interesting questions about how many systems in life sciences (and beyond) have governing equations and what proportion of these equations are possible to discover.

Gaussian processes and simulation of clinical trials

A highlight for Will and Christianne was the informative talk from Ti John which was a practical introduction to Gaussian Processes (GP) which furthered our understanding of how GPs learn non-linear functions from a dataset. By assuming that your data are a collection of realisations of some unknown random function (or combination of functions), and judicious choice of kernel, Gaussian Process modelling can allow the estimation of both long-term trends from short-term fluctuations from noisy data. The presentation was enhanced with this interactive visualisation of GPs, alongside an analysis of how the blood glucose response to meals changes after bariatric surgery. Another highlight was Alejandro Frangi’s talk on in silico clinical trials in which he described how mechanistic modelling (like fluid dynamic simulations of medical devices) can be combined with generative modelling (to synthesise a virtual patient cohort) to explore how medical treatments may perform in patients who would never have qualified for inclusion in a real randomised controlled trial.

Causality

Richard’s favourite talk was by Cheng Zhang from Microsoft Research on causal models, titled “Causality: From Statistics to Foundation Models”. Cheng highlighted that an understanding of causality is vital for the intelligent systems that have a role in decision-making. This area is on the cutting edge of research in AI for biology and healthcare – understanding consequences is necessary for a model that should propose interventions. While association (statistics) is still the main use case for AI, such models have no model of the “true” nature of the world that they are reflecting which leads to hallucinations such as images with too many fingers or nonsensical text generation. One recipe proposed by Cheng was to build a causally-aware model to:

  • Apply an attention mechanism/transformer to data so that the model focuses only on the most important parts
  • Use a penalised hinge loss- the model should learn from its mistakes, and should account for some mistakes being worse than others
  • Read off optimal balancing weights + the causal effect of actions – after training, we need to investigate the model to understand the impact of different actions.

In essence, this is a blueprint to build a smart system that can look at a lot of complex data, decide what’s important, learn from its mistakes efficiently and can help us understand the impact of different actions. As an example, we might be interested in how the amount of time spent studying affects students’ grades. At the outset, it’s hard to say if studying more causes better grades because these students might also have other good habits, have access to more resources, etc. Such a model would try to balance these factors and give a clearer picture of what causes what- effectively, it would try to find the true effect of studying more on a student’s grade, after removing the influence of other habits.

This behaviour is desired for the complex models being developed for healthcare and biology; for example, we may be interested in engineering CRISPR interventions to make crops more resilient to climate change or developing brain stimulation protocols to help with rehabilitation. A model proposing interventions for these should have a causal understanding of which genes impact a trait, or how different patterns of brain activity affect cognitive function.

Recordings of all the talks can be found on here