JGI Seed Corn Funding Project Blog 2023/24: Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day and Debbie Johnson
What is Children of the 90s?
Children of the 90s (Co90s) is a multi-generational population-based study following the health and development of nearly 15,000 families living around Bristol, whose children were born in 1991 and 1992.
Co90s initially recruited its participants during the early stages of the mum’s pregnancy and captures information prospectively, at key time points, using self-reported questionnaires, interviews, clinics and electronic health records (EHR).
The Co90s supports about 20 project teams using NHS data at any one time.
What is Synthetic Data?
At its most basic, synthetic data is information generated artificially rather than recorded directly from real-world events. It is essentially a computer-generated version of the data that doesn’t contain any real data and preserving privacy and confidentiality.
Privacy vs Fidelity
Generating synthetic data is frequently a balancing act between fidelity and privacy (Figure 1).
“Fidelity”: how well does the synthetic data represent the real-world data?
“Privacy”: can personal information be deduced from the synthetic data?
Why synthetic NHS data:
EHR data are incredibly valuable and rich data sources, yet there are significant difficulties to accessing this data, including financial costs and the time taken to complete multiple application forms and have these approved.
Because the authentic NHS data is so difficult to access, it is also not unusual for researchers to have never worked with, or possibly even seen, this type of data before. They often face a learning curve to understand how the data is structured, what variables are present in the data and how those variables relate to each other.
The journey for a project to travel (Figure 2) just to get NHS data typically goes through the following stages:
Each of these stages can take several months and are usually sequential. It not unheard of for projects to run out of time and/or money due to these lengthy timescales.
Current synthetic NHS data:
Recently, the NHS has released synthetic Hospital Episode Statistics (HES) data (available here; https://digital.nhs.uk/services/artificial-data) which is, unfortunately, quite limited for practical purposes. This is because a very simple approach was adopted; each variable is randomly generated independently from all others. While it is possible to infer broadly accurate descriptive statistics for single variables (e.g., age or sex), it is impossible to infer relations between variables (e.g., how the number of cancer diagnoses increases with age). In the terms introduced above, it has high privacy but low fidelity. As shown in the heatmap, Figure 3, we observe practically no association between diagnosis and treatment because synthetic NHS data is randomly generated variable-by-variable.
What do researchers want from synthetic data?
We developed an anonymous survey and asked 230 researchers experienced with EHR data, what would be important to them when considering using synthetic EHR data. Out of the 24 responding most were epidemiologists at fellow or professor level. Researchers were then invited to an online discussion group to expand on insights from the survey. Seven researchers attended.
Most researchers had a more than 3 years of experience using EHRs both within and outside of cohort studies. Although few had much knowledge of synthetic EHR data, many had heard of synthetic EHR data and were interested in its application, particularly as a tool for training and learning about EHRs generally.
The most important issues to researchers (Figure 4) were consistent patient details and having all the additional diagnosis & treatment codes rather than just the main ones:
The most important utility for these researchers was to test/develop code and understand broad structure of the data, as shown below (Figure 5):
This was reflected in their main concerns about maintaining the utility of the data in the synthetic version by producing high level of accuracy and attention to detail.
During the discussion it was recognised that EHRs are “messy” and synthetic data should emulate this, providing an opportunity to prepare for real EHRs.
Being able to prepare for the use of real EHRs was the main use case for synthetic data. No one suggested using the synthetic data as the analysis dataset in place of the real data.
It was suggested, in both survey responses and the discussion group, that any synthetic data should be bespoke to the requirements of each project. Further, it was observed that each research project only ever used a portion of the complete dataset, therefore synthetic data should be minimized also.
“I think any synthetic data set based on any of the electronic health records should be stripped back to the key things that people use, because then the task of making it a synthetic copy [is] less.” (online participant)
Summary
Following the survey and discussion with some researchers familiar with EHRs a few key points came through:
- Training – using synthetic data to understand how EHRs work, and to develop code.
- Fidelity is important – using synthetic data as way for researchers to experience using EHRs (e.g. the real data flaws, linkage errors, duplicates).
- Cost – the synthetic data set, and associated training, must be low cost and easily accessible.
Next Steps
There is a demand for a synthetic data set with a higher level of fidelity than is currently available, and particularly there is a need for data which is much more consistent over time.
The Co90s is well placed to respond to this demand, and will look to:
- Obtain approximately 10 years’ worth of NHS data – record level but pseudonymised.
- Explore and evaluate new methods for synthesising complex EHR data (for example statistical and machine learning methods from reviews such as: https://www.sciencedirect.com/science/article/pii/S0925231222004349?casa_token=PctLs_5KiZYAAAAA:VvibS1nKYZ6uZHuYKwkprs2Aah4C33lY-riaS0bwX801IyNP9pZ7Pw_rR__9quz0hp0HTe_0vQ).
- Focus on synthesizing the key aspects requested by the researchers and maintaining the nuances and errors of the original EHR data.
- Develop an outline syllabus for a short course on NHS EHRs with synthetic data at the centre.