Data analysis – Jean Golding Institute News

Using topic modelling to study lived experiences in extreme weather conditions

Posted on 12 January 2026 by James Thomas

In 2025, Huw Day (formerly JGI Data Scientist, now Research Associate in Digital Health at the VIVO Hub for Enhanced Independent Living) completed a project working with Eunice Lo and Joanne Godwin, helping them with a study of lived experiences in extreme climate conditions in the South West of England.

This project has now been published in Weather and Climate Extremes, and in this blog, Huw explains his role in the project and the work he did.

What was your role?

Part of this survey included asking participants about adaptations they made during extreme weather events, such as heatwaves, storms and floods and cold snaps. My role was to extract discussion themes from this large amount of free text responses. This analysis forms part of the work recently published!

In both panels, the number in each theme indicates the total number of unique mentions of actions within the theme (e.g., repeated mentions of an action in response to the same weather hazard by the same respondent only count as one action). The vertical length of each theme is proportional to the number of actions within it, so themes with few actions may be difficult to read. Examples include “Other: 52” in the orange box at the bottom of panel (a), and “Check in on people: 35” in the yellow box within “Precaution: 360” in panel (b). — **Adaptation actions by survey respondents.** Adaptation actions in response to (a) cold weather and warnings and (b) stormy weather and warnings, clustered into major themes (middle column) and minor themes (right column). *Figure 6 from Lo et al. (December 2025) CC BY 4.0.*

What is topic modelling?

Topic modelling is a natural language processing (NLP) task of taking in a large collection of documents (which could be social media posts, paper titles/abstracts or survey responses) and sorting them into groups. Given those groups of documents, we then seek to extract some sort of meaning from those groups. Are there certain words that are both unique and also common to a particular group of documents, for example?

When is it useful?

Qualitative data tells stories but quantitative data puts it context, so topic modelling is a nice way to get an idea of what topics are being mentioned and how frequently. Understanding the lived experiences of individuals is important, but knowing how many individuals share that experience can help frame it further. In contrast, if you only know the number of people affected but not the specifics of their experience, then it can be hard to know how they might need to be supported.

If you have a huge amount of text data then it’s costly and time consuming to do manual qualitative analysis on it all. Topic modelling (among other NLP tools) can help you focus your manual efforts and put manual qualitative analysis into wider context.

What tools did you use?

I made use of BERTopic for topic modelling, which works by converting sentences of text into high dimensional vectors in some embedding space, projected those vectors down into some lower dimensional space whilst preserving local and global structure and then clustering those embedding vectors.

Each cluster has a collections of sentences in them, so you use something called a bag-of-words approach where you consider how often words appear in clusters. If a word appears a lot across many different clusters, it might be quite a common word in your documents and wouldn’t be a good candidate to represent a particular cluster. If a word appears frequently in your favourite cluster but seldom in the other clusters, then it’s a good candidate for a label to describe the contents of your favourite cluster.

Where can people find out more?

The paper Sociodemographic vulnerability to cold and stormy weather in relation to health and life is published in Weather and Climate Extremes (December 2025).

There is a public GitHub repo for this project which includes:

Code for data loading (not the actual data though), including splitting longer phrases for more representative embeddings whilst keeping track of who said what for visualisations.
Code for running BERTopic for topic modelling, commented to explain how you can fine tune relevant parameters depending on the size and topic diversity of your documents.
A slide deck for a talk I gave on topic modelling to the climate dynamic group at Bristol. The slide deck notes include lots of links to nice explainers on relevant topics from some of my favourite content creators like StatsQuest and 3Blue1Brown.

There is also a recording of my talk A Crash Course in Topic Modelling with BERTopic on the JGI YouTube channel.

Irish Financial Records from the Reign of Edward I: What Applying Data Science Techniques Can Reveal

Posted on 2 December 20252 December 2025 by kerry.turcsi

JGI Seedcorn follow on funding 2023-25: Mike Jones and Brendan Smith

Introduction

This project was a follow-on extension of the JGI-funded seed-corn initiative titled ‘Digital Humanities meets Medieval Financial Records: The Receipt Rolls of the Irish Exchequer.’ The original project, along with a subsequent paper, ‘The Irish Receipt Roll of 1301–2: Data Science and Medieval Exchequer Practice,’ focused on a single receipt roll from the 1301–2 financial year. Building on this foundation, the follow-on project aimed to enhance software and techniques across a larger collection of receipt rolls from Edward I’s reign (1272–1307), offering broader insights into medieval financial practices. However, developing the scripts and troubleshooting errors took longer than expected, which reduced the time available for more in-depth analysis. Nevertheless, we managed to develop a data processing pipeline that allowed a broad analysis of the pipe rolls.

Data

The Irish Exchequer was a government institution responsible for collecting and disbursing income within the lordship of Ireland on behalf of the English Crown. Receipt rolls documented the money received each day by the Irish Exchequer from crown officials, private individuals, and communities. The entries in the rolls consisted of heavily abbreviated Medieval Latin.

There are forty surviving receipt rolls from the reign of Edward I held at the National Archives (TNA) in London. The Virtual Record Treasury of Ireland (VRTI) has translated the rolls from Latin into English for Edward I and later reigns. They have also encoded the translations into TEI/XML (https://tei-c.org), creating a machine-readable and structured digital corpus. The translations and high-quality images of the original documents are accessible to the public on the VRTI website. We gained early access to the TEI/XML documents for Edward I’s reign, which formed the foundation of our data corpus.

Data processing pipeline

To analyse the data, it was first necessary to parse the TEI/XML files and generate comma-separated (CSV) files that could be processed by Pandas, the standard Python library for data analysis, which would then allow us to create plots and visualisations with Matplotlib and Seaborn.

Each payment given to the Irish Exchequer is called a proffer. Each row in the CSV should represent an individual proffer and should include several pieces of information, including:

The financial term. The year was divided into four terms – Michaelmas, Hilary, Easter and Trinity
The date of the proffer, e.g., ‘1286-09-30’
The day of the proffer, e.g., ‘Monday’
The source of the proffer, which is a marginal heading in the roll, e.g., ‘Limerick’
The details of the proffer, e.g., ‘From the debts of various people of Co. Limerick by James Keting: £40’
The extracted monetary offering, e.g., £40
The extracted monetary offering converted to pence, e.g., 9600.0

The pipeline consists of three stages: (1) generate a CSV for each roll; (2) categorise the proffers, for example, whether they relate to profits of justice or rents; and (3) merge all the CSV files into a single ‘mega’ file.

The development of the data processing pipeline in Python was an iterative process. The script was initially written to parse the 1301–2 roll. Although the TEI/XML encoding provided structure, not all the rolls adhered to the composition of the later receipt rolls. For instance, the earlier rolls do not record dates, and some rolls were only partially complete. Consequently, significant time was spent repeatedly refining the script to accommodate the different rolls, allowing us to establish a consistent CSV format.

Part of the iterative development involved error checking, which means verifying the total income calculated from the CSV files against the totals given by the Exchequer clerk on the original roll. Ideally, the values should be either identical or have only minor differences. If the computed total is lower, this may be due to details of the proffers being lost because of damage to the original roll. Computed totals might be higher if additional proffers were added to the roll after the clerk provided the total. Either could indicate parsing errors in the TEI/XML, and any discrepancies require investigation.

A plot showing the comparison between the total provided by the Exchequer clerk, versus the computed total. Most are matching, but there are outliers where the computed total is higher, which needed extra investigation. — *A plot showing the comparison between the total provided by the Exchequer clerk, versus the computed total.*

The error checking facilitated a productive conversation between the project and VRTI, enabling the identification of errors caused by typos in the translations and markup. It also highlighted interesting features in the original rolls. For example, for E 101/230/28, the computed total was significantly greater than that provided by the clerk. The archivists at the TNA re-examined the roll and postulated that membranes from other rolls had been sewn onto this roll during repairs in the Victorian period or later.

Early access to the TEI/XML documents likely meant that more errors were encountered, as not all documents had undergone the whole VTRI editorial process. This resulted in significant time being spent tracking errors, which was not anticipated when the JGI project was conceived.

Analysis and Visualisations

Limitation in scope

After the data was processed, it became possible to analyse and visualise the proffers to the Irish Exchequer. There are 40 existing rolls for the reign. However, due to resource constraints, the analysis is limited to the 21 rolls that are ‘general’ in nature, meaning those relating to proffers from various sources and for different reasons. It does not cover the specialised rolls, such as those related to taxation.

The ‘landscape’ of the rolls

One of the initial visualisations created was to understand the ‘landscape’ of the rolls, specifically what had survived and what had not. In the subsequent plot, we display for each financial year whether we have data for each financial term or whether payments were received outside of those terms. A red box with a tick indicates we have data, and a white box with a cross indicates a gap. As you can see, there are gaps in survival (1281–82, 1283–84, 1289–90, 1297–98, 1302–03, and 1303–04), as well as years with only partial survival (1284–85, 1294–95, 1304–05).

A plot showing the availability of data per financial year and terms. Most years are complete, but some, such as 1281–1282, are missing, or incomplete, like 1284–1285. — *A plot showing the availability of data per financial year and terms.*

However, even this does not provide a complete picture since 1280–1 has an incomplete entry for Michaelmas.

Annual and termly totals

Our dataset does not encompass all income received by the Crown. As noted, some years are missing or contain only partial data, and we do not include additional rolls related to specific sources of income, such as taxation. The subsequent plot depicts the total income from our available data for each financial year, not the actual income received by the Crown.

A plot showing total computed incomes recorded on the receipt rolls per financial year. Most complete years are approaching or over £5000. — *A plot showing total computed incomes recorded on the receipt rolls per financial year.*

We can break down the total income into what was received per term for each financial year. The data is presented as a heatmap, with the darker colours indicating a greater amount of income received. Different terms received the most income in various years. For example, Michaelmas in 1285–86, 1286–87, 1288–89; Easter in 1282–83, 1291–92, 1292–93, 1301-02; and Trinity in 1306–07.

A heatmap showing the total income per financial term and year. Different years have different terms providing the most income. For example, Michaelmas for 1285–1286, 1286–1287 — *A heatmap showing the total income per financial term and year.*

The following plot shows the number of proffers received as a percentage of the total extant proffers for each financial year.

A plot showing the total received per term as a percentage of the year's total. — *A plot showing the total received per term as a percentage of the year’s total.*

Unlike the 1301–2 roll examined in the first project, Easter was not always the term that generated the highest income. However, similar to the 1301–2 roll, we can see in the following plot that, in terms of the number of proffers received each term as a percentage of the financial year, Michaelmas was often the busiest term.

A plot showing the number of proffers per term as a percentage of the year's total. Michaelmas is often the busiest in the number of proffers received. — *A plot showing the number of proffers per term as a percentage of the year’s total.*

Types of business

The proffers were categorised into five broad categories, namely, ‘farms and rents’, ‘profits of justice’, ‘customs’, ‘profits of escheatry, wardships, and temporalities’, and ‘other revenues’. The following plot shows the total income received per category for each financial year. By far, the greatest source of income is from the ‘profits of justice’ category.

A graph showing the income received per broad category. Profits of justice are by far the most outstanding category. — *A graph showing the income received per broad category.*

A plot showing the income received by category as a percentage. Profits of Justice accounts for over 50% of the business. — *A graph showing the income received by category as a percentage.*

Further work is required here, such as distinguishing the profits of justice into fines and amercements: a fine was a voluntary payment made to the king to gain favour or a privilege, such as obtaining a royal writ, whereas an amercement was a financial penalty imposed by the king or a court.

Sources of income

All the rolls specify the ‘source’ of a proffer, often a place, e.g., ‘Dublin.’ However, it can also refer to a group or other entity, e.g., ‘English debts of the merchants of Lucca’, or a specific cause, e.g., ‘By writ of England.’ The following plot shows the total income received per source in the dataset, for the twenty sources that recorded the most proffers. Dublin, by far, accounts for the most significant number of individual proffers.

A plot showing income from the top 20 sources. Dublin is the largest source of income, accounting for over £8,000, with Cork returning £7,000. — *A plot showing income from the top 20 sources.*

Conclusion

Like other Digital Humanities projects, this initiative relied heavily on human labour, especially from archivists and historians who translated the original Latin documents into English and encoded those translations into TEI-XML documents. Although we could process machine-readable datasets, extra effort was needed to clean the data and ensure its accuracy. This additional work was understandable, as the VRTI TEI/XML was created to support a digital edition of the receipt rolls rather than for statistical analysis. However, this limited the time available for detailed analysis, with most work focusing on understanding what was present in the datasets, their limitations due to document loss, and providing a general overview of the payments received. Nonetheless, the project demonstrated opportunities to develop and explore further research questions with additional funding and time.

The project was undertaken by Mike Jones of Research IT and Brendan Smith of the Department of History, with the assistance of Elizabeth Biggs of the Virtual Record Treasury of Ireland and Paul Dryburgh of The National Archives, UK.

AI in Health Awardees 2025-2026

Posted on 19 November 202521 November 2025 by kerry.turcsi

Over summer, the Elizabeth Blackwell (EBI) and Jean Golding (JGI) Institutes together with the Faculty of Health and Life Sciences strategic research support fund, and University Hospitals Bristol and Weston NHS Foundation Trust, ran a pump-priming funding call to support innovative applications of AI in health or biomedical research. This funding call came with the expectation that the funded activities would provide a basis for developing and submitting external bids for future research programmes and projects that use or address AI in health and biomedical research contexts.

We are excited to announce 10 projects involving more than 30 researchers supported by this funding. Check out the successful awardees and their projects below.

AI-assisted personalisation of neurostimulation

Petra Fischer, School of Physiology, Pharmacology and Neuroscience
Conor Houghton, School of Engineering Mathematics and Technology

*Left to right: Petra Fischer, and Conor Houghton*

Dystonia is a heterogenous neurological disorder, which causes involuntary muscle contractions, often resulting in pain and severely restricted movement, affecting millions of people worldwide.

A key challenge in neuroscience is understanding how brain networks process sensory input to control movement. Neural synchronisation plays a vital role in organising this activity, occurring both locally and across distant regions. Excessive synchronisation is linked to disorders like dystonia, Parkinson’s, and schizophrenia, and targeted modulation has emerged as a promising therapy.

Dr Fischer’s lab uses non-invasive, phase-specific vibrotactile stimulation to selectively enhance or disrupt synchronisation in dystonia with the aim to improve symptoms. Currently it is still unclear whether local or interregional modulation drives therapeutic effects. The interdisciplinary team will use existing dystonia data and AI-assisted analysis to:

Map brain-wide effects of stimulation, and
Predict outcomes and effective stimulation parameters based on neural data to replace a trial-and-error based search procedure

Findings will support an MRC funding application to develop a clinical stimulation tool, with potential extension to other techniques like transcranial electrical stimulation for direct cortical targeting.

AI-Organoid: A Smart Predictive Platform for Advanced Neurological Modelling

James Armstrong, Bristol Medical School
Qiang Liu, Engineering Mathematics & Technology

*Left to right: James Armstrong and Qiang Liu*

This interdisciplinary project brings together biomedicine and AI to develop AI-Organoid, a predictive, interpretable platform for tracking and forecasting the development of organoids—lab-grown cell models that mimic human organs. Organoids are vital for reducing animal testing and studying human-specific diseases, but their inconsistent growth limits pharmaceutical applications.

Building on promising pilot data (74% accuracy in predicting brain organoid outcomes), the project will refine AI-Organoid to improve reproducibility and provide mechanistic insights into organoid development. Beyond forecasting organoid outcomes, the platform will identify when developmental trajectories diverge and provide mechanistic insights into the underlying biology of brain organoid growth.

The EBI-JGI grant will support data collection, model training, and dissemination, enabling future applications in disease modelling (e.g., neurodevelopmental disorders) and expansion to other organoid types (e.g., ovarian, intestinal, liver). Outputs will include publications, open-source tools, and a foundation for commercialisation and further funding bids.

An AI-integrated lung-on-a-chip platform for the rapid screening and optimisation of mesenchymal stem cell secretome therapeutics

Wael Kafienah, School of Biochemistry and Cellular and Molecular Medicine
Lucia Marucci, School of Engineering Mathematics and Technology
Darryl Hill, School of Biochemistry and Cellular and Molecular Medicine

*Left to right: Wael Kafienah, Lucia Marucci, and Darryl Hill*

Inflammatory lung diseases, such as acute respiratory distress syndrome (ARDS), are devastating conditions with high mortality and no effective drug treatments. A promising new therapy involves using the cocktail of healing molecules secreted by mesenchymal stem cells (MSCs). These cells can be manipulated to optimise the secretome composition towards a specific therapeutic target. However, identifying the most potent secretome composition is a major bottleneck, relying on slow, laborious methods and animal models that poorly predict human responses.

This project aims to develop an AI-integrated Lung-on-a-Chip (LoC) platform to accelerate discovery of effective MSC therapies for inflammatory lung diseases like ARDS, which currently lack treatments. By mimicking lung inflammation and analysing cell responses in real time, the AI will identify optimal MSC secretome compositions more efficiently than current methods.

The pilot will deliver proof-of-concept data, including a high-accuracy AI model and a rich imaging and gene expression dataset, laying the foundation for reducing reliance on animal models and enabling rapid development of regenerative therapies with commercial potential.

An Integrated AI and machine learning platform to enable high throughput, precision oncology driven drug testing

Deepali Pal, School of Biochemistry and Cellular and Molecular Medicine
Colin Campbell, Mathematics, Engineering and Technology
Stephen Cross, Wolfson Bioimaging Unit, Faculty of Life Sciences
Rihuan Ke, School of Mathematics

*Left to right: Deepali Pal, Colin Campbell, Stephen Cross, and Rihuan Ke*

250 children in the UK die from cancer each year. Leukaemia, affecting the human bone marrow, is the commonest children’s cancer. Yet it is a rare disease, which makes studying new treatments in clinical trials challenging. Therefore preclinical prioritisation is key, requiring predictive patient-derived models. However, hospital samples are difficult to cultivate, and where complex tissue-like biomimetic models to allow patient-sample cultivation have been engineered, these are hard to read, making output data inaccessible.

This project will develop an AI-powered bioimaging analysis tool to accurately detect leukaemia cells within complex bone marrow microenvironments, enabling predictive personalised drug screening.

The integrated machine deep learning and AI tool will analyse 3D bioprinted organoids to distinguish leukaemia from morphologically similar bone marrow cells, overcoming limitations of marker-based identification. The interdisciplinary team will apply image processing, expert annotation, and algorithm development to validate the tool.

Impact includes a proof-of-concept precision oncology organoid platform that generates high-throughput, interpretable drug screening data within clinically relevant timeframes. The project also offers commercial potential in the growing organoid market including applicability across other diseases.

Rational AI Driven Target Acquisition from Genomes (RAIDTAG)

Darryl Hill, School of Cellular and Molecular Medicine
Sean Davis, School of Chemistry

*Left to right: Darryl Hill, and Sean Davis*

This pilot study uses AI to accelerate the discovery of highly repetitive DNA sequences for rapid microorganism identification—critical for healthcare, agriculture, and food production. These ‘repetitive signatures’, absent in closely related species, offer precise, cost-effective diagnostic markers.

Building on proof-of-concept using gold nanoparticle detection, the project will deliver a minimal viable product (MVP): an AI pipeline that automates marker discovery and validates candidates in the lab. This interdisciplinary effort, combining expertise in AI and computer science with nanomaterials and diagnostics, will provide the foundation for future external funding applications and translational research.

Predicting PD-L1 status from H&E slides using AI

Tom Dudding, Bristol Dental School
Qiang Liu, School of Engineering Mathematics and Technology
Sarah Hargreaves, Bristol Dental School
Miranda Pring, Bristol Dental School

*Left to right: Tom Dudding, Qiang Liu, Sarah Hargreaves, and Miranda Pring*

There are approximately 377,700 newly diagnosed mouth cancers each year worldwide. Despite treatment, survival rates remain low, and side effects are life changing. Some people with early-stage cancer unexpectedly experience poor outcomes, such as recurrence or early death, and these high-risk patients are often hard to identify. New ways to detect and manage these high-risk cancers are needed.

PD-L1 is a marker found on many cells, including mouth cancer cells. It helps cancers hide from the immune system and may contribute to poorer outcomes. This marker can however be targeted using drugs like Pembrolizumab, which block PD-L1 and help the patient’s immune system fight the cancer. Because of this, PD-L1 is now an important marker used in clinical practice, to guide treatment decisions for people presenting with late-stage head and neck cancers.

This project aims to develop an AI tool to predict PD-L1 expression in mouth cancer directly from digital histology slides, bypassing costly and limited lab tests.

Using the HN5000 cohort—750 digitised slides with linked biospecimens and long-term follow-up—the pilot will create a proof-of-concept AI model. This will support future funding bids, improve diagnostic equity, and expand access to immunotherapy in both NHS and global settings. Deliverables include a validated AI tool for PD-L1 detection, benchmarking against immunohistochemistry to establish reliability, and preliminary analysis to underpin external bids enabling translation of AI-enabled PD-L1 testing into multi-centre validation and ultimately routine clinical practice.

Genetic doppelgangers: using AI to reveal the true face of streptococcal disease

Alice Halliday, Biochemistry and Cellular & Molecular Medicine
Colin Campbell, Engineering Mathematics and Technology
Rachel Bromell, Biochemistry and Cellular & Molecular Medicine
Anu Goenka, Bristol Medical School
Sion Bayliss, Bristol Veterinary School

*Left to right: Alice Halliday, Colin Campbell, Rachel Bromell, Anu Goenka and Sion Bayliss*

This project aims to use AI tools to evaluate and develop diagnostics to distinguish between Streptococcus pyogenes (GAS) and related bacterium, Streptococcus dysgalactiae subspecies equisimilis (SDSE). GAS and SDSE are very similar genetically, such that they are akin to ‘genetic doppelgangers’. Modern DNA-based tests (qPCR) have not been well evaluated for SDSE detection and struggle with closely-related bacteria. This diagnostic confusion impacts our understanding of responsible pathogens and clinical consequences, with increasing evidence that SDSE’s disease burden is significantly underestimated.

Using genetic material from bacteria isolated from clinical throat swabs, this new interdisciplinary team will build on Bristol AI expertise to develop a machine learning classification tool for distinguishing these bacterial species, based on genome ‘k-mers’.

Combining traditional microbiology, novel DNA-based assays and cutting-edge ML analysis of genome sequence data, the team aim to evaluate and design diagnostic tools that accurately identify both pathogens. This could lead to improved diagnostic capabilities, enhanced disease surveillance, better outbreak investigations, and improved patient outcomes.

Bristol Respiratory Infection Dashboard (BRID Project)

Andrew Dowsey, Bristol Veterinary School
Raul Santos-Rodriguez, Engineering Maths and Technology;
Maha Albur, Consultant Microbiologist
Peter Muir, Consultant Clinical Scientist
Paul North, Business Support Manager and Data Analytics, Severn Pathology, North Bristol NHS Trust and UKHSA
Amy Carson, Academic Clinical Fellow
Gavin Deas, Doctor in Training
Marceli Wac, Engineering Maths and Technology
Jack Stanley, Academic Clinical Fellow, Severn Pathology, North Bristol NHS Trust and University of Bristol

*Left to right: Andrew Dowsey, Raul Santos-Rodriguez, Maha Albur, Peter Muir, and Marceli Wac*

This project will develop the AI-powered Bristol Respiratory Infection Dashboard (BRID) at Severn Pathology, serving the South West region of the UK. By integrating real-time data from pathology and care sources, BRID will enable early detection of respiratory infection trends and support targeted interventions.

Respiratory Tract Infections (RTIs) are a major cause of hospital admissions, with over 400,000 cases in 2024 and winter surges up to 80%. Despite available treatments, disparities in vaccine uptake and care access persist. Real-time surveillance is essential to guide equitable, effective responses and improve outcomes.

This project supports NHS England’s strategy for managing acute respiratory infections (ARIs) through integrated care and digital innovation. Backed by medical directors from both merging trusts, the project will compare AI modelling using local Electronic Patient Record (EPR) data versus the South West Secure Data Environment (SWSDE), evaluating data quality, linkage, and implementation.

A co-designed dashboard with clinicians will guide funding bids to scale the platform, aiming to reduce admissions, optimise resources, and improve public health.

Automated image analysis to facilitate the incorporation of quality assurance measures into surgical RCTs

Natalie Blencowe, Bristol Medical School
Michael Wray, School of Computer Science
Anni King, Bristol Medical School
Sheraz Marker, GOLF study, University of Oxford
Nainika Menon, GOLF study, University of Oxford

*Left to right: Natalie Blencowe, Michael Wray, Anni King, Sheraz Marker, and Nainika Menon*

This project aims to develop an AI model to streamline quality assurance (QA) in surgical randomised controlled trials (RCTs), addressing bias caused by variability in surgical technique and skill. Using annotated videos from the GOLF trial, the AI will assess key operative steps based on anatomical visibility as a proxy for quality.

This project provides valuable pilot work to further the application of AI to surgical videos, enabling QA processes to be efficiently embedded into surgical RCTs, meaning they can be adopted more widely. In turn this will improve RCT quality, ultimately improving patient outcomes. There is also potential for these methods, powered by AI, to be used in ‘real time’ during operations to alert surgeons if a key step has not been fully completed, immediately improving surgical quality. Both these applications have wider implications outside of research studies. In routine clinical practice, they could shorten surgeons’ learning curves through the provision of bespoke, real-time feedback. This could transform the way surgeons learn, as well as optimising patient care.

Explainable AI for early categorisation of child deaths: real-time insights for prevention

Karen Luyt, Bristol Medical School
Edwin Simpson, School of Engineering Mathematics and Technology
Brian Hoy, Bristol Medical School
James Gopsill, School of Electrical, Electronic and Mechanical Engineering
David Odd, School of Medicine, Cardiff University

*Left to right: Karen Luyt, Edwin Simpson. Brian Hoy, James Gopsill, and David Odd*

This project brings National Child Mortality Database analysts from the Faculty of Health and Life Sciences and AI experts from the Faculty of Science and ngineering together to develop an explainable high-confidence early categorisation system for the cause of child deaths.

The Child Mortality Analysis Unit (CMAU) at the University of Bristol is the national hub for analysing statutory child death data in England. CMAU links over 25,000 notifications and 19,500 completed reviews with national datasets across health, education, social care, policing, and child safety. This enables the identification of patterns, causes, and risk factors in child mortality, informing preventative action and national policy.

Now, in partnership with analysts from the National Child Mortality Database and leading AI experts, CMAU is developing a pioneering early categorisation system to identify suspected causes of child deaths in real time. This innovation will enhance national surveillance and accelerate public health responses.

Future work will expand to unstructured data from documents such as clinical notes, unlocking insights and spotting patterns that are currently difficult to detect manually. This will mark a major leap forward in understanding and preventing child deaths.

An explainable, high-confidence early categorisation system could be a game-changer, revolutionising how services across sectors monitor, respond to, and ultimately prevent child mortality.

The project team will be sharing the findings via this website, LinkedIn, academic publications and industry events. Are you a national health provider who would like to do something similar? Please reach out and contact us to learn more.

Widening Participation (WP) Research Summer Internships

Posted on 13 August 202513 August 2025 by kerry.turcsi

The Widening Participation (WP) Research Summer Internships provide undergraduates with hands-on experience of research during the summer holidays, with the aim of encouraging a career in research. Interns gain professional experience and knowledge through a funded placement in their chosen subject. This also supports application for postgraduate study and other research jobs.

This year, the JGI was very pleased to support four internships through the WP scheme. Each of the interns has provided valuable support to an array of diverse and interesting projects related to their fields of interest. We are delighted by the feedback that we have received from their project supervisors and look forward to watching their future progress. Read on for more information on their projects and their experience.

Frihah Farooq

My name is Frihah, and I’m a third year undergraduate studying Mathematics here at the University of Bristol. My academic interests centre around applied data science and machine learning, and this summer I worked on a project involving the General Practice Workforce dataset published by NHS Digital. My focus was on building tools that could bring accessibility to data that is often scattered and difficult to navigate.

The aim of the project was to automate the downloading and linkage of open-access datasets, specifically in the context of healthcare services. Many of these records are stored in files with inconsistent formats and structures, often requiring manual effort to piece together a consistent narrative. I developed a codebase in R that could search for the appropriate files, extract the relevant information, and construct a complete dataset that can be used for longitudinal analysis without the need for repeated intervention. While the code was built around the workforce dataset, the methodology generalises well to other datasets published by NHS Digital.

One observation from the final merged dataset was the trend of decreasing row counts, likely due to restructuring, alongside an increase in the number of recorded variables, a sign that data collection has grown more sophisticated in recent years. This experience strengthened my foundation in data automation and my ability to work with evolving and imperfect data; skills I know will benefit me as I move further into research.

If you’d like to get in touch, you can reach me at cc22019@bristol.ac.uk

Grace Gilman

Grace's poster on Fair Tales project — *Grace Gilman’s headshot*

Hello, my name is Grace Gilman and I am starting my third year studying Computer Science with Artificial Intelligence at the University of Bath. I am hoping to go into academia in the future and pursue computing research specifically with medical applications. You can contact me at gcag20@bath.ac.uk.

Over the six weeks I have been participating in a research internship here at the University of Bristol, supported by the Jean Golding Institute. I have been working on a data science project called ‘Using AI to Study Gender in Children’s Books’, for the team Fair Tales, supervised by Chris McWilliams. During my internship I experimented with image analysis using ChatGPT and Vertex AIi, for future integration into the Data Entry app that Fair Tales is producing to semi-automate character and transcript input. I have also been contributing to the database architecture and search and filtering options for users to interact with the database. Some of my work has been analysing the corpus of children’s books using SQL, one pattern I found was that the difference between mother and father characters(1:0.75) is even more pronounced for grandmothers and grandfathers(1:0.5).

During my time at this internship, I have become much more confident in my abilities to work on a project as well as code that will be used in a research setting. I have learnt more of how research is conducted and what skills are needed for this, and become more sure of an academic future.

Imogen Joseph

Research poster on 'R packages to guide handling of missing data' — *Imogen Joseph*

I am currently studying a Neuroscience MSci with a Year in Industry at the University of Bristol. I’m going into my final year, having just completed a placement year in Southampton General Hospital undertaking clinical research in neonatal respiratory physiology. I’m particularly interested in a career in academia and more specifically looking at molecular mechanisms behind disease for drug discovery.

This summer, I helped in the development of an R package, ‘midoc’ (Multiple Imputation DOCtor, found on CRAN), designed to guide researchers in analysis with missing data under my supervisor Elinor Curnow. I created several functions that resulted in the display of a summary table of missing data, alongside optional graphs to visualise the distributions of their missing data. This allows the user to explore what is actually missing, and additionally make inferences on whether missingness is random or related to particular variables.

Before coming into this internship, my R ability was limited to self-teaching via youtube videos. Ample training was provided in this project but more than anything, throwing myself in and actually writing code has been so beneficial to my learning. This knowledge is extremely useful for a career in research – I was even able to apply my acquired skills onto the work carried out in my placement, and used R to analyse the data I gathered.

I am very grateful for this opportunity given to me under the JGI and will take what I’ve learnt with me into whatever I do next!

You can contact Imogen at imogenjoseph26@gmail.com

Sindenyi Bukachi

Using Big Data to Rethink Children’s Rights (bsindenyi@gmail.com)

MSci Psychology and Neuroscience, University of Bristol (Year 3)

Sindenyi Bukachi holding their research poster on 'Investigating attitudes towards children's rights (in education)' — *Sindenyi Bukachi holding their research poster*

Initially, the project was quite open – the only brief was to explore attitudes towards children’s rights using big data. My early research into Reddit threads, news stories and real-world discourse helped narrow our focus to something more urgent and measurable: children’s right to participation, specifically in educational settings as both my supervisors are based in the School of Education. This became the foundation for the rest of the project, and my supervisors later decided to take it forward as a grant proposal.

Over the first few weeks, I learned how to do structured literature reviews using academic databases like ERIC, build Boolean search strings, and track findings across a spreadsheet. I explored how participation is talked about and measured, and the themes I identified – like tokenism, power struggles between adults, and the emotional toll of being “heard” but not actually listened to – became central to our research direction.

In the second half, I moved from qualitative sources to dataset analysis. I used R and RStudio to explore datasets from the UK Data Service. I learned to work with tricky file types (.SAV, .TAB), use new packages, extract variables, visualise trends, and test relationships between predictors — all while thinking critically about how these datasets (often not made for this topic) could reflect participation and children’s agency.

I’ve gained confidence in data science, research strategy, and independent problem-solving – all skills I’ll take forward into my dissertations and future career. I’m so grateful to Dr Katherin Barg, Professor Claire Fox, and the JGI for the support and trust throughout.

How to make data science skills stick? Learnings from the OCSEAN project

Posted on 29 July 202528 August 2025 by kerry.turcsi

Written Catherine Upex and Rachel Wood

Left to right: Sena Darmasetiyawan; John Calorio; Komang Sumaryana; Chris Kinipi; Wahyu Widiatmika; Dendi Wijaya standing in front of the Fry Building — Visiting researchers from the OCSEAN project (from left to right: Sena Darmasetiyawan (Udayana University); John Calorio (Davao Medical School Foundation); Komang Sumaryana (Udayana University); Chris Kinipi (University of Papua New Guinea); Wahyu Widiatmika (Udayana University); Dendi Wijaya (Jakarta University)

Introduction

Earlier this summer, the University of Bristol and the JGI welcomed a group of visiting researchers from the “Oceanic and Southeast Asian Navigators” (OCSEAN) project. OCSEAN is a worldwide interdisciplinary consortium researching the demographic history of ancient seafarers across Oceania and Southeast Asia. The visiting humanities researchers from Indonesia and the Philippines arrived in Bristol with the aim of learning more about quantitative methods, how to apply them to their research, and to take these skills home to help their research community do the same.

When asked, most said they had little to no knowledge or experience in coding. The task therefore was to design a training approach to help them feel confident independently using Python for research – all in the space of a few weeks.

Our Approach

The training style followed a traditional workshop format, but importantly with two instructors. This allowed one to talk through the course content, and the other to provide one-to-one help to individuals. Initially, the sessions consisted of lecture-style teaching, but as confidence grew, they transitioned to a more independent format, where small groups collaborated to solve data science problems directly related to their research interests.

As most participants has no prior coding experience, it was important not assume any knowledge of technical terms. Over eight two-hour sessions spanning three weeks, the training slowly built-up coding knowledge, covering the following topics:

Introduction to Python (e.g. variables, data types, operators, lists, dictionaries)
Intermediate concepts (e.g. using/writing functions, loops, conditional statements)
How to use Chatbots for coding (e.g. how to write good prompts, refine responses, when/when not to use, error handling, and sanity checking)
Data analysis (e.g. loading/cleaning data, plotting using seaborn and matplotlib, summarising data)

The training also coincided with Bristol Data Week 2025, so the OCSEAN researchers had the opportunity to cement their knowledge by revisiting concepts in similar training sessions from the event.

Comparing training styles

The approach differed to a recent pilot training scheme run by JGI Research Data Science Advocates. The aim of the pilot was to run training on data analysis in Python in a low-stress environment, via a self-led approach. Participants were supplied with materials to work through independently, with optional contact time with facilitators.

Both training styles were designed for researchers with no prior coding experience. It was interesting to see how the hands-on and hands-off approaches compared in order to understand how to most effectively encourage engagement with data science.

Feedback from OCSEAN researchers

By the end of our training period, all the OCSEAN researchers said that they found the training very beneficial for their research. Many acknowledged that they found learning Python challenging. However, the format of the sessions, especially the opportunity to draw upon help from not only facilitators but also ChatGPT, and importantly each other, allowed them to get to grips with new concepts. Intensive successive trainings with a clear syllabus were seen as more beneficial than one-off unconnected sessions.

The importance of structured training was echoed by feedback from the self-led pilot training. Here, participants highlighted that despite a self-led approaching being easier to fit into a working week, they would have benefitted from group discussions and the opportunity to compare their results with others. Additionally, while most of the self-led participants agreed that the pilot scheme facilitated their learning outcomes and expressed a desire to apply what they learnt to their work, some commented that they lacked a basic understanding of Python to independently apply these skills.

Importantly, OCSEAN researchers commented on how it wasn’t just the training structure that facilitated learning. Aspects such as the use of a small meeting room and the inclusion of regular breaks, further encouraged collaboration between participants and drove better understanding. Additionally, the use of datasets adapted to participants’ research fields made coding seem much more accessible and engaging. This highlighted how important it is to facilitate a supportive and personalised teaching environment in order to fully grasp new complex concepts.

Training attendees with their course completion certificates standing beside Dr Dan Lawson, Rachel Wood and, Catherine Upex — Training attendees with their course completion certificates; featured with training facilitators from the University of Bristol: Dr Dan Lawson (Associate Professor of Data Science and member of OCSEAN project; School of Mathematics), Rachel Wood (PhD student; School of Mathematics); Catherine Upex (PhD student; Bristol Medical School)

Reflections and moving forward

This training was facilitated by two PhD students developing their own teaching skills, and the experience taught the team a lot about what makes effective data science training. To feel confident in independently using data science, intensive face-to-face training is needed to make sure basic coding skills are cemented. This can be difficult for many to fit in, but a weekly commitment, combined with a hand-on collaborative atmosphere can effectively drive key concepts home.

Additionally, to drive engagement particularly from disciplines with little data science background, it is important to cater training to specific research questions in that field i.e. using relevant data sets. This way, participants can see how data science can help them in their own research and be more inspired to try for themselves.

So, what’s next? The aim of this training was to provide OCSEAN researchers with data science skills to apply to their own research. It’s been brilliant to see that some have already taken this leap. Using their coding skills and connections made in Bristol, many are developing new projects, applying for PhD positions and forming future collaborations. In the Autumn, the team plan to travel to Bali to aid OCSEAN researchers in sharing coding skills with their research communities, as well as developing more research collaborations.

This blog was written by Catherine Upex and Rachel Wood

Learn more about the OCSEAN project here or contact Daniel Lawson (Dan.Lawson@bristol.ac.uk) or Monika Karmin (monika.karmin@ut.ee) for more information.