data science – Page 4 – Jean Golding Institute News

Successful Seedcorn Awardees 2024-2025

Posted on 6 November 202420 May 2025 by kerry.turcsi

The Jean Golding Institute Seedcorn Funding is a fantastic opportunity to develop multi and interdisciplinary ideas while promoting collaboration in data science and AI.  We are delighted that a new cohort of multidisciplinary researchers has been supported through this funding.

Leighan Renaud – Building a Folk Map of St Lucia

Dr. Leighan Renaud is a lecturer in Caribbean Literatures and Cultures in the Department of English. Her research interests include twenty-first century Caribbean fiction, mothering and motherhood in the Caribbean, folk and oral traditions in the Anglophone Caribbean, and creative practices of neo-archiving.

Louise AC Millard – Using digital health data for tracking menstrual cycles

Dr. Louise Millard is a Senior Lecturer in Health Data Science in the MRC Integrative Epidemiology Unit (IEU) at the University of Bristol. Following an undergraduate Computer Science degree and MSc in Machine Learning and Data Mining, they completed an interdisciplinary PhD at the interface of Computer Science and Epidemiology. Their research interests lie in the development and application of computational methods for population health research, including using digital health and phenotypic data, and statistical and machine learning approaches.

Laura Fryer – Visualisation tool for Enhancing Public Engagement Using Supermarket Loyalty Card Data

Laura is a senior research associate in the Digital Footprints Lab based within the Bristol Medical School. Their aim is to use novel data to unlock insights into behavioural science for the purposes of public good. Laura is particularly passionate about broadening the public’s understanding of digital footprint data (e.g. from loyalty cards, bank transactions or wearable technology such as a smart watch) and demonstrating how vital it can be in developing our understanding of population health within the UK and beyond. Laura’s project is focused on developing a data-visualisation tool that will support public engagement activities and provide a tangible representation of the types of data that we use – building further trust between the public and scientific researchers.

Nicola A Wiseman – Cellular to Global Assessment of Phytoplankton Stoichiometry (C-GAPS)

Dr. Nicola Wiseman is a Research Associate in the School of Geographical Sciences. They received their PhD in Earth System Science from the University of California, Irvine, where they specialized in using ocean biogeochemical models to investigate the impacts of phytoplankton nutrient uptake flexibility on ocean carbon uptake. They also are interested in using statistical methods and machine learning to better understand the interactions between marine nutrient and carbon cycles, and the role of these interactions in regulating global climate.

Georgia Sains – Collecting & Analysing Multilingual EEG Data

Georgia Sains is a Doctoral Teaching Associate in the Neural Computation research group at the School of Computer Science. Her research is focused on the overlap between Computer Science, Neuroscience, and Linguistics. Georgia has worked on developing models to help understand how linguistic traits have evolved. More recently, she has been using Bayesian modelling to find patterns between grammar and neurological response and are now focused on using Electroencephalography experimentation to explore the relationship between linguistic upbringing and how the brain processes language.

Alex Tasker – Building a Strategic Critical Rapid Integrated Biothreat Evaluation (SCRIBE) data tool for research, policy, and practice

Dr. Tasker is a Senior Lecturer at the University of Bristol, a Research Associate at the KCL Conflict Health Research Group and Oxford Climate Change & (In)Security (CCI) project, and a recent ESRC Policy Fellow in National Security and International Relations. Dr. Tasker is an interdisciplinary researcher working across social and natural sciences to understand human-animal-environmental health in situations of conflict, criminality, and displacement using One Health approaches. Alongside this core focus, Dr. Tasker’s work also explores emerging areas of relevance to biosecurity and biothreat including engineering biology, antimicrobial resistance, subterranean spaces, and the use of new forms of evidence and expertise in a rapidly changing world for climate, security, and defense.

Exploring the Impact of Medical Influencers on Health Discourse Through Socio-Semantic Network Analysis

Posted on 2 October 2024 by kerry.turcsi

JGI Seed Corn Funding Project Blog 2023/24: Roberta Bernardi

This Photo by Unknown Author is licensed under CC BY-NC-ND

Project Background

Medical influencers on social media shape attitudes towards medical interventions but may also spread misinformation. Understanding their influence is crucial amidst growing mistrust in health authorities. We used a Twitter dataset of the top 100 medical influencers during Covid-19 to construct a socio-semantic network, mapping both medical influencers’ identities and key topics. Medical influencers’ identities and the topics they use to represent an opinion serve as vital indicators of their influence on public health discourse. We developed a classifier to identify influencers and their network of actors, used BERTopic to identify influencers’ topics, and mapped their identities and topics into a network.

Key Results

Identity classification

Most Twitter bios include job titles and organization types, which often have similar characteristics. So, we used a machine learning tool to see how accurately we could predict someone’s job based on their Twitter bio. Our main question is: How well can we guess occupations from Twitter bios using the latest techniques in Natural Language Processing (NLP), like few-shot classification and pre-trained sentence embeddings? We manually coded a training set of 2000 randomly selected bios from the to 100 medical influencers and their followers. Table 1 shows a sample of 10 users with (multi-)labels.

Table of users and their multi-labels — *Table 1. Users and their multi-labels*

We used six prompts to classify the identities of medical influencers and other actors in their social network. The ensemble method, which combines all prompts, demonstrated superior performance, achieving the highest precision (0.700), recall (0.752), F1 score (0.700), and accuracy (0.513) (Table 2).

Table of prompts and their identities classification — *Table 2. Comparison of different prompts for the identities classification*

Topic Modelling

We used BERTopic to identify topics from a corpus of 424,629 tweets posted by the medical influencers between December 2021 and February 2022 (Figure 1).

Coloured scatter graph of medical influencer topics — *Figure 1. Map of medical influencers’ topics*

In total, 665 topics were identified. The most prevalent topic is related to vaccine hesitancy (8919 tweets). The second most significant topic focuses on equitable vaccine distribution 6860 tweets. Figures 2a and 2b illustrate a comparison between the top topics identified by Latent Dirichlet Allocation (LDA) and those by BERTopic.

Word map of LDA top 5th topics on the left and bar charts of BERTopic top 8th topics on the right — *Figure 2. Comparisons of LDA topics and BERTopic topics*

The topics derived from LDA appear more general and lack specific meaning, whereas the topics from BERTopic are notably more specific and carry clearer semantic significance. For example, the BERTopic model shows either the “Hesitancy” or the “Equity” of the vaccine (topic 0, 1), while the LDA model only provides general topic information (topic 0).

Table 3 shows the three different topic representations generated from the same clusters by three different methods: Bag-of-Words with c-TF-IDF, KeyBERTInspired and ChatGPT.

Table of comparison of three different topic representations methods of BERTopic — *Table 3: Comparison of three different topic representations methods of BERTopic*

The Keyword Lists from Bag-of-Words with c-TF-IDF and KeyBERTInspired provide quick information about the content of the topic, while the narrative Summaries from ChatGPT offer a human-readable summary but may sacrifice some specific details that the keyword lists will provide. BERTopic captures deeper text meanings, essential for understanding conversation context and providing clear topics, especially in short texts like social media posts.

Mapping Identities and Topics in Networks

We mapped actors’ identities and the most prevalent topics from their tweets into a network (Figure 3).

*Figure 3. Network representation of actors’ identities and topics*

Each user node features an attribute detailing their identities, which defines the influence of medical influencers within their network and how their messages resonate across various user communities. This visualization reveals their influence and how they adapt discourse for different audiences based on group affiliations. It aids in exploring how the perspectives of medical influencers on health issues proliferate across social media communities.

Conclusion

Our work shows how to identify who medical influencers are and what topics they talk about. Our network representation of medical influencers’ identities and their topics provides insights into how these influencers change their messages to connect with different audiences. First, we used machine learning to categorize user identities. Then, we used BERTopic to find common topics among these influencers. We created a network map showing the connections between identities, social interactions, and the main topics. This innovative method helps us understand how the identities of medical influencers affect their position in the network and how well their messages connect with different user groups.

Contact details and links

For further information or to collaborate on this project, please contact Dr Roberta Bernardi (email: roberta.bernardi@bristol.ac.uk)

Acknowledgement

This blog post’s content is based on the work published in Guo, Z., Simpson, E., Bernardi, R. (2024). ‘Medfluencer: A Network Representation of Medical Influencers’ Identities and Discourse on Social Media,’ presented at epiDAMIK ’24, August 26, 2024, Barcelona, Spain

Foodscapes: visualizing dietary practices on the Roman frontiers

Posted on 26 September 202426 September 2024 by kerry.turcsi

JGI Seed Corn Funding Project Blog 2023/24: Lucy Cramp, Simon Hammann & Martin Pitts

*Table laid out with Roman pottery from Vindolanda ready for sampling for organic residue analysis as part of our ‘Roman Melting Pots’ AHRC-DFG funded project*

The extraction and molecular analysis of ancient food residues from pottery enable us to reconstruct the actual uses of vessels in the past. This means we can start to build up pictures of dietary patterns in the past, including foodways at culturally diverse communities such as the Roman frontiers. However, there remains a challenge in how we can interpret these complex residues, and both visualise and interrogate these datasets to explore use of resources in the past.

Nowadays, it is commonplace to extract organic residues from many tens, if not hundreds, of potsherds; within each residue, and especially using cutting-edge high-resolution mass spectrometric (HRMS) techniques, there might be several hundred compounds present, including some at very low abundance. Using an existing dataset of gas chromatography-high resolution mass spectrometric data from the Roman fort and associated settlement at Vindolanda, this project aimed to explore methods through which these dietary information could be spatially analysed across an archaeological site, with a view to developing methods that could be applied on a range of scales, from intra-site through to regional and even global. It was hoped that it would be possible to display the presence of different compounds in potsherds recovered from different parts of a site that are diagnostic of particular foodstuffs, in order to spatially analyse the distribution of particular resources within and beyond sites.

*A fragment from a Roman jar that was sampled from Vindolanda for organic residue analysis as part of our ‘Roman Melting Pots’ AHRC-DFG funded project*

The project started by processing a pilot dataset of GC-HRMS data from the site of Vindolanda, following a previously-published workflow (Korf et al. 2020). These pottery sherds came from different locations at the fort, occupied by peoples of different origins and social standings. This included the praetorium (commanding officer’s house), schola (‘officers’ mess’), infantry barracks (occupied by Tungrians, soldiers from modern-day Belgium and Netherlands), and the non-military ‘vicus’ outside of the fort walls likely occupied by locals, traders and families. Complex data, often containing several hundred compounds per residue were re-integrated using open-source mass spectrometry data processing software MZ Mine, supported by our collaborator from MZ IO gmbh, Dr Ansgar Korf. This produced a ‘feature list’ of compounds and their intensities across the sample dataset. This feature list was then presented to Emilio Romero, a PhD student in Translational Health Sciences, who worked as part of the Ask-JGI helpdesk to support academic researchers on projects such as these. Emilio developed data matrices and performed statistical analyses to identify significant compounds of interest that were driving differences between the composition of organic residues from different parts of the settlement. This revealed, for example, that biomarkers of plant origin appear to be more strongly associated with pottery recovered from inside the fort compared with the vicus outside the fort walls. He was then able to start exploring ways to spatially visualize these data, with input from Léo Gorman, a data scientist from the JGI, and Levi Wolf from the School of Geographical Sciences. Emilio says:

‘Over the past year, my experience helping with the Ask-JGI service has been truly rewarding. I was very excited to apply as I wanted to gain more exposure to the world of research in Bristol, meet different researchers and explore with them different ways of working and approaching data.

One of the most challenging projects was working with chemometric concentrations of different chemical compound residues extracted from vessels used in ancient human settlements. This challenge allowed me to engage in dialogue with specialists in the field and work in a multidisciplinary way in developing data matrices, extracting coordinates and creating maps in R. The most rewarding part was being able to use a colour scale to represent the variation in concentration of specific compounds in settlements through the development of a Shiny application in R. It was certainly an invaluable experience and a technique I had never had the opportunity to practice before.’

This work is still in progress, but we have planned a final workshop that will take place in mid-November. Joining us will be our project partners from the Vindolanda Trust, as well as colleagues from across the Roman Melting Pots project, the JGI and the University of Bristol. A funding application to develop this exploratory spatial analysis has been submitted to the AHRC.

Contact details and links

You can find out more about our AHRC-DFG funded project ‘Roman Melting Projects’ and news from this season’s excavations at Vindolanda and its sister site, Magna.

A real-time map-based traffic and air-quality dashboard for Bristol

Posted on 24 September 202426 September 2024 by kerry.turcsi

JGI Seed Corn Funding Project Blog 2023/24: James Matthews

Example screenshot of the Bristol Air Quality and Traffic (AQT) Dashboard with Key — *Example of the dashboard in use.*

A reduction in urban air quality is known to be a detrimental to health, affecting many conditions including cardiorespiratory health. Sources of poor air quality in urban areas include industry, traffic and domestic wood burning. Air quality can be tracked by many government, university and citizen held pollution sensors. Bristol has implemented a clean air zone, but non-traffic related sources, such as domestic wood burning, are not affected.

The project came about through the initiative of Christina Biggs who approached academics in the school of Engineering Mathematics and Technology (Nikolai Bode) and the School of Chemistry (James Matthews and Anwar Khan) with a vision for an easy to use data dashboard that could empower citizens by drawing data from citizen science, university and council air quality and traffic sensors in order to better understand the causes of poor air quality in their area. The aims were to (1) work with community groups to inform the dashboard design (2) create an online dashboard bringing together air quality and traffic data (3) use council air quality sensors to enable comparison with citizen science sensors for validation and (4) to use this to identify likely sources of poor air quality.

An online dashboard was created using R with Shiny and Leaflet, collecting data using API code, and tested offline. The latest version of the dashboard has been named the Bristol Air Quality and Traffic (AQT) dashboard. The dashboard allows PM_2.5 data and traffic numbers to be investigated in specific places and plotted as a time series. We are able to compare citizen sensor data to council and government data, and we can compare to known safety limits.

The dashboard collates traffic data from several sources including Telraam traffic report and Vivacity traffic data which provide information on car numbers from local sensors; and PM_2.5 data from different sources including Defra air quality stations and SensorCommunity (previously named as Luftdaten) citizen air quality stations. Clicking onto a data point provides the previous 24 hour time series of measurements. For example, in the screenshots below, one Telraam sensor shows a clear PM2.5 peak during the morning rush hour of 26^th June 2024 (a) which is likely related to traffic, while the second shows a higher PM2.5 peak in the evening (b) which could be related to domestic field burning, such as an outdoor barbecue. A nearby traffic sensor shows that the morning peak and smaller afternoon peak do agree with traffic numbers (c), but the evening peak might be unrelated. Data can be selected from historic data sets and is available to download for future interrogation.

Example of data output from the dashboard showing PM2.5 midnight to midnight on 26/06/2024 — *Figure (a) Example of data output from the dashboard showing PM_2.5*

*Figure (c) Example of data output from the dashboard showing traffic measured using local Bristol sensors*

It is a hope that these snapshots might provide an intuitive way for communities to understand the air quality in their location. Throughout the project, the project team held regular meetings with Stuart Phelps from Baggator, a community group based in Easton, Bristol, so that community needs were put to the forefront of the dashboard design.

We are currently planning a demonstration event with local stakeholders to allow them to interrogate the data and provide feedback that can be used to add explanatory text to the dashboard and enable easy and intuitive analysis of the data. We will then engage with academic communities to consider how to use the data on the dashboard to answer deeper scientific questions.

Contact details and links

Details of the dashboard can be found at the link below, and further questions can be sent to James Matthews at j.c.matthews@bristol.ac.uk

https://github.com/christinabiggs/Bristol-AQT-Dashboard/tree/main

The Royal Statistical Society Annual Conference 2024

Posted on 13 September 2024 by kerry.turcsi

The Royal Statistical Society meets annually for their internationally attended conference. It serves as the UK’s annual showcase for statistics and data science. This year they met in Brighton for a conference attended by over 600 attendees from around the world, including JGI Data Scientist Dr Huw Day.

The conference had over 250 presentations, including contributed talks, rapid-fire talks, and poster presentations. At any one time, there could be as many as 6 different talks going on, so it was impossible to go to everything but below are some of Huw’s highlights of the conference.

Pre-empting misunderstandings is part of trustworthy communication

*From left to right; Dr Huw Day, Professor Sir David Spiegelhalter and Dr Simon Day (RSS Fellow and Huw’s dad) at the RSS International Conference 2024.*

As part of a session on communicating data to the public, Professor Sir David Spiegelhalter talked about his experiences trying to pre-bunk misinformation when displaying data.

Data in June 2021 showed that the majority of COVID deaths are in the vaccinated group. The Brazilian president President Jair Bolsonaro used this data to support claims that Covid vaccines are killing people. Spiegelhalter and his colleague Anthony Masters tried explaining why this wasn’t a sign the vaccine was bad in an article in The Observer “Why most people who now die with Covid in England have had a vaccination”.

Consider the following analogy: most car passengers who die in car accidents are wearing seatbelts. Intuitively, we understand that just because these two variables are associated, it doesn’t mean that one causes the other. Having a story like that means you don’t have to talk about base rates, stratification or even start to use numbers in your explanations.

We should try to make the caveats clearer of data before we present them. We should be upfront from what you can and can’t conclude from the data.

Spiegelhalter pointed to an academic paper: “Transparent communication of evidence does not undermine public trust in evidence” where participants were shown either persuasive or balanced messages about the benefits of Covid vaccines and nuclear power. It’s perhaps not surprising to read that those who already had positive opinions about either topic continued to have positive views after reading either messages. Far more interesting is that the paper concluded that “balanced messages were consistently perceived as more trustworthy among those with negative or neutral prior beliefs about the message content.”

Whilst we should pre-empt misconceptions and caveats, being balanced and more measured might prove to be an antidote to those who are overly sceptical. Standard overly positive messaging is actively reducing trust in groups with more sceptical views.

Digital Twins of the Human Heart fueled Synthetic 3D Image Generation

Digital twins are a digital replica/simulator of something from the real world. Typically it includes some sort of virtual model which is informed by real world data.

Dr Dirk Husmeiser at the University of Glasgow has been exploring the application of digital twins of the human heart and other organs to investigate behaviour of the heart during heart attacks, as well as trying to use ultrasound to measure blood flow to estimate pulmonary blood pressure (blood pressure in the lungs). Usually, measuring pulmonary blood pressure is an extremely invasive procedure, so using ultrasound methods has clear utility.

One of the issues of building a digital twin is having data about what you’re looking at. In this case, the data looks like MRI scans of the human heart, taken at several “slices”. Because of limitations in existing data, Dr Vinny Davies and Dr Andrew Elliot, (both colleagues of Husmeiser at the University of Glasgow)have been attempting to develop methods of making synthetic 3D models of the human heart, based on their existing data. They broke the problem down into several steps, working to generate synthetic versions of the slices of the heart (which are 2D images) first.

The researchers were using a method called Generative Adversarial Networks (GANs), where two systems compete against each other. The generator system generates the synthetic model and the discriminator system tries to distinguish between real and synthetic images. You can read more about using GANs for synthetic data generation in a recent JGI blog about Chakaya Nyamvula’s JGI placement.

Slide on “Generating Deep Fake Left Ventricle Images for Improved Statistical Emulation”. — A slide from Dr Vinny Davies and Dr Andrew Elliot’s talk on “Generating Deep Fake Left Ventricle Images for Improved Statistical Emulation”. The slide depicts how progressive GANs work, where the generator learns how to generate smaller, less detailed images first and gradually improves until it can reproduce 2D slices of MRIs of the human heart.

Because the job of the generator is far harder than that of the discriminator (consider the task of reproducing a famous painting, versus spotting the difference between an original painting and a version drawn by an amateur), it’s important to find ways to make the generator’s job easier early on, and the discriminator’s job harder so that the two can improve together.

The researchers used a method called a Progressive GAN. Initially they gave the generator the task of drawing a lower resolution version of the image. This is easier and so the generator did easier. Once the generator could do this well, they then used the lower resolution versions as the new starting point and gradually improved the correlation. Consider trying to replicate a low resolution image – all you have to do is colour in a few squares in a convincing way. This then naturally makes the discriminator job’s harder, as it’s tasked with telling the difference between two, extremely low resolution images. This allows the two systems to gradually improve in proficiency.

The work is ongoing and the researchers at Glasgow are looking for a PhD student to get involved with the project!

Data Hazards

On the last day of the conference, Huw alongside Dr Nina Di Cara from the School of Psychology at the University of Bristol presented to participants about the Data Hazards project.

Participants (including Hadley Wickam, keynote speaker and author of the famous R package tidyverse) were introduced to the project, shown examples of how it has been used and then shown an example project where they were invited to take part in discussions about which different data hazards might apply and how you might go about mitigating for those hazards. They also discussed the importance of focussing on which hazards are most relevant and prominent.

Dr Huw Day (left) and Dr Nina Di Cara in front of a screen that says 'Data Hazards Workshop' — *Dr Huw Day (left) and Dr Nina Di Cara (right) about to give their Data Hazards workshop talk at the RSS International Conference 2024.*

All the participants left with their own set of the Data Hazard labels and a new way to think about communicating hazards of data science projects, as well as invites to this term’s first session of Data Ethics Club.