Chakaya Nyamvula’s JGI Placement 

Hi, I’m Chakaya. I am currently pursuing my MSc in AI and Data Science at Keele University and working as a Business Intelligence Analyst at iLabAfrica at Strathmore University in Nairobi, Kenya. This summer, thanks to the partnership between iLabAfrica and JGI, I had an amazing opportunity to work with JGI for my Master’s placement. I wanted to immerse myself in a research environment and connect with people in academia to help figure out my future career path. Working under the guidance of Dr Huw Day, I gained valuable insights into the world of research and expanded my professional network, all while experiencing life in the UK. 

Chakaya Nyamvula in front of a body of water
Chakaya Nyamvula, JGI Intern

What was the project about? 

Previously for a JGI funded Seedcorn project Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day, and Debbie Johnson had surveyed researchers on their thoughts on how they might want to use synthetic data to help with their research. 

Synthetic data is when you take an existing dataset and create a synthetic (i.e. fake) version of it. You might want to do this so you can share something that looks like the data but preserves the privacy of individuals in it, whilst still having a flavour of what the data looks like and what statistical patterns might be present within it. This is useful for writing data pipelines whilst you go through necessary ethics checks to access sensitive data, amongst other things. 

For my summer placement with JGI, I worked with the MIMIC IV dataset of electronic health records and explored methods of generating synthetic versions of some of this data. It was also important to understand how you could measure or benchmark how successful your synthetic data generation has been, based on how well you had preserved privacy or how well the statistics of your synthetic data emulated those of your real data. 

What else did you do as part of your placement? 

Alongside my main work, I attended JGI Data Science meetings and learnt about some of the data science projects at the JGI including a project on antimicrobial resistance and another on 3D image analysis of CT scanned zebrafish to study bone development. 

For some of the more computationally demanding aspects of the project, I got taught how to make use of the JGI’s server (known within the office as “Jeeves”). 

I also had the opportunity to meet some PhD students at the University of Bristol, ask them about their research, and get advice on applying for PhDs in the future. 

Left to right, Huw Day, Elena Fillola Mayoral, Yujie Dai and Chakaya Nyamvula sat at a table at an ice cream shop
From left to right: Huw Day (JGI Data Scientist), Elena Fillola Mayoral (PhD student in AI for Climate), Yujie Dai (CDT in Digital Health) and Chakaya Nyamvula (JGI Intern) discussing PhDs over ice cream

What did you learn about? 

One deep learning method we used was something called a Generative Adversarial Network (GAN). Prior to this project, I had never worked with GANs before, so diving into this methodology was both challenging and exciting.  

A GAN works by having two competing neural networks, a generator and a discriminator. The generator’s job in this case was to take the original data and generate synthetic versions of that data. The discriminator’s job is to try and spot the difference between the real and the synthetic data that has been generated. One of the advantages of such a system is that you have two outputs: 1) a neural network which can generate synthetic data based on some training data and 2) a second neural network which can discriminate between real and synthetic data. This has advantages for applications where people might maliciously generate synthetic data, for example deep fake images. 

A good analogy for GANs is two people learning chess by playing against one another. If both start at similar skill levels, then as one person improves, the other slowly improves too. If you lose a chess game, you know you made a mistake and you might be able to work out how to improve for the next time. If you win, then you know you were doing something right.  

However, if you pit a chess grandmaster against a complete beginner, then the beginner will lose every time and will struggle to understand where they are going wrong, making it difficult to improve. Because the task of making synthetic data is quite complicated, when we began the process of training the GAN, the generator was frequently getting it wrong and wasn’t really able to figure out how to improve. 

To combat this, we did two things. First, you can handicap the discriminator a bit to give the generator a head start (imagine making your grandmaster play blindfolded). This helped, but still wasn’t enough. 

One of the pair plots showing generated vs real data a epoch 0
One of the pair plots showing generated vs real data a epoch 25000
Pair plots showing how well the real and the synthetic data matches by comparing each column. Real data is in blue, synthetic data is in red. The diagonal plots show histogram density plots of each column and how it compares between real and synthetic data. The off diagonal show scatter plots between pairs of variables. The left pair plot shows the output at the start of training, where the synthetic generator just randomly samples a scatter of points. You can see that this is not a good match for the original data. The right pair plot shows that after training, the generator does a lot more of a convincing job at emulating the real data. It is still not perfect, but it is particularly good at identifying clumps of data.

Secondly, you can start to think about how you inform your neural networks whether or not they were successful. Imagine if instead of “win” or “lose” as your outcome of the chess games, you got a measure of how well you performed, say a measure of how many good moves you made. With this more specific information, it becomes easier to decipher why you lost and how you might improve.  

To Be Continued? 

To finish my placement, I shared my experience with my placement supervisors at Keele University through a presentation and a report. I then had the opportunity to present my work to the Data Science Seminar at the University of Bristol, with several lecturers from the data science community in attendance, alongside JGI Data Scientists and some friends I made along the way.  

Additionally, all the code we worked on can be found in a public GitHub repository for other researchers to use and experiment with can be found on Chakaya’s Github.

Chakaya Nyamvula and Huw Day standing in front of a projector presenting at the Data Science Seminar. The projector has a slide on it that says 'Introduction to synthetic data' 
Chakaya Nyamvula (left) and Huw Day (right) presenting at the Data Science Seminar 

Reflecting on my placement at JGI, I can confidently say it was an incredible learning experience. I had the privilege of working with a fantastic supervisor, Dr Huw Day, who provided guidance throughout the project. Co-working with the talented data scientists at JGI was both inspiring and rewarding, and I thoroughly enjoyed networking with professionals in academia. The challenges I faced particularly working with GANs for the first time, pushed me to grow and expand my skill set.  Overall, this experience not only deepened my technical expertise but also solidified my interest in pursuing a career that bridges research and data science. 

Updates from a previous JGI Seed Corn funded project:  Addressing the fetal alcohol spectrum disorder (FASD) ‘data gap’

We are delighted to announce a few updates regarding one of our previous seed corn funded projects. In 2022-2023, the JGI funded Cheryl McQuire’s (Bristol Medical School) project on “Addressing the fetal alcohol spectrum disorder (FASD) ‘data gap’: ascertaining the feasibility of establishing the first UK National linked database for FASD”. This project allowed Cheryl’s team to explore the feasibility of establishing a National Linked Database for Fetal Alcohol Spectrum Disorder (FASD) as Landmark UK guidance has called for urgent action to increase identification, understanding, and support for those affected with this disorder.  

FASD is caused by prenatal alcohol exposure and is thought to be particularly common in the UK population. The aim of the seed corn project was to make the initial steps towards forming a UK National Database for FASD looking at feasibility, acceptability, key purposes and the data structure needed. Through questioning over 100 stakeholders including clinicians, data specialists, researchers, policy makers, charities, and people living with FASD, the project was able todemonstrate a strong support for a national FASD database but there was a common concern among stakeholders about privacy and data sharing. Full details of the project can be found on our previous blog post.  

Cheryl and their team also collaborated with the Elizabeth Blackwell Institute (EBI) on “Developing a National Database for Fetal Alcohol Spectrum Disorder (Nat-FASD UK): incorporating the views and recommendations of people with FASD and their carers.” Their findings from the projects funded by JGI and EBI were presented at ADR-UK conference 2023. The abstract for this work can be viewed here. In addition, a pre-print of their FASD National database workshop findings is now available here.  

Importantly, this work has been selected to feature in the Office for National Statistics (ONS) Research Excellence Series 2024. Cheryl will be delivering a webinar on “Showcasing methods for diverse stakeholder involvement in database design: establishing the feasibility and acceptability of a National Database for Fetal Alcohol Spectrum Disorder (FASD)” on Thursday 13 June 10:30 to 11:30 BST. The webinar will cover how the team developed a tailored, multi-method approach to public and professional involvement activities, leading to high levels of engagement. In addition, you will also hear what people living with FASD and health care, policy and data science professionals had to say about the feasibility and acceptability of a UK National Linked Database for FASD. There will be an opportunity to ask Cheryl any questions during the dedicated Q&A section. You can register a place on the webinar here.  

The work from both projects has been crucial in paving the way for progress in FASD research within the UK. It has also allowed us to get closer to addressing the FASD data gap that has been stalling the progress in prevention, understanding, and appropriate support for too long. Since both projects, Cheryl’s team has continued working on the FASD database and is currently pursuing funding options to establish a National database for FASD.  

The Jean Golding Institute offers seed corn projects every year to support and promote activities that will foster interdisciplinary research in the area of data science, based on the principle that a small financial investment will lead onto bigger things. We anticipate that our next seed corn funding call will be announced in the autumn of 2024.  Sign up to our mailing list to find out when the call goes live. 

Are you a researcher looking for data scientist support?

Researchers across the University benefit from our JGI Seedcorn Funding. Funding is great when you have someone to do the work – but what if you don’t have the right data science expertise in house? For that, this summer we are trialling a new JGI Data Scientist Support service. This provides an alternative support mechanism for researchers who need expertise and time, but not funding. 

The Jean Golding Institute’s team of data scientists and research software engineers are here to support researchers across the University of Bristol fostering a collaborative research environment spanning multiple disciplines. Over the past seven years, our team has expanded thanks to various funding sources, reflecting the increasing importance of data science support in facilitating research outcomes and impact. 

Get in touch with our team to find out how they can help you with: 

  • Data analysis – recommendations or support with tools and methods for statistics, modelling, machine learning, natural language processing, computer vision, geospatial datasets and reproducible data analysis. 
  • Software development – technical support, coding (for example: Python, R, MATLAB, SQL, bash scripts), code review and best practices. 
  • Data communication – data visualisation, dashboards and websites. 
  • Research planning – experimental design, data management plans, data governance, data hazards and ethics. 

Our aim is to support researchers and groups that may not have in-house expertise but have project ideas that can be developed into applications for funding. We’re seeking projects that can take place over the summer until early autumn (July – October 2024). 

How to apply 

Please complete an online expression of interest form  

Deadline: 15 July 2024 

Selection process 

The JGI team will get back to you within one week, to discuss your request.  

If demand exceeds our current resource levels, we’ll meet with applicants to help prioritise projects. As with seedcorn funding, priority will go to applications that match JGI strategic goals and have clear pathways to benefit, such as an identified funding call or impact case. 

Examples of data science projects 

  • Social mobility analysis project – using local and national level data to investigate how different people in Bristol and other UK cities feel about life in their local environment. The JGI data scientist worked as part of a multidisciplinary team including University of Bristol researchers and external stakeholders, for around 2 days per week for 3 months. They analysed survey and geospatial data using Python, presented findings to the group. The output of the project was a grant application in which a data scientist was costed longer-term. 
  • Antimicrobial resistance project – examining patterns in observed levels of antimicrobial resistance during the COVID pandemic. The JGI data scientist worked with a University of Bristol researcher and collaborated with a public sector stakeholder, for around 4 days per week for 4 months. They performed statistical modelling using R, producing data visualisations of the trends found. The project has led to an Impact Acceleration Funding application to develop a tool used to support local health planning. 
  • Transport research-ready dataset grant – linking administrative datasets to support research into car and van use in the UK. The JGI data scientist developed data pipelines and provided methodological and data governance input into a successful ESRC funding application in a collaboration between researchers at the universities of Bristol and Leeds. The data scientist was a named researcher on the application and went on to perform data analysis as part of the project team.