Chakaya Nyamvula’s JGI Placement 

Hi, I’m Chakaya. I am currently pursuing my MSc in AI and Data Science at Keele University and working as a Business Intelligence Analyst at iLabAfrica at Strathmore University in Nairobi, Kenya. This summer, thanks to the partnership between iLabAfrica and JGI, I had an amazing opportunity to work with JGI for my Master’s placement. I wanted to immerse myself in a research environment and connect with people in academia to help figure out my future career path. Working under the guidance of Dr Huw Day, I gained valuable insights into the world of research and expanded my professional network, all while experiencing life in the UK. 

Chakaya Nyamvula in front of a body of water
Chakaya Nyamvula, JGI Intern

What was the project about? 

Previously for a JGI funded Seedcorn project Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day, and Debbie Johnson had surveyed researchers on their thoughts on how they might want to use synthetic data to help with their research. 

Synthetic data is when you take an existing dataset and create a synthetic (i.e. fake) version of it. You might want to do this so you can share something that looks like the data but preserves the privacy of individuals in it, whilst still having a flavour of what the data looks like and what statistical patterns might be present within it. This is useful for writing data pipelines whilst you go through necessary ethics checks to access sensitive data, amongst other things. 

For my summer placement with JGI, I worked with the MIMIC IV dataset of electronic health records and explored methods of generating synthetic versions of some of this data. It was also important to understand how you could measure or benchmark how successful your synthetic data generation has been, based on how well you had preserved privacy or how well the statistics of your synthetic data emulated those of your real data. 

What else did you do as part of your placement? 

Alongside my main work, I attended JGI Data Science meetings and learnt about some of the data science projects at the JGI including a project on antimicrobial resistance and another on 3D image analysis of CT scanned zebrafish to study bone development. 

For some of the more computationally demanding aspects of the project, I got taught how to make use of the JGI’s server (known within the office as “Jeeves”). 

I also had the opportunity to meet some PhD students at the University of Bristol, ask them about their research, and get advice on applying for PhDs in the future. 

Left to right, Huw Day, Elena Fillola Mayoral, Yujie Dai and Chakaya Nyamvula sat at a table at an ice cream shop
From left to right: Huw Day (JGI Data Scientist), Elena Fillola Mayoral (PhD student in AI for Climate), Yujie Dai (CDT in Digital Health) and Chakaya Nyamvula (JGI Intern) discussing PhDs over ice cream

What did you learn about? 

One deep learning method we used was something called a Generative Adversarial Network (GAN). Prior to this project, I had never worked with GANs before, so diving into this methodology was both challenging and exciting.  

A GAN works by having two competing neural networks, a generator and a discriminator. The generator’s job in this case was to take the original data and generate synthetic versions of that data. The discriminator’s job is to try and spot the difference between the real and the synthetic data that has been generated. One of the advantages of such a system is that you have two outputs: 1) a neural network which can generate synthetic data based on some training data and 2) a second neural network which can discriminate between real and synthetic data. This has advantages for applications where people might maliciously generate synthetic data, for example deep fake images. 

A good analogy for GANs is two people learning chess by playing against one another. If both start at similar skill levels, then as one person improves, the other slowly improves too. If you lose a chess game, you know you made a mistake and you might be able to work out how to improve for the next time. If you win, then you know you were doing something right.  

However, if you pit a chess grandmaster against a complete beginner, then the beginner will lose every time and will struggle to understand where they are going wrong, making it difficult to improve. Because the task of making synthetic data is quite complicated, when we began the process of training the GAN, the generator was frequently getting it wrong and wasn’t really able to figure out how to improve. 

To combat this, we did two things. First, you can handicap the discriminator a bit to give the generator a head start (imagine making your grandmaster play blindfolded). This helped, but still wasn’t enough. 

One of the pair plots showing generated vs real data a epoch 0
One of the pair plots showing generated vs real data a epoch 25000
Pair plots showing how well the real and the synthetic data matches by comparing each column. Real data is in blue, synthetic data is in red. The diagonal plots show histogram density plots of each column and how it compares between real and synthetic data. The off diagonal show scatter plots between pairs of variables. The left pair plot shows the output at the start of training, where the synthetic generator just randomly samples a scatter of points. You can see that this is not a good match for the original data. The right pair plot shows that after training, the generator does a lot more of a convincing job at emulating the real data. It is still not perfect, but it is particularly good at identifying clumps of data.

Secondly, you can start to think about how you inform your neural networks whether or not they were successful. Imagine if instead of “win” or “lose” as your outcome of the chess games, you got a measure of how well you performed, say a measure of how many good moves you made. With this more specific information, it becomes easier to decipher why you lost and how you might improve.  

To Be Continued? 

To finish my placement, I shared my experience with my placement supervisors at Keele University through a presentation and a report. I then had the opportunity to present my work to the Data Science Seminar at the University of Bristol, with several lecturers from the data science community in attendance, alongside JGI Data Scientists and some friends I made along the way.  

Additionally, all the code we worked on can be found in a public GitHub repository for other researchers to use and experiment with can be found on Chakaya’s Github.

Chakaya Nyamvula and Huw Day standing in front of a projector presenting at the Data Science Seminar. The projector has a slide on it that says 'Introduction to synthetic data' 
Chakaya Nyamvula (left) and Huw Day (right) presenting at the Data Science Seminar 

Reflecting on my placement at JGI, I can confidently say it was an incredible learning experience. I had the privilege of working with a fantastic supervisor, Dr Huw Day, who provided guidance throughout the project. Co-working with the talented data scientists at JGI was both inspiring and rewarding, and I thoroughly enjoyed networking with professionals in academia. The challenges I faced particularly working with GANs for the first time, pushed me to grow and expand my skill set.  Overall, this experience not only deepened my technical expertise but also solidified my interest in pursuing a career that bridges research and data science. 

Working towards more universal skin cancer identification with AI 

JGI Seed Corn Funding Project Blog 2023/24: James Pope

9 examples of malignant/benign cancer marks on different skin types
Images from the International Skin Imaging Collaboration (https://www.isic-archive.com/

Introduction

Open-source skin cancer datasets contain predominantly lighter skin tones potentially leading to biased artificial intelligence (AI) models. This study aimed to analyse these datasets for skin tone bias. 

What were the aims of the seed corn project? 

The project’s aims were to perform an exploratory data analysis of open-source skin cancer datasets and evaluate potential skin tone bias resulting from the models developed with these datasets.  Assuming biases were found and time permitting, a secondary goal was to mitigate the bias using data pre-processing and modelling techniques. 

What was achieved? 

Dataset collection

The project focused on the International Skin Imaging Collaboration (https://www.isic-archive.com/) archive that contains over 20 datasets totalling over 100,000 images.  The analysis required that the images provide some indication of skin tone.  We found that only 3,623 recorded the Fitzpatrick Skin Type on a scale from 1 (lighter) to 6 (darker).  For each image, we mapped the Fitzpatrick Skin Type to light or dark skin tone.  As future work, the project began exploring tone classification techniques to expand the images considered. 

Artificial Intelligence Modelling

We then developed a typical artificial intelligence model, specifically a deep convolutional neural network, to classify whether the images are malignant (i.e. cancerous) or benign. The model was trained from 2/3 of the images and evaluated in the remaining 1/3.  Due to computational limits, the model was only trained for 50 epochs. The model’s accuracy (how many correct classifications it made of either benign or malignant tumours out of all the tumours it was evaluated on) was comparatively poor with only 82%. 

Bias Analysis

The model was then evaluated relative to light and dark skin tones.  We found that the model was better at identifying cancer in light versus dark skin tone images.  The recall/true positive rate for dark skin tones was 0.26 while for light skin tones it was 0.45.  The resulting disparate impact (a measure used to indicate if a test is biased for certain groups) was found to be 0.58, which indicates the model is potentially biased.

Future plans for the project 

The project results were limited due to the subset of images with skin tone and constrained computational resources.  Future work is to further develop the tone classifier to expand the number of labelled images. Converting colour values from images into values more closely related to skin tone and then comparing with the tone labels of the image, might help train an AI model to exclude the tumour itself when classifying skin tone of the whole image. This is important as we know that the tone of tumours themselves is often different to that of the surrounding skin.

Heat map showing where the skin tone matches the label
An example image from ISIC which had its Fitzpatrick Skin Type labelled. The light green indicates where individual pixels correspond with expected colours associated with the labelled skin type. Notice that the centre of the image, where the tumour is, does not match.

More powerful computational resources will be acquired and used to sufficiently train the model.   Future work will also employ explainable AI techniques to identify the source of the bias. 


Contact details and links 

James Pope: https://research-information.bris.ac.uk/en/persons/james-pope,

Ayush Joshi https://research-information.bris.ac.uk/en/persons/ayush-joshi,  

First Steps Towards a Crowd-Sourced Ancient Greek Encyclopaedia

JGI Seed Corn Funding Project Blog 2023/24: Naomi Scott

Passage of Ancient Greek text
A page from a 10th century manuscript of Julius Pollux’s Onomasticon

In the second century A.D., Julius Pollux, Professor of Rhetoric at the Academy in Athens, wrote the Onomasticon (‘Book of Words’), and dedicated it to the Emperor Commodus. The work sits somewhere between an encyclopaedia and a lexicon. Chapters are organised by topic, and Pollux lists appropriate words on diverse themes such as ‘The Gods’, ‘Bakery Equipment’, ‘Diseases of Dogs’, and ‘Objects Found On Top Of Tables’. Throughout his work, Pollux quotes canonical authors such as Homer, Aeschylus, and Sappho in support of what he considers correct and elegant linguistic usage. This means that in addition to providing a wealth of information on everyday life in the ancient world, the Onomasticon is also one of our best sources of quotations from otherwise lost works of ancient Greek literature.   

Despite Pollux’s obvious importance, his work has not been translated into any modern language. The vast size of the Onomasticon (10 books in total, each comprised of around 250 chapters) means that it is unwieldy even for researchers able to study the original ancient Greek text. With seed-corn funding from the Jean Golding Institute, my project ‘Crowd Sourcing Julius Pollux’s Onomasticon’ has set to work on filling this gap. Eventually, my aim is to use crowd-sourcing to produce not only a translation of the Onomasticon, thereby making it accessible to researchers in a wide variety of disciplines, but an edition of the work which is fully data-tagged, so that researchers can better navigate the text, and produce key data about it: Which ancient authors and genres are most frequently cited as sources and in what contexts; what topics are granted the most or least coverage within the text; and how are different lexical categories distributed within the encyclopaedia? Without the answers to questions such as these, any individual chapter or citation within the Onomasticon cannot be placed in the wider context of the work as a whole.  

Creating a New Digital Edition 

While a digitised version of the ancient Greek text of the Onomasticon exists, it is based on the work of Erich Bethe, whose early twentieth-century edition of Pollux removed all the chapter titles which have been used to organise the text since it was first published as a printed book in 1502. Bethe did this because he did not consider the chapter titles to be Pollux’s own. Both for the purpose of splitting the text up into manageable short chunks for translation, and for the purpose of data-tagging, I decided it was essential to reinstate the titles. Additionally, my own examination of manuscripts of the Onomasticon dating as far back as the 10th century has revealed that the chapter titles are in fact much older than first thought, and that the text as we currently have it (abridged from Pollux’s even longer original!) may even have been conceived with the chapter titles. 

The first step in producing a digital edition suitable for crowd-sourcing and data-tagging is therefore to reinsert the titles into the text. This would be an enormous undertaking if done manually. Working with a brilliant team from Bristol’s Research IT department, led by Serena Cooper, Keiran Pitts, and Mike Jones, we have set about automating this process. Ancient Greek OCR (Optical Character Recognition) software designed by Professor Bruce Robertson at the University of Mount Allison in Canada, two editions of the text were scanned — one Bethe’s chapterless version, and the other by Karl Wilhelm Dindorf, whose 1824 edition of the text includes the titles.  The next step is to use digital mapping software to combine the two texts, inserting the titles from Dindorf into the otherwise superior version of the text produced by Bethe.  

Next Steps 

Once the issue of the chapter titles has been resolved, the next step will be to create a prototype of around 20 chapters, which can then be made available to the scholarly community to begin translating and data-tagging the text. A prototype would allow us to get feedback from researchers around the world working with Pollux, and to better understand what kinds of data would be most useful to those seeking to understand the text. This feedback can then be integrated into an eventual complete edition of the text which can then be translated and data-tagged as a whole.  

Eventually, this project will not only make the Onomasticon more accessible to researchers, and help to revolutionise our understanding of this important work. A complete translation and data-tagged edition complete with chapter titles will also allow the Onomasticon to have an impact beyond the academic community. The eventual plan is to train arts professionals engaging with the ancient Greek world to use the digital edition and translation. The Onomasticon’s remarkably detailed picture of ordinary life and ordinary stuff in antiquity makes it a vital resource for anyone trying to recreate the ancient Greek world on stage, on screen, or in novels. The hope is that this project will therefore not only change the way that scholars understand the Onomasticon and its place in the history of the encyclopaedia. It can also offer artists a window onto antiquity, and through its impact on art, shape the public understanding of the ancient world.  

New Turing Liaison Officers join the JGI team

As an active member of the Turing University Network, we have appointed a Turing Liaison Manager and two Turing Liaison Academics to support and enhance the partnership between Alan Turing Institute and the University of Bristol. These roles will be focusing on increasing engagement from Turing, developing external and internal networks around data science and AI, and supporting relevant interest groups, Enrichment students and Turing Fellows at the University of Bristol.

Turing Liaison Manager, Isabelle Halton and Turing Academic Liaisons, Conor Houghton and Emmanouil Tranos, are keen to build communities around data science and AI, providing support to staff and students who want to be more involved in Turing activity.

Isabelle previously worked in the Professional Liaison Network in the Faculty of Social Sciences and Law. She has extensive experience in building relationships and networks, project and event management and streamlining activities connecting academics and external organisations.

Conor is a Reader in the School of Engineering Mathematics and Technology, interested in linguistics and the brain. Conor is a Turing Fellow and a member of the TReX, the Turing ethics committee.

Emmanouil is currently a Turing Fellow and a Professor of Quantitative Human Geography, specialising primarily on the spatial dimensions of the digital economy.


If you’re interested in becoming more involved with Turing activity or have any questions about the partnership, please email Isabelle Halton, Turing Liaison Manager via the Turing Mailbox

Ask JGI Student Experience Profiles: Rachael Laidlaw

Rachael Laidlaw (Ask-JGI Data Science Support 2023-24) 

I first came into contact with the Jean Golding Institute last year at The Alan Turing Institute’s annual AI UK conference in London, and then again in the early stages of the DataFace project in collaboration with Cheltenham Science Festival. This meant that before I officially joined the team back in October, I already knew what a lovely group of people I’d be getting involved with! Having nice colleagues, however, was not my only motivation for applying to be an Ask-JGI student. On top of that, I’d decided that whilst starting out in my ecological computer-vision PhD niche, I didn’t want to forget all of the statistical skills that I’d developed back in my MSc degree. Plus, it sounded really fun to keep myself on my toes by exercising my mind tackling a variety of data-oriented requests from across the university’s many departments. 

Rachael Laidlaw in centre with two JGI staff members to the left and one JGI staff member to the right pointing towards a Data pin board at the JGI stall
Rachael Laidlaw (centre), second-year PhD student in Interactive Artificial Intelligence, and other JGI staff members at the JGI stall

During the course of my academic life, I’ve taken the plunge of changing disciplines twice, moving from pure mathematics to applied statistics and then again to computer science, and I liked the idea of supporting others to potentially do the same thing as they looked to enhance their work by delving into data. Through Ask-JGI, I kept my weeks interesting by having something other than my own research to sometimes switch my focus to, and it felt very fulfilling to be able to offer useful technical advice to those who were in the same position that I myself had been in not so long ago too! I therefore got stuck in with anything and everything, from training CNNs for rainfall forecasting or performing statistical tests to compare the antibiotic resistance of different bacteria, to modelling the outcomes of university spinouts or advising on the ethical considerations and potential bias present when designing and deploying a questionnaire-based study. And, of course, by exposing myself to these problems (alongside additional outreach initiatives and showcase events), I also learned a lot along the way, both from my own exploration and from the rest of the team’s insights. 

One especially exciting query revolved around automating the process of identifying from images which particular underground printing presses had been used to produce various historical political pamphlets, based on imperfections in the script. This piqued my interest immediately as it drew parallels with my PhD project, highlighting the copious amount of uses of computer vision and how it can save us time by speeding up traditionally manual processes: from the monitoring of animal biodiversity to carrying out detective work on old written records. 

All in all, this year has broadened my horizons by giving me great consultancy-style work experience through the opportunity to share my expertise and help a wide range of researchers. I would absolutely encourage other curious PhD students to apply and see what they can both give to and gain from the role!