First Steps Towards a Crowd-Sourced Ancient Greek Encyclopaedia

JGI Seed Corn Funding Project Blog 2023/24: Naomi Scott

Passage of Ancient Greek text
A page from a 10th century manuscript of Julius Pollux’s Onomasticon

In the second century A.D., Julius Pollux, Professor of Rhetoric at the Academy in Athens, wrote the Onomasticon (‘Book of Words’), and dedicated it to the Emperor Commodus. The work sits somewhere between an encyclopaedia and a lexicon. Chapters are organised by topic, and Pollux lists appropriate words on diverse themes such as ‘The Gods’, ‘Bakery Equipment’, ‘Diseases of Dogs’, and ‘Objects Found On Top Of Tables’. Throughout his work, Pollux quotes canonical authors such as Homer, Aeschylus, and Sappho in support of what he considers correct and elegant linguistic usage. This means that in addition to providing a wealth of information on everyday life in the ancient world, the Onomasticon is also one of our best sources of quotations from otherwise lost works of ancient Greek literature.   

Despite Pollux’s obvious importance, his work has not been translated into any modern language. The vast size of the Onomasticon (10 books in total, each comprised of around 250 chapters) means that it is unwieldy even for researchers able to study the original ancient Greek text. With seed-corn funding from the Jean Golding Institute, my project ‘Crowd Sourcing Julius Pollux’s Onomasticon’ has set to work on filling this gap. Eventually, my aim is to use crowd-sourcing to produce not only a translation of the Onomasticon, thereby making it accessible to researchers in a wide variety of disciplines, but an edition of the work which is fully data-tagged, so that researchers can better navigate the text, and produce key data about it: Which ancient authors and genres are most frequently cited as sources and in what contexts; what topics are granted the most or least coverage within the text; and how are different lexical categories distributed within the encyclopaedia? Without the answers to questions such as these, any individual chapter or citation within the Onomasticon cannot be placed in the wider context of the work as a whole.  

Creating a New Digital Edition 

While a digitised version of the ancient Greek text of the Onomasticon exists, it is based on the work of Erich Bethe, whose early twentieth-century edition of Pollux removed all the chapter titles which have been used to organise the text since it was first published as a printed book in 1502. Bethe did this because he did not consider the chapter titles to be Pollux’s own. Both for the purpose of splitting the text up into manageable short chunks for translation, and for the purpose of data-tagging, I decided it was essential to reinstate the titles. Additionally, my own examination of manuscripts of the Onomasticon dating as far back as the 10th century has revealed that the chapter titles are in fact much older than first thought, and that the text as we currently have it (abridged from Pollux’s even longer original!) may even have been conceived with the chapter titles. 

The first step in producing a digital edition suitable for crowd-sourcing and data-tagging is therefore to reinsert the titles into the text. This would be an enormous undertaking if done manually. Working with a brilliant team from Bristol’s Research IT department, led by Serena Cooper, Keiran Pitts, and Mike Jones, we have set about automating this process. Ancient Greek OCR (Optical Character Recognition) software designed by Professor Bruce Robertson at the University of Mount Allison in Canada, two editions of the text were scanned — one Bethe’s chapterless version, and the other by Karl Wilhelm Dindorf, whose 1824 edition of the text includes the titles.  The next step is to use digital mapping software to combine the two texts, inserting the titles from Dindorf into the otherwise superior version of the text produced by Bethe.  

Next Steps 

Once the issue of the chapter titles has been resolved, the next step will be to create a prototype of around 20 chapters, which can then be made available to the scholarly community to begin translating and data-tagging the text. A prototype would allow us to get feedback from researchers around the world working with Pollux, and to better understand what kinds of data would be most useful to those seeking to understand the text. This feedback can then be integrated into an eventual complete edition of the text which can then be translated and data-tagged as a whole.  

Eventually, this project will not only make the Onomasticon more accessible to researchers, and help to revolutionise our understanding of this important work. A complete translation and data-tagged edition complete with chapter titles will also allow the Onomasticon to have an impact beyond the academic community. The eventual plan is to train arts professionals engaging with the ancient Greek world to use the digital edition and translation. The Onomasticon’s remarkably detailed picture of ordinary life and ordinary stuff in antiquity makes it a vital resource for anyone trying to recreate the ancient Greek world on stage, on screen, or in novels. The hope is that this project will therefore not only change the way that scholars understand the Onomasticon and its place in the history of the encyclopaedia. It can also offer artists a window onto antiquity, and through its impact on art, shape the public understanding of the ancient world.  

New Turing Liaison Officers join the JGI team

As an active member of the Turing University Network, we have appointed a Turing Liaison Manager and two Turing Liaison Academics to support and enhance the partnership between Alan Turing Institute and the University of Bristol. These roles will be focusing on increasing engagement from Turing, developing external and internal networks around data science and AI, and supporting relevant interest groups, Enrichment students and Turing Fellows at the University of Bristol.

Turing Liaison Manager, Isabelle Halton and Turing Academic Liaisons, Conor Houghton and Emmanouil Tranos, are keen to build communities around data science and AI, providing support to staff and students who want to be more involved in Turing activity.

Isabelle previously worked in the Professional Liaison Network in the Faculty of Social Sciences and Law. She has extensive experience in building relationships and networks, project and event management and streamlining activities connecting academics and external organisations.

Conor is a Reader in the School of Engineering Mathematics and Technology, interested in linguistics and the brain. Conor is a Turing Fellow and a member of the TReX, the Turing ethics committee.

Emmanouil is currently a Turing Fellow and a Professor of Quantitative Human Geography, specialising primarily on the spatial dimensions of the digital economy.


If you’re interested in becoming more involved with Turing activity or have any questions about the partnership, please email Isabelle Halton, Turing Liaison Manager via the Turing Mailbox

How Smartwatches Could Help People with Type 1 Diabetes 

JGI Seed Corn Funding Project Blog 2023/24: Miranda Armstrong

Introduction

Type 1 diabetes (T1D) requires consistent self-management, which places a large burden on those who live with it. We explored the role smartwatches could play in reducing that burden. 

Image contains photos of a continuous glucose monitor, smartwatch, closed loop algorithm and Insulin pump
Figure 1: Theoretical closed-loop system that uses smartwatch data in its algorithm. Closed-loop systems without smartwatch integration are the current state of the art of T1D technology. They begin to automate the T1D management process by using data from the eco-system of devices to predict future changes in blood glucose and change insulin dosage to counteract these changes.

Aims

The project aimed to collect and build a dataset that would allow for exploration into the potential of smartwatches in T1D management. This would include both data from the smartwatch and T1D technology the participants used, and user experience of using the smartwatch alongside their typical T1D management. To meet the aim, the following goals were set: 

  1. Collect data from participants, including from smartwatches and T1D devices, and in interviews and focus groups. 
  1. Clean, anonymise, and combine data from different sensors into a consistent format, and transcribe the interviews and focus groups. 
  1. Hold an online data challenge using a sample of the collected data to promote the dataset and highlight potential uses for it. 
  1. Release the dataset publicly to allow other researchers to use it as part of their work and therefore increase the value of the dataset. 

What was achieved

Two graphs. Top graph is Blood glucose, Insulin and Carbohydrates against Time. The bottom graph is Heart rate and steps against time
Figure 2: An example day of data from one participant, with some of the data available. The upper axis highlights the data available to current commercial closed-loop systems and the lower axis shows some smartwatch data from the same period. 

Data Collection

The project recruited 24 participants, and each were given a smartwatch or could use their own. Over six months, participants donated data from their smartwatches and type 1 diabetes (T1D) devices to create a dataset aimed at exploring the integration of smartwatch data into a closed-loop algorithm. This dataset reflects real-life conditions and participants used a range of T1D technology. Over 2000 days, the data that was donated had a high coverage from all the devices the participant used. During this time participants were involved in interviews and focus groups to discuss their opinions of the smartwatches and potential roles they could see in T1D management. A total of 62 interviews and 11 focus groups were completed across the study period. 

Data Processing

We processed a large amount of data to prepare it for public use. The smartwatch and T1D data were cleaned and anonymised (so no one involved in the study could be identified) and then organised into two formats. One was an easy-to-use dataset for researchers to test their algorithms, and the other kept the data in its original form for deeper exploration. We also transcribed and anonymised the interviews and focus groups so other researchers could analyse them to understand the participants’ experience of using the smartwatch. 

Initial Findings

Initial engagement with the interviews and focus group data has highlighted several potential uses for smartwatches in T1D management. These include as a device to display data quickly and discretely to the user, as an interface with T1D technology for easier access, and as a data source to inform management decision making around activity. There are also design implications highlighted in this analysis. These include utilising automation to provide benefit without increasing user burden, allowing customisation to accommodate the wide range of user preferences and usage patterns to promote uptake, and flexibility to allow these systems to adapt to changing user needs and ensure use of the device into the future. 

Future Plans

The data challenge and the public release of the dataset are scheduled for later this year.  We plan to run the competition from mid-September to the end of November 2024, with £1600 in prize vouchers available across entries. If you would like to hear more details about the competition, please leave your details in this form. The whole dataset will then be published after the competition of the data competition.  

Additionally, we will conduct our own analysis on the data that has been collected. This will expand our initial findings that highlight where and how a smartwatch could be used to improve T1D management. It will also test if adding smartwatch data can improve the prediction of blood glucose, by factoring in information on activity. This could be utilised in closed-loop systems (Figure 1) and would allow them to factor activity into their algorithms. For example, if the user was to go for a walk, this system could detect that activity and then predict the drop in blood glucose levels it would cause and so reduce insulin delivery to counteract this drop. Such a system would improve T1D management and reduce the burden placed on those managing it. 


Contact details

Sam James: sam.james@bristol.ac.uk 

Miranda Armstrong: Miranda.armstrong@bristol.ac.uk 

Zahraa Abdallah: zahraa.abdallah@bristol.ac.uk 

Ask JGI Student Experience Profiles: Rachael Laidlaw

Rachael Laidlaw (Ask-JGI Data Science Support 2023-24) 

I first came into contact with the Jean Golding Institute last year at The Alan Turing Institute’s annual AI UK conference in London, and then again in the early stages of the DataFace project in collaboration with Cheltenham Science Festival. This meant that before I officially joined the team back in October, I already knew what a lovely group of people I’d be getting involved with! Having nice colleagues, however, was not my only motivation for applying to be an Ask-JGI student. On top of that, I’d decided that whilst starting out in my ecological computer-vision PhD niche, I didn’t want to forget all of the statistical skills that I’d developed back in my MSc degree. Plus, it sounded really fun to keep myself on my toes by exercising my mind tackling a variety of data-oriented requests from across the university’s many departments. 

Rachael Laidlaw in centre with two JGI staff members to the left and one JGI staff member to the right pointing towards a Data pin board at the JGI stall
Rachael Laidlaw (centre), second-year PhD student in Interactive Artificial Intelligence, and other JGI staff members at the JGI stall

During the course of my academic life, I’ve taken the plunge of changing disciplines twice, moving from pure mathematics to applied statistics and then again to computer science, and I liked the idea of supporting others to potentially do the same thing as they looked to enhance their work by delving into data. Through Ask-JGI, I kept my weeks interesting by having something other than my own research to sometimes switch my focus to, and it felt very fulfilling to be able to offer useful technical advice to those who were in the same position that I myself had been in not so long ago too! I therefore got stuck in with anything and everything, from training CNNs for rainfall forecasting or performing statistical tests to compare the antibiotic resistance of different bacteria, to modelling the outcomes of university spinouts or advising on the ethical considerations and potential bias present when designing and deploying a questionnaire-based study. And, of course, by exposing myself to these problems (alongside additional outreach initiatives and showcase events), I also learned a lot along the way, both from my own exploration and from the rest of the team’s insights. 

One especially exciting query revolved around automating the process of identifying from images which particular underground printing presses had been used to produce various historical political pamphlets, based on imperfections in the script. This piqued my interest immediately as it drew parallels with my PhD project, highlighting the copious amount of uses of computer vision and how it can save us time by speeding up traditionally manual processes: from the monitoring of animal biodiversity to carrying out detective work on old written records. 

All in all, this year has broadened my horizons by giving me great consultancy-style work experience through the opportunity to share my expertise and help a wide range of researchers. I would absolutely encourage other curious PhD students to apply and see what they can both give to and gain from the role! 

Children of the 90s and Synthetic Health Data 

JGI Seed Corn Funding Project Blog 2023/24: Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day and Debbie Johnson

What is Children of the 90s? 

Children of the 90s (Co90s) is a multi-generational population-based study following the health and development of nearly 15,000 families living around Bristol, whose children were born in 1991 and 1992. 

Co90s initially recruited its participants during the early stages of the mum’s pregnancy and captures information prospectively, at key time points, using self-reported questionnaires, interviews, clinics and electronic health records (EHR). 

The Co90s supports about 20 project teams using NHS data at any one time.  

What is Synthetic Data? 

At its most basic, synthetic data is information generated artificially rather than recorded directly from real-world events. It is essentially a computer-generated version of the data that doesn’t contain any real data and preserving privacy and confidentiality. 

Privacy vs Fidelity

Generating synthetic data is frequently a balancing act between fidelity and privacy (Figure 1). 

“Fidelity”: how well does the synthetic data represent the real-world data?  

“Privacy”: can personal information be deduced from the synthetic data? 

Blue line with an arrowhead at each end. Left side is High privacy, low fidelity and the right side is low privacy, high fidelity
Figure 1: Privacy versus fidelity

Why synthetic NHS data: 

EHR data are incredibly valuable and rich data sources, yet there are significant difficulties to accessing this data, including financial costs and the time taken to complete multiple application forms and have these approved. 

Because the authentic NHS data is so difficult to access, it is also not unusual for researchers to have never worked with, or possibly even seen, this type of data before. They often face a learning curve to understand how the data is structured, what variables are present in the data and how those variables relate to each other. 

The journey for a project to travel (Figure 2) just to get NHS data typically goes through the following stages: 

Multiple coloured boxed with each stage a project has to go through to get NHS data from initial grant application to data access
Figure 2: The stages a project goes through to get NHS data

Each of these stages can take several months and are usually sequential. It not unheard of for projects to run out of time and/or money due to these lengthy timescales. 

Current synthetic NHS data: 

Recently, the NHS has released synthetic Hospital Episode Statistics (HES) data (available here; https://digital.nhs.uk/services/artificial-data) which is, unfortunately, quite limited for practical purposes. This is because a very simple approach was adopted; each variable is randomly generated independently from all others. While it is possible to infer broadly accurate descriptive statistics for single variables (e.g., age or sex), it is impossible to infer relations between variables (e.g., how the number of cancer diagnoses increases with age). In the terms introduced above, it has high privacy but low fidelity. As shown in the heatmap, Figure 3, we observe practically no association between diagnosis and treatment because synthetic NHS data is randomly generated variable-by-variable.  

Heat map with disease groupings on the right side and different treatments on the bottom
Figure 3: Heatmap displaying the relations between disease groupings (right side) and treatment (bottom) from the synthetic NHS data. The colour shadings represent the number of patients (e.g., the darker the shading, the higher the number). The similarity in shading within each diagnosis row shows that treatment and diagnosis were largely independent in this synthetic dataset.

What do researchers want from synthetic data? 

We developed an anonymous survey and asked 230 researchers experienced with EHR data, what would be important to them when considering using synthetic EHR data. Out of the 24 responding most were epidemiologists at fellow or professor level. Researchers were then invited to an online discussion group to expand on insights from the survey. Seven researchers attended.   

Most researchers had a more than 3 years of experience using EHRs both within and outside of cohort studies. Although few had much knowledge of synthetic EHR data, many had heard of synthetic EHR data and were interested in its application, particularly as a tool for training and learning about EHRs generally. 

The most important issues to researchers (Figure 4) were consistent patient details and having all the additional diagnosis & treatment codes rather than just the main ones: 

Horizontal bar chart showing different desirable quantities in synthetic EHR against the number of responses.
Figure 4: What researchers look for in synthetic EHRs

The most important utility for these researchers was to test/develop code and understand broad structure of the data, as shown below (Figure 5): 

Chart showing the priorities of researchers on a scale from first to last choice
Figure 5: Priorities of researchers when using synthetic data

This was reflected in their main concerns about maintaining the utility of the data in the synthetic version by producing high level of accuracy and attention to detail. 

During the discussion it was recognised that EHRs are “messy” and synthetic data should emulate this, providing an opportunity to prepare for real EHRs. 

Visual showing discussion points about emulate "messy" real data
Emulate “messy” real data discussion visual

Being able to prepare for the use of real EHRs was the main use case for synthetic data. No one suggested using the synthetic data as the analysis dataset in place of the real data.   

Visual showing factors to consider in relation to preparation for using real EHR data
Preparation for using real EHR data visual

It was suggested, in both survey responses and the discussion group, that any synthetic data should be bespoke to the requirements of each project. Further, it was observed that each research project only ever used a portion of the complete dataset, therefore synthetic data should be minimized also.  

“I think any synthetic data set based on any of the electronic health records should be stripped back to the key things that people use, because then the task of making it a synthetic copy [is] less.” (online participant) 

Summary

Following the survey and discussion with some researchers familiar with EHRs a few key points came through: 

  • Training – using synthetic data to understand how EHRs work, and to develop code. 
  • Fidelity is important – using synthetic data as way for researchers to experience using EHRs (e.g. the real data flaws, linkage errors, duplicates). 
  • Cost – the synthetic data set, and associated training, must be low cost and easily accessible.  

Next Steps

There is a demand for a synthetic data set with a higher level of fidelity than is currently available, and particularly there is a need for data which is much more consistent over time. 

The Co90s is well placed to respond to this demand, and will look to: