JGI Data Scientist Support – Jean Golding Institute News

Irish Financial Records from the Reign of Edward I: What Applying Data Science Techniques Can Reveal

Posted on 2 December 20252 December 2025 by kerry.turcsi

JGI Seedcorn follow on funding 2023-25: Mike Jones and Brendan Smith

Introduction

This project was a follow-on extension of the JGI-funded seed-corn initiative titled ‘Digital Humanities meets Medieval Financial Records: The Receipt Rolls of the Irish Exchequer.’ The original project, along with a subsequent paper, ‘The Irish Receipt Roll of 1301–2: Data Science and Medieval Exchequer Practice,’ focused on a single receipt roll from the 1301–2 financial year. Building on this foundation, the follow-on project aimed to enhance software and techniques across a larger collection of receipt rolls from Edward I’s reign (1272–1307), offering broader insights into medieval financial practices. However, developing the scripts and troubleshooting errors took longer than expected, which reduced the time available for more in-depth analysis. Nevertheless, we managed to develop a data processing pipeline that allowed a broad analysis of the pipe rolls.

Data

The Irish Exchequer was a government institution responsible for collecting and disbursing income within the lordship of Ireland on behalf of the English Crown. Receipt rolls documented the money received each day by the Irish Exchequer from crown officials, private individuals, and communities. The entries in the rolls consisted of heavily abbreviated Medieval Latin.

There are forty surviving receipt rolls from the reign of Edward I held at the National Archives (TNA) in London. The Virtual Record Treasury of Ireland (VRTI) has translated the rolls from Latin into English for Edward I and later reigns. They have also encoded the translations into TEI/XML (https://tei-c.org), creating a machine-readable and structured digital corpus. The translations and high-quality images of the original documents are accessible to the public on the VRTI website. We gained early access to the TEI/XML documents for Edward I’s reign, which formed the foundation of our data corpus.

Data processing pipeline

To analyse the data, it was first necessary to parse the TEI/XML files and generate comma-separated (CSV) files that could be processed by Pandas, the standard Python library for data analysis, which would then allow us to create plots and visualisations with Matplotlib and Seaborn.

Each payment given to the Irish Exchequer is called a proffer. Each row in the CSV should represent an individual proffer and should include several pieces of information, including:

The financial term. The year was divided into four terms – Michaelmas, Hilary, Easter and Trinity
The date of the proffer, e.g., ‘1286-09-30’
The day of the proffer, e.g., ‘Monday’
The source of the proffer, which is a marginal heading in the roll, e.g., ‘Limerick’
The details of the proffer, e.g., ‘From the debts of various people of Co. Limerick by James Keting: £40’
The extracted monetary offering, e.g., £40
The extracted monetary offering converted to pence, e.g., 9600.0

The pipeline consists of three stages: (1) generate a CSV for each roll; (2) categorise the proffers, for example, whether they relate to profits of justice or rents; and (3) merge all the CSV files into a single ‘mega’ file.

The development of the data processing pipeline in Python was an iterative process. The script was initially written to parse the 1301–2 roll. Although the TEI/XML encoding provided structure, not all the rolls adhered to the composition of the later receipt rolls. For instance, the earlier rolls do not record dates, and some rolls were only partially complete. Consequently, significant time was spent repeatedly refining the script to accommodate the different rolls, allowing us to establish a consistent CSV format.

Part of the iterative development involved error checking, which means verifying the total income calculated from the CSV files against the totals given by the Exchequer clerk on the original roll. Ideally, the values should be either identical or have only minor differences. If the computed total is lower, this may be due to details of the proffers being lost because of damage to the original roll. Computed totals might be higher if additional proffers were added to the roll after the clerk provided the total. Either could indicate parsing errors in the TEI/XML, and any discrepancies require investigation.

A plot showing the comparison between the total provided by the Exchequer clerk, versus the computed total. Most are matching, but there are outliers where the computed total is higher, which needed extra investigation. — *A plot showing the comparison between the total provided by the Exchequer clerk, versus the computed total.*

The error checking facilitated a productive conversation between the project and VRTI, enabling the identification of errors caused by typos in the translations and markup. It also highlighted interesting features in the original rolls. For example, for E 101/230/28, the computed total was significantly greater than that provided by the clerk. The archivists at the TNA re-examined the roll and postulated that membranes from other rolls had been sewn onto this roll during repairs in the Victorian period or later.

Early access to the TEI/XML documents likely meant that more errors were encountered, as not all documents had undergone the whole VTRI editorial process. This resulted in significant time being spent tracking errors, which was not anticipated when the JGI project was conceived.

Analysis and Visualisations

Limitation in scope

After the data was processed, it became possible to analyse and visualise the proffers to the Irish Exchequer. There are 40 existing rolls for the reign. However, due to resource constraints, the analysis is limited to the 21 rolls that are ‘general’ in nature, meaning those relating to proffers from various sources and for different reasons. It does not cover the specialised rolls, such as those related to taxation.

The ‘landscape’ of the rolls

One of the initial visualisations created was to understand the ‘landscape’ of the rolls, specifically what had survived and what had not. In the subsequent plot, we display for each financial year whether we have data for each financial term or whether payments were received outside of those terms. A red box with a tick indicates we have data, and a white box with a cross indicates a gap. As you can see, there are gaps in survival (1281–82, 1283–84, 1289–90, 1297–98, 1302–03, and 1303–04), as well as years with only partial survival (1284–85, 1294–95, 1304–05).

A plot showing the availability of data per financial year and terms. Most years are complete, but some, such as 1281–1282, are missing, or incomplete, like 1284–1285. — *A plot showing the availability of data per financial year and terms.*

However, even this does not provide a complete picture since 1280–1 has an incomplete entry for Michaelmas.

Annual and termly totals

Our dataset does not encompass all income received by the Crown. As noted, some years are missing or contain only partial data, and we do not include additional rolls related to specific sources of income, such as taxation. The subsequent plot depicts the total income from our available data for each financial year, not the actual income received by the Crown.

A plot showing total computed incomes recorded on the receipt rolls per financial year. Most complete years are approaching or over £5000. — *A plot showing total computed incomes recorded on the receipt rolls per financial year.*

We can break down the total income into what was received per term for each financial year. The data is presented as a heatmap, with the darker colours indicating a greater amount of income received. Different terms received the most income in various years. For example, Michaelmas in 1285–86, 1286–87, 1288–89; Easter in 1282–83, 1291–92, 1292–93, 1301-02; and Trinity in 1306–07.

A heatmap showing the total income per financial term and year. Different years have different terms providing the most income. For example, Michaelmas for 1285–1286, 1286–1287 — *A heatmap showing the total income per financial term and year.*

The following plot shows the number of proffers received as a percentage of the total extant proffers for each financial year.

A plot showing the total received per term as a percentage of the year's total. — *A plot showing the total received per term as a percentage of the year’s total.*

Unlike the 1301–2 roll examined in the first project, Easter was not always the term that generated the highest income. However, similar to the 1301–2 roll, we can see in the following plot that, in terms of the number of proffers received each term as a percentage of the financial year, Michaelmas was often the busiest term.

A plot showing the number of proffers per term as a percentage of the year's total. Michaelmas is often the busiest in the number of proffers received. — *A plot showing the number of proffers per term as a percentage of the year’s total.*

Types of business

The proffers were categorised into five broad categories, namely, ‘farms and rents’, ‘profits of justice’, ‘customs’, ‘profits of escheatry, wardships, and temporalities’, and ‘other revenues’. The following plot shows the total income received per category for each financial year. By far, the greatest source of income is from the ‘profits of justice’ category.

A graph showing the income received per broad category. Profits of justice are by far the most outstanding category. — *A graph showing the income received per broad category.*

A plot showing the income received by category as a percentage. Profits of Justice accounts for over 50% of the business. — *A graph showing the income received by category as a percentage.*

Further work is required here, such as distinguishing the profits of justice into fines and amercements: a fine was a voluntary payment made to the king to gain favour or a privilege, such as obtaining a royal writ, whereas an amercement was a financial penalty imposed by the king or a court.

Sources of income

All the rolls specify the ‘source’ of a proffer, often a place, e.g., ‘Dublin.’ However, it can also refer to a group or other entity, e.g., ‘English debts of the merchants of Lucca’, or a specific cause, e.g., ‘By writ of England.’ The following plot shows the total income received per source in the dataset, for the twenty sources that recorded the most proffers. Dublin, by far, accounts for the most significant number of individual proffers.

A plot showing income from the top 20 sources. Dublin is the largest source of income, accounting for over £8,000, with Cork returning £7,000. — *A plot showing income from the top 20 sources.*

Conclusion

Like other Digital Humanities projects, this initiative relied heavily on human labour, especially from archivists and historians who translated the original Latin documents into English and encoded those translations into TEI-XML documents. Although we could process machine-readable datasets, extra effort was needed to clean the data and ensure its accuracy. This additional work was understandable, as the VRTI TEI/XML was created to support a digital edition of the receipt rolls rather than for statistical analysis. However, this limited the time available for detailed analysis, with most work focusing on understanding what was present in the datasets, their limitations due to document loss, and providing a general overview of the payments received. Nonetheless, the project demonstrated opportunities to develop and explore further research questions with additional funding and time.

The project was undertaken by Mike Jones of Research IT and Brendan Smith of the Department of History, with the assistance of Elizabeth Biggs of the Virtual Record Treasury of Ireland and Paul Dryburgh of The National Archives, UK.

MagMap – Accurate Magnetic Characteristic Mapping Using Machine Learning

Posted on 25 June 202527 June 2025 by kerry.turcsi

PGR JGI Seed Corn Funding Project Blog 2023/24: Binyu Cui

Introduction:

Magnetic components, such as inductors, play a crucial role in nearly all power electronics applications and are typically known to be the least efficient components, significantly affecting overall system performance and efficiency. Despite extensive research and analysis on the characteristics of magnetic components, a satisfactory first-principle model for their characterization remains elusive due to the nonlinear mechanisms and complex factors such as geometries and fabrication methods. My current research focuses on the characterization and modelling of magnetic core loss, which is essential for power electronics design. This research has practical applications in areas such as the fast charging of electric vehicles and the design of electric motors.

Traditional modelling methods have relied on empirical equations, such as the Steinmetz equation and the Jiles-Atherton hysteresis model, which require parameters to be curve-fitted in advance. Although these methods have been refined over generations (e.g., MSE and iGSE), they still face practical limitations. In contrast, data-driven techniques, such as machine learning with neural networks, have demonstrated advantages in addressing multivariable nonlinear regression problems.

Thanks to the funding and support from the JGI Institute, the interdisciplinary project “MagMap” has been initiated. This project encompasses testing platform modifications, database setup, and neural network development, advancing the characterization and modelling of magnetic core loss.

Outcome

Previously, a large-signal automated testing platform is produced to evaluate the magnetic characteristics under various conditions. Fig. 1 shows the layout of the hardware section of the testing platform and Fig. 2 shows the user interface of the software that is currently used for the testing. With the help of JGI, I have managed to update the automated procedure of the platform including the point-to-point testing workflow and the large signal inductance characterizing. This testing platform is crucial for generating the practical database for the further machine learning process as its automated function has largely increased the testing efficiency of each operating point (approx 6-8s per data point).

Labelled electrical components in a automated testing platform — *Fig. 1. Layout of the automated testing platform.*

Code instructions for the interface of the automated testing platform — *Fig. 2. User interface of the automated testing platform.*

Utilizing the current database, a Long Short-Term Memory (LSTM) model has been developed to predict core loss directly from the input voltage. The model shows a better performance in deducing the core loss than traditional empirical models such as the improved generalized Steinmetz equation. A screenshot of the code outcome is shown in Fig. 3 and an example result of the model for one material is shown in Figure 4. A feedforward neural network has been tried out as a scalar-to-scalar model to deduce the core loss directly from a series of input scalars including the magnetic

flux density amplitude, frequency and duty cycle. Despite the accuracy of the training process, there are limitations in the input waveform types. Convolutional neural networks have also been tested before using the LSTM as a sequence-to-scalar model. However, the model size is significantly larger than the LSTM with hardly any improvement in accuracy.

Code for the demo outcome of the LSTM — *Fig. 3. Demo outcome of the LSTM.*

Bar chart showing ratio of data points against relative error code loss (%) — *Fig. 4. Model performance against the ratio of validation sets used in the training.*

Future Plan:

Although core loss measurement and modelling is a key issue in industrial applications, the reason behind these difficulties is the non-linear relationship between the magnetic flux density and the magnetic field strength which is also known as the permeability of the magnetic material. The permeability of ferromagnetic is very sensitive to a series of external parameters including temperature, induced current, frequency and input waveform types. With an accurate fitting between the relationship of magnetic flux density and field strength, not only

the core loss can be precisely calculated but also the current modelling method that is used in Ansys and COMSOL can be improved.

Acknowledgement:

I would like to extend my gratitude to JGI for funding this research and for their unwavering support throughout the project. I am also deeply thankful to Dr. Jun Wang for his continuous support. Additionally, I would also like to express my appreciation to Mr. Yuming Huo for his invaluable advice and assistance with the neural network coding process.

Unveiling Hidden Musical Semantics: Compositionality in Music Ngram Embeddings

Posted on 25 June 2025 by kerry.turcsi

PGR JGI Seed Corn Funding Project Blog 2023/24: Zhijin Guo

Introduction

The overall aim of this project is to analyse music scores by machine learning. These of course are different from sound recordings of music, since they are symbolic representations of what musicians play. But with encoded versions of these scores (in which the graphical symbols used by musicians are rendered as categorical data) we have the chance to turn these instructions in various sequences of pitches, harmonies, rhythms, and so on.

What were the aims of the seed corn project?

CRIM concerns a special genre of works from sixteenth century Europe in which a composer took some pre-existing piece and adapted the various melodies and harmonies in it to create a new but related composition. More specifically, the CRIM Project is concerned with polyphonic music, in which several independent lines are combined in contrapuntal combinations. As in the case of any given style of music, the patterns that composers create follow certain rules: they write using stereotypical melodic and rhythmic patterns. And they combine these tunes (‘soggetti’, from the Italian word for ‘subject’ or ‘theme’) in stereotypical ways. So, we have the dimensions of melody (line), rhythm (time), and harmony (what we’d get if we slice through the music at each instant.

A network of musical notations — *Figure 1. An illustration of music graph, nodes are music ngrams and edges are different relations between them.* *Image generated by DALL·E*.

We might thus ask the following kinds of questions about music:

Starting from a given composition, what would be its nearest neighbour, based on any given set of patterns we might chose to represent? A machine would of course not know anything about the composer, genre, or borrowing involved in those pieces, but it would be revealing to compare what a machine might tell us about this such ‘neighbours’ in light of what a human might know about them.

What communities of pieces can we identify in a given corpus? That is, if we attempt to classify of groups works in some way based on shared features, what kinds of communities emerge? Are these communities related to Style? Genre? Composer? Borrowing?

In contrast, if we take the various kinds of soggetti (or other basic ‘words’) as our starting point, what can we learn about their context? What soggetti happen before and after them? At the same time as them? What soggetti are most closely related to them? And through this what can we say about the ways each kind of pattern is used?

Interval as Vectors (Music Ngrams)

How can we model these soggetti? Of course they are just sequences of pitches and durations. But since musicians move these melodies around, it will not work simply to look for strings of pitches (since as listeners we can recognize that G-A-B sounds exactly the same as C-D-E). What we need to instead is to model these as distances between notes. Musicians call these ‘intervals’ and you could think of them like musical vectors. They have direction (up/down) and they have some length (X steps along the scale).

Here is an example of how we can use our CRIM Intervals tools (a Python/Pandas library) to harvest this kind of information from XML encodings of our scores. There is more to it than this, but the basic points are clear: the distances in the score are translated into a series of distances in a table. Each column represents the motions in one voice. Each row represents successive time intervals in the piece (1.0 = one quarter note).

An ngram for a section of music — *Figure 2. An example of ngram: [-3, 3, 2, -2], interval as vectors.*

Link Prediction

We are interested in predicting unobserved or missing relations between pairs of ngrams in our musical graph. Given two ngrams (nodes in the graph), the goal is to ascertain the type and likelihood of a potential relationship (edge) between them, be it sequential, vertical, or based on thematic similarity.

Sequential is tuples that come near each other time. This is Large Language Model which computes ‘context’. LLM then produces the semantic information that is latent in the data.

Vertical is tuples that happen at the same time. It is ANOTHER kind of context.

Thematic is based on some measure of similarity.

Upon training, the model’s performance is evaluated on a held-out test set, providing metrics such as precision, recall, and F1-score for each type of relationship. The model achieved a prediction accuracy of 78%.

Beyond its predictive capabilities, the model also generates embeddings for each ngram. These embeddings, which are high-dimensional vectors encapsulating the essence of each ngram in the context of the entire graph, can serve as invaluable tools for further musical analysis.

From aerosol particles to network visualisation: Data science support enhancing research at the University of Bristol

Posted on 25 June 20255 August 2025 by kerry.turcsi

AskJGI Example Queries from Faculty of Science and Engineering

All University of Bristol researchers (from PhD student and up) are entitled to a day of free data science support from the Ask-JGI helpdesk. Just email ask-jgi@bristol.ac.uk with your query and one of our team will get back to you to see how we can support you. You can see more about how the JGI can support data science projects for University of Bristol based researchers on our website.

We support queries from researchers across all faculties and in this blog we’ll tell you about some of the researchers we’ve supported from the Faculty of Health and Life Sciences here at the University of Bristol.

Aerosol particles

A researcher approached us with Python code they’d written for simulating radioactive aerosol particle dynamics in a laminar flow. For particles smaller than 10 nanometers, they observed unexplained error “spikes” when comparing numerical to analytical results, suggesting that numerical precision errors were accumulating due to certain forces being orders of magnitude smaller than others for the tiny particles.

We provided documentation and advice for implementing higher-precision arithmetic using Python’s ‘mpmath’ library so that the researcher could use their domain knowledge to increase precision in critical calculation areas, balancing computational cost with simulation accuracy. We also wrote code to normalise the magnitude of different forces to similar scales to prevent smaller values from being lost in the calculation.

This was a great query to work on. Although Ask-JGI didn’t have the same domain knowledge for understanding the physics of the simulation, the researcher worked closely with us to help find a solution. They provided clear and well documented code, understood the likely cause of their problem and identified the solutions that we explored. This work highlights how computational limitations can impact the simulation of physical systems, and demonstrates the value of collaborative problem-solving between domain specialists and data scientists.

Diagram A shows straight arrow lines and B shows curvy arrow lines — *Laminar flow (a) in a closed pipe, Turbulent flow (b) in a closed pipe. Image credit: SimScale*

Training/course development

The JGI offers training in programming, machine learning and software engineering. We have some standard training courses that we offer as well as upcoming courses being development and shorter “lunch and learn” sessions on various topics.

Queries have come in to both Ask-JGI and the JGI training mailbox (jgi-training@bristol.ac.uk) asking follow up questions from training courses which people have attended. Additionally, requests have come through for further training to be developed in specific areas (e.g. natural language processing, advanced data visualisation or LLM useage). The JGI training mailbox is the place to go, but Ask-JGI will happily redirect you!

People sitting at tables in a computer lab looking at a large computer screen at the end of the table — *Introduction to Python training session for Bristol Data Week 2025.*

Network visualization

Recently Ask-JGI received a query from a PhD researcher in the School of Geographical Sciences. The Ask JGI team offered support on exploring visualisation options for the data provided, and provided example network visualisations of the UK’s industries’ geographical distribution similarity. Documented code solution was also provided so that further customisation and extension of the graphs is possible. At the Ask JGI, we are happy to help researchers who are already equipped with substantive domain knowledge and coding skills to complete small modules of their research output pipeline.

Network made up with lines and dots. Each colour represents a different UK industry — *Network visualisation of similarity of UK industry geographical distribution.*

Spin Network Optimisation

The aim of this query was to accelerate the optimization of a spin network which is a network of nodes coupled together by a certain strength, to perform transfer of information (spin) from one node to another by implementing parallel processing. The workflow involved a genetic algorithm (written in Fortran and executed via a bash script) and a Python-based gradient ascent algorithm.

Initial efforts focused on parallelizing the gradient ascent step. However, significant challenges arose due to the interaction between the parallelized Python code and the sequential execution of the Fortran-based Spinnet script.

Code refactoring was undertaken to improve readability and introduce minor speed enhancements by splitting the Python script into multiple files and grouping similar function calls.

Given the complexity and time investment associated with these code modifications, it was strongly recommended to explore the use of High-Performance Computing (HPC) facilities. Running the current code on an HPC system went on to provide the desired speed improvements without requiring any code changes, as HPC is designed for computationally intensive tasks like this.

Grant development

The Ask-JGI helpdesk is the main place researchers get in contact with the JGI with regards to getting help with grant applications. The JGI can support with grant idea development, giving letters of support for applications and costing in JGI data scientists or research software engineers to support the workload for potential projects. You can read more about how the JGI team can support grant development on the JGI website!

Using ‘The Cloud’ to enhance UoB laboratory data security, storage, sharing, and management

Posted on 13 June 202518 June 2025 by kerry.turcsi

JGI Seed Corn Funding Project Blog 2023/24: Peter Martin, Chris Jones & Duncan Baldwin

Introduction

As a world-leading research-intensive institution, the University of Bristol houses a multi-million-pound array of cutting-edge analytical equipment of all types, ages, function, and sensitivity – distributed across its Schools, Faculties, Research Centres and Groups, as well as in dozens of individual labs. However, as more and more data are captured – how can it be appropriately managed to comply with the needs of both researchers and funders alike?

What were the aims of the seed corn project?

When an instrument is purchased, the associated computing, data storage/resilience, and post-capture analysis is seldom, if ever, considered beyond the standard Data Management Plans.

Before this project, there existed no centralised or officially endorsed mechanism at UoB supported by IT Services to manage long-term instrument data storage and internal/external access to this resource – with every group, lab, and facility individually managing their own data retention, access, archiving, and security policies. This is not just a UoB challenge, but one that is endemic of the entire research sector. As the value of data is now becoming universally realised, not just in academia, but across society – the challenge is more pressing than ever, with an institution-wide solution to the entire data challenge critically required which would be readily exportable to other universities and research organisations. At its core, this Seed Corn project sought to develop a ‘pipeline’ through which research data could be; (1) securely stored within a unified online environment/data centre into perpetuity, and (2) accessed via an intuitive, streamlined and equally secure online ‘front-end’ – such as Globus, akin to how OneDrive and Google Drive seamlessly facilitate document sharing.

What was achieved?

The Interface Analysis Centre (IAC), a University Research Centre in the School of Physics currently operates a large and ever-growing suite of surface and materials science equipment with considerable numbers of both internal (university-wide) and external (industry and commercial) users. Over the past 6-months, working with leading solution architects, network specialists, and security experts at Amazon Web Services (AWS), the IAC/IT Services team have successfully developed a scalable data warehousing system that has been deployed within an autonomous segment of the UoB’s network, such that single-copy data that is currently stored locally (at significant risk) and the need for it to be handled via portable HDD/emailed across the network can be eliminated. In addition to efficiently “getting the data out” from within the UoB network, using native credential management within Microsoft Azure/AWS, the team have developed a web-based front-end akin to Google Drive/OneDrive where specific experimental folders for specific users can be securely shared with these individuals – compliant with industry and InfoSec standards. The proof of the pudding has been the positive feedback received from external users visiting the IAC, all of whom have been able to access their experiment data immediately following the conclusion of their work without the need to copy GB’s or TB’s of data onto external hard-drives!

Future plans for the project

The success of the project has not only highlighted how researchers and various strands within UoB IT Services can together develop bespoke systems utilising both internal and external capabilities, but also how even a small amount of Seed Corn funding such as this can deliver the start of something powerful and exciting. Following the delivery of a robust ‘beta’ solution between the Interface Analysis Centre (IAC) labs and AWS servers, it is currently envisaged that the roll-out and expansion of this externally-facing research storage gateway facility will continue with the support of IT Services to other centres and instruments. Resulting from the large amount of commercial and external work performed across the UoB, such a platform will hopefully enable and underpin data management across the University going forwards – adopting a scalable and proven cloud-based approach.

Contact details and links

Dr Peter Martin & Dr Chris Jones (Physics) peter.martin@bristol.ac.uk and cj0810@bristol.ac.uk

Dr Duncan Baldwin (IT Services) d.j.baldwin@bristol.ac.uk