Visualising group energy

Blog written by Hen Wilkinson, School for Policy Studies at the University of Bristol.

The project was funded by the annual Jean Golding Institute seed corn funding scheme. It emerged from Hen’s ESRC funded PhD research, supported by the SWDTC and School for Policy Studies.

Collaborative working is central to tackling the world’s complex problems but is not easy to sustain

Power dynamics and inequalities play out in all directions, in the relationships between individuals just as much as between organisations. By making ‘hot spots’ visible in group interactions it becomes easier to acknowledge and work with points of conflict that will inevitably arise and to deal with them in a creative and sustainable manner.

While researching ‘the space between’ individuals and organisations, qualitative researcher Hen Wilkinson and data scientist Bobby Stuijfzand developed a new methodology using computer software to visualize energy shifts in group interactions. Listening to audio recordings of groups working together on a task, the impact of nonverbal elements in the group interaction was striking, with dynamics between participants influenced just as much by the nonverbal content of laughs, silences, sighs, asides and interruptions, as by the words spoken.

Visualizing shifts of energy – a new approach in qualitative research

Following this observation, the ambition to visualize these tangible shifts of ‘energy’ in the groups took hold. To date, little attention has been paid to generating computer visuals in qualitative research, so creating a rigorous, systematic visualization of energy shifts was lengthy, challenging and exciting. For more detail on the rationale and methodology we developed over the course of two years and to view the final interactive versions of the design, see Visualizing energy shifts in group interactions. Among the many challenges we faced were finding and adapting an instrument to use with small and interactive qualitative datasets; establishing interrater reliability; identifying what was meant by ‘energy’; deciding which nonverbal elements to visualize; and how to present the resulting data.

On the website we present four 5-minute visualized extracts of group interaction, each drawn from a different group discussion, two of which were held in the UK and two in the Netherlands. Each extract of data is five minutes long, made up of 2.5 minutes of interaction either side of a central mid-point clash or strong challenge in the group. The five minutes of data were then scored by a team of raters listening independently to audio clips of the extract divided into meaning units, which are shown as ‘topic shifts’ on the visualizations. In this way, the qualitative data was converted into numerical values for three main variables – levels of mood and engagement as they shifted over a set period of time.

The support of seed corn funding from the Jean Golding Institute allowed us to work on the presentation of the visualizations, from realising an interactive website which showed how the numerical data we used was reached to refining the aesthetics of the design to encourage maximum engagement with the graphs and clarity of understanding in the viewer. Initial images were generated using ggplot2, a data visualization package for the statistical programming language ‘R’ – see Initial visualizations.

Initial visualizations

Following the generation of these first images, we explored the significance of data presentation through extensive design research, working with designer Derek Edwards. This drew on multiple sources in a visual exploration of accessibility, of the impact of colour, of multi-layered research and into the use of pattern, texture, animation and shape in displaying qualitative data. Slides from the design research show some of the various considerations we were reflecting on:

Design considerations

The initial images generated with ‘R’ were then refined using D3.js, a powerful and well-regarded software library used extensively to create interactive data visualizations on the web. Refining the aesthetics of the design was important to the project, both in terms of encouraging maximum engagement with the graphs and in terms of data clarity. Each graph contains multiple layers of information, from group participant engagement levels to the overall mood of the group, points of topic shift in the group discussions and dropdown text boxes of the verbal interactions between participants at any topic shift point.

The example below – visualizing a strikingly bad-tempered interaction – uses the final design we settled on (see Visualizing energy shifts in group interactions) once all considerations had been taken into account. The ‘energy line’ running through the centre of the graph is a composite of engagement and mood results and is cut across by a second nonverbal indicator of group dynamic – incidents of laughter illustrating both use and function. As outlined in the methodology sketch, we developed a categorisation for types of laughter heard in this study ranging from cohesive (green) through self-focused (yellow) to divisive (red). In this group, laughter can be seen to anticipating the shifts in mood from positive (green) to negative (red) and back again.

This project has sparked considerable interest, both in terms of its early-day implications for qualitative and mixed methods research and in terms of its potential as an applied tool for teams, organisations and collaborations to use. Further funding in 2019 through an Impact Award has enabled the interdisciplinary team working on the project to embark on further developments and connections.

We are fully aware of the work-in-progress nature of this approach and are very interested to receive feedback, comments, ideas for future applications from anyone out there! If you would like more information on this visualization project or have a comment to share, please contact the lead researcher, Hen Wilkinson, via hen.wilkinson@bristol.ac.uk.

The Jean Golding Institute seed corn funding scheme

The JGI offer funding to a handful of small pilot projects every year in our seed corn funding scheme – our next round of funding will be launched in the Autumn of 2019. Find out more about the funding and the projects we have supported.

 

Reusing qualitative datasets to understand shifts in HIV prevention 1997-2013

Photo (copyright D. Kingstone): This image of Catherine and Ibi in conversation represents varied layers of clarity/blurriness that were a constituent part of the anonymising process. Decisions about the removal of potentially identifying data got talked through by both researchers until they achieved clarity about the best way forward.

A conversation between Dr Catherine Dodds and Dr Ibidun Fakoya

The project was funded by the annual Jean Golding Institute seed corn funding scheme.

Qualitative data re-use and open archiving

This project aimed to demonstrate the considerable value of qualitative data re-use and open archiving. Our team undertook in-depth anonymisation of two existing qualitative HIV datasets by applying and refining an anonymisation protocol. The two qualitative datasets anonymised were:

  • Relative Safety: contexts of risk for gay men with HIV (2008/09) 42 transcripts, Sigma Research
  • Plus One: HIV serodiscordant relationships among Black African people in England (2010/11) 60 transcripts, Sigma Research

The key aim of the project was to ready these materials through the removal of personally identifying information, so that the deidentified data can then be deposited with the UK Data Archive. We provided meta-data (including participant recruitment materials, data collection templates, related research outputs, thematic lists used in coding and details of the anonymisation) for each submission.

In this blog post, Dr Catherine Dodds and Dr Ibidun Fakoya from the School of Public Policy, converse about the ethical and practical considerations of archiving qualitative data. Catherine was involved in the original data collection and was the PI for this project, and Ibidun undertook the bulk of the anonymising work.

Ibidun: What motivated you to deposit these datasets to the UK Archive?

Catherine: A few years back I was awarded funding from the Wellcome Trust to examine the feasibility of re-using and archiving multiple qualitative datasets relating to HIV in the UK. Working with a coalition of other UK researchers on that project, we learned that while the UK Data Archive makes the process of deposit incredibly straightforward, what takes much more time is the decision-making, data transfer and preparation of data for deposit. We were looking at depositing projects that go quite far back in time, before the notion of Open Data was a widespread concept, so there was a lot to be considered in terms of readying these transcripts for deposit in a way that is useful, ethical and responsible. A key outcome of that work was the development of an anonymisation protocol to assist with the practical and ethical decision-making that is involved when readying such data for sharing in an archive.

Ibidun: How did you decide which datasets to deposit?

Catherine: During my 16-year career with Sigma Research (latterly at LSHTM), I was involved with and led on a considerable array of qualitative studies. I selected the data from Plus One and Relative Safety II because they were both undertaken just over ten years ago, at a time when it became more clear that HIV pharmaceuticals were being positioned as HIV prevention technologies. Because this is an area of particular interest for me, I wanted to personally revisit these two studies first in order to re-use them, while also anonymising them in readiness for archiving.

Ibidun: Do you think the ethical considerations for depositing data are different for qualitative and quantitative data? If so, how?

Catherine: They are absolutely different, because qualitative data tend to focus on the experiences, perspectives and human stories of participants in ways that are rich and detailed. This is one of the real strengths of this type of data. This means we need to anonymise in a way that goes beyond just identifying and removing personal names and names of organisations or places. Instead, we need to consider whether the overall narrative a person offers (as a collection of life experiences) could itself identify an individual who should be allowed to remain anonymous. This requires a highly skilled approach to anonymisation. If we were anonymising quantitative datasets, the risk of potential identification would probably be much lower, and anonymisation might just involve removing a few fields from a database.

Catherine : You weren’t involved in the original data collection, what contextual detail helped you to get started when anonymising these materials in readiness for archiving?

Ibidun: It helped that I am familiar with the HIV research in the UK. I started out working in HIV and sexual health back in 2001, so I was aware of the findings of these studies before I started the anonymisation. Nevertheless, it was useful to read through the original study materials such as the research protocols, topic guides, interview schedules and fieldnotes. Speaking to the original investigators also provided insights to the research landscape at that time. By understanding the original aims, objectives and findings from the studies, I was able to focus on just the anonymisation rather than become distracted by the themes that were emerging from the data.

Catherine: How did you tackle the task of reading through so much text and remaining alert to the requirements of anonymisation?

Ibidun: Anonymisation takes a lot of concentration. You need to remain focussed on the transcripts and read every word to ensure that you do not miss any identifiers. I knew that I would struggle to remain alert if I tried to read the transcripts on my computer because I am used to skim reading articles on screen. Initially, I had thought about printing out all the transcripts, but I am conscious of wasting paper. Instead, I made use of the “Learning Tools” in Microsoft Word. I followed these steps to improve my focus and comprehension and ensure I read every word:

  1. Go to View > Learning Tools
  2. Select Column Width to compress sentence line length to make the page narrower
  3. Select Read Aloud, to hear the document as each word is highlighted.
  4. Increase the Read Aloud speed so you are reading approximately 300 words per minute.

It takes a little while to get used to the Read Aloud function, particularly at speed, but ultimately, I found this method to be the most efficient way to remain focused and quickly read through the large volume of text.

Catherine: How did you find the anonymisation protocol that was devised as a support tool?

Ibidun: The protocol was useful for getting started with the task, particularly for straightforward guidance on how to deal with direct identifiers and geographic locations. For more complicated anonymisation (e.g. “Situations, activities and contexts”) the guiding principles set out in the protocol provided only a starting point, meaning we needed to identify cases for team discussion where the potential for identification was high.

Catherine: What advice about anonymising qualitative datasets would you give others who want to archive similar materials for re-use?

Ibidun: My top three tips are:

  1. Keep a research diary like you would do for any other study. Keep note of your reflections and ideas as these may come in handy later.
  2. Work in a team of at least two so you can discuss any ambiguous decisions about de-identification.
  3. Your first duty is to protect the anonymity of the interviewee. If you cannot do that without destroying the integrity of the data (because you have to redact too much material) then err on the side of caution and keep the transcript out of the archive. When in doubt, do not deposit.

Catherine: Is there anything that surprised you in undertaking this work?

Ibidun: I was surprised by the emotional impact of reading the transcripts. Many of the interviewees recounted traumatic events or spoke about painful personal relationships. At times I found myself angry about injustices interviewees had faced, especially those who had been subject to criminal investigations for the reckless transmission for HIV. Therefore, anonymising such sensitive information has the same ethical considerations for researchers as undertaking original qualitative data collection. Researchers undertaking anonymisation also need to pause and reflect on the effects of engaging with emotionally charged narratives and be able to discuss these with colleagues.

Catherine: What value is there to making these materials available to other researchers through the UK Data Archive?

Ibidun: I hope researchers from outside the field of sexual health and HIV research can use these narratives in new and novel ways. It is possible that themes unrelated to the original research might emerge from the data for other researchers. For example, a linguist might want to examine changes in speech patterns among gay men in the UK. A sociologist might examine the datasets about the impact of unemployment on black African migrants in England. There’s a lot of potential in re-using these datasets, perhaps in combination with other data from the UK Data Archive.

Ibidun: I can throw that question back at you, how useful are qualitative datasets for other researchers?

Catherine: I suspect that will be up to them to decide. We have had interest from PhD students and other colleagues who want to use these data to interrogate specific historical aspects of HIV in the UK. It is a shame that we have not been able to undertake anonymisation and deposit more swiftly, but our learning is that to do this work retrospectively takes a great deal of re-familiarisation and case-by-case decision making. Archivists at the UKDA have been very excited about the prospect of having a themed set of qualitative data on social aspects of HIV in their collection and are convinced that this will be of use for researchers focussed on HIV. Social historians, LGBT and queer studies specialists, anthropologists and others might also have use for the data.

Ibidun: What are the methodological considerations when re-using qualitative data?

Catherine: I have just written a methods article on this subject, but in brief:

  1. It is essential that the person re-using the data becomes as familiar as possible with the original context and goals of the project from which the data emerges. Hopefully there will be metadata available to support them in this, and they also should seek to discuss the project with those originally involved (where possible). In my own case, even though I was one of the original data collectors, I was amazed by just how much I had already forgotten (or re-structured in my own mind’s eye).
  2. It is instructive to attend to Hammersley’s (2010) reflections on re-use, which encourage us to think about the given and constructed nature of the data we encounter in these endeavours. For instance, my colleague Peter Keogh and I have approached re-use purposively; we were interested in the theme of biomedicalization and so were selective about which particular projects and transcripts we chose to analyse. We wanted to capture both the mundane and the challenging aspects of life in close proximity to HIV. At the same time, some of the given elements of these data emerged from the shadows in ways that had taken us off guard and reminded us of what it was to work in different moments and places of an unfolding epidemic. And furthermore, as Irwin et al. (2012) have also argued, bringing data and researchers together across datasets can afford an opportunity for listening out to silences which enable us to open up new interpretive avenues.

The Jean Golding Institute seed corn funding scheme

The JGI offer funding to a handful of small pilot projects every year in our seed corn funding scheme – our next round of funding will be launched in the Autumn of 2019. Find out more about the funding and the projects we have supported.

Metastable impressions

Blog written by Rob Arbon, Alex Jones, George Holloway and Pete Bennett

This project was funded by the annual Jean Golding Institute seed corn funding scheme.

The JGI funded project “Metastable impressions” sought to bring together statistical modelling, sound engineering, classical composition and deep learning to create an audio-visual art work about the dynamics of proteins and their representations. The project grew out of work by PhD candidates Alex Jones and Rob Arbon (supervised by Dr Dave Glowacki) called ‘Sonifying Stochastic Walks on Biomolecular Energy Landscapes’. See also blog

The project team was comprised of Dr Pete Bennett (project supervisor), Alex Jones (sonification), Rob Arbon (animations and statistical modelling) and Dr George Holloway (composition).  We shall first give a more detailed overview of the project and then hear from Rob, George and Pete about their specific contributions and thoughts on the project.

All of the publicly available materials are available on our repository at the Open Science Framework.

Project overview

The core of this project is the Sonification: the process or turning information into sound. Take for example the popularity of the search term ‘Proteins’ on Google:

Here we map the popularity of the search term to the position of a blue dot in the vertical direction and the time that this relates to the position in the horizontal direction. In this way we make a visual display of the popularity over time. However, we could map the popularity to, say, the pitch of a piano sound and play them in the same order as the order they were observed in. This would result in an audio display (the other name for sonification) of the information, in this case the sound of a piano piece rising and falling in pitch over time.

The information that we wanted to sonify was a statistical model of protein, SETD8. The data for this protein came from the lab of John Chodera and was produced by Rafal Wiewiora. You can read about the amazing effort to produce this dataset in Computational ‘Hive Mind’ helps scientists solve an enzyme’s cryptic movements.

One criticism of sonification is that it is very hard to listen to for long periods of time so we decided to recruit a classically trained composer to help us design a sonification that would be pleasing to listen to. In order to help do this, the composer George Holloway composed the piece Metastable for string quartet which was performed by the Ligeti String Quartet in a performance event in May. The whole event  (unfortunately including the ‘click track’ for the quartet only) was streamed on Twitter.

In addition to the sonification and string quartet we wanted to produce novel visual representations of the protein. To do this we used a technique called style transfer to impart style from paintings to more traditional representations of the protein. One example can be found on the flyer for the performance:

The right-hand image is of a painting by the French artist Boucher and in the middle is a representation of the protein ‘in the style of’ Boucher.

The string quartet and visual representations of SETD8 were linked through a timeline of scientific thought loosely related to proteins, chemistry and statistical modelling. We highlighted five different scientists, corresponding to the five movements of Metastable, with contemporaneous (both geographically and temporally) artworks and composers. The five composers provided the musical style of each movement, while the five artworks provided the artistic style of the protein representations. Our final timeline was:

I. Medieval period, England

The scientist we chose was Roger Bacon for his work on developing the ‘scientific method’. The composer was Godric of Finchale and the art work was an illumination from the Queen Mary Psalter by an unknown illuminator.

II. 17th Century, England

The scientist we chose was Robert Hooke for his investigation of the microscopic world which was beautifully illustrated in his book Micrographia. The artwork we chose was an image of a flea from this work and the composer was Henry Purcell.

III. 18th Century, France

Our understanding of protein dynamics is, in part, statistical, so for that reason we chose a pioneer of statistics Pierre Simon Laplace as our scientist. The visual source material was Jupiter in the Guise of Diana and the Nymph Callistor by François Boucher and the composer was the prolific opera composer Jean-Philippe Rameau.

IV. 19th Century, Russia

Protein motion has the property of being ‘memoryless’ which means that its future motion isn’t determined by its past motion. Andrey Markov was a mathematician who studied this type of process. Russian contemporaries of Markov were the composer Modest Mussorgsky and artist Illya Repin, whose work Sadko gave us the fourth visual style.

V. 20th century, USA

Many of the advances in our understanding of Proteins came from the UK, such as Kendrew and Perutz (first protein structure determination by X-ray crystallography), and Dorothy Hodgkin (structure of vitamin B12 and insulin).  However, we wanted to avoid more UK based scientists so we instead went for Berni Alder & Thomas Everett Wainwright for their work on simulating molecules and Frances Arnold whose Nobel prize winning work has given us novel ways for creating enzymes (a type of protein).  The composer we chose was Morton Feldman whose work incorporated uncertainty and thus seemed natural to complement the statistical nature of protein motion. The artwork we chose is Gothic by Jackson Pollock, another prominent artist on the New York art scene along with Feldman.

The art works for each movement can be seen below:

Rob Arbon, animations and statistical modelling

My role in this project was twofold:  

  1. to create the statistical model of the protein dynamics from the data provided by the Chodera lab
  2. produce animations of the protein in the style of the artists from the time line. 

The statistical tool I used was a Hidden Markov Model (HMM) which takes multiple time-series of the protein and classifies them as belonging to a small number of distinct states (in our case five) called Metastable States. Protein motion can then be thought of as hopping from metastable state to metastable state. We took five representative time-series and used how the protein hopped from state to state as a structure for each movement of the string quartet.  

In our previous work, Alex Jones and I had already worked out how and what information contained in the HMM to sonify and so we were able to use that framework to sonify this system.  

To create the animations I used a technique called style transfer. This uses deep convolutional neural networks (CNN) which are used to classify images. We can ask ourselves (mathematically, of course) what a CNN considers the ‘style’ of an image.  Below, I asked Google’s Inception CNN what it thought the style of the Boucher image was. 

On the left-hand side is the original image and on the right-hand side is the CNNs conception of ‘style’ at a particular point in the classification process. There are other points available which pick up different types of style not shown hereThe style transfer algorithm takes this conception of style and blends it with arbitrary ‘content’ images. In this case our content images are the traditional representations of proteins. The image below shows this

The right-hand image here is a content image – a traditional representation of a protein which shows the surface atoms of the protein only. The left-hand image is a blending of the pure Boucher style above and the content image. I did this for 10’000 still images of the protein and used these stills to create the animations accompanying the string quartet.  

There were a number of challenges I faced in performing my part. The first was to find appropriate parameters of the statistical model which gave musical structures which were usable by George. The second was to find appropriate parameters of the style transfer process, most notably how to pre-process the images and what particular conception of style for each artwork to transfer over to the content image.  

Alex Jones, sonification

My role was to create a faithful sonification of the protein dynamics using electronic musical synthesisThe primary aim of the sonification is to allow the listener to hear accurately the information being aurally displayed. This is in contrast to the string quartet which was primarily a piece of music ‘informed’ by the data.  

In our previous work, Rob Arbon and I worked out which parameters of the HMM were most useful to represent and my expertise in sound engineering allowed us to map these to synthesized sounds which synchronised with traditional animations of the protein. In this way the user can see structural information while simultaneously hearing more abstract concepts of the protein such as its stability.  

In this project the sound design was informed by my conversations which George on more traditional musical theory topics such as the harmonic series and chord inversions. The end result was a sonification of the data underlying the first movement of Metastable and is available at the project OSF repository.  

There were a number of challenges in this project which needed to be overcome. The first being that the new harmonic language introduced by George was complex and mapping information to sounds became much more challenging than our previous sonification. The second major challenge was overcoming the language barrier between the classical and sound engineering worlds. We were both surprised however at how much common ground we shared once our respective nomenclatures and conventions had been explained. 

George Holloway, composer

I had a dual role in the project. The first was to work with Alex Jones on the sonification, to find readily audible musical structures that could meaningfully convey to an audience the aspects of the data we deemed to be relevant. In so doing, we had to consider not just the audibility, but also the listenability of the musical structures we chose. This naturally touches upon one’s individual judgement and taste, and so takes the sonification a step beyond the merely slavish translation of data into sound, into an aestheticised or crafted sonification. My second, and quite distinct role, was to compose a piece of music that was both data-sonification and an autonomous artwork— a sort of data-inspired music.  

For both sonification and composition, the “legibility”, or more aptly, the audibility of the relationship between underlying data and heard result was crucial, but for subtly different reasons. The musical composition, while not in any way intended as a data-scanning tool like the sonification, and only weakly intended as a public science-communication tool, nonetheless is ineradicably bound to the underlying data. For the music to have a clear expressive purpose, one must be able to appreciate that there are processes of tension and change in the music evocative of the physical processes at work in the molecular dynamics. In the “Metastable Impressions” project this stricture was made even more acute by the combination of the music with a projected visualisation of the molecule. The music could not therefore be completely autonomous (freely treating the material in its own time and following its own development), but had necessarily to conform to the same time structure and transformations of material to which the visualisation conformed, as dictated by the underlying data.   

There had to be some accommodation to aesthetic considerations at the stage of choosing the portions of data to be sonified (the “trajectories”): I proposed parameters to which the data should conform in order for it to be usable for generating musical material. Once Robert Arbon had selected trajectories according to these parameters, however, it was clear that the data would entirely preclude a “traditional” musical syntax and phraseology.  

This influence of the data proved to be decidedly advantageous for me as the composer, and was perhaps the most valuable insight I gained from composing the piece: precisely because the time structures and repetitions of material dictated by the data precluded more expected “organic” development of the materials, the five movements that made up my piece Metastable naturally took on unexpected and spontaneous-feeling structures. The music felt elusive and yet not incoherent, at least to my mind.  

One final aspect to mention is the use of style-transfer in both the visualisation (using machine-learning) and in the musical composition (done “manually” by me as the composer). In a sense, my stylistic use of earlier composers, such as Purcell and Mussorgsky, sits in a time-honoured tradition of musical borrowing known as “transcription”. The idea was that both visualisation and music would take on stylistic aspects of the time periods and locations related to important developments in the history of science that led to the present research into molecular dynamics. This in itself added an entirely other layer of aesthetic considerations to a complex but very rewarding project. 

Dr Pete Bennett, project supervisor

I took a supervisory role in the Metastable Impressions project, attending the weekly meetings and overseeing the technical and artistic collaboration between Rob, George and Alex. Having a background in both computer science and music has been useful throughout the project as it has allowed me to fully appreciate the outstanding work done by the team and occasionally allowed me to act as a bridge when miscommunications arose during the meetings. I’ve particularly enjoyed the approach to interdisciplinary working that this project has taken, with the three disciplines of music composition, computational chemistry and sonification all playing an equal role in leading the project forward. At no point was there the feeling of one being held to have greater importance and dominating the discussion. Additionally it was great to see time and effort being taken throughout the project for complex terminology and theory to be explained in a simple manner to all team members. 

The result of this project is hard to describe – a string quartet, playing a score based on molecular dynamic structure, musically influenced by both sonification techniques and the history of composition, accompanied by a visualisation that uses machine learning to transfer artistic styles from the history of chemistry. The difficulty of describing what was achieved arises from the fact that no single element takes precedent and overall is testament to the truly interdisciplinary nature of the project. Despite the difficulty of explaining the project on paper, the concluding performance brought together all of the strands together seamlessly into one clear artistic vision that was very well received by the audience, resulting in a deep debate that spanned all the disciplines involved. 

The Jean Golding Institute seed corn funding scheme

The JGI offer funding to a handful of small pilot projects every year in our seed corn funding scheme – our next round of funding will be launched in the Autumn of 2019. Find out more about the funding and the projects we have supported.

 

Multitask learning for AMR

Blog written by Rob Arbon, Data Scientist at the University of Bristol.

This project was funded by the annual Jean Golding Institute seed corn funding scheme.

Multitask learning for AMR

“Multitask learning for AMR” developed out of our collaboration with the Jean Golding Institute on the One Health Selection and Transmission of Antimicrobial Resistance (OH-STAR) project funded by the Natural Environment Research Council (NERC), the Biotechnology and Biological Sciences Research Council (BBSRC) and the Medical Research Council (MRC).

OH-STAR is seeking to understand how human activity and interactions between the environment, animals and people lead to the selection and transmission of Antimicrobial Resistance (AMR) within and between these three sectors. As part of this project we collected thousands of observations, from over 50 dairy farms, of the prevalence of AMR bacteria, as well over 300 potential environmental variables and farm management practices which could lead to AMR transmission or selection (so called risk factors). If we can identify these risk factors the hope is that we can use this information to shape policy to reduce the spread of AMR into the human population where it threatens to cause widespread death and disease.

Multitask learning (MTL) is a statistical technique that aims to relate different “tasks” in order to improve how we perform those tasks separately and to understand how the tasks are related. MTL has been used in many different areas from improving image recognition and to help diagnose neurodegenerative diseases. In this project we wanted to see if MTL could be used to help better understand the OH-STAR data and also to sketch out ideas for potential grant applications to develop this idea further.

To evaluate MTL for our purposes, we focused our attention on two small subsets of the OH-STAR data: 600 faecal samples from pre-weaned young calves and 1800 from adult dairy cows. Each of these samples had been tested to see whether the Escherichia coli. bacteria within them contained the CTX-M  gene. This gene is important because it confers resistance to a range of antibiotics such as penicillin-like  and cephalosporin antibiotics which are used to treat a range of different infections  in humans and cattle.

In order to model the risk of something occurring, statisticians use a technique called logistic regression. With this technique you can quantify by how much a risk factor, e.g. which antibiotics a farmer may use,  increases or decreases the risk of observing the CTX-M gene in samples from the farm. As an example, consider how one of our risk factors, atmospheric temperature, affects the risk of observing CTX-M. Each dot on the chart below represents one of our samples, whether it had CTX-M (right hand axis), and the temperature when the sample was recorded (horizontal axis). The results of the logistic regression are the black line (left hand axis): the risk of observing CTX-M as the temperature increases.

How atmospheric temperature affects the risk of observing CTX-M

This relationship can be summarised by a single number called the log-odds-ratio, in this case the log-odds-ratio of temperature is 0.4. The fact that this number is greater than 0 means the temperature increases the risk of CTX-M, if it were less zero this means that it decreases the risk.

So, to quantify how each of the measured risk factors affects the risk of observing CTX-M we could just use logistic regression on all 300 risk factors for both the adults and separately for the calves  and look at the log-odds-ratios for each risk factor. This approach suffers from two problems however:

  1. by fitting a standard logistic regression model with 300 risk factors and at most 1800 observations means your conclusions won’t necessarily be correct for a wider population (because of overfitting)
  2. this approach treats understanding risk in adults and heifers as two separate tasks, when in fact they share many similarities.

To overcome the first problem statisticians use a technique called regularization. In a nutshell this reduces the complexity of your model to prevent it predicting random points (“noise”) in your data and focuses the model on predicting the “true” signal.

Multi-task learning (MTL) is an approach to overcoming the second problem. The idea is that some risk factors will pertain more to calves than to adults (e.g. the type of antibiotics given when they are young) or vice versa. So it makes sense for these risk factors to have different impacts on the model. However, there are some risk factors that will have very similar effects on both types of animal (e.g. outside temperature ). The way MTL relates tasks is complicated but it is very similar to the regularization technique linked to above. Interested readers can read reviews of MTL in A brief review on multi-task learning and An overview of multi-task learning in deep neural networks.

We looked at a number of different MTL techniques using the R package RMTL (package and accompanying paper) but for brevity we will consider only one here. Our two ‘tasks’ were logistic regression to find the risk factors affecting pre-weaned calves (task 1, labelled ‘heifers’) and adults (task 2).

The technique I’d like to talk about here is called ‘Joint feature learning’ and relates the tasks by encouraging them to have similar values for each risk factor. This means that if they’re not important for both, they will feature less strongly in each model. The results are shown below for the results with (right) and without (left) MTL.

The red results are for heifers and the blue results are for the adults. Each bar denotes the effect of a risk factor on the risk of observing CTX-M: positive increases the risk while negative decreases the risk. The effect of using MTL was to suggest that the circled risk factors were not important to both tasks as applying MTL made all those risk factors irrelevant to the task. This was important for two reasons.

First, this cut down on the number  of possible risk factors that needed further investigation. Second, this meant that those risk factors which did show up as having different effects on the risk could be trusted as not being down to just chance. For example, temperature was one of the most important features for the heifers but not for the adults – this could provide ideas for interesting hypotheses to test in future work.

MTL has potential novel applications in real world scenarios

The main conclusion from this work is that MTL has potentially novel applications in complicated “real-world” scenarios but that the tools for MTL need developing. For example the techniques in the RMTL package did not fully take into account the structure of the data. This is something that will need to be addressed in any future work.

In our wrap up meeting we discussed the potential for using these techniques in future work. The main idea discussed was to use data collected as part of a completely separate project to help understand risk in OH-STAR data and vice versa. For example, the One Health drivers of AMR in rural Thailand (OH-DART) is a similar project, funded by the Global Challenges Relief Fund (GCRF). The distinct but related datasets from OH-STAR and OH-DART projects could be analysed jointly using MTL to identify risk factors. Thanks to the funding from the JGI we now have the materials necessary to write such a proposal and we will be watching calls from e.g. the GCRF and BBSRC in the near future to fund this work.

All of our code for this work can be found on the Open Science Framework.

The Jean Golding Institute seed corn funding scheme

The JGI offer funding to a handful of small pilot projects every year in our seed corn funding scheme – our next round of funding will be launched in the Autumn of 2019. Find out more about the funding and the projects we have supported.

 

 

Interactive visualisation of Antarctic mass trends from 2003 until present

Image of the calving front of the Brunt Ice Shelf, Antarctica. Image Credit: Ronja Reese (distributed via imaggeo.egu.eu), available under a Creative Commons Licence.

Blog written by Dr Stephen Chuter, Research Associate in Sea Level Research, School of Geographical Sciences, University of Bristol

This project was funded by the annual Jean Golding Institute seed corn funding scheme.

Antarctica and sea level – a grand socioeconomic challenge

Global sea level rise is one of the most pressing societal and economic challenges we face today. A rise of over 2 m by 2100 cannot be discounted (Ice sheet contributions to future sea-level rise from structured expert judgement), potentially displacing 187 million people from low-lying coastal communities. As a major and increasing contributor to global sea level rise, it is of critical importance to understand and quantify trends in the mass of the Antarctic Ice Sheet. Additionally, effective communication of this information is vital for policy makers, researchers and the wider public.

Combining diverse observations requires new statistical approaches

The Antarctic Ice Sheet is larger in area than the contiguous United States, meaning the only way to get a complete picture of its change through time is by using satellite and/or airborne remote sensing to supplement relatively sparse field observations. Combining these datasets is a key computational and statistical challenge due to the large data volumes and the vastly different spatial and temporal resolutions of different techniques. This challenge, which continues to grow as the length of the observation record increases and as new remote sensing technologies provide ever higher resolution data, requires new and novel statistical approaches.

We previously developed a Bayesian hierarchical modelling approach as part of the NERC-funded RATES project, which combined diverse observations in a statistically-rigorous manner. It allowed us to calculate the total mass trend at an individual drainage basin scale, in addition to the relative contribution of the individual component processes driving this change; such as variations in ice flow or snowfall. Being able to study the spatial and temporal pattern of the component process allows researchers to better understand what the key external drivers of ice sheet change are, which becomes critical when making predictions regarding about its future evolution.

Project goals

Our current work is aiming to achieve two major goals:

  1. Develop the Bayesian hierarchical framework and incorporate the latest observations in order to extend the mass trend time series until as close as possible to the present day.
  2. Synthesise the results in an easily accessible web application, so that users can interrogate and visualise the results at a variety of spatial scales.

The first of these goals is critical to providing model and dataset improvements, paving the way for the framework to be used over longer time series in order to better understand multi-decadal processes. The second goal, funded by the JGI seed corn award, is to provide these outputs in a manner that is easily used and understood by a range of stakeholders:

  • Scientists – Ability to use the latest available results within their own research and as part of large international collaborative inter-comparisons (e.g. World Climate Research Programme).
  • Policy makers – Allows easy interrogation of the results and have an accessible overview of the methodology. This will allow for project outputs to be easier included as evidence in policy making decisions.
  • General public – Generate public interest in the research output, the method used and raise awareness of the potential impacts of climate change on ice sheet dynamics.

Results

To date we have extended mass trends for the Antarctic Ice Sheet up to and including 2015. This has enabled us to see the onset of dynamic thinning (changes in mass due to increased ice flow into the ocean) over areas previously considered stable such as the Bellingshausen Sea Sector and glaciers flowing into the Getz Ice Shelf.

Rates of elevation change due to ice dynamic processes from 2003 – 2015
Time series of annual mass trends for the Antarctic ice sheet from 2003 – 2015

The cyan line represents total ice sheet mass change, the orange line represents changes due to ice dynamics (variations in ice flow) and the purple line represents changes in mass due to surface processes (variations in mass due to changes in precipitation and surface melt). The shaded areas around each line represent the one standard deviation uncertainty.  

In order to disseminate these results, a new web application has been developed. This allows users to interactively explore and download the updated results. Additionally, the web application features a second interactive page, aimed at the public and policy makers, which provides an overview of the datasets used in this work 

Future plans

The project has allowed us to make key advances in this methodology, laying the foundations for extending the time series nearer to the present dayFuture plans include incorporating additional observations from the new ICESat-2 and GRACE follow-on satellite missionsUltimately, the extended time series will be an important input dataset for the GlobalMass ERC project, which aims to take the same statistical approach to study the global sea level budget.  

The creation of the web application will allow for future updates to be quickly communicated, which allows our results to be incorporated in related research or policy making decisions. We hope to extend the web application functionality further to include more outputs from the GlobalMass project. 

The Jean Golding Institute seed corn funding scheme

The JGI offer funding to a handful of small pilot projects every year in our seed corn funding scheme – our next round of funding will be launched in the Autumn of 2019. Find out more about the funding and the projects we have supported.