The Royal Statistical Society meets annually for their internationally attended conference. It serves as the UK’s annual showcase for statistics and data science. This year they met in Brighton for a conference attended by over 600 attendees from around the world, including JGI Data Scientist Dr Huw Day.
The conference had over 250 presentations, including contributed talks, rapid-fire talks, and poster presentations. At any one time, there could be as many as 6 different talks going on, so it was impossible to go to everything but below are some of Huw’s highlights of the conference.
Pre-empting misunderstandings is part of trustworthy communication
As part of a session on communicating data to the public, Professor Sir David Spiegelhalter talked about his experiences trying to pre-bunk misinformation when displaying data.
Data in June 2021 showed that the majority of COVID deaths are in the vaccinated group. The Brazilian president President Jair Bolsonaro used this data to support claims that Covid vaccines are killing people. Spiegelhalter and his colleague Anthony Masters tried explaining why this wasn’t a sign the vaccine was bad in an article in The Observer “Why most people who now die with Covid in England have had a vaccination”.
Consider the following analogy: most car passengers who die in car accidents are wearing seatbelts. Intuitively, we understand that just because these two variables are associated, it doesn’t mean that one causes the other. Having a story like that means you don’t have to talk about base rates, stratification or even start to use numbers in your explanations.
We should try to make the caveats clearer of data before we present them. We should be upfront from what you can and can’t conclude from the data.
Spiegelhalter pointed to an academic paper: “Transparent communication of evidence does not undermine public trust in evidence” where participants were shown either persuasive or balanced messages about the benefits of Covid vaccines and nuclear power. It’s perhaps not surprising to read that those who already had positive opinions about either topic continued to have positive views after reading either messages. Far more interesting is that the paper concluded that “balanced messages were consistently perceived as more trustworthy among those with negative or neutral prior beliefs about the message content.”
Whilst we should pre-empt misconceptions and caveats, being balanced and more measured might prove to be an antidote to those who are overly sceptical. Standard overly positive messaging is actively reducing trust in groups with more sceptical views.
Digital Twins of the Human Heart fueled Synthetic 3D Image Generation
Digital twins are a digital replica/simulator of something from the real world. Typically it includes some sort of virtual model which is informed by real world data.
Dr Dirk Husmeiser at the University of Glasgow has been exploring the application of digital twins of the human heart and other organs to investigate behaviour of the heart during heart attacks, as well as trying to use ultrasound to measure blood flow to estimate pulmonary blood pressure (blood pressure in the lungs). Usually, measuring pulmonary blood pressure is an extremely invasive procedure, so using ultrasound methods has clear utility.
One of the issues of building a digital twin is having data about what you’re looking at. In this case, the data looks like MRI scans of the human heart, taken at several “slices”. Because of limitations in existing data, Dr Vinny Davies and Dr Andrew Elliot, (both colleagues of Husmeiser at the University of Glasgow)have been attempting to develop methods of making synthetic 3D models of the human heart, based on their existing data. They broke the problem down into several steps, working to generate synthetic versions of the slices of the heart (which are 2D images) first.
The researchers were using a method called Generative Adversarial Networks (GANs), where two systems compete against each other. The generator system generates the synthetic model and the discriminator system tries to distinguish between real and synthetic images. You can read more about using GANs for synthetic data generation in a recent JGI blog about Chakaya Nyamvula’s JGI placement.
Because the job of the generator is far harder than that of the discriminator (consider the task of reproducing a famous painting, versus spotting the difference between an original painting and a version drawn by an amateur), it’s important to find ways to make the generator’s job easier early on, and the discriminator’s job harder so that the two can improve together.
The researchers used a method called a Progressive GAN. Initially they gave the generator the task of drawing a lower resolution version of the image. This is easier and so the generator did easier. Once the generator could do this well, they then used the lower resolution versions as the new starting point and gradually improved the correlation. Consider trying to replicate a low resolution image – all you have to do is colour in a few squares in a convincing way. This then naturally makes the discriminator job’s harder, as it’s tasked with telling the difference between two, extremely low resolution images. This allows the two systems to gradually improve in proficiency.
The work is ongoing and the researchers at Glasgow are looking for a PhD student to get involved with the project!
Data Hazards
On the last day of the conference, Huw alongside Dr Nina Di Cara from the School of Psychology at the University of Bristol presented to participants about the Data Hazards project.
Participants (including Hadley Wickam, keynote speaker and author of the famous R package tidyverse) were introduced to the project, shown examples of how it has been used and then shown an example project where they were invited to take part in discussions about which different data hazards might apply and how you might go about mitigating for those hazards. They also discussed the importance of focussing on which hazards are most relevant and prominent.
All the participants left with their own set of the Data Hazard labels and a new way to think about communicating hazards of data science projects, as well as invites to this term’s first session of Data Ethics Club.