The University of Bristol's central hub for data science and data-intensive research, connecting a multidisciplinary community of experts across the University and beyond.
As an active member of the Turing University Network, we have appointed a Turing Liaison Manager and two Turing Liaison Academics to support and enhance the partnership between Alan Turing Institute and the University of Bristol. These roles will be focusing on increasing engagement from Turing, developing external and internal networks around data science and AI, and supporting relevant interest groups, Enrichment students and Turing Fellows at the University of Bristol.
Turing Liaison Manager, Isabelle Halton and Turing Academic Liaisons, Conor Houghton and Emmanouil Tranos, are keen to build communities around data science and AI, providing support to staff and students who want to be more involved in Turing activity.
Isabelle previously worked in the Professional Liaison Network in the Faculty of Social Sciences and Law. She has extensive experience in building relationships and networks, project and event management and streamlining activities connecting academics and external organisations.
Conor is a Reader in the School of Engineering Mathematics and Technology, interested in linguistics and the brain. Conor is a Turing Fellow and a member of the TReX, the Turing ethics committee.
Emmanouil is currently a Turing Fellow and a Professor of Quantitative Human Geography, specialising primarily on the spatial dimensions of the digital economy.
If you’re interested in becoming more involved with Turing activity or have any questions about the partnership, please email Isabelle Halton, Turing Liaison Manager via the Turing Mailbox
A public event organised by The Alan Turing Institute – 20 June 2024 Blog post by Léo Gorman, Data Scientist, Jean Golding Institute
Let’s say you are a researcher approaching a new dataset. Often it seems that there is a virtually infinite number of legitimate paths you could take between loading your data for the first time and building a model that is useful for prediction or inference. Even if we follow statistical best practice, it can feel that even more established methods still don’t allow us to communicate our uncertainty in an intuitive way, to say where our results are relevant and where they are not, or to understand whether our models can be used to infer causality. These are not trivial issues. The Alan Turing Institute (the Turing) hosted a theory and methods challenge fortnight (TMCF), where leading researchers got together to discuss these issues.
JGI team members Patty Holley, James Thomas and Léo Gorman (left to right) at the Turing
Members of the Jean Golding Institute (Patty Holley, James Thomas, and Léo Gorman) went to London to participate in this event, and to meet with staff at the Turing to discuss opportunities for more collaboration between the Turing and the University of Bristol.
In this post, I aim to provide a brief summary of my take-home messages that I hope you will find useful. At the end of this post, I recommend materials from all three speakers which will cover these topics in much more depth.
Andrew Gelman – Beyond “push a button, take a pill” data science
Andrew Gelman presenting
Gelman mainly discussed how are statistics used to assess the impact of ‘interventions’ in modern science. Randomised controlled trials (RCTs) are considered the gold-standard option, but according to Gelman, the common interpretation of these studies could be improved. First, the trials need to be taken in context, and it needs to be understood that these findings might be different in another scenario.
We need to move beyond the binary “it worked” or “it didn’t” outcomes. There are intermediate outcomes which help us understand how well a treatment worked. For example, let’s take cancer treatment trial. Rather than just looking at if a treatment worked for a group, we could look at how the size of the tumour changed, and whether this changed for different people. As Gelman says in his blog: “Real-world effects vary among people and over time”.
Jessica Hullman – What do we miss with average model effects? How can simulation and data visualisation help?
Jessica Hullman presenting
Hullman’s talk expanded on some of the themes in Gelman’s talk, Let’s continue with the example of an RCT for cancer treatment. If we saw an average effect of 0.1 between treatment and control, how would that vary for different characteristics (represented by the x-axis in the quartet of graphs below). Hullman demonstrated how simulation and visualisation can help us understand how different scenarios can lead to the same conclusion.
Causal quartets, as shown in Gelman, Hullman, and Kennedy’s paper. All four plots show an average effect of 0.1, but these effects vary as a function of an explanatory variable (x-axis)
Hadley Wickham – Challenges with putting data science into production
Hadley Wickham presenting
Wickham’s talk focused on some of the main issues with conducting reproducible, collaborative, and robust data science. Wickham framed these challenges under three broad themes:
Not just once: an analysis likely needs to be runnable more than once, for example you may want to run the same code on new data as it is collected.
Not just on my computer: You may need to run some code on your own laptop, but also another system, such as the University’s HPC.
Not just me: Someone else may need to use your code in their workflow.
According to Wickham, for people in an organisation to be able to work on the same codebase, they have the following needs (in order of priority), they need to be able to:
find the code
run the code
understand the code
edit the code.
These challenges exist at all types of organisation, and there are surprisingly few cases where organisations fulfil all criteria.
Panel discussion – Reflections on data science
Cagatay Turkay, Roger Beecham, Hadley Wickham, Andrew Gelman, Jessica Hullman (left to right) at the Turing
Following each of their individual talks, the panellists reflected more generally. Here are a few key points:
Causality and complex relationships: When asked about the biggest surprises in data science over the past 10 years both Gelman and Hullman seemed surprised at the uptake of ‘blackbox’ machine learning methods. More work needs to be done to understand how these models work and to try and communicate uncertainty. The causal quartet visualisations, presented in the talk, only addressed simple/ideal cases for causal inference. Gelman and Hullman both said that figuring out how to understand complex causal relationships for high-dimensional data was at the ‘bleeding edge’ of data science.
People problems not methods/tools problems: All three panellists agreed that most of the issues we face in data science are people problems rather than methods/tools problems. Much of the tools/methods exist already, but we need to think more careful.
Léo’s takeaway
The whole trip reminded me of the importance of continual learning, and I will definitely be spending more time going through some of Gelman’s books (see below).
Gelman and Hullman’s talk in general encouraged people to think: At each point in my analysis, were there alternative choices that could have been made that would have been equally reasonable, and if so, how different would my results have looked had I made these choices? This made me want to think more about multiverse analyses (see analysis package and article).
Further Reading
Theory and Methods Challenge Fortnight – Garden of Forking Paths
The speakers were there as part of the Turing’s Theory and Methods Challenge Fortnight (TMCF), more information can be found below:
TMCF website (worth checking for updates and write-ups)
For people who have not heard of Andrew Gelman before, he is known to be an entertaining communicator (you can search for some of his talks online or look at the Columbia statistics blog). He also has several great books:
Again, check the Columbia statistics blog, where Hullman also contributes. The home page of Hullman’s website also includes selected publications which cover causal quartets, but also reliability and interpretability for more complex models.
Hadley Wickham
Wickham has made many contributions for R and data science. He is chief scientist at Posit and is lead of the tidyverse team. His book R for Data Science is a particularly useful resource. Other work can be found on his website.