A Challenge Owner’s perspective of the inaugural Turing Network Data Study Group – Part 2

The challenge team with Challenge Owner Simon de Lusignan and Data Science Principle Investigator Mark Joy

Understanding and improving the reliability of disease monitoring in GP surgeries is the extensive, yet critical task taken on by the team of researchers at Royal Society of General Practitioners (RCGP) and the University of Surrey headed by Professor Simon de Lusignan and Dr Mark Joy. With this goal in mind, the team challenged participants of the first Turing Network Data Study Group to attempt to develop a predictive algorithm using machine learning that corrects sub-optimal data allowing for better disease monitoring. In part two of our blog series focusing on a Challenge Owner’s perspective of the Turing Network Data Study Group, Professor de Lusignan and his team tell us about their experience of the DSG and the challenge they presented: Improving our ability to use routine data to inform the management of key disease areas. You can read part one of this series, where we spoke to another Challenge Owner, University of Bristol’s Danielle Paul about her experience of the event on the JGI blog. Challenge Owner team – who was involved?

  • Prof Simon de Lusignan, University of Oxford/University of Surrey/Royal College of General Practitioners
  • Dr Mark Joy, University of Surrey
  • Rachel Byford, University of Oxford/University of Surrey
  • Dr John Williams, University of Oxford/University of Surrey
  • Dr Nadia Smith, University of Surrey/National Physical Laboratory

Can you give us a brief overview of the challenge you presented to the Data Study Group participants? It is essential to monitor blood pressure in various chronic diseases (e.g. heart disease, diabetes, etc). However, GPs tend to indicate certain biases in recording measurements, for example a preference for round numbers. We have 47 million blood pressure readings and 7 million glycated haemoglobin (HbA1c) readings (a measure of diabetes control) and we were interested in finding the true blood pressure and HbA1c trends from the inaccurate data, comparing trends for different groups of patients (e.g. on various medications). Participants were challenged to attempt to develop a predictive algorithm using machine learning that corrects suboptimal data allowing for better disease monitoring. The challenge ended up being split into three sub-challenges:

  1. Identifying whether a case is a new (incident) or a follow-up (prevalent) when this information is not recorded in the computerised medical record
  2. What is the true underlying blood pressure (BP) in a population where there is marked end-digit preference for zero, when data are recorded?
  3. What is the trend in diabetes control when there is additional testing at the time of ill health?

What kind of solutions did the challenge team come up with? The solutions suggested to the three sub-challenges were as follows:

  1. Tree classifiers for classification as this is essentially a binary classification problem (is a GP visit a follow-up or a new, incident visit?); decision trees and random forests for classification of episodes into new and ongoing; data driven approaches to finding threshold and min-max range of number of days between two episodes per diseases.
  2. Latent variables, time series ideas
  3. Bayesian-type approach with an iterative procedure for uncovering the posterior (incorporating Neural Network classifiers for patient characteristics)

What are your hopes for the potential applications of the team’s findings from this week?

The team had to work together to find solutions to the three sub-challenges

We have two members of the group interested in carrying on this work. We hope to explore further the team’s approach to Sub-challenge 1 as we feel this is a promising area for further exploration. The team’s contribution to Sub-challenge 2 is already planned to be incorporated in to the RCGP report to Public Health England. It increases the scope and applicability of this report on the nation’s health in certain key disease areas. Sub-challenge 3 was arguably the more difficult challenge, and the team’s feedback has led us to reconsider how we engineer our data to better address this prediction problem. As a Challenge Owner, what was your favourite part of the Data Study Group week? New perspectives, the opportunity to make more use of our data. We enjoyed engaging with the enthusiasm and energy of the team. Our favourite part was listening to the presentations at the end of the week. Were there any surprises for you at the event? How narrow population health and epidemiological technique are compared with the wealth of ideas and approaches available. Is there anything else you would like to tell us? Two members of the group have been in contact about continuing this work. One to work on “episode types” the other on end-digit preference in blood pressure recording. The event was immensely enjoyable, truly challenging for the team members, and a joy to participate in.

The Alan Turing Institute and Data Study Groups

The inaugural Turing Network Data Study Group was hosted by the Jean Golding Institute at the University of Bristol – one of The Alan Turing Institute’s 13 partner universities in August 2019. The event united six Challenge Owners with 50 students, postdocs and senior academics to tackle real-world data science challenges spanning a variety of fields, from spectroscopy and analytical chemistry to text mining and digital humanities. Building on the popular Data Study Groups (DSGs), held three times a year at Turing HQ in London, this ‘Turing Network’ event was the first of its kind to be hosted by a partner university. It followed the tried-and-tested format of a five-day collaborative hackathon. The Challenge Owners – organisations from industry, government and the third sector – provided real-world data challenges that were tackled by small groups of highly talented researchers. The results were presented on the final day. Find out more about Data Study Groups, including how you can get involved as a researcher or Challenge Owner on The Alan Turing Institute website

JGI Seed corn funding call winners 2020 announced!

The Jean Golding Institute are delighted to announce the winners of the Seed corn funding call 2020.

This funding call has been successfully running for last three years and aims to support activities to foster interdisciplinary research in data science (including AI) and data-intensive research.

The Jean Golding Institute has funded a total of 32 seed corn projects since 2016. This year, we have been able to fund 10 projects and are grateful to have received funds from the Faculty of Arts and Strategic Funding in order to offer additional awards. Our winners this year are:

  • Oliver Davis, Claire Haworth and Nina Di Cara with ‘Mood music: using Spotify to infer wellbeing’
  • Brendan Smith and Mike Jones with ‘Digital humanities meets Medieval financial records: the receipt rolls of the Irish exchequer
  • Zoi Toumpakari, Ivan Palomares Carrascosa, Daniele Quercia and Luca Maria Aiello with ‘Automating food aggregation for nutrition and health research’
  • Avon Huxor, Emma Turner, Eleanor Walsh and Raul Santos-Rodriguez with ‘Elements of free text used in decision making: an exemplar from death reviews in prostate cancer and learning disabilities’
  • Jim Dunham, Gethin Williams, Nathan Lepora, Tony Pickering and Manuel Martinez Perez with ‘Decoding pain: development of a clinical tool to enable real-time data visualisation and analysis of human pain nerve activity’
  • Ranjeet Bhamber, Andrew Dowsey, Febe Van Maldegem and Julian Downward with ‘Super-charging single cell imaging pathology’
  • Elaine McGirr and Julian Warren with ‘Mapping Oliver Messel’
  • Liz Washbrook with ‘Mental health and educational achievement in two national contexts: a machine learning approach
  • Ella Gale, Natalie Fey, Craig Butts, Varinder Aggarwal with ‘Chemspeed data capture and curation’
  • Pierangelo Gobbo and Lars Bratholm with ‘Machine learning assisted polymer design’.

We will  be interested to hear how all these projects progress this year and will report back on their progress in the summer of 2020. Our next Seed corn funding call will be in the Autumn of 2020.

To ensure you keep up to date with any other funding calls, news, events and other opportunities, please join the JGI mailing list.

A Challenge Owner’s perspective of the inaugural Turing Network Data Study Group

The DSG challenge team with Danielle Paul (far right)

Using AI and machine learning to increase understanding of cardiac muscle proteins – the molecular basis of heart disease – is a potentially daunting challenge. But it was one that Turing Fellow Danielle Paul, from the School of Physiology, Pharmacology & Neuroscience at the University of Bristol, was keen to explore. Danielle took part in the first Turing Network Data Study Group, hosted by the Jean Golding Institute at the University of Bristol – one of The Alan Turing Institute’s 13 partner universities. The event united six Challenge Owners with 50 students, postdocs and senior academics to tackle real-world data science challenges spanning a variety of fields, from spectroscopy and analytical chemistry to text mining and digital humanities.

Building on the popular Data Study Groups (DSGs), held three times a year at Turing HQ in London, this ‘Turing Network’ event was the first of its kind to be hosted by a partner university. It followed the tried-and-tested format of a five-day collaborative hackathon. The Challenge Owners – organisations from industry, government and the third sector – provided real-world data challenges that were tackled by small groups of highly talented researchers. The results were presented on the final day.

Here, Danielle tells us about her experience of the DSG, and the challenge she presented: Applying AI and machine learning to reveal the molecular basis of heart disease.

What was your challenge about? 

An example of the image data the challenge team were working on

This was an image-processing challenge with potential outcomes that could improve our fundamental understanding of cardiac muscle proteins. The proteins in our images are susceptible to mutations that cause hypertrophic cardiomyopathy, which is a known cause of adult sudden death.

To obtain high-resolution molecular models of these proteins we need to collect hundreds of thousands of images of our protein from noisy data obtained via cryogenic electron microscopy (cryo-EM). It took us six months to manually annotate the small dataset we used in the DSG. It’s a laborious process and it highlights the pressing need for a robust, machine-based approach. If we can automate the protein-identification step, it would overcome a significant bottleneck in our image-processing workflow.

What solutions did the challenge team generate?

The team implemented several deep learning algorithms, known as Convolutional Neural Networks, which were then trained to recognise our proteins in the images. The automated methods that they presented to us performed very well and upon testing, providing as much as 90% accuracy.

What are your hopes for potential applications of the findings from the week?

The protein hunters: the challenge team hard at work finding a way to automate a laborious image-processing task

I hope that the various methods that were implemented and trained using our challenge dataset can now be put through their paces on our larger datasets. It will be interesting to see how they perform with slightly different data and imaging conditions. This methodology will be built upon as part of a Turing-funded research project to make a more general software tool to identify proteins in Cryo-EM images.

As a Challenge Owner, what was your favourite part of the Turing Network DSG?

My favourite part was having discussions with the participants. In particular, hearing the ideas and thoughts they had in response to the problems described in our image processing workflow. I appreciate how much effort they put in and their enthusiasm towards addressing the challenge.

Were there any surprises during the event?

That the participants were keen to do more at the end!

Find out more about Data Study Groups, including how you can get involved as a researcher or Challenge Owner

The Alan Turing Institute 

The Alan Turing Institute’s goals are to undertake world-class research in data science and artificial intelligence, apply its research to real-world problems, drive economic impact and societal good, lead the training of a new generation of scientists and shape the public conversation around data. 

Find out more about The Alan Turing Institute. 

A brief introduction to colliders

Blog written by Sean Roberts, Research Associate, Anthropology and Archaeology, University of Bristol

Causal graphs

Causal graphs are great ways of expressing your idea about how the world works. They can also help you design your analyses, including choosing what you need to control for in order to exclude alternative explanations for the patterns you find. Standard methods about how to choose control variables are often vague (Pearl & Mackenzie, 2018), and many assume that controlling for more variables makes the central test more robust. However, controlling for some variables can create spurious correlations due to colliders. This is very worrying! But if we draw our ideas as causal graphs, then we can spot colliders and try to avoid them.

Pearl & Mackenzie (2018) talk about causality and correlation in terms of the flow of information.  Causal effects ‘flow’ between nodes following the causal arrows, but correlation can flow both ways. In the example below, taking a certain medicine might affect your health. But your age might affect both your health and whether you remember to take the medicine. In this case, there might be a correlation between taking medicine and health either because of the causal connection (the causal path), or because of the confounding correlational path though age (the non-causal path):

In a randomised control experiment, the link between health and taking medicine might be blocked by intervening and randomly assigning who takes the medicine (and in this hypothetical example, ensuring that they do take the medicine). That is, the only thing that decides whether the medicine is taken is our experimenter’s random decision.  This means that the only path which connects medicine and health (in our hypothesis) is the causal path that we are interested in.

Blocking causal paths can also be done by ‘conditioning’, for example controlling for the effect of age in a statistical regression:

The final thing that can block a correlation path is a collider. A collider is a node on a causal path with two causal links (arrows) pointing into it. In the graph below, X and Y affect Z. We wouldn’t expect a correlation between X and Y, because that path is blocked by the collider at Z.

Colliders behave differently to other causal structures. Below are the four different types of connection between three nodes (excluding ones where X and Y are connected). In the first three, there is a path between X and Y, so X and Y should be correlated.  This is the root of one of the central problems in research: we cannot tell the first three systems apart just by observation. We would have to manipulate one of the variables (e.g. in an intervention study) and see whether it had an effect.

In the first two graphs, Z is a ‘pipe’. It connects X and Y. If we were to intervene or control for Z, then the relationship between X and Y would be broken and there should be no correlation. In a statistical framework, we would not want to control for Z.  The third system, X and Y are correlated due to a ‘common cause’ or ‘fork’ in Z. However, the behaviour is the same: X and Y should be correlated except for when we control for Z.

The final example is a ‘collider’, and it is different from the rest. Here, X and Y are not correlated except for when we control for Z, at which point they will become correlated (Elwert 2013).

To help understand this, imagine that you and I are working at a waterworks. We can each control the rate of flow in our pipe (X and Y) and our pipes are connected so that the final rate of flow is the combined rate from each of us, Z. I can turn my cog independently of you, and vice versa.

If you turn your rate up it increases the flow in Z, but that has no effect on me, so our rates (X and Y) should not be correlated.  However, then our manager calls and tells us that we have to maintain a certain rate of flow in Z (they are conditioning or fixing Z). Now, if I turn my rate up, then you have to turn your rate down to maintain the required rate in Z. And if you turn your rate down, I have to turn my rate up.  So someone observing X and Y would see a correlation.  That is, X and Y are not correlated, except for when conditioning on Z.

So, correlations will be blocked by a collider unless we control for it, at which point the correlation path ‘opens up’ again.

Here’s another example. Imagine we’re filling sacks with potatoes and carrots. The weight depends independently on each vegetable, and there’s no correlation between them. But if we split the observations by weight, then the number of potatoes predicts the number of carrots.

The behaviour of colliders means that controlling for some variables can lead to spurious correlations (Elwert & Winship 2014, Ding & Miratrix 2015, Westfall & Yarkoni 2016, Rohrer 2018, Middleton et al 2016, York 2018).

Considering colliders is important when deciding which variables to control for in a statistical test. Imagine that we’re investigating reaction times (RT) in reading, and we have measured the length of a word, reaction times for reading the word, the frequency of the word and the word’s valence (the degree of pleasantness of the meaning of the word). We’re interested in testing whether word length affects reaction time. What should we control for? Let’s say that we have the following hypotheses: frequency is affected by length (Zipf’s law, Zipf 1935) and valence (Polyanna hypothesis, Boucher & Osgood 1969), and valence affects reaction time (e.g. Kuperman, Stadthagen-Gonzalez & Brysbaert 2012):

Frequency is actually a collider along the path from length to valence. This means that, although there is a non-causal path between length and RT, the flow of information is blocked. In this case, we should not control for frequency in our statistical model, since doing so would cause length and valence to become correlated, opening up a non-causal path from length to RT.

This is not a hypothetical problem. It will genuinely affect real analyses. For example, in the R code below, we create some artificial data generated by a world where there is no causal path between length and RT:



n = 200

length =  sample(1:7, n, replace = T)

valence = sample(1:7, n, replace = T)

freq = length + valence + rnorm(n)

RT = valence + rnorm(n)


We can run a statistical model, predicting reaction time by length and we see that there is no significant correlation (as we expected):


summary(lm(RT ~ length))

Estimate Std. Error t value Pr(>|t|)

length      -0.03436    0.07971  -0.431    0.667


However, when we add frequency as an independent variable, suddenly both length and frequency are significantly correlated with RT:


summary(lm(RT ~ length + freq))

Estimate Std. Error t value Pr(>|t|)

length      -0.83004    0.06520 -12.730   <0.001 ***

freq         0.85081    0.04647  18.310   <0.001 ***


Of course, this may not be the correct or complete causal model in the real world. There are many possible models (some of them don’t have colliders and so you should control for frequency). The point is that your hypothesis should affect the design of your statistical model or your experimental manipulations. Throwing all your variables into the model may actually worsen your ability to infer what’s happening.

It is therefore vital to clearly define our prior hypotheses, and causal graphs are an excellent way of doing this.

As Pearl & Mackenzie explain, we can use causal graphs to identify variables that we should control for. If we’re interested in the effect of X on Y, then we might be able to calculate:

  1. The observational relationship between X and Y (e.g. probability of observing Y given X).
  2. The state of Y if we were to manipulate X (an intervention)

A confounding variable is anything that leads to a difference between these two calculations. To remove confounding, we need to block every non-causal path without blocking any causal paths (block any ‘back door paths’). This means we should control for any variable Z on a non-causal path from X to Y that starts with an arrow pointing to X and where Z is not a descendant of X (there’s no way to get from X to the Z following causal paths, see Shrier & Platt 2008).

In the example below, there is a path from X to Y through Z1 and Z2, so we should control for either Z1 or Z2 (or both) in order to block this non-causal path.

Other examples can become more complicated. For example, in the graph below, there is a non-causal path that needs to be closed: X – Z2 – Z3 – Y. However, controlling for Z2 creates a correlation between Z1 and Z3, opening a new non-causal path. In this case we should control for Z3 rather than Z2.

Tools like Dagitty have algorithms for calculating the options for which variables should be controlled for. The Causal Hypotheses in Evolutionary Linguistics Database (CHIELD is a database of causal claims with tools for exploring connections between them. Graphs in CHIELD can be exported to Dagitty.

A causal approach to research asks us to be brave and make commitments about how we think the world works. If we do this, we might be able to make better decisions and to extract a lot more from our data than we expected.


CHIELD is a searchable database of causal hypotheses in evolutionary linguistics. It has tools for expressing and exploring hypotheses. Anyone can contribute, and the code is open source for other fields to build their own databases. Contact sean.roberts@bristol.ac.uk for more information.

Recommended reading

Rohrer, J. M. (2017). Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data.

Pearl, J., & Mackenzie, D. (2018). The book of why: the new science of cause and effect. Basic Books.

Roberts, S. (2018). Robust, causal and incremental approaches to investigating linguistic adaptation. Frontiers in Psychology, 9, 166.


Flying far from the nest – the biggest adventure or a mental health disaster?

A blog post by Angharad Stell, PhD student, Atmospheric Chemistry Research Group, School of Chemistry, University of Bristol

Mental health at university has been making headlines for all the wrong reasons. Every day seems to bring a new shocking article:

  • “one in four students suffer from mental health problems” [1]
  • “student suicide increase warning” [2]
  • “mental health: a university crisis” [3]

Whilst there are plenty of scary numbers out there, there is little knowledge of the cause. Perhaps if we can understand that, we can combat the issue more effectively. Here, data science is used to investigate whether moving away from family to a new environment has an impact on students’ loneliness.

University is often a young adult’s first move away from their family and the area they grew up in. This movement is visible in the UK’s internal migration data shown opposite, with peaks above the underlying trend for when students arrive and leave university.

Different universities will encourage different amounts of movement: a better or more specialised university will attract students from further away. So, if we can compare different universities’ mental health conditions prescriptions, (depression and social anxiety), by selecting the nearest GP, we should be able to see differences related to student movement.

However, there will be many other factors that might affect loneliness at different universities, which we will have to try to take account of:

  • Number of students – could large institutes and the associated anonymity be a cause of loneliness? Or could small institutes mean there is less chance to meet friends? It will also be harder to pick up smaller institutes in the data.
  • Proportion of UK students compared to international – coming from abroad is the biggest move you can make, does this impact loneliness?
  • University Ranking – are higher ranking institutes pressure cookers for young people? Here, the entry requirements, graduate prospects, and student satisfaction are used. Research quality has been cut out as we are only looking for undergraduates, the other three criteria were not combined here, as they measure quite different things.
  • Widening Participation – do students from less typical backgrounds find the transition harder? Here, the number of students that come from a state school, low participation area, or receive Disabled Student’s Allowance (DSA) compared to what it would be in a socially equal admissions scenario, are combined together to form one indicator.

As the GP data is chosen by matching each university to its nearest GP, other factors will affect the data, which will show varying amounts of students and the general population. For example, the University of Exeter’s nearest GP specialises in caring for the homeless, and the University of Salford’s cares for those in nursing homes. Therefore, an attempt is made to consider these external factors:

  • Deprivation of the area – do worse off areas have poorer mental health?
  • Median age – more young people are lonely [4], and suffer from mental health conditions [5]

The Sutton Trust has a dataset containing how far students have moved, or commute from the family home, to go to university. There are three distance groups: short (0-57 miles from home to university), medium (57-152 miles), and long (more than 152 miles). The complete dataset examined is a combination of this, the university characteristics, and its nearest GP mental health conditions prescriptions and characteristics. Once all the data is collected, the fun bit can begin: data analysis!

Visualising the Data

First, let’s examine the correlation matrix, shown on the right. Strong correlations are shown in red, strong anti-correlations are blue, and weak correlations are shown in the paler shades. There are some obvious relations: a better ranked university has more international students, and more movers. There is also a nod to the social inequality of higher ranked institutions taking fewer non-typical background students. Interestingly, student satisfaction is a poor indicator of everything!

Higher rates of mental health conditions appear to be related to:

  1. Higher number of students at a university
  2. Higher number of medium movers (lower numbers of short commuters)
  3. Higher university ranking (entry standards and graduate prospects)
  4. Lower median age

1 and 4 likely just show that young people have higher rates of mental health issues, however 2 and 3 could be interesting. However, the relations are unlikely to be nice and simple linear ones, and so just looking at this correlation matrix is not enough, let’s try a better way to visualise the data.

Dimension Reduction

Humans are not good at dealing with graphs in more than two dimensions, let alone the 16 in this dataset. So, let’s reduce the dimensionality, using a dimension reducing algorithm, (t-SNE). This algorithm attempts to plot points that are similar close to each other, and points that are dissimilar far apart. The result can be seen in the figure below, where each subplot is coloured by the values of the title variable, (zero indicates the lowest value of the variable, and one is the highest). From the distribution of the points, you can see there appear to two clusters: one smaller top left and a larger bottom one. From the colourings, there does also seem to be structure in the bottom cluster, so perhaps this can be further divided.


Clustering can be very fickle, this is especially true for real datasets with noise, outliers, and not nice circular clumps. So, here four clustering algorithms have been put through their paces, and the best one for our needs selected: K-means, (though agglomerative hierarchical clustering gives a very similar answer and leads to the same inferences).

We can look at how the algorithm clustered the points on the reduced dimension projection in the plot on the right. We have here a top, bottom left and bottom right cluster, which seems to mainly conform to the structure that the dimension reduction algorithm suggested.

Now, let’s see what we can learn from these clusters. In the plot below, each cluster’s density is shown by the filled areas, over the 16 variables. The black lines show each cluster’s median value. On the x-axis, zero is the variable’s lowest value in the dataset, and one is the highest.

The top green cluster has the following characteristics:

  • Highest number of students
  • Lowest number of UK origin students (highest numbers of international students)
  • Highest numbers of medium movers and lowest numbers of short commuters
  • Highest entry standards and graduate prospects
  • Highest rates of depression and social anxiety
  • Lowest median age

The very low median age, high number of students, and high number of movers, (commuters are unlikely to show up at the GP nearest their university), suggest these points have an abnormally high proportion of students at these GPs. Therefore, explaining the highest rates of depression and social anxiety. In fact, a quick google suggests that many of the GPs with the highest rate of depression and social anxiety specialise in student health.

The remaining two clusters have similar higher median age distributions, making them more comparable, as they likely have similar numbers of students relative to the general population. There are differences in the number of medium movers, number of short commuters, and deprivation, but these clusters have similar distributions of depression and social anxiety. This suggests that the distance moved to university does not affect the rate of mental health issues, or offsets the effect of deprivation.


Another method we can use to understand the dataset is regression modelling. After testing a few methods, a decision tree seemed to work best. In the plot below, starting at the top box, is the first statement true or false? Proceed to the next box based on your answer, (left for true, right for false), repeat until you get to the final box in your chain.

As in the clustering, the first split is in terms of median age: young people have more mental health issues than older people. In the high density of students branch, (left hand side), further division is done on the data from the university rankings: higher graduate prospects leads to higher rates of mental health problems, and subdividing that group again is done by student satisfaction: worse student satisfaction gives higher rates of mental illness.

High Student Density Areas

From the above analysis, the main difference in mental health issues is caused by the number of students in an area, rather than the distance the students moved. If we take a quick look within the K-means top green cluster, (the high density of students one), and those with a median age of less than or equal to 23.5 (as suggested by the first split in the decision tree), perhaps we can see differences within this small group where the student signal is clearest. Looking at the correlation matrices, (see below), most of the initial correlation of the mental health conditions with medium movers and short commuters disappears. This tells us that the initial correlation comes from better universities having more students that move into the local area, and students in general suffering from more mental health issues.

In the low median age group, correlation between the mental health conditions and graduate prospects and student satisfaction is present, (as seen by the divisions in the decision tree). Median age’s correlation is still there too though, perhaps this is just continuing to show that better universities attract more people to move to the local area, hence lowering the median age, and pushing up the rates of mental health conditions.

Though for the high student density cluster, a fairly strong anti-correlation between the mental health conditions with the number of long movers has appeared, as well as a strong correlation with wider participation and UK origin. Plotting these though, (see below), they are not very convincing, and since this is now a small sample, it is hard to identify the outliers.


Students in general suffer from mental health conditions more than the rest of general population, and the distance moved from their family does not have a major impact.

There is perhaps a hint of a relationship with long distance movers within the high density of students cluster, but this is too small a sample to draw conclusions from. In order to better explore this, different data would be beneficial: actual distance measures rather than short, medium, and long moves. In addition, a good way to remove the noise of the general population from the GP data would be to look at the rates of usage of university counselling services instead. This would only include students and would also include the commuters, who will just be too hard to find within the noise of the general population in the current dataset.

In terms of the immediately available datasets, it would also be good to look at monthly data: are there more prescriptions during exams? Or is it worse at the start of academic year as students move away for the first time?

Overall, we have seen the power of data science. What first looked like an interesting correlation between mental health conditions and more students moving rather than commuting, has been shown to be misleading. Whilst this work’s conclusions did not succeed in revolutionising the university mental health care system, perhaps with more university specific data, this kind of approach could reveal interesting relationships, and be the beginnings of a solution.


  1. https://yougov.co.uk/topics/lifestyle/articles-reports/2016/08/09/quarter-britains-students-are-afflicted-mental-hea
  2. https://www.bbc.co.uk/news/education-43739863
  3. https://www.theguardian.com/education/series/mental-health-a-university-crisis
  4. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/771482/Community_Life_Survey_Focus_on_Loneliness_201718.pdf
  5. https://epi.org.uk/publications-and-research/prevalence-of-mental-health-issues-within-the-student-aged-population/


A special thanks goes to Matthew Boyd, who helped me find data and questioned my logic.

The Jean Golding Institute Data Competition

This project was one of our runners up into our recent ‘Loneliness for Education’ competition. We run various data competitions throughout the year – find out more