Flying far from the nest – the biggest adventure or a mental health disaster?

A blog post by Angharad Stell, PhD student, Atmospheric Chemistry Research Group, School of Chemistry, University of Bristol

Mental health at university has been making headlines for all the wrong reasons. Every day seems to bring a new shocking article:

  • “one in four students suffer from mental health problems” [1]
  • “student suicide increase warning” [2]
  • “mental health: a university crisis” [3]

Whilst there are plenty of scary numbers out there, there is little knowledge of the cause. Perhaps if we can understand that, we can combat the issue more effectively. Here, data science is used to investigate whether moving away from family to a new environment has an impact on students’ loneliness.

University is often a young adult’s first move away from their family and the area they grew up in. This movement is visible in the UK’s internal migration data shown opposite, with peaks above the underlying trend for when students arrive and leave university.

Different universities will encourage different amounts of movement: a better or more specialised university will attract students from further away. So, if we can compare different universities’ mental health conditions prescriptions, (depression and social anxiety), by selecting the nearest GP, we should be able to see differences related to student movement.

However, there will be many other factors that might affect loneliness at different universities, which we will have to try to take account of:

  • Number of students – could large institutes and the associated anonymity be a cause of loneliness? Or could small institutes mean there is less chance to meet friends? It will also be harder to pick up smaller institutes in the data.
  • Proportion of UK students compared to international – coming from abroad is the biggest move you can make, does this impact loneliness?
  • University Ranking – are higher ranking institutes pressure cookers for young people? Here, the entry requirements, graduate prospects, and student satisfaction are used. Research quality has been cut out as we are only looking for undergraduates, the other three criteria were not combined here, as they measure quite different things.
  • Widening Participation – do students from less typical backgrounds find the transition harder? Here, the number of students that come from a state school, low participation area, or receive Disabled Student’s Allowance (DSA) compared to what it would be in a socially equal admissions scenario, are combined together to form one indicator.

As the GP data is chosen by matching each university to its nearest GP, other factors will affect the data, which will show varying amounts of students and the general population. For example, the University of Exeter’s nearest GP specialises in caring for the homeless, and the University of Salford’s cares for those in nursing homes. Therefore, an attempt is made to consider these external factors:

  • Deprivation of the area – do worse off areas have poorer mental health?
  • Median age – more young people are lonely [4], and suffer from mental health conditions [5]

The Sutton Trust has a dataset containing how far students have moved, or commute from the family home, to go to university. There are three distance groups: short (0-57 miles from home to university), medium (57-152 miles), and long (more than 152 miles). The complete dataset examined is a combination of this, the university characteristics, and its nearest GP mental health conditions prescriptions and characteristics. Once all the data is collected, the fun bit can begin: data analysis!

Visualising the Data

First, let’s examine the correlation matrix, shown on the right. Strong correlations are shown in red, strong anti-correlations are blue, and weak correlations are shown in the paler shades. There are some obvious relations: a better ranked university has more international students, and more movers. There is also a nod to the social inequality of higher ranked institutions taking fewer non-typical background students. Interestingly, student satisfaction is a poor indicator of everything!

Higher rates of mental health conditions appear to be related to:

  1. Higher number of students at a university
  2. Higher number of medium movers (lower numbers of short commuters)
  3. Higher university ranking (entry standards and graduate prospects)
  4. Lower median age

1 and 4 likely just show that young people have higher rates of mental health issues, however 2 and 3 could be interesting. However, the relations are unlikely to be nice and simple linear ones, and so just looking at this correlation matrix is not enough, let’s try a better way to visualise the data.

Dimension Reduction

Humans are not good at dealing with graphs in more than two dimensions, let alone the 16 in this dataset. So, let’s reduce the dimensionality, using a dimension reducing algorithm, (t-SNE). This algorithm attempts to plot points that are similar close to each other, and points that are dissimilar far apart. The result can be seen in the figure below, where each subplot is coloured by the values of the title variable, (zero indicates the lowest value of the variable, and one is the highest). From the distribution of the points, you can see there appear to two clusters: one smaller top left and a larger bottom one. From the colourings, there does also seem to be structure in the bottom cluster, so perhaps this can be further divided.


Clustering can be very fickle, this is especially true for real datasets with noise, outliers, and not nice circular clumps. So, here four clustering algorithms have been put through their paces, and the best one for our needs selected: K-means, (though agglomerative hierarchical clustering gives a very similar answer and leads to the same inferences).

We can look at how the algorithm clustered the points on the reduced dimension projection in the plot on the right. We have here a top, bottom left and bottom right cluster, which seems to mainly conform to the structure that the dimension reduction algorithm suggested.

Now, let’s see what we can learn from these clusters. In the plot below, each cluster’s density is shown by the filled areas, over the 16 variables. The black lines show each cluster’s median value. On the x-axis, zero is the variable’s lowest value in the dataset, and one is the highest.

The top green cluster has the following characteristics:

  • Highest number of students
  • Lowest number of UK origin students (highest numbers of international students)
  • Highest numbers of medium movers and lowest numbers of short commuters
  • Highest entry standards and graduate prospects
  • Highest rates of depression and social anxiety
  • Lowest median age

The very low median age, high number of students, and high number of movers, (commuters are unlikely to show up at the GP nearest their university), suggest these points have an abnormally high proportion of students at these GPs. Therefore, explaining the highest rates of depression and social anxiety. In fact, a quick google suggests that many of the GPs with the highest rate of depression and social anxiety specialise in student health.

The remaining two clusters have similar higher median age distributions, making them more comparable, as they likely have similar numbers of students relative to the general population. There are differences in the number of medium movers, number of short commuters, and deprivation, but these clusters have similar distributions of depression and social anxiety. This suggests that the distance moved to university does not affect the rate of mental health issues, or offsets the effect of deprivation.


Another method we can use to understand the dataset is regression modelling. After testing a few methods, a decision tree seemed to work best. In the plot below, starting at the top box, is the first statement true or false? Proceed to the next box based on your answer, (left for true, right for false), repeat until you get to the final box in your chain.

As in the clustering, the first split is in terms of median age: young people have more mental health issues than older people. In the high density of students branch, (left hand side), further division is done on the data from the university rankings: higher graduate prospects leads to higher rates of mental health problems, and subdividing that group again is done by student satisfaction: worse student satisfaction gives higher rates of mental illness.

High Student Density Areas

From the above analysis, the main difference in mental health issues is caused by the number of students in an area, rather than the distance the students moved. If we take a quick look within the K-means top green cluster, (the high density of students one), and those with a median age of less than or equal to 23.5 (as suggested by the first split in the decision tree), perhaps we can see differences within this small group where the student signal is clearest. Looking at the correlation matrices, (see below), most of the initial correlation of the mental health conditions with medium movers and short commuters disappears. This tells us that the initial correlation comes from better universities having more students that move into the local area, and students in general suffering from more mental health issues.

In the low median age group, correlation between the mental health conditions and graduate prospects and student satisfaction is present, (as seen by the divisions in the decision tree). Median age’s correlation is still there too though, perhaps this is just continuing to show that better universities attract more people to move to the local area, hence lowering the median age, and pushing up the rates of mental health conditions.

Though for the high student density cluster, a fairly strong anti-correlation between the mental health conditions with the number of long movers has appeared, as well as a strong correlation with wider participation and UK origin. Plotting these though, (see below), they are not very convincing, and since this is now a small sample, it is hard to identify the outliers.


Students in general suffer from mental health conditions more than the rest of general population, and the distance moved from their family does not have a major impact.

There is perhaps a hint of a relationship with long distance movers within the high density of students cluster, but this is too small a sample to draw conclusions from. In order to better explore this, different data would be beneficial: actual distance measures rather than short, medium, and long moves. In addition, a good way to remove the noise of the general population from the GP data would be to look at the rates of usage of university counselling services instead. This would only include students and would also include the commuters, who will just be too hard to find within the noise of the general population in the current dataset.

In terms of the immediately available datasets, it would also be good to look at monthly data: are there more prescriptions during exams? Or is it worse at the start of academic year as students move away for the first time?

Overall, we have seen the power of data science. What first looked like an interesting correlation between mental health conditions and more students moving rather than commuting, has been shown to be misleading. Whilst this work’s conclusions did not succeed in revolutionising the university mental health care system, perhaps with more university specific data, this kind of approach could reveal interesting relationships, and be the beginnings of a solution.




A special thanks goes to Matthew Boyd, who helped me find data and questioned my logic.

The Jean Golding Institute Data Competition

This project was one of our runners up into our recent ‘Loneliness for Education’ competition. We run various data competitions throughout the year – find out more

The Beauty of Data 2019 – A JGI data visualisation competition

We are excited to announce that the winner of the 2019 JGI Beauty of Data competition is Vincent Cheng from Population Health Sciences with his visualisation project ‘Automated Forward Citation Snowballing using Google Scholar and Machine Learning’.

The winning visualisation is a short video, and you can view the full submission below:


About the winning project

This is an exploratory project that aims to understand how studies are being cited in Google Scholar, and to explore its application to evidence searches in a systematic review. In a systematic review, searching for studies is one of the most crucial steps. Although Google Scholar acts as a comprehensive database, its searching criteria and processing are not reproducible and transparent for conducting systematic reviews. This project uses a visualisation of a citation pattern in Google Scholar from a forward snowballing exercise (identifying new studies based on those papers citing the study being examined).

In the video, each node represents a search result (study) from Google Scholar. The size of a node represents the number of times a study has been cited. The width of an edge represents the number of duplicates. The visualisation demonstrates the search “Lassa fever ribavirin” on Google Scholar and extracted the first 10 search results as a start set (Iteration 0; n=10 as shown in the video). A trained machine learning model then selected potential studies based on the title, abstract, authors and journal of each study for the forward snowballing process. In each iteration, the information from the first 10 studies citing a potential study was extracted. After 10 iterations, there were n=4,765 search results (with n=1,384 unique results). The number of retrieved studies increased with each iteration. However, the number of duplicated studies also increased in later iterations, suggesting inefficient retrievals.

The data visualisation provides spatial relationships between each iteration in a chronological order to inspect the change. The results provide an insight into the Google Scholar search algorithm and help us to search and utilise the database more efficiently.

More about the competition

The winner received £100 in prize money and was invited to present his visualisation as a poster at the Data Visualisation Symposium at the Alan Turing Institute in London on 13 September 2019. You can take a look at the full poster of the winning visualisation here: Turing AI Symposium Poster.

Two runners-up each receiving £50 each are Chris Moreno-Stokoe and Valeriia Haberland.

Take a look at their visualisations below:

‘History appears to have repeated itself with unsubstantiated claims about the effects of bilingualism’ by Chris Moreno-Stokoe.


‘From a data space to knowledge discovery’ by Valeriia Haberland

The Jean Golding Institute Beauty of Data competition challenges staff and students to submit their work in this exciting challenge to find the best University of Bristol data visualisation. You take a look past entries on our Flickr page.

The Jean Golding Institute data competitions

We run a number of competitions throughout the year – to find out more take a look at Data competitions.

Loneliness competition winners announced

Photo by Danielle MacInness on Unsplash

We are pleased to announce that the winners of the competition are Nina Di Cara from Population Health Sciences and Tiff Massey, Analyst from Ernst and Young with their project ‘Is loneliness associated with movement for education?’. The specific research question assumes that in most cases, movement for primary and secondary education is associated with upward social mobility. That is, moving to try to get into a better school than is available in their current local area.  

The team’s research question was ‘Is community-level loneliness associated with the quality of local schools, and how far can this be attributed to the movement of families pursuing upward social mobility through education?’  

The winning team explored several models and created novel metrics to explore the relationship between loneliness and movement education. They found the population change caused by moving of children aged 4-15 has a small impact on loneliness in communities. They hypothesised that the reason children of this age move, is mostly to pursue better educational opportunities and so movement for the purpose of education in primary and secondary students is associated with loneliness. We will hear more about the details of the analysis in Nina Di Cara’s upcoming blog, to be published on the ONS Data Science Campus website. 

Nina Di Cara said “We were so excited at the opportunity to take part in the data challenge, especially since it gave us the chance to try out using open data to answer a question that has real-life significance. It was a lot of fun to work together and challenge ourselves – we both learned a lot by taking part so winning was a bonus!” 

Jasmine Grimsley, Senior Data Scientist at the ONS Data Science Campus, said “Congratulations to the winners of this year’s Jean Golding Institute Loneliness data challenge which provided an opportunity for students to use their cutting-edge analysis skills to answer current questions relevant to government. Students brought together alternative data sources, admin data, and combined it with existing open government data in novel ways.

“At the Data Science Campus we want to work with people from across the country to try new ways of analysing data to provide new information which can inform decisions. The methods our winners used are exciting and will help in future explorations of how the country can make better use of its data.”

The winners received £1,000 in prize money and have also been invited to the Office for National Statistics (ONS) Data Science Campus to share new ideas for data analysis. They will also have the opportunity to present their findings and spend a “Day in the life” of a Government Data Scientist. Furthermore, their work will be showcased on the Data Science Campus website in blog form.   

The two runners-up each receiving £250 are Angharad Stell and Robert Eyre.  

More about the competition

The Office for National Statistics have developed a loneliness index using open prescription data which is available at the MSOA (Middle layer Super Output Area in the ONS coding system) level across England. These data also provide information to identify MSOA’s that are within geographical clusters where the loneliness index is high or low. We would like to understand if the mobility of people for education is associated with the risk for being in a high or low cluster. The movement of people for education can be locally or across a great distance.

In this competition, we challenged participants to put forward a research question related to loneliness and movement for education, and answer it using the loneliness dataset provided (see below) alongside other suggested data sources.  Read more

The Jean Golding Institute data competitions

We run a number of competitions throughout the year – to find out more take a look at Data competitions.


Computer Experiments

Blog written by Jonathan Rougier, Professor of Statistical Science, University of Bristol

In a computer experiments we run our experiment in silico, in situations where it would be expensive or illegal to run them for real.

Computer code which is used as an analogue for the underlying system of interest is termed a simulator; often we have more than one simulator for a specified system. I have promoted the use of ‘simulator’ over the also-common ‘model’, because the word ‘model’ is very overloaded, especially in Statistics (see Second-order exchangeability analysis for multimodel ensembles).

Parameters and Calibration

The basic question in a computer experiment is how to relate the simulator(s) and the underlying system. We need to do this in order to calibrate the simulator’s parameters to system observations, and to make predictions about system behaviour based on runs of the simulator.

Parameters are values in the simulator which are adjustable. In principle every numerical value in the code of the simulator is adjustable, but we would usually leave physically-based values like the gravitational constant alone. It is common to find parameters in chunks of code which are standing-in for processes which are not understood, or which are being approximated at a lower resolution. In ocean simulators, for example, we distinguish between ‘molecular viscosity’, which is a measurable value, and ‘eddy viscosity‘, which is the parameter used in the code.

The process of adjusting parameters to system observations is a statistical one, requiring specification of the ‘gap’ between the simulator and the system, termed the discrepancy, and the measurement errors in the observations. In a Bayesian analysis this process tends to be called calibration. When people refer to calibration as an inverse problem it is usually because they have (maybe implicitly) assumed that the simulator is perfect and the measurement error is Normal with a simple variance. These assumptions imply that the Maximum Likelihood value for the parameters is the value which minimizes the sum of squared deviations. But we do not have to make these assumptions in a statistical analysis, and often we can use additional insights to do much better, including quantifying uncertainty.

The dominant statistical model for relating the simulator and the system is the best input model, which asserts that there is a best value for the parameters, although we do not what it is. Crucially, the best value does not make the simulator a perfect analogue of the system: there is still a gap. I helped to formalize this model, working with Michael Goldstein and the group at Durham University (e.g. Probabilistic formulations for transferring inferences from mathematical models to physical systems and Probabilistic inference for future climate using an ensemble of climate model evaluations). Michael Goldstein and I then proposed a more satisfactory reified model which was better-suited to situations where there was (or could be) more than one simulator (Reified Bayesian modelling and inference for physical systems). The paper has been well-cited but the idea has not (yet) caught on.

In a Bayesian analysis, calibration and prediction tend to be quite closely related, particularly because the same model of the gap between the simulator and the system has to be used for both calibration (using historical system behaviour) and prediction (future system behaviour). There are some applications where quite simplistic models have been widely used, such as ‘anomaly correction’ in paleoclimate reconstruction and climate prediction (See Climate simulators and climate projections).


Calibration and prediction are fairly standard statistical operations when the simulator is cheap enough to run that it can be embedded ‘in the loop’ of a statistical calculation. But many simulators are expensive to run; for example, climate simulators on super-computers run at about 100 simulated years per month. In this case, each run has to be carefully chosen to be as informative as possible. The crucial tool here is an emulator, which is a statistical model of the simulator.

In a nutshell, carefully-chosen (expensive) runs of the simulator are used to build the emulator, and (cheap) runs of the emulator are used ‘in the loop’ of the statistical calculation. Of course, there is also a gap between the emulator and the simulator.

Choosing where to run the simulator is a topic of experimental design.

Early in the process, a space-filling design like a Latin Hypercube is popular. As the calculation progresses, it is tempting to include system observations in the experimental design. This is possible and can be very advantageous, but the book-keeping in a fully-statistical approach can get quite baroque, because of keeping track of double-counting – see Bayes linear calibrated prediction for complex systems. It is quite common in a statistical calculation to split learning about the simulator on the one hand, and using the emulator to learn about the system on the other, for pragmatic reasons (Comment on article by Sanso et al (PDF)).

Sometimes the emulator will be referred to as the surrogate simulator, particularly in Engineering. Often the surrogate is a flexible fitter with a restricted statistical provenance (e.g.’polynomial chaos (PDF)‘). This makes it difficult to use surrogates for statistical calculations, because a well-specified uncertainty about the simulator is a crucial output from an emulator. Statistics and Machine Learning have widely adopted the Gaussian process as a statistical model for an emulator.

Gaussian processes can be expensive to compute with, especially when the simulator output is high-dimensional, like a field of values (Efficient emulators for multivariate deterministic functions). The recent approach of inducing points looks promising  (On sparse variational methods and the Kullback-Leibler divergence between stochastic processes (PDF)).

Emulators have also been used in optimization problems. Here the challenge is to approximately maximize an expensive function of the parameters; I will continue to refer to this function as the ‘simulator’. Choosing the parameter values at which to run the simulator is another experimental design problem. In the early stages of the maximization the simulator runs are performed mainly to learn about the gross features of the simulator’s shape, which means they tend to be widely-scattered in the input space. But as the shape becomes better known (i.e., the emulator’s uncertainty reduces), the emphasis shifts to homing-in on the location of the maximum, and the simulator runs tend to concentrate in one region. There are some very effective statistical criteria for managing this transition from explore to exploit. This topic tends to be known as ‘Bayesian optimization’ in Machine Learning, see Michael Osborne’s page for some more details.



EPIC Lab: Generating a first-person (egocentric) vision dataset for practical chemistry – data analysis and educational opportunities

Blog written by Chris Adams, Teaching Fellow, School of Chemistry, University of Bristol

This project was funded by the annual Jean Golding Institute seed corn funding scheme.

Our project was a collaboration between the Schools of Computer Science and Chemistry. The computer scientist side stems from the Epic Kitchens project, which used head-mounted GoPro cameras to capture video footage from the user’s perspective of people performing kitchen tasks. They then used the resulting dataset to set challenges to the computer vision community: can a computer learn to recognise tasks that are being done slightly differently by different people? And if it can, can the computer learn to recognise whether the procedure is being done well? Conceptually this is not so far from what we as educators do in an undergraduate student laboratory; we watch the students doing their practicals, and then make judgements about their competency. Since chemistry is essentially just cooking, we joined forces to record some video footage of undergraduates doing chemistry experiments. Ultimately, one can imagine the end point being a computer trained to recognize if an experiment was being done well and providing live feedback to the student; like a surgeon doing an operation wearing a camera that can guide them. This is way in the future though….

There were some technical aspects that we were interested in exploring – for example, chemistry often involves colourless liquids in transparent vessels. The human eye generally copes with this situation without any problem, but it’s much more of a challenge for computers. There were also some educational aspects to think about – we use videos a lot in the guidance that we give students, but these are not first person, and are professionally filmed. How would footage of real students doing real experiments compare? It was also interesting to have recordings of what the students actually do (as opposed to what they’re told to do) so we can see at which points they deviate from the instructions.

We used the funding to purchase a couple of GoPros to augment those we already had, and to fund two students to help with the project. Over the course of a month, we collected film of about 30 different students undertaking the same first year chemistry experiment, each film being about 16 GB of data (thanks to the rdsf for looking after this for us). It was interesting to see how the mere fact of wearing the camera affected the student’s behaviour; several of them commented that they made mistakes which they wouldn’t normally have done, simply because they were being watched. As somebody who has sat music exams in the recent past I can testify that this is true….

One of the research students then sat down and watched the films, creating a list of which activities were being carried out at what times, and we’re now in the process of feeding that information to the computer and training it to recognize what’s going on. This analysis is still ongoing, so watch this space….

The Jean Golding Institute seed corn funding scheme

The JGI offer funding to a handful of small pilot projects every year in our seed corn funding scheme – our next round of funding will be launched in the Autumn of 2019. Find out more about the funding and the projects we have supported in our projects page.