Machine learning based online discourse analysis of mental health and medication use

JGI Seed Corn Funding Project Blog 2022-2023: Holly Fraser

Introduction

Extracting information from free text sources is a complex and exciting data science challenge. Textual data generated by humans is rich, complex, with its meaning often contextual and packed with nuance. This project analysed data from Reddit, an online social news website and discussion forum community, using Natural Language Processing (NLP). NLP is a data science technique used to extract information from textual data sources. A particular aim of this project was to explore ways to model the types of discussions Reddit users were having about antidepressants (AD), a common intervention for the management of depression and anxiety symptoms.

Model exploration

With support from the Jean Golding Institute (JGI), I was able to explore sentiment, emotion, and topics discussed on various subreddits using a range of data driven techniques. For example, I used a sentiment analysis package[1] to extract the sentiment (fig 1) and emotion (fig 2) from a large data corpus (n=24183) extracted from the r/antidepressants subreddit. Figure 3 depicts a schematic of the workflow used to analyse the sentiment of the comments on this subreddit. I then used topic modelling to extract and cluster the topics from the data corpus using a cluster based transformation technique[2] (fig 4).

Figure 1: Sentiment analysis of data from r/antidepressants (n=24183)
Figure 2: Emotion analysis of data from r/antidepressants (n=24183)
 
Figure 3: Schematic of sentiment analysis workflow
 
Figure 4: Example of clustered topic extractions

It was really valuable to be able to use these techniques to explore questions related to the lived experience of managing mental health challenges. My PhD research involves using population health data to explore questions related to depression and medication use, so using free text data to explore similar questions in a data driven exploratory way was thought-provoking. For example, how do you extract specific information relevant to mental health from a real-life, unstructured dataset? How could we use data analysis like this for impactful mental health research?

Interpretation of results

One of the biggest challenges of the project has been interpretation of the model results. The output of the topic modelling was particularly difficult, due to many of the topics extracted containing words that didn’t have much meaning out of context despite using strategies to remove these ‘noisy’ words.

The results of the sentiment and emotion analysis are relatively easy to describe and interpret however – for example, the sentiment model classified the majority of comments as having a negative sentiment (fig 1). The emotion analysis model output is also relatively easy to interpret, but worth considering that the model struggled to classify the data into discrete emotion categories, with the ‘others’ column being the most densely populated (fig 2). This doesn’t seem like a surprising finding when looking at the raw data from a human perspective; many of the comments are long and complex, containing multiple stances. For example, the model struggled to correctly predict the emotion of comments which said things like ‘I was doing badly on X, but I’m doing much better now on Y’ (paraphrased). Therefore, more work evaluating the ability of the model to correctly classify things like sentiment and emotion would be valuable.

Knowing which types of text data the model struggled to classify gives an interesting insight into what the challenges of NLP are in this particular context, where the text data are often complex and comprised of different clauses containing multiple emotions.

Conclusion and next steps

A valuable next step to this work would be to more formally assess the ability of the models to classify sentiment, emotion, and extract topics on a smaller data set by using human interpretation. This would give an insight into how well the models I used perform on Reddit data, by comparing the model output to human judgement in a structured way. Extracting important information related to health (e.g., patient experience of a healthcare intervention) from unstructured text data is a significant NLP challenge; having better insight into the complexity of this challenge has been one of the most valuable outcomes of this project.

If anyone is interested in hearing more about this work or my other projects, you can find me on Twitter @hollyalexfraser or email holly.fraser@bristol.ac.uk.

Note: Work carried out for academic purposes only.

References

[1] J. M. Pérez, J. C. Giudici, and F. Luque, “pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks.” arXiv, Jun. 17, 2021. Accessed: Jun. 17, 2023. [Online]. Available: http://arxiv.org/abs/2106.09462

[2] M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure.” arXiv, Mar. 11, 2022. Accessed: Jun. 17, 2023. [Online]. Available: http://arxiv.org/abs/2203.05794

JGI’s Widening Participation Summer Internship Experience: Emily Anderton & Senyi Luo

JGI’s Widening Participation Summer Internship Experience: Emily Anderton & Senyi Luo

Momi, Senyi and Emily with JGI team
Interns Momi, Senyi and Emily with the JGI team

We completed a six-week internship working with Dr Hen Wilkinson and the JGI team as part of the University of Bristol’s Widening Participation Research Summer Internship scheme. Our internship project was in the data science field and centered around the topic of PeaceTech.

Our experience:

Throughout the internship, I gained technical skills in using Tableau software and Python to create data visualisations, and I further developed my critical thinking skills when producing a scoping review of PeaceTech-related literature and researching stop and search statistics.

I felt it was valuable experience to be included in meetings with the wider JGI team, and the friendly nature of the team helped me feel confident when presenting updates on our project. It was also fascinating to learn of the other JGI data science projects currently underway during these meetings, especially since I am interested in pursuing a career in this field.

Although the internship was not akin to my undergraduate degree course field (Accounting and Management), it helped me to develop universal academic skills, including writing literature reviews and reports, which will prove useful as I enter the third year of my degree programme. The internship also enabled me to gain insight into the way research is conducted at the University of Bristol, which will be of great use to me when considering postgraduate study.

Overall, I felt that the internship was well structured, and the daily check-ins helped to keep everyone on the right track, which enabled me to learn so much in only six weeks. The JGI team were a pleasure to work with and I would definitely recommend an internship with them.

– Emily Anderton

During the internship at the JGI, I got a deeper understanding of what PeaceTech is and got to know more about data science. I learned how to collect data, clean data, and visualize it in the end. Alongside with this, I learned about literature review which I have not done before, and it is helpful to my future study. It is such a fascinating experience into the world of research at the university. You can get an excellent insight about how research is conducted and the University works.

Also, the real-world project gives you a valuable hands-on experience about learning new things, solving the problem, and most importantly work as a team. The JGI team is also very friendly and welcoming which as well they give us a lot of support and the atmosphere here is so good. The Widening Participation program is well-organized and there is a meet-up every week so that you can share your experience about your project with other interns from different departments as well which can give you more point of views.

– Senyi Luo

We would like to thank the JGI and the University of Bristol for this incredible opportunity.

Addressing the fetal alcohol spectrum disorder (FASD) ‘data gap’

JGI Seed Corn Funding Project Blog 2022-2023: Cheryl McQuire

A red puzzle bridge connecting two puzzle islands.

Cheryl McQuire on behalf of the study team: Amy Dillon, University of Bristol; Prof Raja Mukherjee, Surrey and Borders Partnership NHS Foundation Trust; Prof Penny Cook, University of Salford; Sandra Butcher, National Organisation for FASD; Andy Boyd, Director, UK Longitudinal Linkage Collaboration; Beverley Samways, University of Bristol; Dr Sarah Harding, University of Bristol

Twitter: @cheryl_mcquire

What’s the problem?

Landmark UK guidance has called for urgent action to increase identification, understanding, and support for those affected by fetal alcohol spectrum disorder (FASD); but a paucity of national data undermines the feasibility of achieving this.

Tell me more…

Fetal alcohol spectrum disorder (FASD) is caused by exposure to alcohol in pregnancy. It is the most common non-genetic cause of lifelong disability worldwide. FASD is associated with problems with learning and behaviour and an increased risk of physical, mental health, substance misuse, and social problems. Prevention, early diagnosis, and support for people living with FASD, can improve outcomes and lead to societal cost savings.

In the UK, FASD is thought to be particularly common. A study in Manchester schools found that 2% of children had confirmed FASD, and 4% had possible FASD. UK health organisations have recommended urgent action to improve FASD prevention, diagnosis, and support. Publication of the National Institute for Health and Care Excellence (NICE) Quality Standard for FASD in 2022 sets the strongest precedent yet for improved prevention, assessment, and support for FASD.

In parallel, the UK government has called for a transformation in the way people’s information (data) is used to improve health. However, reliable and accessible data on FASD is not available. This makes it difficult to achieve important FASD research, policy, and healthcare goals.

A potential solution?

We believe that an important step towards addressing the FASD ‘data gap’ will be to produce the first UK National Linked Database for FASD. This would bring together de-identified FASD assessment records from NHS and private health settings that have not previously been available for research. These records would be stored in a trusted research environment, enabling researchers to use the data in way that protects people’s privacy. FASD records could then be linked to other population records including health, education, employment, crime, and social care. It would provide new insights into the characteristics and needs of people living with FASD, impacts and costs of FASD in the UK, and identify opportunities for improving outcomes.

What were the aims of this seed corn project?

This seed corn funding allowed us to take the first steps towards making a UK National Database for FASD a reality. We used it to establish the feasibility, acceptability, key purposes, and data structure of the first linked national research database for fetal alcohol spectrum disorder.

What did we do?

We spoke to over 100 stakeholders including clinicians, data specialists, researchers, policy makers, charities, and people living with FASD to find out:

  1. What they want from a FASD database
  2.  How this database could be used to advance policy, research and practice
  3.  What UK data sources are currently available, and are due to become available, for FASD
  4.  What data are commonly collected by FASD clinics
  5.  What opportunities there are for standardisation/harmonisation of FASD data
  6.  What should be considered in relation to ethical and data governance frameworks, data collation, transfer, storage, linkage, onward sharing and sustainability

To maximise engagement, we took a flexible and tailored approach, speaking to people using email/video conferencing and holding 1 in-person workshop to coincide with the UK Conference for FASD 2023 (Salford, March 2023).

What did we find?

There was strong support for a national FASD database. Charities and those living with FASD spoke of the benefits of increased awareness, understanding and support for FASD. Clinicians reported that the detailed clinical information provided on a national database could improve diagnosis, making assessment more efficient, potentially reducing long waiting lists. Researchers expressed enthusiasm for using it to better understand long-term outcomes, costs and opportunities for improved support. Policy makers identified clear alignment with current FASD and data transformation policy. The most common concern was around privacy and data sharing. The study team has been developing a data pipeline model to ensure that these concerns are appropriately addressed.

What’s next?

We are developing a ‘data pipeline’ model, in collaboration with representatives from FASD clinics and people working in secure data environments to take the initial steps in making the national database for FASD a reality.

We have clear plans for follow on funding, maintaining the strong, widespread, collaborations that we have developed and strengthened through this seed corn work.
We are presenting a summary of this public engagement work at the ADR-UK conference in November and have had this work accepted in the International Journal of Population Data Science.

Overall, this project has been invaluable in paving the way for progress in FASD in the UK. We hope to finally address this crucial FASD ‘data gap’ that has been stalling progress in prevention, understanding and appropriate support for too long.