Somaliland is self-declared independent from Somalia: it has been autonomous since 1991 but is internationally unrecognised. It is democratic and has among the highest levels of poverty in the world. Somaliland’s government and industry have recognised that ICT is a great opportunity for sustainable economic development. Somaliland has all key attributes to develop in this sector bar one. The key ingredient that is missing is local, skilled workers. This is a key issue identified by the local government and business community.
Prof Velthuis and Dr De Sio from the UoB particle physics group travelled to Hargeisa to start a project with the University of Hargeisa, UoB’s strategic partner Transparency Solutions and local industry partners including Telesom and NGOs like Candlelight and ADAM Academy to boost capability in data analysis and data intensive research. More information on the project can be found here.
Data Science Course
Dr De Sio and Prof Velthuis delivered a course on Python programming and Machine Learning to students coming from universities, NGOs and industry. The course covered Python Programming from the basics to experienced user level and provided an introduction to Machine Learning techniques. After an initial in person lecture at the University of Hargeisa, the course was taught in 15 online lectures and assessed by two assignments.
The lectures were accompanied by 10 lectures from experts working in the field to show applications of data science.
Mr Mark Gibbons – How we use data at the Centre for Sustainable Energy
Dr Richard Hugtenburg – Monte Carlo in medical physics
Dr Jaya Chakrabarti – Project VANA: AI, data and the environment
Dr Daniel Saunders – Data Science Applications: from Physics, to Gaming, to Crime Fighting
Dr Christian Thomay – Data in the Cockpit: Enhancing Pilot – Training with Eye Tracking and Data-driven Feedback
Dr Valerio Maggio – Teaching Machines to Recognise Emotion. An interactive online system to showcase human-in-the-loop AI
Dr Sudan Paramesvaran – The data challenge at CMS
Mr Dan Laasna Reuter – Why data is sexy
Dr Chris Lucas – AI in Practice
Dr Leonor Frazão – Using AI to Fight Financial Crime
The course was examined through two assignments. Successful completion of the first assignment was required to pass the course. This required analysis of data obtained with the HiSPARC detector network. The second, more advanced, assignment required analysis of the number of phone operators required in a call centre or analysis of travel times in a lift. Successful completion of the second assignment was required to pass the course at the advanced level.
The candidates who successfully completed the course were:
Abdiaziz Harun Mohamed
Abdillahi Sh. Ahmed Farah
Yasin Nour Moge
Salah Mohamed Osmaan
The candidates who successfully completed the course at the advanced level were:
Siraaj Jama Abdilahi
Abdirahman Hussein Mahad
Abdirizak Hussein Guled
Ahmed Mohamed Warsame
Mustafa Ahmed Muse
Mohamed Ibrahim Abdilahi
Mohamed Ali Mohamed
Abdalla Mohamed Yusuf
Abdirisaq Yusuf Adani
Ahmed Suleiman Bashir
More information on the project can be found here.
Can sharing app data facilitate communication between young people and their mental health practitioner?
Bridget Ellis, Lucy Biddle, Helen Bould, Jon Bird
Mental health problems are increasing among young people, who have the highest prevalence of mental health problems among all age groups . Despite the adverse outcomes that result from this, young people access mental health services at a lesser rate than other age groups , with barriers including communication, poor mental health literacy, embarrassment, fear of stigma and confidentiality concerns.
Research illustrates that digital peer support can help people with mental health difficulties  and the increased availability of mobile technologies is now being harnessed to deliver mental health support.
Our project was a collaboration with the company that created the award winning, NHS-endorsed young person’s mental health app, ‘Tellmi’ (www.Tellmi.help). The app is a fully moderated peer support environment, where young people anonymously share ‘tweet’ style posts about their emotional and mental health difficulties. A holistic dataset builds up for each individual which could have potential clinical value if shared with a healthcare practitioner. For example, the posts can be tagged for content, rated for severity, displayed longitudinally and presented in a shareable summary document.
Previous feasibility survey and interview data investigated the views of young people who used the Tellmi app, and Child and Adolescent mental health services (CAMHS) clinicians about the acceptability and utility of sharing such a summary document during mental health consultations as a means of enhancing the clinical exchange. Our current study had two aims: i) to carry out in-depth thematic analysis of this previously collected data; and ii) to form a multidisciplinary working group and convene a one-day workshop to present and discuss our findings as preparation for a full-scale research proposal.
We conducted thematic analysis on interviews with five young people and four healthcare practitioners, and 120 survey responses from users of Tellmi.
“So I think finding the words and putting them on Tellmi makes it easier to be able to say them to someone who is in front of you”
A theme was identified surrounding communication and how a summary document could be utilised to facilitate this between young people and healthcare practitioners. A concern raised by young people was that the way they communicate varies upon the audience they are communicating with, meaning a summary of the posts which they intended to be seen by peers may contain information they may not usually present to a clinician. Young people appear to value the written communication of Tellmi and were enthusiastic about how this could help to provide a focus and inform clinical sessions. For young people who struggle when trying to communicate their levels of distress with a clinician, this could be overcome through making it possible to share their experiences through their Tellmi posts. Additionally, providing a written account of how young people have been feeling may help to bridge a gap between the more honest and open information that is disclosed anonymously and that that is disclosed face-to-face with a clinician. However, young people did raise concerns about how this written information could be misconstrued or misinterpreted.
“If I feel comfortable with them then I’ll be more likely to share but if I don’t feel comfortable then I would not share”
We found that trust would play a key role in the process of sharing. This was not only trust between a young person and their clinician but also trust between a young person and Tellmi and how sharing could change how young people engaged with the app going forward. Clinicians also raised questions about trusting the Tellmi app, in particular how successfully an algorithm can identify risk or how the data being shared may be monetised.
“Tellmi posts do tend to be quite personal and honest and open because you expect to be talking to someone who isn’t really there so you can say whatever you like and there’s no judgement”
Young people seem to really value Tellmi as a safe space. This safety appeared particularly facilitated by the anonymity it provides. Young people were concerned about how their data may be handled if it was no longer anonymous and being shared with clinicians.
We also found practicalities surrounding sharing that would need to be addressed. For example, young people required control over their data and how it is used and shared. The potential of young people censoring the information they present to their clinician was also discussed. Additionally, the impact that revisiting old posts may have on young people was considered. Factors specific to clinicians could also impact sharing, with time being a concern for both clinician and young people.
“I think it’s [sharing a Tellmi summary] a great idea but the young people would need to have complete control of the information that is included to avoid endangering young people”
Our multidisciplinary working group consisted of three researchers from computer science and health sciences, two child and adolescent psychiatrists, representatives from Tellmi, and two young people with lived experience of mental health difficulties. We presented our findings from the thematic analysis then used discussion sessions and group work to consider implications for the design of future research. We discussed how data sharing is likely to be most beneficial; how acceptability can be enhanced for young people and clinicians; stakeholders’ evaluations of the dummy data summary document of Tellmi posts, including methods of data visualisation; and potential barriers to data sharing in practice.
Discussion of the design for a user study of Tellmi data-sharing in practice identified this would involve varied stakeholders, including Tellmi users, researchers and clinicians. It was noted that recruitment could bring challenges and discussion sought to identify the most appropriate pathways for recruiting clinicians and young people in paired groups so that both perspectives can be captured for each case of data sharing. If recruiting through clinicians, it was noted that young people may not be Tellmi users or have enough data to produce a summary document. A suggestion for overcoming this was to ask young people to engage with Tellmi while on a waiting list. However, one of the lived experience advisors highlighted a challenge: “I think another issue with recruiting people through NHS is that no matter how good an app is, if you are young person on a waiting list and a clinician says, ‘Use this app,’ it’s like, ‘No, I want you to help me and why am I going to use an app?”. Alternatively, we discussed recruiting through the Tellmi platform and young people approaching their clinicians to get involved. However again, there would be challenges with this approach such as obtaining ethical approvals and clinical ‘buy in’ where relevant young people could be based all over the country.
We also discussed the practicalities of sharing and how a study procedure would be designed, focusing this discussion around the implications for design highlighted in our thematic analysis. This encompassed details determining how the process of sharing would actually take place. For example, we considered whether the summary would be shared as a physical document or an electronic copy, and whether this should be given to the young person to present to their clinician or be sent directly to the clinician. When to share is also a key consideration, our data showing young people have varied views around this and whether sharing should be repeated, and if so, at what frequency. Additionally, methods of improving and encouraging sharing were discussed, as well as the overall design of a summary document and how this could be altered to ensure inclusivity for special educational needs.
Key to designing a research study were methods of evaluation and establishing outcome measures. Young people and clinicians flagged a range of potential outcomes. These included completing clinical tasks such as goal setting, and how successful a young person may consider a session “something else to measure would be how the young person feels coming out of the appointment. Has it empowered them or let them take control of their healthcare”. The view of the young person was considered key in determining how outcome would be measured “it’s just making me think what is the actual point of sharing the data again? I guess that depends on the young person”.
The workshop provided a space for exciting discussion with input from stakeholders from different backgrounds. While we hoped it would allow elements of co-design to inform development of a data sharing document and research plans to evaluate this, challenges were raised which suggest further development work may be necessary before the process of sharing can be evaluated. The ideas and issues raised at our workshop will be explored through our continued collaboration with Tellmi.
“The workshop was incredibly insightful. It provided us the opportunity to discuss the findings of the study with a diverse group of experts including academics, clinicians and young people with lived experiences of poor mental health. It has helped us to completely rethink how to approach the problem and we look forward to continuing to work with the Bristol team.” Kerstyn Comley, Tellmi Co-CEO
Dr Josh Hoole, Dr Oliver Andrews, Dr Steve Bullock
Various UK organisations provide 24/7 Search and Rescue (SAR) capability year-round across land, sea and air. Data analytics provides a key route to supporting SAR operations and aerospace system design in the future.
Aims of the Project
The aim of this project was to explore what data is available to capture the variability present in SAR operations (including mission characteristics and weather) to help support the future design of aerial systems to support SAR. This aim was to be achieved using the following objectives:
Engagement with search and rescue organisations to establish:
Availability of data for characterising SAR mission profiles
Perceptions on developing Unmanned Aerial Vehicles (UAVs or ‘drones’) to support SAR
Data fusion across asset tracking data to characterise SAR mission profiles:
Exploitation of aircraft and vessel trajectories
Combining mission profiles with meteorological data
This project therefore lay at the exciting and valuable intersection between data science, aerospace systems, weather and climate analysis and SAR.
To date on the project, the following activities have been performed supported by the Seedcorn Funding:
Data Workshop with the Royal National Lifeboat Institution (RNLI)
A one-day workshop was held with the RNLI Data team at the RNLI College in Poole. Within this workshop, areas of interest and ideas were shared spanning the exploitation of data for mission analysis, future planning and the use of computer vision to support lifesaving activities. The University of Bristol team were simply amazed at the large amount of data-driven work performed by the RNLI and look forward to establishing stronger links between the RNLI and research institutions in the future (see contact details below).
There was also a tour of the RNLI’s training and lifeboat manufacturing facilities as part of the workshop to provide context to the RNLI’s activities. The Bristol team were overwhelmed by the vast and diverse capabilities present in a single location and thoroughly recommend a tour of the RNLI College and All-weather Lifeboat Centre.
Initial Assessment of Vessel Tracking Data
Maritime vessels are equipped with real-time tracking capability via Automatic Identification System (AIS) installations. Historic AIS data provides vessel trajectories which can be post-processed to characterise the mission performed. Building on prior work in the literature, an initial investigation into processing the AIS trajectories of RNLI lifeboats has been performed using data sourced from MarineTraffic. Using simple algorithms, AIS trajectories can be processed to identify the occurrence of lifeboat search manoeuvres and generate characteristics regarding the search operation (e.g. search time, search area, etc.). It is intended that such characteristics can be used in the future to support the post-mission reporting performed by the RNLI.
Data Fusion to Enhance SAR Helicopter Tracking Data
A large number of aerospace vehicles are also equipped with real-time tracking capability via Automatic Dependent Surveillance-Broadcast (ADS-B) equipment. However, as a line-of-sight system, ADS-B derived trajectories are often lacking in the regions where SAR operations take place, such as at low altitude, close to obstructions or out at sea. SAR helicopters are also equipped with AIS equipment, permitting ADS-B and AIS data sources to be fused to greatly increase the coverage of SAR helicopter trajectories. The ADS-B/AIS fused trajectories can then be further processed to generate mission characteristics as for the maritime vessel trajectories.
Exploitation of Meteorological Data Products
Following completion of the SAR mission characterisation via AIS and ADS-B data sources, the project will intend to couple the trajectories to meteorological data products to fully characterise the SAR operational environment. This level of data fusion could support automated post-mission reporting, draw correlations between the search characteristics and operating environment, as well as support future planning with respect to the impacts of climate change on UK SAR operations.
Engage further with Inland SAR Organisations (PhD projects)
So far, the project has focused on maritime SAR. Future work will engage with inland SAR organisations to a greater extent and initial links have been formed with the relevant organisations. Dr Steve Bullock has successfully secured funding for two PhD students in the area of SAR planning for UAVs and these project will aim to leverage the expertise from the SAR connections made during this project.
Future SAR Data Research Partnerships
The workshop with the RNLI highlighted a significant number of data-centric avenues that could be pursued within future research projects, including aspects of machine learning, computer vision, weather and climate, along with mission analysis. A future workshop is planned, and researchers from across the data community at the University of Bristol are encouraged to participate, so please get in touch via the contact details below. The University of Bristol team are also very keen to explore collaborative partnerships within this area with other research institutions (GW4 and beyond) and SAR organisations. Please send any expressions of interest regarding future opportunities to the contact details below.
Contact Details Dr Josh Hoole, Department of Aerospace Engineering, University of Bristol, firstname.lastname@example.org
Medfluencers: how medical experts influence discourse about practices against COVID-19 on Twitter
This project aims to investigate the role of medical experts on Twitter in influencing public health discourse and practice in the fight against COVID-19.
Aims of the Project
The project focuses on medical experts as the driving force of Network of Practices (NoPs) on social media and investigates the influence of these networks on public health discourse and practice in the fight against COVID-19. NoPs are networks of people who share practices and mobilise knowledge on a large scale thanks to digital technologies. We have chosen Twitter as a focus of our analysis since it is an open platform widely used by UK and international medical experts to reach out to the public. A key methodological challenge that this project seeks to address is to extend existing text analytics and visual network analysis methods to identify latent topics that are representative of practices and construct multimodal networks that include topics/practices and actors as nodes and Twitter affordances as edges of the network (e.g. retweets, @metions). To address this challenge, the aims of this project are:
Build a machine learning classifier of tweets that mention relevant practices in the fight against COVID-19.
Build a machine learning classifier of authors of tweets, which can distinguish between medical experts and other key actors (e.g. public health organisations, journalists).
1. Data Collection
We used the report from Onalytica to identify the top-100 influential medical experts on Twitter. After receiving academic access to Twitter API, we collected a total of 424,629 tweets from the official accounts of these medical experts with the R package academictwitteR from 01 December 2020 to 02 February 2022.
2. Build a machine learning classifier for relevant practices
After cleaning the data set, we randomly selected a sample of 1,200 tweets, which was then manually coded as either “relevant” or “non-relevant” by two independent coders and employed to train the Machine Learning classifier. By relevant we mean representative of relevant practices in the fight against COVID-19 (e.g., wearing a mask, getting a vaccine). After training and testing a series of algorithms (support vector classifier, random forest, logistic regression and naïve Bayes), a support vector classifier (SVC) gave the best classification results with 0.907 accuracy. To create the inputs to the classifier, we used a sentence transformer to convert each tweet to feature vector (a sentence embedding) that aims to represent the semantics of the tweet (https://www.sbert.net/). We compared this to a feature vector representing the number of occurrences of individual words in the tweet, achieving lower performance of 0.873 with a random forest classifier. For reference, the baseline accuracy when labelling all classes as relevant is 0.57, showing that the classifier can learn a lot from simple word features. The performance of SVC, random forest and logistic regression was similar throughout our experiments, suggesting that the choice of classifier itself is less important than choosing suitable features to represent each tweet. We applied the chosen SVC + sentence embeddings classifier to the remaining sample of tweets, resulting in 235,320 tweets that were classified as representative of relevant practices.
3. Topic modelling
We employed a topic modelling analysis to gain a better understanding of the types of practices that were discussed by the medical experts. After testing a number of indexes, we found 20 latent topics, were present in our data. We therefore employed a LDA (Latent Dirichlet Allocation) topic model analysis with 20 topics (Figure 1).
We selected 9 topics related to significant practices linked to the fight against Covid-19:
Topic 1 is about vaccines
Topics 2 and 16 are about global health policy/practices
Topic 6 is about prevention of long covid in children
Topic 9 is about immunity (either natural or vaccine-induced) against variants, hence related to COVID-19 public health measures or practices
Topic 13 is about reporting of COVID cases and therefore linked to effectiveness of public health measures or practices
Topic 18 is about masks (in schools)
Topic 19 is about testing
Topic 20 is not about a “public health” practice but a scientific practice about sequencing
4. Build a machine learning classifer of authors of tweets
One hundred twitter bios of medical experts were not enough to build the machine learning classifier. Therefore, we upsized our sample by including the bio descriptions of accounts that medical experts followed on Twitter. This strategy allowed us to include bios of users who were not medical experts and therefore differentiate between medical and non-medical “experts” or influencers. We collected the “following” accounts with the R package “twitteR” resulting into a total of 315,589 bios. We randomly selected a smaller sample for the manual coding of these bios (2,000). Following an inductive approach, two independent coders manually coded the bios into labels that classified individuals by their job occupation/profession and organizations by their sector or mission. The label “non classifiable” was used for bios that could not be classified in any professional or organizational category. This resulted into a total of 188 labels which were then aggregated into higher-level categories resulting into a final list of 49 labels.
Future plans for the project
We will use the coded sample of Twitter bios to train a Machine Learning classifier of authors of tweets. We will apply for further funding to improve our methodology and extend the scope of our project, for example, by including more medical conditions, non-English speaking countries, and other platforms in addition to Twitter. We will improve the methodology by identifying experts from a sample of collected Tweets by relevant topics representative of practices and sample and classify authors’ bios from this sample. This will allow us to have a more representative sample of individuals and organizational entities that are active in the public health discourse related to COVID-19 and other medical conditions on Twitter. We will classify practices and then map classified practices and authors onto a network to conduct a network analysis of how medical influencers affect discourse about public health practices on Twitter.
Mapping the linguistic topography of Sophocles’ plays: what Natural Language Processing can teach us about Sophoclean drama
Benjamin Folit-Weinberg, A.G. Leventis Postdoctoral Research Fellow (Institute for Greece, Rome, and the Classical Tradition & Department of Classics & Ancient History, University of Bristol) and Justus Schollmeyer, Data Scientist & Programmer
Scholars have long recognized that Sophocles, the great 5th Century B.C.E. tragedian, repeats thematically important words in his plays and that studying these repetitions can offer fundamental insights into his work. At present, however, identifying these repetitions is time-consuming and unsystematic, and the significance of specific repetitions is not always clear. Our project applies Natural Language Processing (NLP) and data visualization techniques to help scholars of Sophocles both identify linguistic patterns more efficiently and rigorously and interpret the significance of these patterns more insightfully.
Seed Corn funding provided by the Jean Golding Institute allowed us to create a feasibility prototype for an NLP and data visualization tool with several functions. The first function is heuristic and identifies the words or word families that appear most frequently in each of the seven fully extant plays of Sophocles. The second function is analytical and calculates how frequently a given word or word family is used in a specific play by Sophocles compared to the remaining six plays. The third function is hermeneutic and depicts the distribution of selected words within a specific play (see diagram below); the chart will ultimately include various overlays that demarcate units of the play and articulate relationships between uses of key words.
The successful development of this feasibility prototype has enabled us to apply for further funding to develop our tool; our goal is to make this available as a common good to anyone with an internet connection, regardless of their institutional affiliation or programming literacy. We are also exploring the possibility of scaling up our tool to address the entire 5th Century Athenian dramatic corpus and other corpora of texts from Greco-Roman antiquity.
For further information, please contact email@example.com
Transferring early disease detection classifiers for wearables on companion animals
Axel Montout, Andrew Dowsey, Tilo Burghardt, Ranjeet Bhamber, Melanie Hezzell and Emily Blackwell
Sensor-based health monitoring has been growing in popularity as it can provide insight into the wearer’s health in a practical and inexpensive way. Accelerometer-based sensors are a popular choice and have used been for various applications, ranging from human health to livestock monitoring.
Aim of the Project
This project aims to predict early signs of degenerative joint disease (DJD) in indoor cats with the use of accelerometers and machine learning techniques.
Methodology and Results
Data points originating from a study investigating DJD in cats were used in this project1. Fifty-five pet cats were equipped with a wearable sensor that collected accelerometery data over a continuous 12-day period.
The Raw data comprised of 57 MS Excel spreadsheets containing the sensor data, along with a metadata spreadsheet, which described the age and joint health of each individual cat, and included an owner-generated mobility score (evaluated by the owner in a series of questions asking them to report changes in the mobility of their cat).
Out of the 57 sensors, five were set up to measure activity counts every millisecond, while the rest collected activity counts every second. For this project, all the activity data were resampled to the second resolution (i.e. each count contains the sum of activity counts within each second) if it was not already at that resolution.
Using supervised machine learning (SVM)2 requires the building of samples which are composed of an array of features and a ground truth value, which is the value we want to predict. In this case, we decided to solely use activity counts as features and the health score as ground truth.
Accelerometry data for three of the cats were excluded from further analysis, as their respective sensors were calibrated differently compared to the rest of the cats. Although postprocessing approaches could tackle such an issue, this was done as a safeguard to avoid biasing the analysis.
Given the limited number of cats (55), we decided to build multiple samples out of each individual cat. We hypothesised that the effect of degenerative joint disease would reflect more in the activity of a cat when it is performing higher activity behaviour such as jumping, quick running etc. For that reason, we created samples by looking for peaks of activity in each cat trace and selecting a fixed amount of activity data before and after the peak occurred, effectively building a window around the peaks. We fixed the window length to 4 minutes, or 240 seconds, and selected the top 0.001% of peaks, based on the activity count value. With this approach, we were able to generate 10 peaks per cat, giving a total of 550 samples.
Before feeding in the dataset of peaks, the data was pre-processed (features) to optimise the predictive power of the machine learning classifier. In order to establish which pipeline was optimal, several different pre-processing pipelines were used. Here, only our ‘Baseline’ pre-processing pipeline will be discussed. We initially applied quotient normalisation to the array of activity of the sample, followed by standard scaling.
This prepossessing step aims to normalise the amplitude of the activity in the samples.
This standardizes features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as: z = (x – u) / s
The initial dataset was augmented (increasing the number of samples) by building permutations of peaks from the initial 550 sample dataset. Permutations of 2 and 3 peaks were built, creating datasets containing 4680 samples (2250 healthy and 2430 unhealthy) and 37440 samples (18000 healthy and 19440 unhealthy) respectively.
An SVM classifier (RBF kernel) was trained with the datasets described above. To evaluate the prediction performance, we used the leave one out cross-validation method, where we trained the model with all the samples in the dataset, apart from the sample from one cat, which was used for testing. This was done so that all 50 cats were tested against the rest. All samples were pre-processed by applying quotient normalisation and standard scaling before training the SVM.
We used the area under the curve (AUC) of the receiver operating characteristic (ROC) curve to evaluate performances. For the 1, 2 and 3-peaks dataset, a testing AUC of 57%, 65% and 69% were obtained respectively. Using more than three peaks did not improve the results.
Conclusion and Future Plans
Despite the limited amount of data available, a machine learning approach demonstrated promising results in the prediction of early joint disease in cats, with the help of our data augmentation approach. These results suggest that the data within small windows, centred around bursts of activity, contains enough information to discriminate a healthy cat from an unhealthy cat. Further work is currently ongoing to determine which part of the window of activity is most important for prediction. We hope that this will provide a novel insight into how the activity traces of cats with early joint disease differ from unaffected cats, when performing movements involving high activity.
Investigating biomarkers associated with Alzheimer’s Disease to boost multi-modality approach for early diagnosis
Alzheimer’s disease is a brain disorder that gradually destroys memory and thinking skills, as well as the ability to perform the most basic tasks. Most people with the disease, those with late-onset symptoms, experience symptoms in their mid-60s. Early-onset Alzheimer’s disease is extremely rare and occurs between the ages of 30 and 60. Alzheimer’s disease is the leading cause of dementia in older people. According to recent studies, 5.8 million people in the United States aged 65 and up have Alzheimer’s disease. Alzheimer’s disease is estimated to affect 60-70 % of the approximately 50 million people worldwide who have dementia.
Aim of the Project
Numerous recent studies leveraged state-of-the-art machine learning techniques to predict biomarkers in Alzheimer’s disease. However, most of these studies focused on medical images of the brain. Despite showing promising results, such images are scarce and typically required at later stages of the disease. We investigate in this project an alternative approach by investigating biomarkers in non-image data instead. Specifically, we explore genomic and protein data for predicting the early stages of Alzheimer’s disease, particularly when combined with other modalities such as EHR (Electronic Health Records) and cognitive tests. The aims of this project can be summarised as follows:
Investigate the role of protein and genomic data in detecting Alzheimer’s disease.
Explore a multi-modality approach to combine various measures.
Assessing ML models for the task and deciding the most suitable choice.
For this project, we use genetic data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (https://adni.loni.usc.edu/). In the scope of this project, we only use data from ADNI 2 and ADNI GO cohorts. Subjects are medically confirmed to belong to four categories – cognitively normal (CN), Early Mild Cognitive Impairment (EMCI), Late Mild Cognitive Impairment (LMCI), Significant Memory Concern (SMC) or Alzheimer’s disease (AD). Data is combined with a set of cognitive test results.
Role of APOE in Alzheimer’s disease:
Apolipoprotein E (APOE) is a protein involved in the metabolism of fats in the body. A subtype is implicated in Alzheimer’s disease and cardiovascular disease. APOE has three common forms: APOEe2, APOEe3 and APOEe4. E4 variant is the most common and known to be associated with getting the disease at an earlier age. Approximately 15% to 25% of the general population carries an APOE e4 allele .
We examine the E4 variant which has been identified as the greatest known genetic risk factor for late-onset sporadic Alzheimer’s disease (AD) in a variety of ethnic groups . Table 1 shows the distribution of the E4 variate across different stages of Alzheimer’s. APOE4 of zero e4 is the most common in controlled and early MCI. For later stage in LMCI and AD, the most common is the E4 is 1 or more.
Age is also considered a significant factor related to APEO E4; Figure 1 shows the distribution of various types of APOE4 across ages. A higher value of the protein is more common at a younger age. At older ages, 85+years, a smaller value is more common.
Only considering the APOE4 protein and age, the AdaBoost ensemble method is used for predicting the four classes, the average accuracy attained is 41% ± 9%. Using 10-fold cross-validation. This indicated that both factors play an important role in Alzheimer’s disease, yet it is not recommended to use them solely to forecast the condition. We improved the model by adding 4 different Cognitive tests:
CDRSB: Clinical Dementia Rating Scale – Sum of Boxes.
ADAS11: Alzheimer’s Disease Assessment Scale 11.
MMSE: Mini-Mental State Examination.
RAVLT_immediate: Rey Auditory Verbal Learning Test (sum of scores from 5 first trials).
Incorporating cognitive tests, the accuracy of the model increased to 76% ± 6%
For analysing the feature’s importance in the combined protein model, we use a forest of trees to evaluate the mean decrease in impurity, along with their inter-trees’ variability represented by the error bars. Through this model, it is confirmed that CDRSB is the most significant factor as shown in Figure 2. In addition to cognitive tests, age is a strong contributing factor too. Education level plays a role in the classification task.
Role of genetic sequence data:
We also investigate the role genetic data plays in predicting Alzheimer’s. The dataset has more than 7000 feature variables describing the genetic constitution of the patients. A simplified Decision tree diagram has been shown in figure 1 which highlights the available actionable paths. LOC5778 and LOC 2303 are the most significant according to the decision tree information gain score.
We focused on developing prediction models by taking only the top 50 genetic features. Using the genetic features, the classifier achieves only 41% ± 6% accuracy. Combining genetic features with the other features (APOE4, EHR, cognitive tests and MRI tests) shows an increase in accuracy. Various models are tested, and accuracy is shown in Figure 4. Tree-based methods are the best for this task, AdaBoost is on the top of the list with an average accuracy of 71% followed by Decision Tree at 68%.
We have learnt a few lessons working through this project, we summarise them as follows:
A combination of modalities is the best approach for predicting Alzheimer’s.
Cognitive tests play a significant role in guiding protein and genomic data analysis.
Tree-based models showed the best performance for this task.
Future aims for the project include the following
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) database’s impending phase 3 will add fresh patient records to the dataset. We aim to expand this work to include more samples for studies and challenge more hypotheses and unknown biomarkers for Alzheimer’s disease. More in-depth analysis into the role of apolipoprotein E4 in the disease progression. Also, women seem more likely to develop AD than men . The reasons for this are still unclear and need to be investigated further. Other contributing lifestyle factors can be studied more in our future work.
This project emphasised the importance of a multi-modality approach for Alzheimer’s classification. In our future work, we aim to incorporate additional modalities such as PET scans and MRI images. Whereas the sample size increase, we aim to utilise deep learning approaches, especially for image-based modalities.
 Bogdanovic, B., Eftimov, T., & Simjanoska, M. (2022). In-depth insights into Alzheimer’s disease by using explainable machine learning approach. Scientific Reports, 12(1), 1-26.
 Kavitha, C et al. “Early-Stage Alzheimer’s Disease Prediction Using Machine Learning Models.” Frontiers in public health vol. 10 853294. 3 Mar. 2022, doi:10.3389/fpubh.2022.853294
 Bachman, D. L. et al. Prevalence of dementia and probable senile dementia of the Alzheimer type in the Framingham study. Neurology42, 115–19 (1992).
Non-invasive imaging of the eye to predict Alzheimer’s disease
Alzheimer’s Disease (AD) is an increasing global health burden but despite intense research efforts, drug trials have shown little evidence of success. Thankfully, there is now exciting evidence that specialist imaging techniques like optical coherence tomography (OCT) can identify individuals at high risk of developing AD. OCT is a rapid low-cost and non-invasive way to take high-resolution (3-5mm) images of the retina and optic nerves at the back of our eyes and detect early signs of neurodegeneration. It is a technique that is also available in most high street opticians. By using this technique to identify high-risk individuals before they get AD, they have the opportunity to change their lifestyles or enter drug trials at a much earlier stage.
Our aim was to find out how early signs of neurodegeneration in the eye are linked to AD. Using seed corn funding from the Jean Golding Institute, we learnt that measurements of the optic nerves at the back of our eyes can help to determine our future risk of AD. Optic nerve size is associated with eye and brain growth, education, and myopia (short-sightedness). The outcome of our analysis was that people who are more educated and who are more likely to be short-sighted have the lowest risk of AD. Therefore, having to wear glasses is not so bad! Our plan is to run further analysis on other lifestyle and environmental factors that could influence our risk of AD.
Maria Pregnolato, Lecturer in Infrastructure Resilience, University of Bristol
James Boyd, Head of Research, Brunel Institute/SS Great Britain Trust
Christopher Woods, Head of Research Software Engineering, Advanced Computing Research Centre, University of Bristol
Brunel’s Networks – Interactive has created a physical real-world exhibit of the online network graphing project Brunel’s Networks. The project uses the archives of the Brunel Institute, a collaboration of the SS Great Britain Trust and University of Bristol, to digitally map the groups of individuals and working I.K. Brunel and major 19th century engineering innovations. This first project was developed as a web tool and was missing a physically interactive display. This JGI Seedcorn funding has provided the resources to create a physically interactive experience, which is now installed and in use at the SS Great Britain site (Figure 1).
By uploading the original code of the network graphing into a stable, non-networked (!) unit, the interactive experience can run continuously in a robust console, helping visitors to the SS Great Britain site understand the history of engineering through innovative data visualizations. The console was presented at the JGI showcase event in small table-top form in June 2022, and received positive feedback from visitors, public use. More feedback of the large-scale interactive exhibit will be evaluated during August, allowing the project researchers to understand how data visualisations help the public to understand the past, and the potential of STEM (e.g. programming) behind activities.
Public evaluations from Brunel’s Networks – Interactive will inform the ways in which the SS Great Britain Trust uses digital interactives and data visualizations in future exhibit use – a critical issue in developing contemporary visitor experience and public engagement. This interdisciplinary project combined historical research from the museum with research software engineering from the University, to improve the use of data visualisation in historical analysis, and to use data visualisation as an interpretive museum tool.
JGI Student Experience Profiles: Richard Lane (Ask-JGI Data Science Support 2021-22)
I’m very glad I applied to Ask-JGI; I wanted to get some broader experience with data science than my studies offered and to solve some problems that I wouldn’t ordinarily come across. I found that not only was I exposed to diverse and interesting areas of data science, but also got to be involved with things that I hadn’t considered – the JGI team have a wide range of interests and it was great being immersed in a community interested in everything from data hazards and data ethics to software development and best practices, to outreach and public engagement.
The JGI team and workplace culture were great – the JGI staff and PGR helpers were all lovely and the range of interests and personalities made for a really friendly and dynamic feel. The less academic environment was also a nice change of pace – Ask-JGI felt like a team of professionals working on small, self-contained problems, which contrasted nicely with my less well-defined, more bureaucratic PhD studies.
My favourite part of my JGI experience was taking part in the range of in-person workshops and events that the JGI is involved in throughout the year- I found myself tending a stall at the JGI’s Bristol Data & AI Showcase, organising materials for the Data Week Handbook Resource workshop and attending a research culture networking event. These were all really fun, hugely rewarding, and something I’d recommend to anyone interested in any of the work that the JGI does.
The time commitment doesn’t have to be huge – the work we did was flexible enough that it didn’t impose on my studies during crunch time, and in less busy periods I could tackle more work. The largest project I took on was with a team hoping to improve the student experience, which involved natural language processing of email data. I worked to make a prototype model to classify, label and cluster emails on similar topics. To do this I used several data-science techniques that were familiar to me, but I also had to consider data security, privacy and ethics challenges which I found surprisingly interesting and something I hadn’t encountered before.
Overall I would absolutely recommend joining the Ask-JGI team to anyone interested in engaging with data science and the wider Bristol research community – you’ll get to solve some interesting problems, meet some great people and immerse yourself in the diversity of Bristol’s research.