JGI Seed Corn Funding Project Blog 2021: Dr Zahraa Abdallah

Investigating biomarkers associated with Alzheimer’s Disease to boost multi-modality approach for early diagnosis


Alzheimer’s disease is a brain disorder that gradually destroys memory and thinking skills, as well as the ability to perform the most basic tasks. Most people with the disease, those with late-onset symptoms, experience symptoms in their mid-60s. Early-onset Alzheimer’s disease is extremely rare and occurs between the ages of 30 and 60. Alzheimer’s disease is the leading cause of dementia in older people. According to recent studies, 5.8 million people in the United States aged 65 and up have Alzheimer’s disease. Alzheimer’s disease is estimated to affect 60-70 % of the approximately 50 million people worldwide who have dementia.

Aim of the Project

Numerous recent studies leveraged state-of-the-art machine learning techniques to predict biomarkers in Alzheimer’s disease. However, most of these studies focused on medical images of the brain. Despite showing promising results, such images are scarce and typically required at later stages of the disease. We investigate in this project an alternative approach by investigating biomarkers in non-image data instead. Specifically, we explore genomic and protein data for predicting the early stages of Alzheimer’s disease, particularly when combined with other modalities such as EHR (Electronic Health Records) and cognitive tests. The aims of this project can be summarised as follows:

  • Investigate the role of protein and genomic data in detecting Alzheimer’s disease.
  • Explore a multi-modality approach to combine various measures.
  • Assessing ML models for the task and deciding the most suitable choice.


For this project, we use genetic data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (https://adni.loni.usc.edu/). In the scope of this project, we only use data from ADNI 2 and ADNI GO cohorts. Subjects are medically confirmed to belong to four categories – cognitively normal (CN), Early Mild Cognitive Impairment (EMCI), Late Mild Cognitive Impairment (LMCI), Significant Memory Concern (SMC) or Alzheimer’s disease (AD). Data is combined with a set of cognitive test results.

Role of APOE in Alzheimer’s disease:

Apolipoprotein E (APOE) is a protein involved in the metabolism of fats in the body. A subtype is implicated in Alzheimer’s disease and cardiovascular disease. APOE has three common forms: APOEe2, APOEe3 and APOEe4. E4 variant is the most common and known to be associated with getting the disease at an earlier age. Approximately 15% to 25% of the general population carries an APOE e4 allele [2].

We examine the E4 variant which has been identified as the greatest known genetic risk factor for late-onset sporadic Alzheimer’s disease (AD) in a variety of ethnic groups [1]. Table 1 shows the distribution of the E4 variate across different stages of Alzheimer’s. APOE4 of zero e4 is the most common in controlled and early MCI. For later stage in LMCI and AD, the most common is the E4 is 1 or more.

Table 1: Distribution of E4 variant in each AD type

Age is also considered a significant factor related to APEO E4; Figure 1 shows the distribution of various types of APOE4 across ages. A higher value of the protein is more common at a younger age. At older ages, 85+years, a smaller value is more common.

Figure 1: Age distribution according to genetic type

Only considering the APOE4 protein and age, the AdaBoost ensemble method is used for predicting the four classes, the average accuracy attained is 41% ± 9%. Using 10-fold cross-validation. This indicated that both factors play an important role in Alzheimer’s disease, yet it is not recommended to use them solely to forecast the condition. We improved the model by adding 4 different Cognitive tests:

  • CDRSB: Clinical Dementia Rating Scale – Sum of Boxes.
  • ADAS11: Alzheimer’s Disease Assessment Scale 11.
  • MMSE: Mini-Mental State Examination.
  • RAVLT_immediate: Rey Auditory Verbal Learning Test (sum of scores from 5 first trials).

Incorporating cognitive tests, the accuracy of the model increased to 76% ± 6%

For analysing the feature’s importance in the combined protein model, we use a forest of trees to evaluate the mean decrease in impurity, along with their inter-trees’ variability represented by the error bars. Through this model, it is confirmed that CDRSB is the most significant factor as shown in Figure 2. In addition to cognitive tests, age is a strong contributing factor too. Education level plays a role in the classification task.

Figure 2: Feature importance in model based on APOE4 and cognitive tests

Role of genetic sequence data:


Figure 3: Visualization of Decision tree based on genetic sequence

We also investigate the role genetic data plays in predicting Alzheimer’s.  The dataset has more than 7000 feature variables describing the genetic constitution of the patients. A simplified Decision tree diagram has been shown in figure 1 which highlights the available actionable paths.  LOC5778 and LOC 2303 are the most significant according to the decision tree information gain score.

We focused on developing prediction models by taking only the top 50 genetic features. Using the genetic features, the classifier achieves only 41% ± 6% accuracy. Combining genetic features with the other features (APOE4, EHR, cognitive tests and MRI tests) shows an increase in accuracy. Various models are tested, and accuracy is shown in Figure 4. Tree-based methods are the best for this task, AdaBoost is on the top of the list with an average accuracy of 71% followed by Decision Tree at 68%.

Figure 4: Comparison of accuracy in different ML models

We have learnt a few lessons working through this project, we summarise them as follows:

  • A combination of modalities is the best approach for predicting Alzheimer’s.
  • Cognitive tests play a significant role in guiding protein and genomic data analysis.
  • Tree-based models showed the best performance for this task.

Future aims for the project include the following

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) database’s impending phase 3 will add fresh patient records to the dataset. We aim to expand this work to include more samples for studies and challenge more hypotheses and unknown biomarkers for Alzheimer’s disease. More in-depth analysis into the role of apolipoprotein E4 in the disease progression. Also, women seem more likely to develop AD than men [3]. The reasons for this are still unclear and need to be investigated further. Other contributing lifestyle factors can be studied more in our future work.

This project emphasised the importance of a multi-modality approach for Alzheimer’s classification.  In our future work, we aim to incorporate additional modalities such as PET scans and MRI images. Whereas the sample size increase, we aim to utilise deep learning approaches, especially for image-based modalities.


[1] Bogdanovic, B., Eftimov, T., & Simjanoska, M. (2022). In-depth insights into Alzheimer’s disease by using explainable machine learning approach. Scientific Reports12(1), 1-26.

[2] Kavitha, C et al. “Early-Stage Alzheimer’s Disease Prediction Using Machine Learning Models.” Frontiers in public health vol. 10 853294. 3 Mar. 2022, doi:10.3389/fpubh.2022.853294

[3] Bachman, D. L. et al. Prevalence of dementia and probable senile dementia of the Alzheimer type in the Framingham study. Neurology 42, 115–19 (1992).

Contact details and links

Harshit Maheshwari, Data Scientist vk20457@alumni.bristol.ac.uk

Dr Zahraa Abdallah Zahraa.abdallah@bristol.ac.uk, Lecturer in Data Science, Engineering Mathematics Department, University of Bristol